Massive Algorithms: Google Interview

Google Interview - Valid UTF-8 Character
http://algorithmguru.com/blog/?p=148

UTF-8 is a variable-length encoding of letters or runes. If the most significant bit of the first byte is 0, the letter is 1 byte long. Otherwise, its length is the number of leading 1’s in the first byte. If a letter is more than one byte long, all subsequent runes start with 10. Here’s a chart:

UTF-8 encoding scheme is described below:

0XXXXXXX = this is the entire rune
10XXXXXX = this is a continuation of the rune from the previous byte
110XXXXX = this is the start of a 2-byte rune.
1110XXXX = this is the start of a 3-byte rune.
11110XXX = this is the start of a 4-byte rune.
111110XX = this is the start of a 5-byte rune.
1111110X = this is the start of a 6-byte rune.
11111110 = this is the start of a 7-byte rune.
11111111 = this is the start of a 8-byte rune.

For example, a 3-byte rune would be 1110XXXX 10XXXXXX 10XXXXXX.

UTF-8是一种变长的编码方式。它可以使用1~4个字节表示一个符号，根据不同的符号而变化字节长度。UTF-8的编码规则很简单，只有二条：
1）对于单字节的符号，字节的第一位设为0，后面7位为这个符号的unicode码。因此对于英语字母，UTF-8编码和ASCII码是相同的。
2）对于n字节的符号（n>1），第一个字节的前n位都设为1，第n+1位设为0，后面字节的前两位一律设为10。剩下的没有提及的二进制位，全部为这个符号的unicode码。
0xxxxxxx是一个合法的单字节UTF8编码。
110xxxxx 10xxxxxx是一个合法的2字节UTF8编码。
1110xxxx 10xxxxxx 10xxxxxx是一个合法的3字节UTF8编码。
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx是一个合法的4字节UTF8编码。


bool valid_utf8(const vector<unsigned char>& data) {

int size = 0;

for(auto c : data) {

if(size == 0) {

if((c >> 5) == 0b110) size = 1;

else if((c >> 4) == 0b1110) size = 2;

else if((c >> 3) == 0b11110) size = 3;

else if(c >> 7) return false;

} else {

if((c >> 6) != 0b10) return false;

--size;

}

}

return size == 0;

}

https://reeestart.wordpress.com/2016/06/23/google-utf8-validation/

UTF8编码定义为 1~8个byte:
1byte以0开头，0XXXXXXX
2byte以110开头，110XXXXX
3byte以1110开头，1110XXXX
4byte以11110开头，11110XXX
5byte以111110开头，111110XX
6byte以1111110开头，1111110X
7byte以11111110开头
8byte以11111111开头

10XXXXXX这种作为连续多个byte的组成部分，比如3byte 就是1110XXXX 10XXXXXX 10XXXXXX。

要求就是给一个byte[]数组，判断它是否符合编码定义。

[Solution]
http://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array

stack overflow上很好的解法。

思路就是看开头的byte，根据开头来判断byte数组应该有的长度，再和实际长度比较，以及后面的组成部分是否都start with 10.

而判断开头byte的方法有很多种，最简单就是brute force扫一遍看有几个1. 但是stack overflow上面那个bit manipulation的方法很巧妙。

比如2byte, 就用0b1110000和实际byte[0]做AND, 如果是valid的编码，得到的结果一定是0b11000000:

0b11100000
& 0bXXXXXXXX
= 0b11000000

[Node]
另外注意utf8定义长度超过6的都是不合法的。

  public boolean utf8Validation(byte[] bytes) {

    int expectedLen = 0;

    if (bytes.length == 0) {

      return false;

}

    else if ((bytes[0] & 0b10000000) == 0b00000000) {

      expectedLen = 1;

}

    else if ((bytes[0] & 0b11100000) == 0b11000000) {

      expectedLen = 2;

}

    else if ((bytes[0] & 0b11110000) == 0b11100000) {

      expectedLen = 3;

}

    else if ((bytes[0] & 0b11111000) == 0b11110000) {

      expectedLen = 4;

}

    else if ((bytes[0] & 0b11111100) == 0b11111000) {

      expectedLen = 5;

}

    else if ((bytes[0] & 0b11111110) == 0b11111100) {

      expectedLen = 6;

}

    else {

      return false;

}

    if (expectedLen != bytes.length) {

      return false;

}

    for (int i = 1; i < bytes.length; i++) {

      if ((bytes[i] & 0b11000000) != 0b10000000) {

        return false;

}

}

    return true;

}

Read full article from Google Interview - Valid UTF-8 Character

Google Interview - Valid UTF-8 Character

Labels

Popular Posts