Massive Algorithms: Write Code vs. Write Poetry: UTF-8 Validation (Array)

Write Code vs. Write Poetry: UTF-8 Validation (Array)
Question: A string is encoded in UTF-8 as a sequence of bytes, where each character is composed of one or more bytes in the encoding. To figure out how many bytes are in a character, look at the leading number of ones in the binary representation of the character's first byte:
0xxxxxxx - 1-byte character (i.e., 0xxxxxxx)
10xxxxxx - continuation byte (not valid as a leading byte)
110xxxxx - starts a 2-byte character (i.e., 110xxxxx 10xxxxxx)
1110xxxx - starts a 3-byte character (i.e., 1110xxxx 10xxxxxx 10xxxxxx)
11110xxx - starts a 4-byte character
…
11111111 - starts a 8-byte character
For example:
[11001101, 10000111, 00110000, 11101111, 10010011, 10000110] is a valid three-character string.
First character is 11001101, 10000111.
Second character is 00110000.
Third character is 11101111, 10010011, 10000110.
[01001110, 11000011, 11010101] is an invalid string.
3rd byte should be continuation byte.
[10111000, 00010010] is an invalid string.
1st byte is should be a non-continuation byte.
Write a function that determines whether a given byte array is a valid UTF-8-encoded string.

Idea: Brute force. For each input string array, scan from left to right. If it is a start character, check the next x strings whether it is 10xxxxxx format.

Time: O(n) Space: O(1)

      public boolean checkUTF8(String[] input)  
      {  
           if(input==null||input.length==0)  
                return true;  
           int index=0;  
           while(index<input.length)  
           {  
                int type=getType(input[index]);  
                if(type==0)  
                     return false;  
                int end=index+type-1;  
                if(end>=input.length)  
                     return false;  
                while(end>index)  
                {  
                     if(getType(input[end])!=0)  
                          return false;  
                     end--;  
                }  
                index=index+type;  
           }  
           return true;  
      }  
      public int getType(String s)  
      {  
           int i=0;  
           for(;i<s.length();i++)  
           {  
                if(s.charAt(i)!='1')  
                     break;  
           }  
           if(i==0)  
                return 1;  
           else if(i==1)  
                return 0;  
           else  
                return i;  
      }

https://gist.github.com/gcrfelix/f24af648dee0483a12fe
https://en.wikipedia.org/wiki/UTF-8#Description
Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

根据链接里发现一个UTF-8最多有6bytes，这个byte[] input的长度就代表几个byte。我们还发现不同byte长度的UFT8的区别就是第一个byte不一样以及长度
不一样。基于这两个区别，我们可以进行validate。

public static boolean validate(byte[] bytes) {
int expectedLen;
if (bytes.length == 0) return false;
else if ((bytes[0] & 0b10000000) == 0b00000000) expectedLen = 1; // 对第一个byte进行validate，这里0b notation代表二进制
else if ((bytes[0] & 0b11100000) == 0b11000000) expectedLen = 2;
else if ((bytes[0] & 0b11110000) == 0b11100000) expectedLen = 3;
else if ((bytes[0] & 0b11111000) == 0b11110000) expectedLen = 4;
else if ((bytes[0] & 0b11111100) == 0b11111000) expectedLen = 5;
else if ((bytes[0] & 0b11111110) == 0b11111100) expectedLen = 6;
else return false;

if (expectedLen != bytes.length) return false; // 进行长度的validate

for (int i = 1; i < bytes.length; i++) {
if ((bytes[i] & 0b11000000) != 0b10000000) { // 对剩余的byte进行validate
return false;
}
}
return true;
}
http://massivealgorithms.blogspot.com/2015/06/google-interview-valid-utf-8-character.html
Read full article from Write Code vs. Write Poetry: UTF-8 Validation (Array)

Write Code vs. Write Poetry: UTF-8 Validation (Array)

Labels

Popular Posts