Write Code vs. Write Poetry: UTF-8 Validation (Array)
Question: A string is encoded in UTF-8 as a sequence of bytes, where each character is composed of one or more bytes in the encoding. To figure out how many bytes are in a character, look at the leading number of ones in the binary representation of the character's first byte:
0xxxxxxx - 1-byte character (i.e., 0xxxxxxx)
10xxxxxx - continuation byte (not valid as a leading byte)
110xxxxx - starts a 2-byte character (i.e., 110xxxxx 10xxxxxx)
1110xxxx - starts a 3-byte character (i.e., 1110xxxx 10xxxxxx 10xxxxxx)
11110xxx - starts a 4-byte character
…
11111111 - starts a 8-byte character
For example:
[11001101, 10000111, 00110000, 11101111, 10010011, 10000110] is a valid three-character string.
First character is 11001101, 10000111.
Second character is 00110000.
Third character is 11101111, 10010011, 10000110.
[01001110, 11000011, 11010101] is an invalid string.
3rd byte should be continuation byte.
[10111000, 00010010] is an invalid string.
1st byte is should be a non-continuation byte.
Write a function that determines whether a given byte array is a valid UTF-8-encoded string.
Idea: Brute force. For each input string array, scan from left to right. If it is a start character, check the next x strings whether it is 10xxxxxx format.
Time: O(n) Space: O(1)
https://en.wikipedia.org/wiki/UTF-8#Description
Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
根据链接里发现一个UTF-8最多有6bytes,这个byte[] input的长度就代表几个byte。我们还发现不同byte长度的UFT8的区别就是第一个byte不一样以及长度
不一样。基于这两个区别,我们可以进行validate。
public static boolean validate(byte[] bytes) {
int expectedLen;
if (bytes.length == 0) return false;
else if ((bytes[0] & 0b10000000) == 0b00000000) expectedLen = 1; // 对第一个byte进行validate,这里0b notation代表二进制
else if ((bytes[0] & 0b11100000) == 0b11000000) expectedLen = 2;
else if ((bytes[0] & 0b11110000) == 0b11100000) expectedLen = 3;
else if ((bytes[0] & 0b11111000) == 0b11110000) expectedLen = 4;
else if ((bytes[0] & 0b11111100) == 0b11111000) expectedLen = 5;
else if ((bytes[0] & 0b11111110) == 0b11111100) expectedLen = 6;
else return false;
if (expectedLen != bytes.length) return false; // 进行长度的validate
for (int i = 1; i < bytes.length; i++) {
if ((bytes[i] & 0b11000000) != 0b10000000) { // 对剩余的byte进行validate
return false;
}
}
return true;
}
http://massivealgorithms.blogspot.com/2015/06/google-interview-valid-utf-8-character.html
Read full article from Write Code vs. Write Poetry: UTF-8 Validation (Array)
Question: A string is encoded in UTF-8 as a sequence of bytes, where each character is composed of one or more bytes in the encoding. To figure out how many bytes are in a character, look at the leading number of ones in the binary representation of the character's first byte:
0xxxxxxx - 1-byte character (i.e., 0xxxxxxx)
10xxxxxx - continuation byte (not valid as a leading byte)
110xxxxx - starts a 2-byte character (i.e., 110xxxxx 10xxxxxx)
1110xxxx - starts a 3-byte character (i.e., 1110xxxx 10xxxxxx 10xxxxxx)
11110xxx - starts a 4-byte character
…
11111111 - starts a 8-byte character
For example:
[11001101, 10000111, 00110000, 11101111, 10010011, 10000110] is a valid three-character string.
First character is 11001101, 10000111.
Second character is 00110000.
Third character is 11101111, 10010011, 10000110.
[01001110, 11000011, 11010101] is an invalid string.
3rd byte should be continuation byte.
[10111000, 00010010] is an invalid string.
1st byte is should be a non-continuation byte.
Write a function that determines whether a given byte array is a valid UTF-8-encoded string.
Idea: Brute force. For each input string array, scan from left to right. If it is a start character, check the next x strings whether it is 10xxxxxx format.
Time: O(n) Space: O(1)
public boolean checkUTF8(String[] input)
{
if(input==null||input.length==0)
return true;
int index=0;
while(index<input.length)
{
int type=getType(input[index]);
if(type==0)
return false;
int end=index+type-1;
if(end>=input.length)
return false;
while(end>index)
{
if(getType(input[end])!=0)
return false;
end--;
}
index=index+type;
}
return true;
}
public int getType(String s)
{
int i=0;
for(;i<s.length();i++)
{
if(s.charAt(i)!='1')
break;
}
if(i==0)
return 1;
else if(i==1)
return 0;
else
return i;
}
https://gist.github.com/gcrfelix/f24af648dee0483a12fehttps://en.wikipedia.org/wiki/UTF-8#Description
Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
根据链接里发现一个UTF-8最多有6bytes,这个byte[] input的长度就代表几个byte。我们还发现不同byte长度的UFT8的区别就是第一个byte不一样以及长度
不一样。基于这两个区别,我们可以进行validate。
public static boolean validate(byte[] bytes) {
int expectedLen;
if (bytes.length == 0) return false;
else if ((bytes[0] & 0b10000000) == 0b00000000) expectedLen = 1; // 对第一个byte进行validate,这里0b notation代表二进制
else if ((bytes[0] & 0b11100000) == 0b11000000) expectedLen = 2;
else if ((bytes[0] & 0b11110000) == 0b11100000) expectedLen = 3;
else if ((bytes[0] & 0b11111000) == 0b11110000) expectedLen = 4;
else if ((bytes[0] & 0b11111100) == 0b11111000) expectedLen = 5;
else if ((bytes[0] & 0b11111110) == 0b11111100) expectedLen = 6;
else return false;
if (expectedLen != bytes.length) return false; // 进行长度的validate
for (int i = 1; i < bytes.length; i++) {
if ((bytes[i] & 0b11000000) != 0b10000000) { // 对剩余的byte进行validate
return false;
}
}
return true;
}
http://massivealgorithms.blogspot.com/2015/06/google-interview-valid-utf-8-character.html
Read full article from Write Code vs. Write Poetry: UTF-8 Validation (Array)