The Code Sniper: 187. Repeated DNA Sequences Leetcode
Rolling hash and hashtable.
https://zyfu0408.gitbooks.io/interview-preparation/content/bit-manipulation/storer-information-by-bit.html
A C G T4个character能够用0,1,2,3这个四个数字表示,也就是00,01,01,和11,每个只占用了2位,DNA sequence是10个连续的character,也就是能用20位来表达。一个int能够包含32位,因此我们可以用int来表示DNA sequence
public List<String> findRepeatedDnaSequences(String s) { Set<Integer> words = new HashSet<>(); Set<Integer> doubleWords = new HashSet<>(); List<String> rv = new ArrayList<>(); char[] map = new char[26]; //map['A' - 'A'] = 0; map['C' - 'A'] = 1; map['G' - 'A'] = 2; map['T' - 'A'] = 3; for(int i = 0; i < s.length() - 9; i++) { int v = 0; for(int j = i; j < i + 10; j++) { v <<= 2; v |= map[s.charAt(j) - 'A']; } if(!words.add(v) && doubleWords.add(v)) { rv.add(s.substring(i, i + 10)); } } return rv; }
https://leetcode.com/discuss/46948/accepted-java-easy-to-understand-solution
private int stringToHash(String s) { int numberBuilder = 0; for (int i = 0; i < s.length(); i++) { numberBuilder <<= 2; if (s.charAt(i) == 'A') numberBuilder |= 0b00; else if (s.charAt(i) == 'C') numberBuilder |= 0b01; else if (s.charAt(i) == 'G') numberBuilder |= 0b10; else if (s.charAt(i) == 'T') numberBuilder |= 0b11; } return numberBuilder; }
http://betterpoetrythancode.blogspot.com/2015/02/repeated-dna-sequences-leetcode-bit.html
Idea: Create a HashMap<bit representation of substring, appeared times>. For each 10 letter window, calculate its bit representation as the key. Then slide the window from left to right, if the key shows up exactly twice, append the substring to the result.
Time: O(n) Space: O(n)
// substring is not efficient.
http://www.programcreek.com/2014/03/leetcode-repeated-dna-sequences-java/
http://likesky3.iteye.com/blog/2236004
此题思路是容易想到的,遍历输入字符串的每个长度为10的substring,利用HashMap 检查其出现次数,出现两次或者以上的则加入到结果中。
实现时仅当某个substring第二次出现时加入结果可避免结果中出现重复字符串。但直接实现会得到Memory Limit Exceed,就是程序内存开销太大了。
此题的关键就是要将那些待检查的substring转换为int来节省内存,如何高效的编码substring?共4个字符,ACGT,可用两个bit区分它们,分别是00,01,10,11,
参考解答中的掩码技巧值得学习,使用一个20位的数字0x3ffff称为eraser,每次要更新一位字符时,将老的编码hint & eraser, 然后左移两位,然后加上新字符对应的编码,
这样就得到了新substring的编码,很巧妙~
哈希表法
时间 O(N) 空间 O(N)
最简单的做法,我们可以把位移一位后每个子串都存入哈希表中,如果哈希表中已经有这个子串,而且是第一次重复,则加入结果中。如果已经遇到多次,则不加入结果中。如果哈希表没有这个子串,则把这个子串加入哈希表中。
We can do better.
时间 O(N) 空间 O(N)
思路
实际上我们的哈希表可以不用存整个子串,因为我们知道子串只有10位,且每一位只可能有4种不同的字母,那我们可以用4^10个数字来表示每种不同的序列,因为4^10=2^20<2^32所以我们可以用一个Integer来表示。具体的编码方法是用每两位bit表示一个字符。
public List<String> findRepeatedDnaSequences(String s) { Set seen = new HashSet(), repeated = new HashSet(); for (int i = 0; i + 9 < s.length(); i++) { String ten = s.substring(i, i + 10); if (!seen.add(ten)) repeated.add(ten); } return new ArrayList(repeated); }
LeetCode Extended:
http://www.1point3acres.com/bbs/thread-147555-1-1.html
是要求按大小顺序输出来着,alphabetical order。我直接用的是int array,没有被要求用bit vector来优化了~
hash函数把结果变成整数,范围是0到4^10 - 1, 然后直接用一个这么大的数组来记录结果。之后遍历数组,把找到的结果reverse hash回去~
一共有三种状态吧:0次出现,1次出现,至少两次出现。然后用两个bit来表示这三种状态咯。bit的读取和改变就用and, or之类的bitwise operator来做吧
Read full article from The Code Sniper: 187. Repeated DNA Sequences Leetcode
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
Example:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT" Output: ["AAAAACCCCC", "CCCCCAAAAA"]
X. Bit
public List<String> findRepeatedDnaSequences(String s) {
Set<Integer> words = new HashSet<>();
Set<Integer> doubleWords = new HashSet<>();
List<String> rv = new ArrayList<>();
char[] map = new char[26];
//map['A' - 'A'] = 0;
map['C' - 'A'] = 1;
map['G' - 'A'] = 2;
map['T' - 'A'] = 3;
for(int i = 0; i < s.length() - 9; i++) {
int v = 0;
for(int j = i; j < i + 10; j++) {
v <<= 2;
v |= map[s.charAt(j) - 'A'];
}
if(!words.add(v) && doubleWords.add(v)) {
rv.add(s.substring(i, i + 10));
}
}
return rv;
}
X. Rolling hash(bit)
Improve time complexity from O(kn) to O(n)
public List<String> findRepeatedDnaSequences(String s) {
if(s.length() < 11) return new ArrayList<>();
Set<Integer> words1 = new HashSet<>();
Set<Integer> words2 = new HashSet<>();
List<String> res = new ArrayList<>();
char[] map = new char[26];
map['A' - 'A'] = 0;
map['C' - 'A'] = 1;
map['G' - 'A'] = 2;
map['T' - 'A'] = 3;
int val = 0;
for(int i = 0; i < 10; i++){ // first value
val = val << 2;
val = val | map[s.charAt(i) - 'A'];
}
words1.add(val);
for(int i = 1; i < s.length() - 9; i++){
val &= ~(3 << 18);
val = val << 2;
val = val | map[s.charAt(i+9) - 'A'];
if(!words1.add(val) && words2.add(val)){
res.add(s.substring(i, i + 10));
}
}
return res;
}
https://leetcode.com/problems/repeated-dna-sequences/discuss/53952/20-ms-solution-(C%2B%2B)-with-explanation
public List<String> findRepeatedDnaSequences(String s) {
if (s.length() < 10) {
return Collections.emptyList();
}
List<String> ret = new ArrayList<String>();
Map<Character, Integer> map = new HashMap<>();
map.put('A', 0);
map.put('C', 1);
map.put('G', 2);
map.put('T', 3);
Set<Integer> set1 = new HashSet<>();
Set<Integer> set2 = new HashSet<>();
int hash = 0;
for (int i = 0; i < 10; i++) {
hash = hash << 2;
hash |= map.get(s.charAt(i));
}
set1.add(hash);
int mask = (1 << 20) - 1;
System.out.println(Integer.toBinaryString(mask));
for (int i = 10; i < s.length(); i++) {
hash = (hash << 2) & mask;
hash |= map.get(s.charAt(i));
if (!set1.contains(hash)) {
set1.add(hash);
continue;
}
if(!set2.contains(hash)) {
ret.add(s.substring(i - 9, i + 1));
set2.add(hash);
}
}
return ret;
}
https://leetcode.com/problems/repeated-dna-sequences/discuss/53902/Short-Java-"rolling-hash"-solution
The idea is to use rolling hash technique or in case of string search also known as Rabin-Karp algorithm. As our alphabet A consists of only 4 letters we can be not afraid of collisions. The hash for a current window slice could be found in a constant time by subtracting the former first character times size of the A in the power of 9 and updating remaining hash by the standard rule: hash = hash*A.size() + curr_char.
private static final Map<Character, Integer> A = new HashMap<>();
static { A.put('A',0); A.put('C',1); A.put('G',2); A.put('T',3); }
private final int A_SIZE_POW_9 = (int) Math.pow(A.size(), 9);
public List<String> findRepeatedDnaSequences(String s) {
Set<String> res = new HashSet<>();
Set<Integer> hashes = new HashSet<>();
for (int i = 0, rhash = 0; i < s.length(); i++) {
if (i > 9) rhash -= A_SIZE_POW_9 * A.get(s.charAt(i-10));
rhash = A.size() * rhash + A.get(s.charAt(i));
if (i > 8 && !hashes.add(rhash)) res.add(s.substring(i-9,i+1));
}
return new ArrayList<>(res);
}
Rolling hash and hashtable.
Since there are only 4 possible letters, we can build a rolling hash with base of 5 and calculate hashcode for each 10-letter-long subsequences.
If we define code('A'):1, code('C'):2; code('G'):3; code('T'):4, Then the hashcode for the first 10-letter-long sequence will be 1*5^0+1*5^1+1*5^2+...2*5^5+...2*5^9. The advantage of rolling hash method is that we don't need to recalculate the hash code from the beginning of the next 10-letter-long sequence, we can calculate the hash code hc[i] based on last hash code hc[i-1]. hc[i]=hc[i-1]/5+code(s.charAt(i))*5^9. It is guranteed that unique DNA sequence will have unique hashcode. By using a hashtable, we can see whether the sequence is occurred more than once.
public List<String> findRepeatedDnaSequences(String s) {
List<String> res=new ArrayList<String>();
if(s==null || s.length()<10) return res;
HashMap<Integer,Integer> map=new HashMap<Integer,Integer>();
int hashCode=0;
int base=1;
for(int i=0;i<10;i++){
hashCode+=getCode(s.charAt(i))*base;
base*=5;
}
map.put(hashCode,1);
base=base/5;
for(int i=10;i<s.length();i++){
hashCode=hashCode/5+getCode(s.charAt(i))*base;
if(!map.containsKey(hashCode)) map.put(hashCode,1);
else{
if(map.get(hashCode)==1) {
res.add(s.substring(i-9,i+1));
map.put(hashCode,2);
}
}
}
return res;
}
public int getCode(char c){
switch (c){
case 'A': return 1;
case 'C': return 2;
case 'G': return 3;
case 'T': return 4;
default: return 0;
}
}
https://zyfu0408.gitbooks.io/interview-preparation/content/bit-manipulation/storer-information-by-bit.html
A C G T4个character能够用0,1,2,3这个四个数字表示,也就是00,01,01,和11,每个只占用了2位,DNA sequence是10个连续的character,也就是能用20位来表达。一个int能够包含32位,因此我们可以用int来表示DNA sequence
public List<String> findRepeatedDnaSequences(String s) {
List<String> result = new ArrayList<String>();
if (s == null || s.length() < 10) {
return result;
}
Map<Character, Integer> map = new HashMap<Character, Integer>();
map.put('A', 0);
map.put('C', 1);
map.put('G', 2);
map.put('T', 3);
Set<Integer> dna = new HashSet<Integer>();
for (int i = 9; i < s.length(); i++) {
int k = 0, num = 0;
for (int j = i - 9; j <= i; j++) {
int n = map.get(s.charAt(j));
num |= n << k;
k += 2;
}
if (dna.contains(num)) {
String str = s.substring(i - 9, i + 1);
if (!result.contains(str)) {
result.add(str);
}
}
dna.add(num);
}
return result;
}
https://leetcode.com/discuss/25399/clean-java-solution-hashmap-bits-manipulationpublic List<String> findRepeatedDnaSequences(String s) { Set<Integer> words = new HashSet<>(); Set<Integer> doubleWords = new HashSet<>(); List<String> rv = new ArrayList<>(); char[] map = new char[26]; //map['A' - 'A'] = 0; map['C' - 'A'] = 1; map['G' - 'A'] = 2; map['T' - 'A'] = 3; for(int i = 0; i < s.length() - 9; i++) { int v = 0; for(int j = i; j < i + 10; j++) { v <<= 2; v |= map[s.charAt(j) - 'A']; } if(!words.add(v) && doubleWords.add(v)) { rv.add(s.substring(i, i + 10)); } } return rv; }
Improve a little bit. Maintain a size 10 window rather than loop through every time.
public List<String> findRepeatedDnaSequences(String s) {
List<String> result = new ArrayList<>();
Set<Integer> word = new HashSet<>();
Set<Integer> secondWord = new HashSet<>();
int[] map = new int[26];
map['C' - 'A'] = 1;
map['G' - 'A'] = 2;
map['T' - 'A'] = 3;
int value = 0;
for (int i = 0; i < s.length(); i++) {
value <<= 2;
value |= map[s.charAt(i) - 'A'];
value &= 0xfffff;//??
if (i < 9) {
continue;
}
if (!word.add(value) && secondWord.add(value)) {
result.add(s.substring(i - 9, i + 1));
}
}
return result;
}https://leetcode.com/discuss/46948/accepted-java-easy-to-understand-solution
public List<String> findRepeatedDnaSequences(String s) {
List<String> list = new ArrayList<String>();
if (s == null || s.length() < 10) return list;
HashMap<Integer, Boolean> map = new HashMap<Integer, Boolean>();
for (int i = 0; i + 10 <= s.length(); i++ ) {
int hash = stringToHash(s.substring(i, i + 10));
if (map.containsKey(hash)) {
if (!map.get(hash)) {
list.add(s.substring(i, i + 10));
map.put(hash, true);
}
} else {
map.put(hash, false);
}
}
return list;
}
private int stringToHash (String s) {
String numberBuilder = "";
for (int i = 0; i < s.length(); i++) {
if (s.charAt(i) == 'A') numberBuilder += "0";
if (s.charAt(i) == 'C') numberBuilder += "1"; // add else
if (s.charAt(i) == 'G') numberBuilder += "2";
if (s.charAt(i) == 'T') numberBuilder += "3";
}
return Integer.parseInt(numberBuilder, 4);
}
Thanking for the suggestions by TWiStErRob at el. the stringToHash function can be implemented to be more efficient. Here TWiStErRob's code for it
private int stringToHash(String s) {
int numberBuilder = 0;
for (int i = 0; i < s.length(); i++) {
numberBuilder *= 4;
if (s.charAt(i) == 'A') numberBuilder += 0;
else if (s.charAt(i) == 'C') numberBuilder += 1;
else if (s.charAt(i) == 'G') numberBuilder += 2;
else if (s.charAt(i) == 'T') numberBuilder += 3;
}
return numberBuilder;
}
private int stringToHash(String s) { int numberBuilder = 0; for (int i = 0; i < s.length(); i++) { numberBuilder <<= 2; if (s.charAt(i) == 'A') numberBuilder |= 0b00; else if (s.charAt(i) == 'C') numberBuilder |= 0b01; else if (s.charAt(i) == 'G') numberBuilder |= 0b10; else if (s.charAt(i) == 'T') numberBuilder |= 0b11; } return numberBuilder; }
http://betterpoetrythancode.blogspot.com/2015/02/repeated-dna-sequences-leetcode-bit.html
Idea: Create a HashMap<bit representation of substring, appeared times>. For each 10 letter window, calculate its bit representation as the key. Then slide the window from left to right, if the key shows up exactly twice, append the substring to the result.
Time: O(n) Space: O(n)
// substring is not efficient.
public List<String> findRepeatedDnaSequences(String s) {
List<String> result=new ArrayList<String>();
HashMap<Integer,Integer> computed=new HashMap<Integer,Integer>();
for(int i=0;i<=s.length()-10;i++)
{
String sub=s.substring(i,i+10); // slow
int key=getKey(sub);
if(computed.containsKey(key))
{
computed.put(key,computed.get(key)+1);
if(computed.get(key)==2)
result.add(sub);
}
else
computed.put(key,1);
}
return result;
}
public int getKey(String s)
{
int result=0;
for(int i=s.length()-1;i>=0;i--)
{
int b=0;
switch(s.charAt(i))
{
case 'A':
b|=0;
break;
case 'T':
b|=1;
break;
case 'G':
b|=2;
break;
case 'C':
b|=3;
break;
}
result=b|result;
result=result<<2;
}
return result;
}
https://codesolutiony.wordpress.com/2015/02/10/repeated-dna-sequences/
public List<String> findRepeatedDnaSequences(String s) {
List<String> result = new ArrayList<String>();
Map<Character, Integer> map = new HashMap<Character, Integer>();
map.put('A', 0);
map.put('C', 1);
map.put('G', 2);
map.put('T', 3);
if (s == null || s.length() < 10) {
return new ArrayList<String>();
}
Set<Integer> unique = new HashSet<Integer>();
Set<Integer> set = new HashSet<Integer>();
for (int i = 0; i + 10 <= s.length(); i++) {
int value = 0;
for (int index = i; index < i+10; index++) {
value = (value << 2) | map.get(s.charAt(index));
}
if (set.contains(value) && !unique.contains(value)) {
result.add(s.substring(i, i+10));
unique.add(value);
} else {
set.add(value);
}
}
return result;
}
public List<String> findRepeatedDnaSequences(String s) {
Set<String> result = new HashSet<String>();
if (s == null || s.length() < 10) {
return new ArrayList<String>();
}
int targetHash = 0, resHash = 0, base = 7, multiplies = 1;
for (int i = 0; i < 10; i++) {
targetHash = targetHash * base + s.charAt(i);
multiplies *= base;
}
multiplies /= base;
Set<Integer> set = new HashSet<Integer>();
set.add(targetHash);
for (int i = 1; i + 10 <= s.length(); i++) {
targetHash = (targetHash - multiplies * s.charAt(i-1)) * base + s.charAt(i + 9);
if (set.contains(targetHash)) {
result.add(s.substring(i, i+10));
} else {
set.add(targetHash);
}
}
return new ArrayList<String>(result);
}
http://likesky3.iteye.com/blog/2236004
此题思路是容易想到的,遍历输入字符串的每个长度为10的substring,利用HashMap 检查其出现次数,出现两次或者以上的则加入到结果中。
实现时仅当某个substring第二次出现时加入结果可避免结果中出现重复字符串。但直接实现会得到Memory Limit Exceed,就是程序内存开销太大了。
此题的关键就是要将那些待检查的substring转换为int来节省内存,如何高效的编码substring?共4个字符,ACGT,可用两个bit区分它们,分别是00,01,10,11,
参考解答中的掩码技巧值得学习,使用一个20位的数字0x3ffff称为eraser,每次要更新一位字符时,将老的编码hint & eraser, 然后左移两位,然后加上新字符对应的编码,
这样就得到了新substring的编码,很巧妙~
- // Method 2: hashmap store int instead of string to bypass MLE
- public static final int eraser = 0x3ffff;
- public static HashMap<Character, Integer> ati = new HashMap<Character, Integer>();
- static {
- ati.put('A', 0);
- ati.put('C', 1);
- ati.put('G', 2);
- ati.put('T', 3);
- }
- public List<String> findRepeatedDnaSequences(String s) {
- List<String> result = new ArrayList<String>();
- if (s == null || s.length() <= 10)
- return result;
- int N = s.length();
- int hint = 0;
- for (int i = 0; i < 10; i++) {
- hint = (hint << 2) + ati.get(s.charAt(i));
- }
- HashMap<Integer, Integer> checker = new HashMap<Integer, Integer>();
- checker.put(hint, 1);
- for (int i = 10; i < N; i++) {
- hint = ((hint & eraser) << 2) + ati.get(s.charAt(i));
- Integer value = checker.get(hint); // Count Map
- if (value == null) {
- checker.put(hint, 1);
- } else if (value == 1) {
- checker.put(hint, value + 1); // this is not needed.
- result.add(s.substring(i - 9, i + 1)); // could we avoid substring?
- }
- }
- return result;
- }
- // Method 1: Memory Limit Exceed & may contain duplicates
- public List<String> findRepeatedDnaSequences1(String s) {
- HashMap<String, Integer> map = new HashMap<String, Integer>();
- int last = s.length() - 10;
- for (int i = 0; i <= last; i++) {
- String key = s.substring(i, i + 10);
- if (map.containsKey(key)) {
- map.put(key, map.get(key) + 1);
- } else {
- map.put(key, 1);
- }
- }
- List<String> result = new ArrayList<String>();
- for (String key : map.keySet()) {
- if (map.get(key) > 1)
- result.add(key);
- }
- return result;
- }
public List<String> findRepeatedDnaSequences(String s) {
List<String> res = new ArrayList<String>();
if (s == null || s.length() < 10) {
return res;
}
Map<Integer, Integer> map = new HashMap<Integer, Integer>();
// map<> stores int-to-int: first int is "ACGT" -> 1234
// second int is the init position of the pattern in s
int num = 0;
int mask = 0xFFFFFFFF;
mask >>>= (32 - 18);
for (int i = 0; i < s.length(); i++) {
num = mask & num;
// get the last 18 binary digits of num
num = (num << 2) | convertChatToInt(s.charAt(i));
// append num with the number representing charAt(i)
if (i >= 9) {
// insert or search num in the map
if (!map.containsKey(num)) {
map.put(num, (i - 9));
// (i - 9) is the start position of
// the 10-letter-long sequence
} else if (map.get(num) != -1) {
res.add(s.substring(map.get(num), map.get(num) + 10));
map.put(num, -1);
// after the sequence has been added, mark it in the map
} else {
// do nothing
}
}
}
return res;
}
int convertChatToInt(char ch) {
if (ch == 'A') {
return 1;
} else if (ch == 'C') {
return 2;
} else if (ch == 'G') {
return 3;
}
return 0;
}
http://segmentfault.com/a/1190000003922976哈希表法
时间 O(N) 空间 O(N)
最简单的做法,我们可以把位移一位后每个子串都存入哈希表中,如果哈希表中已经有这个子串,而且是第一次重复,则加入结果中。如果已经遇到多次,则不加入结果中。如果哈希表没有这个子串,则把这个子串加入哈希表中。
public List<String> findRepeatedDnaSequences(String s) {
List<String> res = new LinkedList<String>();
HashMap<String, Integer> map = new HashMap<String, Integer>();
for(int index = 10; index <= s.length(); index++){
// 从第10位开始作为结尾,位移一位,比较一次子串
String substr = s.substring(index - 10, index);
if(map.containsKey(substr)){
// 如果是第一次遇到,则加入结果
if(map.get(substr) == 1){
res.add(substr);
}
// 标记为已经遇到过一次了
map.put(substr, 2); // the real count doesn't mater
} else {
// 如果不存在,则加入表中
map.put(substr, 1);
}
}
return res;
}
编码法We can do better.
时间 O(N) 空间 O(N)
思路
实际上我们的哈希表可以不用存整个子串,因为我们知道子串只有10位,且每一位只可能有4种不同的字母,那我们可以用4^10个数字来表示每种不同的序列,因为4^10=2^20<2^32所以我们可以用一个Integer来表示。具体的编码方法是用每两位bit表示一个字符。
public List<String> findRepeatedDnaSequences(String s) {
List<String> res = new LinkedList<String>();
HashMap<Integer, Integer> map = new HashMap<Integer, Integer>();
for(int index = 10; index <= s.length(); index++){
String substr = s.substring(index - 10, index);
int code = encode(substr);
if(map.containsKey(code)){
if(map.get(code) == 1){
res.add(substr);
}
map.put(code, 2);
} else {
map.put(code, 1);
}
}
return res;
}
private int encode(String str){
int code = 0;
for(int i = 0; i < str.length(); i++){
char c = str.charAt(i);
// 每两位表示一个字符
code <<= 2;
switch(c){
case 'A': code += 0; break;
case 'C': code += 1; break;
case 'T': code += 2; break;
case 'G': code += 3; break;
}
}
return code;
}
https://leetcode.com/discuss/64841/7-lines-simple-java-o-npublic List<String> findRepeatedDnaSequences(String s) { Set seen = new HashSet(), repeated = new HashSet(); for (int i = 0; i + 9 < s.length(); i++) { String ten = s.substring(i, i + 10); if (!seen.add(ten)) repeated.add(ten); } return new ArrayList(repeated); }
LeetCode Extended:
http://www.1point3acres.com/bbs/thread-147555-1-1.html
是要求按大小顺序输出来着,alphabetical order。我直接用的是int array,没有被要求用bit vector来优化了~
hash函数把结果变成整数,范围是0到4^10 - 1, 然后直接用一个这么大的数组来记录结果。之后遍历数组,把找到的结果reverse hash回去~
一共有三种状态吧:0次出现,1次出现,至少两次出现。然后用两个bit来表示这三种状态咯。bit的读取和改变就用and, or之类的bitwise operator来做吧
Read full article from The Code Sniper: 187. Repeated DNA Sequences Leetcode