Massive Algorithms: Searching for Patterns | Set 3 (Rabin-Karp Algorithm)

Searching for Patterns | Set 3 (Rabin-Karp Algorithm) | GeeksforGeeks

Rabin-Karp algorithm also slides the pattern one by one, Rabin Karp algorithm matches the hash value of the pattern with the hash value of current substring of text, and if the hash values match then only it starts matching individual characters.

Hash at the next shift must be efficiently computable from the current hash value and next character in text or we can say hash(txt[s+1 .. s+m]) must be efficiently computable fromhash(txt[s .. s+m-1]) and txt[s+m] i.e., hash(txt[s+1 .. s+m])= rehash(txt[s+m], hash(txt[s .. s+m-1])and rehash must be O(1) operation.

The numeric values cannot be practically stored as an integer. Therefore, the numeric value is calculated using modular arithmetic to make sure that the hash values can be stored in an integer variable (can fit in memory words). To do rehashing, we need to take off the most significant digit and add the new least significant digit for in hash value. Rehashing is done using the following formula.

hash( txt[s+1 .. s+m] ) = d ( hash( txt[s .. s+m-1]) – txt[s]*h ) + txt[s + m] ) mod q

d: Number of characters in the alphabet // usually 256
q: A prime number
h: d^(m-1)

The average and best case running time of the Rabin-Karp algorithm is O(n+m), but its worst-case time is O(nm). Worst case of Rabin-Karp algorithm occurs when all characters of pattern and text are same as the hash values of all the substrings of txt[] match with hash value of pat[]. For example pat[] = “AAA” and txt[] = “AAAAAAA”.

http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
Rabin-Karp is an algorithm of choice for multiple pattern search
Java code: http://www.sanfoundry.com/java-program-rabin-karp-algorithm/
http://algs4.cs.princeton.edu/53substring/RabinKarp.java.html

#define d 256 

/*  pat  -> pattern

    txt  -> text

    q    -> A prime number

*/

void search(char *pat, char *txt, int q)

{

    int M = strlen(pat);

    int N = strlen(txt);

    int i, j;

    int p = 0;  // hash value for pattern

    int t = 0; // hash value for txt

    int h = 1;

    // The value of h would be "pow(d, M-1)%q"

    for (i = 0; i < M-1; i++)

        h = (h*d)%q;

    // Calculate the hash value of pattern and first window of text

    for (i = 0; i < M; i++)

{

        p = (d*p + pat[i])%q;

        t = (d*t + txt[i])%q;

}

    // Slide the pattern over text one by one 

    for (i = 0; i <= N - M; i++)

{

        // Chaeck the hash values of current window of text and pattern

        // If the hash values match then only check for characters on by one

        if ( p == t )

{

            /* Check for characters one by one */

            for (j = 0; j < M; j++)

{

                if (txt[i+j] != pat[j])

                    break;

}

            if (j == M)  // if p == t and pat[0...M-1] = txt[i, i+1, ...i+M-1]

{

                printf("Pattern found at index %d \n", i);

}

}

        // Calulate hash value for next window of text: Remove leading digit, 

        // add trailing digit           

        if ( i < N-M )

{

            t = (d*(t - txt[i]*h) + txt[i+M])%q;

            // We might get negative value of t, converting it to positive

            if(t < 0) 

              t = (t + q); 

}

}

}

http://www.sparknotes.com/cs/searching/hashtables/section4.rhtml
Rabin-Karp rolling hash

The Rabin-Karp string search algorithm is normally used with a very simple rolling hash function that only uses multiplications and additions:

H = c_1 a^{k-1} + c_2 a^{k-2} + c_3 a^{k-3} + ... + c_k a^{0}

where

a

is a constant and

c_1, ..., c_k

are the input characters.

In order to avoid manipulating huge

H

values, all math is done modulo

n

. The choice of

a

and

n

is critical to get good hashing; see linear congruential generator for more discussion.

Removing and adding characters simply involves adding or subtracting the first or last term. Shifting all characters by one position to the left requires multiplying the entire sum

H

a

. Shifting all characters by one position to the right requires dividing the entire sum

H

a

. Note that in modulo arithmetic,

a

can be chosen to have a multiplicative inverse

a^{-1}

by which

H

can be multiplied to get the result of the division without actually performing a division.

http://www.quora.com/How-does-the-Rolling-Hash-function-used-in-Rabin-Karp-algorithm-work
Hash(H1) over k characters(c1..ck) could be computed as follows:
H1= c1*a^k-1 + c2*a^k-2+c3*a^k-3+…+ck*a^0
where a is a constant
and c1…ck are input characters.

Lets assume you want to calculate the hash(H2) over same size window k over characters(c2..ck+1) could be computed from similar logic above as follows:
H2 = c2*a^k-1+c3*a^k-2+…+ck+1*a^0
where a is a constant
and c2..ck+1 are input characters.

Now, if we look carefully, H2 = [H1*a] + [ck+1*a^0(i.e. last term of this window)] - [c1*a^k-1(i.e. first term of H1)]

hash( txt[s+1 .. s+m] ) = d ( hash( txt[s .. s+m-1]) – txt[s]*h ) + txt[s + m] ) mod q

hash( txt[s .. s+m-1] ) : Hash value at shift s.
hash( txt[s+1 .. s+m] ) : Hash value at next shift (or shift s+1)
d: Number of characters in the alphabet
q: A prime number
h: d^(m-1)

t = (d*(t - txt[i]*h) + txt[i+M])%q;

// We might get negative value of t, converting it to positive

if(t < 0) 

    t = (t + q); 

http://algs4.cs.princeton.edu/53substring/RabinKarp.java.html
txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;

txtHash = (txtHash*R + txt.charAt(i)) % Q;

http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=stringSearching

A + B = C => (A % M + B % M) % M = C % M
A * B = C => ((A % M) * (B % M)) % M = C % M

We get:

H % M = (((s[0] % M) * (B(m - 1) % M)) % M + ((s[1] % M) * (B(m - 2) % M)) % M +…
…+ ((s[m - 2] % M) * (B1 % M)) % M + ((s[m - 1] % M) * (B0 % M)) % M) % M

H0 = Hs[0]…s[2] = s[0] * B2 + s[1] * B + s[2]

H1 = Hs[1]..s[3] = s[1] * B2 + s[2] * B + s[3]

H1 = (H0 - s[0] * B2 ) * B + s[3]

In general:

Hi = ( Hi - 1 - s[i- 1] * Bm - 1 ) * B + s[i + m - 1]

Applying again the rules of modular arithmetic, we get:

Hi % M = (((( Hi - 1 % M - ((s[i- 1] % M) * (Bm - 1 % M)) % M ) % M) * (B % M)) % M +
+ s[i + m - 1] % M) % M

Obviously the value of (Hi - 1 - s[i - 1] * Bm - 1) may be negative. Again, the rules of modular arithmetic come into play:

A - B = C => (A % M - B % M + k * M) % M = C % M

Since the absolute value of (Hi - 1 - s[i - 1] * Bm - 1) is between 0 and (M - 1), we can safely use a value of 1 for k.

http://blog.giovannibotta.net/2012/08/rabin-karp-algorithm.html
class RabinKarp {
// radix: 65536 for java chars, can actually be much smaller since we are hashing
private final static long d = Character.MAX_VALUE + 1;
// prime number chosen so that d*(q+1) < 2^63-1 (fits in a java long)
private final static long q = 96293;
/**
* Search for the given pattern in the text using the Rabin-Karp algorithm.
* The radix-d base d and the prime number q for hashing are chosen as the
* maximum value a char can have (65536 in java) and such that d*(q+1) fits
* in a java int, i.e., q<32767.
*
* @param text
* Text to search in.
* @param pattern
* Pattern to search.
* @return Index in the text at which the first occurrence of the pattern
* appears.
*/
public List<Integer> findMatches(char[] text, char[] pattern)
{
return findMatches(text, pattern, d, q);
}
/**
* Search for the given pattern in the text using the Rabin-Karp algorithm.
* The radix-d base and the prime number for hashing must be provided as
* arguments.
*
* @param text
* Text to search in.
* @param pattern
* Pattern to search.
* @param d
* Base for the hashing.
* @param q
* Prime number to use in the hashing functions.
* @return Index in the text at which the first occurrence of the pattern
* appears.
*/
public List<Integer> findMatches(char[] text, char[] pattern, long d, long q) {
// length of pattern
int m = pattern.length;
// do I have enough text?
if (m > text.length)
return new LinkedList<>();
LinkedList<Integer> ret = new LinkedList<>();
long p = 0;
long t = 0;
for (int i = 0; i < m; i++) {
// I can get away with that because d*(q+1) < 2^63-1
// otherwise I risk overflowing
p = (p * d + pattern[i]) % q;
t = (t * d + text[i]) % q;
}
// the most significant digit
long h = powMod(d, m - 1, q); // d^(m-1) mod q
int nMinusM = text.length - m;
for (int i = 0; i <= nMinusM; i++) {
if (p == t)
if (compare(text, i, pattern))
ret.add(i);
if (i < nMinusM) {
t -= h * text[i];
// necessary because the previous subtraction most likely leads
// to a negative value and java doesn't know of modulo of a
// negative number.
while (t < 0)
t += q;
t = (d * t + text[i + m]) % q;
}
}
return ret;
}
private boolean compare(char[] text, int start, char[] pattern) {
if (text.length - start < pattern.length)
return false;
for (int i = 0; i < pattern.length; i++)
if (text[i + start] != pattern[i])
return false;
return true;
}
/**
* Computes the power d to the n modulo q.
*
* @param d
* @param n
* @param q
* @return d^n mod q.
*/
private long powMod(long d, long n, long q) {
if (n == 0)
return 1;
if (n == 1)
return d % q;
long j = powMod(d, n / 2, q);
j = (j * j) % q;
if (n % 2 == 0)
return j;
return ((j * d) % q);
}
}
http://www.abrut.us/blog/2014/04/28/rabin-karp-example-in-java/
The code is not well written, but the idea here is that we can use simple hash function hash(a) =a, especially during interview. -> extract hash computation as method.
class RabinKarp {
// Hashes that simply take the int value of a character
public static int hash(char a) {
return (int) a;
}
public static int hash(String a) {
int hash = 0;
for(char ch : a.toCharArray()) {
hash += (int)ch;
}
return hash;
}

static public int strpos(String big, String small) {
if(small.length() > big.length()) return -1;
int desiredHash = hash(small);
int actualHash = -1;
char [] bigArr = big.toCharArray();
for(int i=0; i<big.length(); i++) {
if(i + small.length() > big.length()) break;
// Calculate hash once O(m)
if(actualHash == -1) {
actualHash = hash(big.substring(i, i+small.length()));
}
// Rolling hash, constant time on consecutive hashes
else {
actualHash -= hash(bigArr[i-1]);
actualHash += hash(bigArr[i+small.length()-1]);
}
if(actualHash == desiredHash) {
// Could have collided
if(big.substring(i, i+small.length()).equals(small))
return i;
}
}
return -1;
}
}
Usage of Rolling Hash

http://www.infoarena.ro/blog/rolling-hash

http://blog.teamleadnet.com/2012/10/rabin-karp-rolling-hash-dynamic-sized.html

http://www.codeproject.com/Articles/801608/Using-a-rolling-hash-to-break-up-binary-files

查找與给定字符串是anagram的子串，例如：

GetAnagram("abcdbcsdaqdbahs'', "scdcb'') ==>"cdbcs''。

因为這裏不是查找相等的字符串，而是對應的anagram，所以比較的次數要比上面的第一個示例多，但也可以采用一定的機制來保證盡量少的哈希碰撞，從而減少比對的次數，大大降低复雜度。

Rabin-Karp can be used to match against multiple pattern. This makes it perfect to detect plagiarism even for larger phrases.

Rolling hash, Rabin Karp, palindromes, rsync and others

Given a string of length n, count the number of palindromic substring of S. Solve better than O(n²).

Given a string of length n, find the longest substring that is repeated at least k times. Solve in O(n log n)

Given a bitmap, find out the largest square that's repeated in the bitmap.

Given a string S figure out if the string is periodic. (there is a p such that for any i we have that s[i] == s[i + p]) For example abcabcab is periodic with p = 3. O(n log n)

Given two equal length strings, figure out if one is a rotation of the other. O(n)

Given two polygons find out if they are similar polygons. O(n)

Given a string, find it's longest periodic prefix. O(n log n) For aaabaaabcde the answer is aaabaaab

Given a tree T with character labeled nodes and a given string P count how many downward paths match P. O(n)

3 Reasons Why Rabin-Karp is Cool
1. Good for plagiarism, because it can deal with multiple pattern matching!

2. Not faster than brute force matching in theory, but in practice its complexity is O(n+m)!
3. With a good hashing function it can be quite effective and it’s easy to implement!

2 Reasons Why Rabin-Karp is Not Cool
1. There are lots of string matching algorithms that are faster than O(n+m)

2. It’s practically as slow as brute force matching and it requires additional space

http://codingrecipies.blogspot.com/2014/07/pattern-matching-rabin-karp.html
Read full article from Searching for Patterns | Set 3 (Rabin-Karp Algorithm) | GeeksforGeeks

Searching for Patterns | Set 3 (Rabin-Karp Algorithm) | GeeksforGeeks

3 Reasons Why Rabin-Karp is Cool
1. Good for plagiarism, because it can deal with multiple pattern matching!

2 Reasons Why Rabin-Karp is Not Cool
1. There are lots of string matching algorithms that are faster than O(n+m)

Labels

Popular Posts

Searching for Patterns | Set 3 (Rabin-Karp Algorithm) | GeeksforGeeks

3 Reasons Why Rabin-Karp is Cool1. Good for plagiarism, because it can deal with multiple pattern matching!

2 Reasons Why Rabin-Karp is Not Cool1. There are lots of string matching algorithms that are faster than O(n+m)

Labels

Popular Posts

3 Reasons Why Rabin-Karp is Cool
1. Good for plagiarism, because it can deal with multiple pattern matching!

2 Reasons Why Rabin-Karp is Not Cool
1. There are lots of string matching algorithms that are faster than O(n+m)