Searching for Patterns | Set 3 (Rabin-Karp Algorithm) | GeeksforGeeks


Rabin-Karp algorithm also slides the pattern one by one, Rabin Karp algorithm matches the hash value of the pattern with the hash value of current substring of text, and if the hash values match then only it starts matching individual characters.

Hash at the next shift must be efficiently computable from the current hash value and next character in text or we can say hash(txt[s+1 .. s+m]) must be efficiently computable fromhash(txt[s .. s+m-1]) and txt[s+m] i.e., hash(txt[s+1 .. s+m])= rehash(txt[s+m], hash(txt[s .. s+m-1])and rehash must be O(1) operation.

The numeric values cannot be practically stored as an integer. Therefore, the numeric value is calculated using modular arithmetic to make sure that the hash values can be stored in an integer variable (can fit in memory words). To do rehashing, we need to take off the most significant digit and add the new least significant digit for in hash value. Rehashing is done using the following formula.
hash( txt[s+1 .. s+m] ) = d ( hash( txt[s .. s+m-1]) – txt[s]*h ) + txt[s + m] ) mod q
d: Number of characters in the alphabet // usually 256
q: A prime number
h: d^(m-1)
The average and best case running time of the Rabin-Karp algorithm is O(n+m), but its worst-case time is O(nm). Worst case of Rabin-Karp algorithm occurs when all characters of pattern and text are same as the hash values of all the substrings of txt[] match with hash value of pat[]. For example pat[] = “AAA” and txt[] = “AAAAAAA”.

http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
Rabin-Karp is an algorithm of choice for multiple pattern search
Java code: http://www.sanfoundry.com/java-program-rabin-karp-algorithm/
http://algs4.cs.princeton.edu/53substring/RabinKarp.java.html

#define d 256
/*  pat  -> pattern
    txt  -> text
    q    -> A prime number
*/
void search(char *pat, char *txt, int q)
{
    int M = strlen(pat);
    int N = strlen(txt);
    int i, j;
    int p = 0;  // hash value for pattern
    int t = 0; // hash value for txt
    int h = 1;
  
    // The value of h would be "pow(d, M-1)%q"
    for (i = 0; i < M-1; i++)
        h = (h*d)%q;
  
    // Calculate the hash value of pattern and first window of text
    for (i = 0; i < M; i++)
    {
        p = (d*p + pat[i])%q;
        t = (d*t + txt[i])%q;
    }
  
    // Slide the pattern over text one by one
    for (i = 0; i <= N - M; i++)
    {
        
        // Chaeck the hash values of current window of text and pattern
        // If the hash values match then only check for characters on by one
        if ( p == t )
        {
            /* Check for characters one by one */
            for (j = 0; j < M; j++)
            {
                if (txt[i+j] != pat[j])
                    break;
            }
            if (j == M)  // if p == t and pat[0...M-1] = txt[i, i+1, ...i+M-1]
            {
                printf("Pattern found at index %d \n", i);
            }
        }
         
        // Calulate hash value for next window of text: Remove leading digit,
        // add trailing digit          
        if ( i < N-M )
        {
            t = (d*(t - txt[i]*h) + txt[i+M])%q;
             
            // We might get negative value of t, converting it to positive
            if(t < 0)
              t = (t + q);
        }
    }
}
http://www.sparknotes.com/cs/searching/hashtables/section4.rhtml
Rabin-Karp rolling hash
The Rabin-Karp string search algorithm is normally used with a very simple rolling hash function that only uses multiplications and additions:
H = c_1 a^{k-1} + c_2 a^{k-2} + c_3 a^{k-3} + ... + c_k a^{0} where a is a constant and c_1, ..., c_k are the input characters.
In order to avoid manipulating huge H values, all math is done modulo n. The choice of a and n is critical to get good hashing; see linear congruential generator for more discussion.
Removing and adding characters simply involves adding or subtracting the first or last term. Shifting all characters by one position to the left requires multiplying the entire sum H by a. Shifting all characters by one position to the right requires dividing the entire sum H by a. Note that in modulo arithmetic, a can be chosen to have a multiplicative inverse a^{-1} by which H can be multiplied to get the result of the division without actually performing a division.
http://www.quora.com/How-does-the-Rolling-Hash-function-used-in-Rabin-Karp-algorithm-work
Hash(H1) over k characters(c1..ck) could be computed as follows:
H1= c1*a^k-1 + c2*a^k-2+c3*a^k-3+…+ck*a^0
where a is a constant
and c1…ck are input characters.

Lets assume you want to calculate the hash(H2) over same size window k over characters(c2..ck+1) could be computed from similar logic above as follows:
H2 = c2*a^k-1+c3*a^k-2+…+ck+1*a^0
where a is a constant
and c2..ck+1 are input characters.

Now, if we look carefully, H2 = [H1*a] + [ck+1*a^0(i.e. last term of this window)] - [c1*a^k-1(i.e. first term of H1)]


hash( txt[s+1 .. s+m] ) = d ( hash( txt[s .. s+m-1]) – txt[s]*h ) + txt[s + m] ) mod q
hash( txt[s .. s+m-1] ) : Hash value at shift s.
hash( txt[s+1 .. s+m] ) : Hash value at next shift (or shift s+1)
d: Number of characters in the alphabet
q: A prime number
h: d^(m-1)

t = (d*(t - txt[i]*h) + txt[i+M])%q;
// We might get negative value of t, converting it to positive
if(t < 0)
    t = (t + q);
http://algs4.cs.princeton.edu/53substring/RabinKarp.java.html
txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
txtHash = (txtHash*R + txt.charAt(i)) % Q; 
http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=stringSearching
A + B = C => (A % M + B % M) % M = C % M
A * B = C => ((A % M) * (B % M)) % M = C % M
We get:
H % M = (((s[0] % M) * (B(m - 1) % M)) % M + ((s[1] % M) * (B(m - 2) % M)) % M +…
…+ ((s[m - 2] % M) * (B1 % M)) % M + ((s[m - 1] % M) * (B0 % M)) % M) % M

H0 = Hs[0]…s[2] = s[0] * B2 + s[1] * B + s[2]
H1 = Hs[1]..s[3] = s[1] * B2 + s[2] * B + s[3]
H1 = (H0 - s[0] * B2 ) * B + s[3]
In general:
Hi = ( Hi - 1 - s[i- 1] * Bm - 1 ) * B + s[i + m - 1]
Applying again the rules of modular arithmetic, we get:
Hi % M = (((( Hi - 1 % M - ((s[i- 1] % M) * (Bm - 1 % M)) % M ) % M) * (B % M)) % M +
+ s[i + m - 1] % M) % M
Obviously the value of (Hi - 1 - s[i - 1] * Bm - 1) may be negative. Again, the rules of modular arithmetic come into play:
A - B = C => (A % M - B % M + k * M) % M = C % M
Since the absolute value of (Hi - 1 - s[i - 1] * Bm - 1) is between 0 and (M - 1), we can safely use a value of 1 for k.
http://blog.giovannibotta.net/2012/08/rabin-karp-algorithm.html
class RabinKarp {
      // radix: 65536 for java chars, can actually be much smaller since we are hashing
      private final static long d = Character.MAX_VALUE + 1;
      // prime number chosen so that d*(q+1) < 2^63-1 (fits in a java long)
      private final static long q = 96293;
      /**
       * Search for the given pattern in the text using the Rabin-Karp algorithm.
       * The radix-d base d and the prime number q for hashing are chosen as the
       * maximum value a char can have (65536 in java) and such that d*(q+1) fits
       * in a java int, i.e., q<32767.
       *
       * @param text
       *      Text to search in.
       * @param pattern
       *      Pattern to search.
       * @return Index in the text at which the first occurrence of the pattern
       *     appears.
       */
      public List<Integer> findMatches(char[] text, char[] pattern)
      {
           return findMatches(text, pattern, d, q);
      }
      /**
       * Search for the given pattern in the text using the Rabin-Karp algorithm.
       * The radix-d base and the prime number for hashing must be provided as
       * arguments.
       *
       * @param text
       *      Text to search in.
       * @param pattern
       *      Pattern to search.
       * @param d
       *      Base for the hashing.
       * @param q
       *      Prime number to use in the hashing functions.
       * @return Index in the text at which the first occurrence of the pattern
       *     appears.
       */
      public List<Integer> findMatches(char[] text, char[] pattern, long d, long q) {
           // length of pattern
           int m = pattern.length;
           // do I have enough text?
           if (m > text.length)
                return new LinkedList<>();
           LinkedList<Integer> ret = new LinkedList<>();
           long p = 0;
           long t = 0;
           for (int i = 0; i < m; i++) {
                // I can get away with that because d*(q+1) < 2^63-1
                // otherwise I risk overflowing
                p = (p * d + pattern[i]) % q;
                t = (t * d + text[i]) % q;
           }
           // the most significant digit
           long h = powMod(d, m - 1, q); // d^(m-1) mod q
           int nMinusM = text.length - m;
           for (int i = 0; i <= nMinusM; i++) {
                if (p == t)
                     if (compare(text, i, pattern))
                          ret.add(i);
                if (i < nMinusM) {
                     t -= h * text[i];
                     // necessary because the previous subtraction most likely leads
                     // to a negative value and java doesn't know of modulo of a
                     // negative number.
                     while (t < 0)
                          t += q;
                     t = (d * t + text[i + m]) % q;
                }
           }
           return ret;
      }
      private boolean compare(char[] text, int start, char[] pattern) {
           if (text.length - start < pattern.length)
                return false;
           for (int i = 0; i < pattern.length; i++)
                if (text[i + start] != pattern[i])
                     return false;
           return true;
      }
      /**
       * Computes the power d to the n modulo q.
       *
       * @param d
       * @param n
       * @param q
       * @return d^n mod q.
       */
      private long powMod(long d, long n, long q) {
           if (n == 0)
                return 1;
           if (n == 1)
                return d % q;
           long j = powMod(d, n / 2, q);
           j = (j * j) % q;
           if (n % 2 == 0)
                return j;
           return ((j * d) % q);
      }
 }
http://www.abrut.us/blog/2014/04/28/rabin-karp-example-in-java/
The code is not well written, but the idea here is that we can use simple hash function hash(a) =a, especially during interview. -> extract hash computation as method.
class RabinKarp {
    // Hashes that simply take the int value of a character
    public static int hash(char a) {
        return (int) a;
    }
    public static int hash(String a) {
        int hash = 0;
        for(char ch : a.toCharArray()) {
            hash += (int)ch;
        }
        return hash;
    }

    static public int strpos(String big, String small) {
        if(small.length() > big.length()) return -1;
        int desiredHash = hash(small);
        int actualHash = -1;
        char [] bigArr = big.toCharArray();
        for(int i=0; i<big.length(); i++) {
            if(i + small.length()  > big.length()) break;
            // Calculate hash once O(m)
            if(actualHash == -1) {
                actualHash = hash(big.substring(i, i+small.length()));
            }
            // Rolling hash, constant time on consecutive hashes
            else {
                actualHash -= hash(bigArr[i-1]);
                actualHash += hash(bigArr[i+small.length()-1]);
            }
            if(actualHash == desiredHash) {
                // Could have collided
                if(big.substring(i, i+small.length()).equals(small))
                    return i;
            }
        }
        return -1;
    }
}
Usage of Rolling Hash
查找與给定字符串是anagram的子串,例如:

GetAnagram("abcdbcsdaqdbahs'', "scdcb'') ==>"cdbcs''。

因为這裏不是查找相等的字符串,而是對應的anagram,所以比較的次數要比上面的第一個示例多,但也可以采用一定的機制來保證盡量少的哈希碰撞,從而減少比對的次數,大大降低复雜度。


Rabin-Karp can be used to match against multiple pattern. This makes it perfect to detect plagiarism even for larger phrases.
Rolling hash, Rabin Karp, palindromes, rsync and others

  • Given a string of length n, count the number of palindromic substring of S. Solve better than O(n2).
  • Given a string of length n, find the longest substring that is repeated at least k times. Solve in O(n log n)
  • Given a bitmap, find out the largest square that's repeated in the bitmap.
  • Given a string S figure out if the string is periodic. (there is a p such that for any i we have that s[i] == s[i + p]) For example abcabcab is periodic with p = 3. O(n log n)
  • Given two equal length strings, figure out if one is a rotation of the other. O(n)
  • Given two polygons find out if they are similar polygons. O(n)
  • Given a string, find it's longest periodic prefix. O(n log n) For aaabaaabcde the answer is aaabaaab
  • Given a tree T with character labeled nodes and a given string P count how many downward paths match P. O(n)

  • 3 Reasons Why Rabin-Karp is Cool
    1. Good for plagiarism, because it can deal with multiple pattern matching!

    2. Not faster than brute force matching in theory, but in practice its complexity is O(n+m)!
    3. With a good hashing function it can be quite effective and it’s easy to implement!

    2 Reasons Why Rabin-Karp is Not Cool
    1. There are lots of string matching algorithms that are faster than O(n+m)

    2. It’s practically as slow as brute force matching and it requires additional space
    http://codingrecipies.blogspot.com/2014/07/pattern-matching-rabin-karp.html
    Read full article from Searching for Patterns | Set 3 (Rabin-Karp Algorithm) | GeeksforGeeks

    Labels

    LeetCode (1432) GeeksforGeeks (1122) LeetCode - Review (1067) Review (882) Algorithm (668) to-do (609) Classic Algorithm (270) Google Interview (237) Classic Interview (222) Dynamic Programming (220) DP (186) Bit Algorithms (145) POJ (141) Math (137) Tree (132) LeetCode - Phone (129) EPI (122) Cracking Coding Interview (119) DFS (115) Difficult Algorithm (115) Lintcode (115) Different Solutions (110) Smart Algorithm (104) Binary Search (96) BFS (91) HackerRank (90) Binary Tree (86) Hard (79) Two Pointers (78) Stack (76) Company-Facebook (75) BST (72) Graph Algorithm (72) Time Complexity (69) Greedy Algorithm (68) Interval (63) Company - Google (62) Geometry Algorithm (61) Interview Corner (61) LeetCode - Extended (61) Union-Find (60) Trie (58) Advanced Data Structure (56) List (56) Priority Queue (53) Codility (52) ComProGuide (50) LeetCode Hard (50) Matrix (50) Bisection (48) Segment Tree (48) Sliding Window (48) USACO (46) Space Optimization (45) Company-Airbnb (41) Greedy (41) Mathematical Algorithm (41) Tree - Post-Order (41) ACM-ICPC (40) Algorithm Interview (40) Data Structure Design (40) Graph (40) Backtracking (39) Data Structure (39) Jobdu (39) Random (39) Codeforces (38) Knapsack (38) LeetCode - DP (38) Recursive Algorithm (38) String Algorithm (38) TopCoder (38) Sort (37) Introduction to Algorithms (36) Pre-Sort (36) Beauty of Programming (35) Must Known (34) Binary Search Tree (33) Follow Up (33) prismoskills (33) Palindrome (32) Permutation (31) Array (30) Google Code Jam (30) HDU (30) Array O(N) (29) Logic Thinking (29) Monotonic Stack (29) Puzzles (29) Code - Detail (27) Company-Zenefits (27) Microsoft 100 - July (27) Queue (27) Binary Indexed Trees (26) TreeMap (26) to-do-must (26) 1point3acres (25) GeeksQuiz (25) Merge Sort (25) Reverse Thinking (25) hihocoder (25) Company - LinkedIn (24) Hash (24) High Frequency (24) Summary (24) Divide and Conquer (23) Proof (23) Game Theory (22) Topological Sort (22) Lintcode - Review (21) Tree - Modification (21) Algorithm Game (20) CareerCup (20) Company - Twitter (20) DFS + Review (20) DP - Relation (20) Brain Teaser (19) DP - Tree (19) Left and Right Array (19) O(N) (19) Sweep Line (19) UVA (19) DP - Bit Masking (18) LeetCode - Thinking (18) KMP (17) LeetCode - TODO (17) Probabilities (17) Simulation (17) String Search (17) Codercareer (16) Company-Uber (16) Iterator (16) Number (16) O(1) Space (16) Shortest Path (16) itint5 (16) DFS+Cache (15) Dijkstra (15) Euclidean GCD (15) Heap (15) LeetCode - Hard (15) Majority (15) Number Theory (15) Rolling Hash (15) Tree Traversal (15) Brute Force (14) Bucket Sort (14) DP - Knapsack (14) DP - Probability (14) Difficult (14) Fast Power Algorithm (14) Pattern (14) Prefix Sum (14) TreeSet (14) Algorithm Videos (13) Amazon Interview (13) Basic Algorithm (13) Codechef (13) Combination (13) Computational Geometry (13) DP - Digit (13) LCA (13) LeetCode - DFS (13) Linked List (13) Long Increasing Sequence(LIS) (13) Math-Divisible (13) Reservoir Sampling (13) mitbbs (13) Algorithm - How To (12) Company - Microsoft (12) DP - Interval (12) DP - Multiple Relation (12) DP - Relation Optimization (12) LeetCode - Classic (12) Level Order Traversal (12) Prime (12) Pruning (12) Reconstruct Tree (12) Thinking (12) X Sum (12) AOJ (11) Bit Mask (11) Company-Snapchat (11) DP - Space Optimization (11) Dequeue (11) Graph DFS (11) MinMax (11) Miscs (11) Princeton (11) Quick Sort (11) Stack - Tree (11) 尺取法 (11) 挑战程序设计竞赛 (11) Coin Change (10) DFS+Backtracking (10) Facebook Hacker Cup (10) Fast Slow Pointers (10) HackerRank Easy (10) Interval Tree (10) Limited Range (10) Matrix - Traverse (10) Monotone Queue (10) SPOJ (10) Starting Point (10) States (10) Stock (10) Theory (10) Tutorialhorizon (10) Kadane - Extended (9) Mathblog (9) Max-Min Flow (9) Maze (9) Median (9) O(32N) (9) Quick Select (9) Stack Overflow (9) System Design (9) Tree - Conversion (9) Use XOR (9) Book Notes (8) Company-Amazon (8) DFS+BFS (8) DP - States (8) Expression (8) Longest Common Subsequence(LCS) (8) One Pass (8) Quadtrees (8) Traversal Once (8) Trie - Suffix (8) 穷竭搜索 (8) Algorithm Problem List (7) All Sub (7) Catalan Number (7) Cycle (7) DP - Cases (7) Facebook Interview (7) Fibonacci Numbers (7) Flood fill (7) Game Nim (7) Graph BFS (7) HackerRank Difficult (7) Hackerearth (7) Inversion (7) Kadane’s Algorithm (7) Manacher (7) Morris Traversal (7) Multiple Data Structures (7) Normalized Key (7) O(XN) (7) Radix Sort (7) Recursion (7) Sampling (7) Suffix Array (7) Tech-Queries (7) Tree - Serialization (7) Tree DP (7) Trie - Bit (7) 蓝桥杯 (7) Algorithm - Brain Teaser (6) BFS - Priority Queue (6) BFS - Unusual (6) Classic Data Structure Impl (6) DP - 2D (6) DP - Monotone Queue (6) DP - Unusual (6) DP-Space Optimization (6) Dutch Flag (6) How To (6) Interviewstreet (6) Knapsack - MultiplePack (6) Local MinMax (6) MST (6) Minimum Spanning Tree (6) Number - Reach (6) Parentheses (6) Pre-Sum (6) Probability (6) Programming Pearls (6) Rabin-Karp (6) Reverse (6) Scan from right (6) Schedule (6) Stream (6) Subset Sum (6) TSP (6) Xpost (6) n00tc0d3r (6) reddit (6) AI (5) Abbreviation (5) Anagram (5) Art Of Programming-July (5) Assumption (5) Bellman Ford (5) Big Data (5) Code - Solid (5) Code Kata (5) Codility-lessons (5) Coding (5) Company - WMware (5) Convex Hull (5) Crazyforcode (5) DFS - Multiple (5) DFS+DP (5) DP - Multi-Dimension (5) DP-Multiple Relation (5) Eulerian Cycle (5) Graph - Unusual (5) Graph Cycle (5) Hash Strategy (5) Immutability (5) Java (5) LogN (5) Manhattan Distance (5) Matrix Chain Multiplication (5) N Queens (5) Pre-Sort: Index (5) Quick Partition (5) Quora (5) Randomized Algorithms (5) Resources (5) Robot (5) SPFA(Shortest Path Faster Algorithm) (5) Shuffle (5) Sieve of Eratosthenes (5) Strongly Connected Components (5) Subarray Sum (5) Sudoku (5) Suffix Tree (5) Swap (5) Threaded (5) Tree - Creation (5) Warshall Floyd (5) Word Search (5) jiuzhang (5)

    Popular Posts