geeksforgeeks: Searching for Patterns
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.
Naive Pattern Searching: O(NM)
Slide the pattern over text one by one and check for a match. If a match is found, then slides by 1 again to check for subsequent matches.
Searching for Patterns | Set 2 (KMP Algorithm)
http://www.sanfoundry.com/java-program-knuth-morris-pratt-algorithm/
Rabin-Karp Algorithm
Like the Naive Algorithm, Rabin-Karp algorithm also slides the pattern one by one. But unlike the Naive algorithm, Rabin Karp algorithm matches the hash value of the pattern with the hash value of current substring of text, and if the hash values match then only it starts matching individual characters. So Rabin Karp algorithm needs to calculate hash values for following strings.
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.
Naive Pattern Searching: O(NM)
Slide the pattern over text one by one and check for a match. If a match is found, then slides by 1 again to check for subsequent matches.
void
search(
char
*pat,
char
*txt)
{
int
M =
strlen
(pat);
int
N =
strlen
(txt);
/* A loop to slide pat[] one by one */
for
(
int
i = 0; i <= N - M; i++)
{
int
j;
/* For current index i, check for pattern match */
for
(j = 0; j < M; j++)
{
if
(txt[i+j] != pat[j])
break
;
}
if
(j == M)
// if pat[0...M-1] = txt[i, i+1, ...i+M-1]
{
printf
(
"Pattern found at index %d \n"
, i);
}
}
}
The KMP matching algorithm uses degenerating property (pattern having same sub-patterns appearing more than once in the pattern) of the pattern and improves the worst case complexity to O(n). The basic idea behind KMP’s algorithm is: whenever we detect a mismatch (after some matches), we already know some of the characters in the text (since they matched the pattern characters prior to the mismatch). We take advantage of this information to avoid matching the characters that we know will anyway match.
KMP algorithm does some preprocessing over the pattern pat[] and constructs an auxiliary array lps[] of size m (same as size of pattern). Here name lps indicates longest proper prefix which is also suffix.. For each sub-pattern pat[0…i] where i = 0 to m-1, lps[i] stores length of the maximum matching proper prefix which is also a suffix of the sub-pattern pat[0..i].
KMP algorithm does some preprocessing over the pattern pat[] and constructs an auxiliary array lps[] of size m (same as size of pattern). Here name lps indicates longest proper prefix which is also suffix.. For each sub-pattern pat[0…i] where i = 0 to m-1, lps[i] stores length of the maximum matching proper prefix which is also a suffix of the sub-pattern pat[0..i].
lps[i] = the longest proper prefix of pat[0..i] which is also a suffix of pat[0..i].
Examples:
For the pattern “AABAACAABAA”, lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
Java Implementation:For the pattern “AABAACAABAA”, lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
http://www.sanfoundry.com/java-program-knuth-morris-pratt-algorithm/
void
computeLPSArray(
char
*pat,
int
M,
int
*lps)
{
int
len = 0;
// lenght of the previous longest prefix suffix
int
i;
lps[0] = 0;
// lps[0] is always 0
i = 1;
// the loop calculates lps[i] for i = 1 to M-1
while
(i < M)
{
if
(pat[i] == pat[len])
{
len++;
lps[i] = len;
i++;
}
else
// (pat[i] != pat[len])
{
if
(len != 0)
{
// This is tricky. Consider the example AAACAAAA and i = 7.
len = lps[len-1];
// Also, note that we do not increment i here
}
else
// if (len == 0)
{
lps[i] = 0;
i++;
}
}
}
}
void
KMPSearch(
char
*pat,
char
*txt)
{
int
M =
strlen
(pat);
int
N =
strlen
(txt);
// create lps[] that will hold the longest prefix suffix values for pattern
int
*lps = (
int
*)
malloc
(
sizeof
(
int
)*M);
int
j = 0;
// index for pat[]
// Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps);
int
i = 0;
// index for txt[]
while
(i < N)
{
if
(pat[j] == txt[i])
{
j++;
i++;
}
if
(j == M)
{
printf
(
"Found pattern at index %d \n"
, i-j);
j = lps[j-1];
}
// mismatch after j matches
else
if
(i < N && pat[j] != txt[i])
{
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if
(j != 0)
j = lps[j-1];
else
i = i+1;
}
}
free
(lps);
// to avoid memory leak
}
Like the Naive Algorithm, Rabin-Karp algorithm also slides the pattern one by one. But unlike the Naive algorithm, Rabin Karp algorithm matches the hash value of the pattern with the hash value of current substring of text, and if the hash values match then only it starts matching individual characters. So Rabin Karp algorithm needs to calculate hash values for following strings.
So Rabin Karp algorithm needs to calculate hash values for following strings.
1) Pattern itself.
2) All the substrings of text of length m.
2) All the substrings of text of length m.
Since we need to efficiently calculate hash values for all the substrings of size m of text, we must have a hash function which has following property.
Hash at the next shift must be efficiently computable from the current hash value and next character in text or we can say hash(txt[s+1 .. s+m]) must be efficiently computable from hash(txt[s .. s+m-1])and txt[s+m] i.e., hash(txt[s+1 .. s+m])= rehash(txt[s+m], hash(txt[s .. s+m-1]) and rehash must be O(1) operation.
Hash at the next shift must be efficiently computable from the current hash value and next character in text or we can say hash(txt[s+1 .. s+m]) must be efficiently computable from hash(txt[s .. s+m-1])and txt[s+m] i.e., hash(txt[s+1 .. s+m])= rehash(txt[s+m], hash(txt[s .. s+m-1]) and rehash must be O(1) operation.
Java version: