Searching for Patterns | Set 2 (KMP Algorithm) - GeeksforGeeks
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.
In the preprocessing part, we calculate values in lps[]. To do that, we keep track of the length of the longest prefix suffix value (we use len variable for this purpose) for the previous index. We initialize lps[0] and len as 0. If pat[len] and pat[i] match, we increment len by 1 and assign the incremented value to lps[i]. If pat[i] and pat[len] do not match and len is not 0, we update len to lps[len-1].
http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm
It never re-compares a text symbol that has matched a pattern symbol.
Definition: Let A be an alphabet and x = x0 ... xk-1, k a string of length k over A.
http://jakeboxer.com/blog/2009/12/13/the-knuth-morris-pratt-algorithm-in-my-own-words/
http://tekmarathon.com/2013/05/14/algorithm-to-find-substring-in-a-string-kmp-algorithm/
b[0] is computed by looking at longest prefix and suffix match of the substring of p[0…0-1] as p[-1] is not defined, hence b[0] is given -1
b[1] is computed by looking at longest prefix and suffix match of the substring of p[0…1-1] = b[0] = A, since there wont be any prefix and suffix for one letter word, b[1] is 0.
Note: For any given pattern b[0] and b[1] are always fixed
https://weblogs.java.net/blog/potty/archive/2012/05/10/string-searching-algorithms-part-ii
Also check http://algs4.cs.princeton.edu/53substring/KMP.java.html
http://algs4.cs.princeton.edu/53substring/KMPplus.java.html
http://www.sanfoundry.com/java-program-knuth-morris-pratt-algorithm/
https://gist.github.com/shonenada/4266864
https://gist.github.com/shoenig/1430733/250b4184dc4a2dd31aa136e2fbdded5f90489a64
X. Pat[0]=0;
https://www.fmi.uni-sofia.bg/fmi/logic/vboutchkova/sources/KMPMatch_java.html
https://dzone.com/articles/algorithm-week-morris-pratt
Read full article from Searching for Patterns | Set 2 (KMP Algorithm) - GeeksforGeeks
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.
The KMP matching algorithm uses degenerating property (pattern having same sub-patterns appearing more than once in the pattern) of the pattern and improves the worst case complexity to O(n). The basic idea behind KMP’s algorithm is: whenever we detect a mismatch (after some matches), we already know some of the characters in the text of next window. We take advantage of this information to avoid matching the characters that we know will anyway match.
Matching Overview txt = "AAAAABAAABA" pat = "AAAA" We compare first window of txt with pat txt = "AAAAABAAABA" pat = "AAAA" [Initial position] We find a match. This is same as Naive String Matching. In the next step, we compare next window of txt with pat. txt = "AAAAABAAABA" pat = "AAAA" [Pattern shifted one position] This is where KMP does optimization over Naive. In this second window, we only compare fourth A of pattern with fourth character of current window of text to decide whether current window matches or not. Since we know first three characters will anyway match, we skipped matching first three characters. Need of Preprocessing? An important question arises from above explanation, how to know how many characters to be skipped. To know this, we pre-process pattern and prepare an integer array lps[] that tells us count of characters to be skipped.
- For each sub-pattern pat[0..i] where i = 0 to m-1, lps[i] stores length of the maximum matching proper prefix which is also a suffix of the sub-pattern pat[0..i].
lps[i] = the longest proper prefix of pat[0..i] which is also a suffix of pat[0..i].
- We start comparison of pat[j] with j = 0 with characters of current window of text.
- We keep matching characters txt[i] and pat[j] and keep incrementing i and j while pat[j] and txt[i] keep matching.
- When we see a mismatch
- We know that characters pat[0..j-1] match with txt[i-j+1…i-1] (Note that j starts with 0 and increment it only when there is a match).
- We also know (from above definition) that lps[j-1] is count of characters of pat[0…j-1] that are both proper prefix and suffix.
- From above two points, we can conclude that we do not need to match these lps[j-1] characters with txt[i-j…i-1] because we know that these characters will anyway match.
Unlike Naive algorithm, where we slide the pattern by one and compare all characters at each shift, we use a value from lps[] to decide the next characters to be matched. The idea is to not match character that we know will anyway match.
How to use lps[] to decide next positions (or to know number of characters to be skipped)?
In the preprocessing part, we calculate values in lps[]. To do that, we keep track of the length of the longest prefix suffix value (we use len variable for this purpose) for the previous index. We initialize lps[0] and len as 0. If pat[len] and pat[i] match, we increment len by 1 and assign the incremented value to lps[i]. If pat[i] and pat[len] do not match and len is not 0, we update len to lps[len-1].
pat[] = "AAACAAAA" len = 0, i = 0. lps[0] is always 0, we move to i = 1 len = 0, i = 1. Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++. len = 1, lps[1] = 1, i = 2 len = 1, i = 2. Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++. len = 2, lps[2] = 2, i = 3 len = 2, i = 3. Since pat[len] and pat[i] do not match, and len > 0, set len = lps[len-1] = lps[1] = 1 len = 1, i = 3. Since pat[len] and pat[i] do not match and len > 0, len = lps[len-1] = lps[0] = 0 len = 0, i = 3. Since pat[len] and pat[i] do not match and len = 0, Set lps[3] = 0 and i = 4. len = 0, i = 4. Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++. len = 1, lps[4] = 1, i = 5 len = 1, i = 5. Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++. len = 2, lps[5] = 2, i = 6 len = 2, i = 6. Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++. len = 3, lps[6] = 3, i = 7 len = 3, i = 7. Since pat[len] and pat[i] do not match and len > 0, set len = lps[len-1] = lps[2] = 2 len = 2, i = 7. Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++. len = 3, lps[7] = 3, i = 8 We stop here as we have constructed the whole lps[].
void
KMPSearch(String pat, String txt)
{
int
M = pat.length();
int
N = txt.length();
// create lps[] that will hold the longest
// prefix suffix values for pattern
int
lps[] =
new
int
[M];
int
j =
0
;
// index for pat[]
// Preprocess the pattern (calculate lps[]
// array)
computeLPSArray(pat,M,lps);
int
i =
0
;
// index for txt[]
while
(i < N)
{
if
(pat.charAt(j) == txt.charAt(i))
{
j++;
i++;
}
if
(j == M)
{
System.out.println(
"Found pattern "
+
"at index "
+ (i-j));
j = lps[j-
1
];
}
// mismatch after j matches
else
if
(i < N && pat.charAt(j) != txt.charAt(i))
{
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if
(j !=
0
)
j = lps[j-
1
];
else
i = i+
1
;
}
}
}
void
computeLPSArray(String pat,
int
M,
int
lps[])
{
// length of the previous longest prefix suffix
int
len =
0
;
int
i =
1
;
lps[
0
] =
0
;
// lps[0] is always 0
// the loop calculates lps[i] for i = 1 to M-1
while
(i < M)
{
if
(pat.charAt(i) == pat.charAt(len))
{
len++;
lps[i] = len;
i++;
}
else
// (pat[i] != pat[len])
{
// This is tricky. Consider the example.
// AAACAAAA and i = 7. The idea is similar
// to search step.
if
(len !=
0
)
{
len = lps[len-
1
];
// Also, note that we do not increment
// i here
}
else
// if (len == 0)
{
lps[i] = len;
i++;
}
}
}
}
It never re-compares a text symbol that has matched a pattern symbol.
Definition: Let A be an alphabet and x = x0 ... xk-1, k a string of length k over A.
A prefix u of x or a suffix u of x is called a proper prefix or suffix, respectively, if ux, i.e. if its length b is less than k.
A border of x is a substring r with
r = x0 ... xb-1 and r = xk-b ... xk-1 where b {0, ..., k-1}
A border of x is a substring that is both proper prefix and proper suffix of x. We call its length b the width of the border.
Example:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... |
---|---|---|---|---|---|---|---|---|---|---|
a | b | c | a | b | c | a | b | d | ||
a | b | c | a | b | d | |||||
a | b | c | a | b | d |
The symbols at positions 0, ..., 4 have matched. Comparison c-d at position 5 yields a mismatch. The pattern can be shifted by 3 positions, and comparisons are resumed at position 5.
The shift distance is determined by the widest border of the matching prefix of p. In this example, the matching prefix is abcab, its length is j = 5. Its widest border is ab of width b = 2. The shift distance is j – b = 5 – 2 = 3.
In the preprocessing phase, the width of the widest border of each prefix of the pattern is determined. Then in the search phase, the shift distance can be computed according to the prefix that has matched.
Theorem: Let r, s be borders of a string x, where |r| < |s|. Then r is a border of s.
Proof: Figure 1 shows a string x with borders r and s. Since r is a prefix of x, it is also a proper prefix of s, because it is shorter than s. But r is also a suffix of x and, therefore, proper suffix of s. Thus r is a border of s.
Figure 1: Borders r, s of a string x | |
If s is the widest border of x, the next-widest border r of x is obtained as the widest border of s etc.
Definition: Let x be a string and a A a symbol. A border r of x can be extended by a, if ra is a border of xa.
Figure 2: Extension of a border | |
Figure 2 shows that a border r of width j of x can be extended by a, if xj = a.
In the preprocessing phase an array b of length m+1 is computed. Each entry b[i] contains the width of the widest border of the prefix of length i of the pattern (i = 0, ..., m). Since the prefix ε of length i = 0 has no border, we set b[0] = -1.
Figure 3: Prefix of length i of the pattern with border of width b[i] | |
Provided that the values b[0], ..., b[i] are already known, the value of b[i+1] is computed by checking if a border of the prefix p0 ...pi-1 can be extended by symbol pi. This is the case if pb[i] = pi (Figure 3). The borders to be examined are obtained in decreasing order from the values b[i], b[b[i]] etc.
The preprocessing algorithm comprises a loop with a variable j taking these values. A border of width j can be extended by pi, ifpj = pi. If not, the next-widest border is examined by setting j = b[j]. The loop terminates at the latest if no border can be extended (j = -1).
After increasing j by the statement j++ in each case j is the width of the widest border of p0 ... pi. This value is written to b[i+1] (tob[i] after increasing i by the statement i++).
Preprocessing algorithm
void kmpPreprocess() { int i=0, j=-1; b[i]=j; while (i<m) { while (j>=0 && p[i]!=p[j]) j=b[j]; i++; j++; b[i]=j; } } |
Example: For pattern p = ababaa the widths of the borders in array b have the following values. For instance we have b[5] = 3, since the prefix ababa of length 5 has a border of width 3.
j: | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|---|
p[j]: | a | b | a | b | a | a | |
b[j]: | -1 | 0 | 0 | 1 | 2 | 3 | 1 |
Searching algorithm
void kmpSearch() { int i=0, j=0; while (i<n) { while (j>=0 && t[i]!=p[j]) j=b[j]; i++; j++; if (j==m) { report(i-j); j=b[j]; } } } |
When in the inner while loop a mismatch at position j occurs, the widest border of the matching prefix of length j of the pattern is considered (Figure 5). Resuming comparisons at position b[j], the width of the border, yields a shift of the pattern such that the border matches. If again a mismatch occurs, the next-widest border is considered, and so on, until there is no border left (j = -1) or the next symbol matches. Then we have a new matching prefix of the pattern and continue with the outer while loop.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
a | b | a | b | b | a | b | a | a | ||||
a | b | a | b | a | c | |||||||
a | b | a | b | a | c | |||||||
a | b | a | b | a | c | |||||||
a | b | a | b | a | c | |||||||
a | b | a | b | a | c |
http://jakeboxer.com/blog/2009/12/13/the-knuth-morris-pratt-algorithm-in-my-own-words/
Here’s the partial match table for the pattern “abababca”:
Proper prefix: All the characters in a string, with one or more cut off the end. “S”, “Sn”, “Sna”, and “Snap” are all the proper prefixes of “Snape”.
Proper suffix: All the characters in a string, with one or more cut off the beginning. “agrid”, “grid”, “rid”, “id”, and “d” are all proper suffixes of “Hagrid”.
With this in mind, I can now give the one-sentence meaning of the values in the partial match table:
The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern.
How to use the Partial Match Table
The first time we get a partial match is here:
This is a partial_match_length of 1. The value at
table[partial_match_length - 1]
(or table[0]
) is 0, so we don’t get to skip ahead any.
The next partial match we get is here:
This is a partial_match_length of 5. The value at
table[partial_match_length - 1]
(or table[4]
) is 3. That means we get to skip ahead partial_match_length - table[partial_match_length - 1]
(or 5 - table[4]
or 5 - 3
or 2
) characters:
http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/stringmatchingclasses/KmpStringMatcher.java
http://tekmarathon.com/2013/05/14/algorithm-to-find-substring-in-a-string-kmp-algorithm/
p : a b c a b d a b c
p[i] :
0
1
2
3
4
5
6
7
8
b[i] : -
1
0
0
0
1
2
0
1
2
3
b[1] is computed by looking at longest prefix and suffix match of the substring of p[0…1-1] = b[0] = A, since there wont be any prefix and suffix for one letter word, b[1] is 0.
Note: For any given pattern b[0] and b[1] are always fixed
public
int
[] preProcessPattern(
char
[] ptrn) {
int
i =
0
, j = -
1
;
int
ptrnLen = ptrn.length;
int
[] b =
new
int
[ptrnLen +
1
];
b[i] = j;
while
(i < ptrnLen) {
while
(j >=
0
&& ptrn[i] != ptrn[j]) {
// if there is mismatch consider next widest border
j = b[j];
}
i++;
j++;
b[i] = j;
}
return
b;
}
public
void
searchSubString(
char
[] text,
char
[] ptrn) {
int
i =
0
, j =
0
;
// pattern and text lengths
int
ptrnLen = ptrn.length;
int
txtLen = text.length;
// initialize new array and preprocess the pattern
int
[] b = preProcessPattern(ptrn);
while
(i < txtLen) {
while
(j >=
0
&& text[i] != ptrn[j]) {
System.out.println(
"Mismatch happened, between text char "
+ text[i] +
" and pattern char "
+ ptrn[j]
+
", \nhence jumping the value of "
+
"j from "
+ j
+
" to "
+ b[j] +
" at text index i at "
+ i
+
" based on partial match table"
);
j = b[j];
}
i++;
j++;
// a match is found
if
(j == ptrnLen) {
System.out.println(
"FOUND SUBSTRING AT i "
+ i +
" and index:"
+ (i - ptrnLen));
System.out.println(
"Setting j from "
+ j +
" to "
+ b[j]);
j = b[j];
}
}
}
Also check http://algs4.cs.princeton.edu/53substring/KMP.java.html
http://algs4.cs.princeton.edu/53substring/KMPplus.java.html
http://www.sanfoundry.com/java-program-knuth-morris-pratt-algorithm/
https://gist.github.com/shonenada/4266864
static int[] getNext(String p){ int i=1,j=0; | |
int[] next = new int[p.length()+2]; | |
char[] pattern = p.toCharArray(); | |
next[0] = -1; | |
next[1] = 0; | |
while(i<p.length()-1){ | |
if(pattern[i] == pattern[j]){ | |
i++; | |
j++; | |
next[i] = next[j]; | |
} | |
else if(j == 0){ | |
next[i+1] = 0; | |
i++; | |
} | |
else{ | |
j = next[j]; | |
} | |
} | |
return next; | |
} | |
static int findKMPSub(String str, String p,int start, int next[]){ | |
char[] string = str.toCharArray(); | |
char[] pattern = p.toCharArray(); | |
int i = start; | |
int j=0,v; | |
while(i<str.length() && j<p.length()){ | |
if(j == -1 || string[i] == pattern[j]){ | |
i++; | |
j++; | |
} | |
else{ | |
j = next[j]; | |
} | |
} | |
if ( j == p.length()){ | |
v = i - p.length(); | |
}else{ | |
v = -1; | |
} | |
return v; | |
} | |
public static Integer[] createTable(char[] W) { | |
Integer[] table = new Integer[W.length]; | |
int pos = 2; // cur pos to compute in T | |
int cnd = 0; // index of W of next character of cur candidate substr | |
// first few values are fixed | |
table[0] = -1; // table[0] := -1 | |
table[1] = 0; // table[1] := 0 | |
while(pos < W.length) { | |
// first case: substring is still good | |
if(W[pos-1] == W[cnd]) { | |
table[pos] = cnd; | |
cnd += 1; | |
pos += 1; | |
} else if(cnd > 0) | |
cnd = table[cnd]; | |
else { | |
table[pos] = 0; | |
pos += 1; | |
} | |
} | |
return table; | |
} | |
// S := text string | |
// W := word | |
public static int kmp(char[] S, char[] W) { | |
if(W.length == 0) // substr is empty string | |
return 0; | |
if(S.length == 0) // text is empty, can't be found | |
return -1; | |
int m = 0; // index of beg. of current match in S | |
int i = 0; // pos. of cur char in W | |
Integer[] T = createTable(S); | |
while(m+i < S.length) { | |
if(W[i] == S[m+i]) { | |
if(i == W.length-1) | |
return m; | |
i += 1; | |
} else { | |
m = (m+i - T[i]); | |
if(T[i] > -1) | |
i = T[i]; | |
else | |
i = 0; | |
} | |
} | |
return -1; | |
} |
X. Pat[0]=0;
https://www.fmi.uni-sofia.bg/fmi/logic/vboutchkova/sources/KMPMatch_java.html
void
KMPSearch(
char
*pat,
char
*txt)
{
int
M =
strlen
(pat);
int
N =
strlen
(txt);
// create lps[] that will hold the longest prefix suffix values for pattern
int
*lps = (
int
*)
malloc
(
sizeof
(
int
)*M);
int
j = 0;
// index for pat[]
// Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps);
int
i = 0;
// index for txt[]
while
(i < N)
{
if
(pat[j] == txt[i])
{
j++;
i++;
}
if
(j == M)
{
printf
(
"Found pattern at index %d \n"
, i-j);
j = lps[j-1];
}
// mismatch after j matches
else
if
(i < N && pat[j] != txt[i])
{
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if
(j != 0)
j = lps[j-1];
else
i = i+1;
}
}
free
(lps);
// to avoid memory leak
}
void
computeLPSArray(
char
*pat,
int
M,
int
*lps)
{
int
len = 0;
// lenght of the previous longest prefix suffix
int
i;
lps[0] = 0;
// lps[0] is always 0
i = 1;
// the loop calculates lps[i] for i = 1 to M-1
while
(i < M)
{
if
(pat[i] == pat[len])
{
len++;
lps[i] = len;
i++;
}
else
// (pat[i] != pat[len])
{
if
(len != 0)
{
// This is tricky. Consider the example AAACAAAA and i = 7.
len = lps[len-1];
// Also, note that we do not increment i here
}
else
// if (len == 0)
{
lps[i] = 0;
i++;
}
}
}
}
https://dzone.com/articles/algorithm-week-morris-pratt
Read full article from Searching for Patterns | Set 2 (KMP Algorithm) - GeeksforGeeks