后缀:从母串的某一位置开始到结尾,suffix(i) = Ai,Ai+1...An。
后缀数组:后缀数组SA是个一维数组,它保存1...n的某个排列SA[1],SA[2]...SA[n],并且保证suffix(SA[i])<suffix(SA[i+1]),也就是将S的n个后缀从小到大排好序后的开头位置保存到SA中。
名次数组:名次数组Rank[i]保存的是以i开头的后缀的排名,与SA互为逆。简单的说,后缀数组是“排在第几的是谁”,名次数组是“你排第几”。
为了方便比较,通常在串的末尾添加一个字符,它是从未出现并且最小的字符。
Let the given string be "banana".
0 banana 5 a
1 anana Sort the Suffixes 3 ana
2 nana ----------------> 1 anana
3 ana alphabetically 0 banana
4 na 4 na
5 a 2 nana
So the suffix array for "banana" is {5, 3, 1, 0, 4, 2}
Build Suffix Array
-O(n^2logn)
http://www.acmerblog.com/suffix-array-6150.html
15 | int cmp( struct suffix a, struct suffix b) |
17 | return strcmp (a.suff, b.suff) < 0? 1 : 0; |
21 | int *buildSuffixArray( char *txt, int n) |
24 | struct suffix suffixes[n]; |
26 | for ( int i = 0; i < n; i++) |
28 | suffixes[i].index = i; |
29 | suffixes[i].suff = (txt+i); |
33 | sort(suffixes, suffixes+n, cmp); |
36 | int *suffixArr = new int [n]; |
37 | for ( int i = 0; i < n; i++) |
38 | suffixArr[i] = suffixes[i].index; |
O(nlognlogn) or O(nlogn)
http://www.geeksforgeeks.org/suffix-array-set-2-a-nlognlogn-algorithm/
the above algorithm uses standard sort function and therefore time complexity is O(nLognLogn). We can use Radix Sort here to reduce the time complexity to O(nLogn).
We first sort all suffixes according to first character, then according to first 2 characters, then first 4 characters and so on while the number of characters to be considered is smaller than 2n. The important point is, if we have sorted suffixes according to first 2
i characters, then we can sort suffixes according to first 2
i+1 characters in O(nLogn) time using a nLogn sorting algorithm like Merge Sort. This is possible as two suffixes can be compared in O(1) time (we need to compare only two values, see the below example and code).
The sort function is called O(Logn) times (Note that we increase number of characters to be considered in powers of 2). Therefore overall time complexity becomes O(nLognLogn). See
http://www.stanford.edu/class/cs97si/suffix-array.pdf for more details.
int
cmp(
struct
suffix a,
struct
suffix b)
{
return
(a.rank[0] == b.rank[0])? (a.rank[1] < b.rank[1] ?1: 0):
(a.rank[0] < b.rank[0] ?1: 0);
}
int
*buildSuffixArray(
char
*txt,
int
n)
{
struct
suffix suffixes[n];
for
(
int
i = 0; i < n; i++)
{
suffixes[i].index = i;
suffixes[i].rank[0] = txt[i] -
'a'
;
suffixes[i].rank[1] = ((i+1) < n)? (txt[i + 1] -
'a'
): -1;
}
sort(suffixes, suffixes+n, cmp);
int
ind[n];
for
(
int
k = 4; k < 2*n; k = k*2)
{
int
rank = 0;
int
prev_rank = suffixes[0].rank[0];
suffixes[0].rank[0] = rank;
ind[suffixes[0].index] = 0;
for
(
int
i = 1; i < n; i++)
{
if
(suffixes[i].rank[0] == prev_rank &&
suffixes[i].rank[1] == suffixes[i-1].rank[1])
{
prev_rank = suffixes[i].rank[0];
suffixes[i].rank[0] = rank;
}
else
{
prev_rank = suffixes[i].rank[0];
suffixes[i].rank[0] = ++rank;
}
ind[suffixes[i].index] = i;
}
for
(
int
i = 0; i < n; i++)
{
int
nextindex = suffixes[i].index + k/2;
suffixes[i].rank[1] = (nextindex < n)?
suffixes[ind[nextindex]].rank[0]: -1;
}
sort(suffixes, suffixes+n, cmp);
}
int
*suffixArr =
new
int
[n];
for
(
int
i = 0; i < n; i++)
suffixArr[i] = suffixes[i].index;
return
suffixArr;
}
http://blog.csdn.net/ljsspace/article/details/6613034
https://en.wikipedia.org/wiki/LCP_array
the
longest common prefix array (LCP
array) is an auxiliary
data structure to the
suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.
For example, if A := [aab, ab, abaab, b, baab] is a suffix array, the longest common prefix betweenA[1] = aab and A[2] = ab is a which has length 1, so H[2] = 1 in the LCP array H. Likewise, the LCP ofA[2] = ab and A[3] = abaab is ab, so H[3] = 2.
Augmenting the suffix array with the LCP array allows one to efficiently simulate top-down and bottom-up traversals of the suffix tree, speeds up pattern matching on the suffix array and is a prerequisite for compressed suffix trees.
http://ab.inf.uni-tuebingen.de/teaching/ws08/seqan/sarry.java/view
class LCPArray {
int H[];
LCPArray(String s, int[] A) {
int l = s.length();
H = new int[l];
// build inverse suffix array I:
int[] I = new int[l];
for (int i = 0; i < l; i++) I[A[i]] = i;
// build LCP:
int h = 0; H[0] = 0;
for (int i = 0; i < l; i++) {
if (I[i] != 0) {
while (s.charAt(i+h) == s.charAt(A[I[i]-1]+h)) h++;
H[I[i]] = h--;
if (h < 0) h = 0;
}
}
}
}
Application
Search a pattern using the built Suffix Array
如何在text中查找模式串pattern?有了后缀数组,我们就可以用二分查找来进行搜索。
O(mLogn)
void
search(
char
*pat,
char
*txt,
int
*suffArr,
int
n)
{
int
m =
strlen
(pat);
int
l = 0, r = n-1;
while
(l <= r)
{
int
mid = l + (r - l)/2;
int
res =
strncmp
(pat, txt+suffArr[mid], m);
if
(res == 0)
{
cout <<
"Pattern found at index "
<< suffArr[mid];
return
;
}
if
(res < 0) r = mid - 1;
else
l = mid + 1;
}
cout <<
"Pattern not found"
;
}