Massive Algorithms: Suffix Array

Suffix Array

http://yzmduncan.iteye.com/blog/979771

后缀：从母串的某一位置开始到结尾，suffix(i) = Ai,Ai+1...An。

后缀数组：后缀数组SA是个一维数组，它保存1...n的某个排列SA[1],SA[2]...SA[n]，并且保证suffix(SA[i])<suffix(SA[i+1])，也就是将S的n个后缀从小到大排好序后的开头位置保存到SA中。

名次数组：名次数组Rank[i]保存的是以i开头的后缀的排名，与SA互为逆。简单的说，后缀数组是“排在第几的是谁”，名次数组是“你排第几”。

为了方便比较，通常在串的末尾添加一个字符，它是从未出现并且最小的字符。

Let the given string be "banana".

0 banana                          5 a
1 anana     Sort the Suffixes     3 ana
2 nana      ---------------->     1 anana  
3 ana        alphabetically       0 banana  
4 na                              4 na   
5 a                               2 nana

So the suffix array for "banana" is {5, 3, 1, 0, 4, 2}

Build Suffix Array
-O(n^2logn)
http://www.acmerblog.com/suffix-array-6150.html

14// 字典序比较后缀

15int cmp(struct suffix a, struct suffix b)

16{

17    return strcmp(a.suff, b.suff) < 0? 1 : 0;

18}

19

20// 构造txt的后缀数组

21int *buildSuffixArray(char *txt, int n)

22{

23    //结果

24    struct suffix suffixes[n];

25

26    for (int i = 0; i < n; i++)

27    {

28        suffixes[i].index = i;

29        suffixes[i].suff = (txt+i);

30    }

31

32    // 排序

33    sort(suffixes, suffixes+n, cmp);

34

35    // 排在第几的是谁

36    int *suffixArr = new int[n];

37    for (int i = 0; i < n; i++)

38        suffixArr[i] = suffixes[i].index;

39

40    return  suffixArr;

41}

O(nlognlogn) or O(nlogn)
http://www.geeksforgeeks.org/suffix-array-set-2-a-nlognlogn-algorithm/
the above algorithm uses standard sort function and therefore time complexity is O(nLognLogn). We can use Radix Sort here to reduce the time complexity to O(nLogn).

We first sort all suffixes according to first character, then according to first 2 characters, then first 4 characters and so on while the number of characters to be considered is smaller than 2n. The important point is, if we have sorted suffixes according to first 2i characters, then we can sort suffixes according to first 2i+1 characters in O(nLogn) time using a nLogn sorting algorithm like Merge Sort. This is possible as two suffixes can be compared in O(1) time (we need to compare only two values, see the below example and code).
The sort function is called O(Logn) times (Note that we increase number of characters to be considered in powers of 2). Therefore overall time complexity becomes O(nLognLogn). Seehttp://www.stanford.edu/class/cs97si/suffix-array.pdf for more details.

// A comparison function used by sort() to compare two suffixes

// Compares two pairs, returns 1 if first pair is smaller

int cmp(struct suffix a, struct suffix b)

{

    return (a.rank[0] == b.rank[0])? (a.rank[1] < b.rank[1] ?1: 0):

               (a.rank[0] < b.rank[0] ?1: 0);

}

// This is the main function that takes a string 'txt' of size n as an

// argument, builds and return the suffix array for the given string

int *buildSuffixArray(char *txt, int n)

{

    // A structure to store suffixes and their indexes

    struct suffix suffixes[n];

    // Store suffixes and their indexes in an array of structures.

    // The structure is needed to sort the suffixes alphabatically

    // and maintain their old indexes while sorting

    for (int i = 0; i < n; i++)

    {

        suffixes[i].index = i;

        suffixes[i].rank[0] = txt[i] - 'a';

        suffixes[i].rank[1] = ((i+1) < n)? (txt[i + 1] - 'a'): -1;

    }

    // Sort the suffixes using the comparison function

    // defined above.

    sort(suffixes, suffixes+n, cmp);

    // At his point, all suffixes are sorted according to first

    // 2 characters.  Let us sort suffixes according to first 4

    // characters, then first 8 and so on

    int ind[n];  // This array is needed to get the index in suffixes[]

                 // from original index.  This mapping is needed to get

                 // next suffix.

    for (int k = 4; k < 2*n; k = k*2)

    {

        // Assigning rank and index values to first suffix

        int rank = 0;

        int prev_rank = suffixes[0].rank[0];

        suffixes[0].rank[0] = rank;

        ind[suffixes[0].index] = 0;

        // Assigning rank to suffixes

        for (int i = 1; i < n; i++)

        {

            // If first rank and next ranks are same as that of previous

            // suffix in array, assign the same new rank to this suffix

            if (suffixes[i].rank[0] == prev_rank &&

                    suffixes[i].rank[1] == suffixes[i-1].rank[1])

            {

                prev_rank = suffixes[i].rank[0];

                suffixes[i].rank[0] = rank;

            }

            else // Otherwise increment rank and assign

            {

                prev_rank = suffixes[i].rank[0];

                suffixes[i].rank[0] = ++rank;

            }

            ind[suffixes[i].index] = i;

        }

        // Assign next rank to every suffix

        for (int i = 0; i < n; i++)

        {

            int nextindex = suffixes[i].index + k/2;

            suffixes[i].rank[1] = (nextindex < n)?

                                  suffixes[ind[nextindex]].rank[0]: -1;

        }

        // Sort the suffixes according to first k characters

        sort(suffixes, suffixes+n, cmp);

    }

    // Store indexes of all sorted suffixes in the suffix array

    int *suffixArr = new int[n];

    for (int i = 0; i < n; i++)

        suffixArr[i] = suffixes[i].index;

    // Return the suffix array

    return  suffixArr;

}

http://blog.csdn.net/ljsspace/article/details/6613034

https://en.wikipedia.org/wiki/LCP_array

the longest common prefix array (LCP array) is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.

For example, if A := [aab, ab, abaab, b, baab] is a suffix array, the longest common prefix betweenA[1] = aab and A[2] = ab is a which has length 1, so H[2] = 1 in the LCP array H. Likewise, the LCP ofA[2] = ab and A[3] = abaab is ab, so H[3] = 2.

Augmenting the suffix array with the LCP array allows one to efficiently simulate top-down and bottom-up traversals of the suffix tree, speeds up pattern matching on the suffix array and is a prerequisite for compressed suffix trees.
http://ab.inf.uni-tuebingen.de/teaching/ws08/seqan/sarry.java/view

class LCPArray {
    int H[];
    
    LCPArray(String s, int[] A) {
 int l = s.length();
 H = new int[l];

 // build inverse suffix array I:
 int[] I = new int[l];
 for (int i = 0; i < l; i++) I[A[i]] = i;
 
 // build LCP:
 int h = 0; H[0] = 0;
 for (int i = 0; i < l; i++) {
     if (I[i] != 0) {
  while (s.charAt(i+h) == s.charAt(A[I[i]-1]+h)) h++;
  H[I[i]] = h--;
  if (h < 0) h = 0;
     }
 }
    }
}

Application
Search a pattern using the built Suffix Array
如何在text中查找模式串pattern？有了后缀数组，我们就可以用二分查找来进行搜索。
O(mLogn)

// A suffix array based search function to search a given pattern

// 'pat' in given text 'txt' using suffix array suffArr[]

void search(char *pat, char *txt, int *suffArr, int n)

{

    int m = strlen(pat);  // get length of pattern, needed for strncmp()

    // Do simple binary search for the pat in txt using the

    // built suffix array

    int l = 0, r = n-1;  // Initilize left and right indexes

    while (l <= r)

{

        // See if 'pat' is prefix of middle suffix in suffix array

        int mid = l + (r - l)/2;

        int res = strncmp(pat, txt+suffArr[mid], m);

        // If match found at the middle, print it and return

        if (res == 0)

{

            cout << "Pattern found at index " << suffArr[mid];

            return;

}

        // Move to left half if pattern is alphabtically less than

        // the mid suffix

        if (res < 0) r = mid - 1;

        // Otherwise move to right half

        else l = mid + 1;

}

    // We reach here if return statement in loop is not executed

    cout << "Pattern not found";

}

Suffix Array

Labels

Popular Posts