Massive Algorithms: Suffix Array | Set 2 (nLogn Algorithm)

Suffix Array | Set 2 (nLogn Algorithm) - GeeksforGeeks
A suffix array is a sorted array of all suffixes of a given string.
O(nLogn) algorithm

O(nLognLogn) algorithm
Let us first discuss a O(n * Logn * Logn) algorithm for simplicity. The idea is to use the fact that strings that are to be sorted are suffixes of a single string.
We first sort all suffixes according to first character, then according to first 2 characters, then first 4 characters and so on while the number of characters to be considered is smaller than 2n. The important point is, if we have sorted suffixes according to first 2ⁱ characters, then we can sort suffixes according to first 2ⁱ⁺¹ characters in O(nLogn) time using a nLogn sorting algorithm like Merge Sort. This is possible as two suffixes can be compared in O(1) time (we need to compare only two values).
The sort function is called O(Logn) times (Note that we increase number of characters to be considered in powers of 2). Therefore overall time complexity becomes O(nLognLogn).

We can use Radix Sort here to reduce the time complexity to O(nLogn).

Suffix arrays can be constructed in O(n) time also. 

// Structure to store information of a suffix

struct suffix

{

    int index; // To store original index

    int rank[2]; // To store ranks and next rank pair

};

// A comparison function used by sort() to compare two suffixes

// Compares two pairs, returns 1 if first pair is smaller

int cmp(struct suffix a, struct suffix b)

{

    return (a.rank[0] == b.rank[0])? (a.rank[1] < b.rank[1] ?1: 0):

               (a.rank[0] < b.rank[0] ?1: 0);

}

// This is the main function that takes a string 'txt' of size n as an

// argument, builds and return the suffix array for the given string

int *buildSuffixArray(char *txt, int n)

{

    // A structure to store suffixes and their indexes

    struct suffix suffixes[n];

    // Store suffixes and their indexes in an array of structures.

    // The structure is needed to sort the suffixes alphabatically

    // and maintain their old indexes while sorting

    for (int i = 0; i < n; i++)

    {

        suffixes[i].index = i;

        suffixes[i].rank[0] = txt[i] - 'a';

        suffixes[i].rank[1] = ((i+1) < n)? (txt[i + 1] - 'a'): -1;

    }

    // Sort the suffixes using the comparison function

    // defined above.

    sort(suffixes, suffixes+n, cmp);

    // At his point, all suffixes are sorted according to first

    // 2 characters.  Let us sort suffixes according to first 4

    // characters, then first 8 and so on

    int ind[n];  // This array is needed to get the index in suffixes[]

                 // from original index.  This mapping is needed to get

                 // next suffix.

    for (int k = 4; k < 2*n; k = k*2)

    {

        // Assigning rank and index values to first suffix

        int rank = 0;

        int prev_rank = suffixes[0].rank[0];

        suffixes[0].rank[0] = rank;

        ind[suffixes[0].index] = 0;

        // Assigning rank to suffixes

        for (int i = 1; i < n; i++)

        {

            // If first rank and next ranks are same as that of previous

            // suffix in array, assign the same new rank to this suffix

            if (suffixes[i].rank[0] == prev_rank &&

                    suffixes[i].rank[1] == suffixes[i-1].rank[1])

            {

                prev_rank = suffixes[i].rank[0];

                suffixes[i].rank[0] = rank;

            }

            else // Otherwise increment rank and assign

            {

                prev_rank = suffixes[i].rank[0];

                suffixes[i].rank[0] = ++rank;

            }

            ind[suffixes[i].index] = i;

        }

        // Assign next rank to every suffix

        for (int i = 0; i < n; i++)

        {

            int nextindex = suffixes[i].index + k/2;

            suffixes[i].rank[1] = (nextindex < n)?

                                  suffixes[ind[nextindex]].rank[0]: -1;

        }

        // Sort the suffixes according to first k characters

        sort(suffixes, suffixes+n, cmp);

    }

    // Store indexes of all sorted suffixes in the suffix array

    int *suffixArr = new int[n];

    for (int i = 0; i < n; i++)

        suffixArr[i] = suffixes[i].index;

    // Return the suffix array

    return  suffixArr;

}

http://blog.csdn.net/ljsspace/article/details/6613034

Prefix Doubling算法（倍增法）是构造后缀数组一个比较实用的算法。其基本思想是先计算出每个后缀的k-前缀的rank值，然后在此基础上计算每个后缀的2k-前缀rank值，k从1开始。直到每个后缀都排出先后顺序为止（任何两个后缀都不会相等，换句话说，每个后缀终究都能排出先后顺序）。在处理2k-前缀时，只需要使用基数排序（radix sort）算法，先后对两位数字排序（可以采用计数排序算法(counting sort)对每一位数字排序）。在最坏情况下，需要做lgn次基数排序，每一次基数排序的操作次数为2*O(n)，因此它的时间复杂度是O(nlgn)。

https://gist.github.com/calmhandtitan/8119030

http://blog.csdn.net/itegel84/article/details/6019842
nocow 后缀数组
http://www.aiuxian.com/article/p-1831057.html
Video: https://www.youtube.com/watch?v=HKPrVm5FWvg
https://courses.cs.ut.ee/MTAT.03.190/2012_fall/uploads/Main/karkkainen_sanders.pdf
http://algs4.cs.princeton.edu/63suffix/
http://web.stanford.edu/class/cs97si/suffix-array.pdf
Read full article from Suffix Array | Set 2 (nLogn Algorithm) - GeeksforGeeks

Suffix Array | Set 2 (nLogn Algorithm) - GeeksforGeeks

Labels

Popular Posts