Massive Algorithms: LeetCode 1044 - Longest Duplicate Substring

Given a string S, consider all duplicated substrings: (contiguous) substrings of S that occur 2 or more times. (The occurrences may overlap.)

Return any duplicated substring that has the longest possible length. (If S does not have a duplicated substring, the answer is "".)

Example 1:

Input: "banana"
Output: "ana"

Example 2:

Input: "abcd"
Output: ""

Note:

2 <= S.length <= 10^5
S consists of lowercase English letters.

X. Approach 1: Binary Search + Rabin-Karp
https://leetcode.com/articles/longest-duplicate-substring/

String Searching Algorithms

The problem is a follow-up of Longest Repeating Substring, and typically used to check if you're comfortable with string searching algortihms.

Best algorithms have a linear execution time in average. The most popular ones are Aho-Corasick, KMPand Rabin-Karp: Aho-Corasick is used by fgrep, KMP is used for chinese string searching, and Rabin-Karp is used for plagiarism detection and in bioinformatics to look for similarities in two or more proteins.

The first two are optimised for a single pattern search, and Rabin-Karp for a multiple pattern search, that is exactly the case here.

public int search(int L, int a, long modulus, int n, int[] nums) {
  // compute the hash of string S[:L]
  long h = 0;
  for(int i = 0; i < L; ++i) h = (h * a + nums[i]) % modulus;

  // already seen hashes of strings of length L
  HashSet<Long> seen = new HashSet();
  seen.add(h);
  // const value to be used often : a**L % modulus
  long aL = 1;
  for (int i = 1; i <= L; ++i) aL = (aL * a) % modulus;

  for(int start = 1; start < n - L + 1; ++start) {
    // compute rolling hash in O(1) time
    h = (h * a - nums[start - 1] * aL % modulus + modulus) % modulus;
    h = (h + nums[start + L - 1]) % modulus;
    if (seen.contains(h)) return start;
    seen.add(h);
  }
  return -1;
}

public String longestDupSubstring(String S) {
  int n = S.length();
  // convert string to array of integers
  // to implement constant time slice
  int[] nums = new int[n];
  for(int i = 0; i < n; ++i) nums[i] = (int)S.charAt(i) - (int)'a';
  // base value for the rolling hash function
  int a = 26;
  // modulus value for the rolling hash function to avoid overflow
  long modulus = (long)Math.pow(2, 32);

  // binary search, L = repeating string length
  int left = 1, right = n;
  int L;
  while (left != right) {
    L = left + (right - left) / 2;
    if (search(L, a, modulus, n, nums) != -1) left = L + 1;
    else right = L;
  }

  int start = search(left - 1, a, modulus, n, nums);
  return start != -1 ? S.substring(start, start + left - 1) : "";
}

https://leetcode.com/problems/longest-duplicate-substring/discuss/290871/Python-Binary-Search
Suffix array is typical solution for this problem.

The fastest way is to copy a template form the Internet.
The code will be quite long.
Here I want to share a binary search solution.

Explanation

Binary search the length of longest duplicate substring and call the help function test(L).
test(L) slide a window of length L,
rolling hash the string in this window,
record the seen string in a hashset,
and try to find duplicated string.

I give it a big mod for rolling hash and it should be enough for this problem.
Actually there could be hash collision.
One solution is to have two different mod for hash.
Or we can use a hashmap to record the index of string.

Complexity

Binary Search in range 1 and N, so it's O(logN)
Rolling hash O(N)
Overall O(NlogN)
SpaceO(N)

Python:

    def longestDupSubstring(self, S):
        A = [ord(c) - ord('a') for c in S]
        mod = 2**63 - 1

        def test(L):
            p = pow(26, L, mod)
            cur = reduce(lambda x, y: (x * 26 + y) % mod, A[:L], 0)
            seen = {cur}
            for i in xrange(L, len(S)):
                cur = (cur * 26 + A[i] - A[i - L] * p) % mod
                if cur in seen: return i - L + 1
                seen.add(cur)
        res, lo, hi = 0, 0, len(S)
        while lo < hi:
            mi = (lo + hi + 1) / 2
            pos = test(mi)
            if pos:
                lo = mi
                res = pos
            else:
                hi = mi - 1
        return S[res:res + lo]

思路：二分。为了节省内存，将string hash成整数（用了mod，能AC是因为数据量不是很大）

为了降低hash的时间，利用前面substring的hash结果

X. Suffix Array
https://leetcode.com/problems/longest-duplicate-substring/discuss/290852/Suffix-array-clear-solution

LeetCode 1044 - Longest Duplicate Substring

String Searching Algorithms

Explanation

Complexity

Labels

Popular Posts