Massive Algorithms: LeetCode 940 - Distinct Subsequences II

LeetCode 940 - Distinct Subsequences II

Given a string S, count the number of distinct, non-empty subsequences of S .

Since the result may be large, return the answer modulo 10^9 + 7.

Example 1:

Input: "abc"
Output: 7
Explanation: The 7 distinct subsequences are "a", "b", "c", "ab", "ac", "bc", and "abc".

Example 2:

Input: "aba"
Output: 6
Explanation: The 6 distinct subsequences are "a", "b", "ab", "ba", "aa" and "aba".

Example 3:

Input: "aaa"
Output: 3
Explanation: The 3 distinct subsequences are "a", "aa" and "aaa".

Note:

S contains only lowercase letters.
1 <= S.length <= 2000

用一个数组endsWith[26]来表示（截止到当前位置的字符串）中以每个字母结尾的非空sequence的数量。假设当前这个统计是不重不漏的。考虑加入下一个字母S[i]后增加的所有sequence。在不考虑重复的情况下，这个数量是sum(endsWith) + 1（原来的所有sequence后面加上这个字母，还有这个字母自己）。如果要发生重复，显然那些被重复的序列的结尾只能是S[i]；而那些被重复的以S[i]结尾的sequence去掉S[i]的结果必定以某种形式存在于endsWith数组中，因此它们必定都被重复了一遍：因此重复序列的总数就是endsWith[S[i] - 'a']。于是最后的效果是每次更新的时候endsWith[S[i] - 'a'] = sum(endsWith) + 1。

https://leetcode.com/problems/distinct-subsequences-ii/discuss/192017/C++JavaPython-4-lines-O%28N%29-Time-O%281%29-Space

Init an array endswith[26]
endswith[i] to count how many sub sequence that ends with ith character.

Now we have N = sum(endswith) different sub sequence,
add a new character c to each of them,
then we have N different sub sequence that ends with c.

With this idea, we loop on the whole string S,
and we update end[c] = sum(end) + 1 for each character.

We need to plus one here, because "c" itself is also a sub sequence.

Example

Input: "aba"
Current parsed: "ab"

endswith 'a': ["a"]
endswith 'b': ["ab","b"]

"a" -> "aa"
"ab" -> "aba"
"b" -> "ba"
"" -> "a"

endswith 'a': ["aa","aba","ba","a"]
endswith 'b': ["ab","b"]
result: 6

    public int distinctSubseqII(String S) {
        long end[] = new long[26], mod = (long)1e9 + 7;
        for (char c : S.toCharArray())
            end[c - 'a'] = Arrays.stream(end).sum()%mod + 1;
        return (int)(Arrays.stream(end).sum() % mod);
    }

Use a variable to count the sum(end) can avoid repeatly count it.
Improve time complexity from O(26N) to O(N), if you want

    public int distinctSubseqII(String S) {
        int end[] = new int[26], res = 0, added = 0, mod = (int)1e9 + 7;
        for (char c : S.toCharArray()) {
            added = (res + 1 - end[c - 'a']) % mod;
            res = (res + added) % mod;
            end[c - 'a'] = (end[c - 'a'] + added) % mod;
        }
        return (res + mod) % mod;
    }

Furthermore, we can use a sum to represent sum(dp[0], ..., dp[i - 1]).
And also a count array, in which count[S.charAt(i) - 'a'] represents the count of presented subsequence ends with S.charAt(i).
Then dp[i] = sum - count[S.charAt(i) - 'a'].

    public int distinctSubseqII(String S) {
        int n = S.length(), M = (int)1e9 + 7;
        int[] dp = new int[n];
        Arrays.fill(dp, 1);
        int[] count = new int[26];
        int sum = 0;
        for (int i = 0; i < n; i++) {
            int index = S.charAt(i) - 'a';
            dp[i] += sum - count[index];
            dp[i] = (dp[i] + M) % M;
            sum = (sum + dp[i]) % M;
            count[index] = (count[index] + dp[i]) % M;
        }
        return sum;
    }

https://www.jianshu.com/p/02501f516437

最开始的思路是用Trie来表示所有可能subseq.
遍历string S，对Trie中每个节点新建叶节点。
提交后果然答案对了，但是Memory Limit Exceed。
转念一想，没必要存下所有节点，只需要知道当前节点的数量就好了。
Trie中一个节点对应一个seq。
假设当前有N个不同的seq，每个seq加上一个新的字母，又是新的N个不同sequence了。
但新的seq中，有一部分原来就有。
比如新的字母是'a'，那么原来以'a'结尾的seq，每一个都会存在新的这N个seq中。
到这里，解题思路就比较清楚了。
我们需要用一个数组int endsWith[26]，
endsWith[i]来表示以第i个字母结束的sequence数量。
最后修饰一下代码细节。

数组全部初始化为0。
正好按照题目要求，空string不计作subseq。
每次更新的时候，end[i] = sum(end) + 1。
加一的原因是，比如我们遇到新字母'a'，新的N个seq里不包含“a”，需要额外加上

这个方法比较简洁，时间O(26N)，空间O(26)
可以看心情优化的地方是，用一个变量保存当前数组和，避免重复计算。
时间优化到O(N).

https://zxi.mytechroad.com/blog/dynamic-programming/leetcode-940-distinct-subsequences-ii/

counts[i][j] := # of distinct sub sequences of s[1->i] and ends with letter j. (‘a'<= j <= ‘z’)

Initialization:

counts[*][*] = 0

Transition:

counts[i][j] = sum(counts[i-1]) + 1 if s[i] == j else counts[i-1][j]

ans = sum(counts[n])

e.g. S = “abc”

counts[1] = {‘a’ : 1}
counts[2] = {‘a’ : 1, ‘b’ : 1 + 1 = 2}
counts[3] = {‘a’ : 1, ‘b’ : 2, ‘c’: 1 + 2 + 1 = 4}
ans = sum(counts[3]) = 1 + 2 + 4 = 7

Time complexity: O(N*26)

Space complexity: O(N*26) -> O(26)

https://zhanghuimeng.github.io/post/leetcode-940-distinct-subsequences-ii/

X.

这些做法使我感到，这种计数DP是在非常有限的信息下做到不重不漏地统计的过程，所以保留哪些信息是十分关键的。

https://leetcode.com/articles/distinct-subsequences-ii/

public int distinctSubseqII(String S) {

int MOD = 1_000_000_007;

int N = S.length();

int[] dp = new int[N + 1];

dp[0] = 1;

int[] last = new int[26];

Arrays.fill(last, -1);

for (int i = 0; i < N; ++i) {

int x = S.charAt(i) - 'a';

dp[i + 1] = dp[i] * 2 % MOD;

if (last[x] >= 0)

dp[i + 1] -= dp[last[x]];

dp[i + 1] %= MOD;

last[x] = i;

}

dp[N]--;

if (dp[N] < 0)

dp[N] += MOD;

return dp[N];

}

X. https://www.geeksforgeeks.org/count-distinct-subsequences/

The problem of counting distinct subsequences is easy if all characters of input string are distinct. The count is equal to nC0 + nC1 + nC2 + … nCn = 2n.

How to count distinct subsequences when there can be repetition in input string?
A Simple Solution to count distinct subsequences in a string with duplicates is to generate all subsequences. For every subsequence, store it in a hash table if it doesn’t exist already. Time complexity of this solution is exponential and it requires exponential extra space.

An Efficient Solution doesn’t require generation of subsequences.

Let countSub(n) be count of subsequences of 
first n characters in input string. We can
recursively write it as below. 

countSub(n) = 2*Count(n-1) - Repetition

If current character, i.e., str[n-1] of str has
not appeared before, then 
   Repetition = 0

Else:
   Repetition  =  Count(m)
   Here m is index of previous occurrence of
   current character. We basically remove all
   counts ending with previous occurrence of
   current character.

How does this work?
If there are no repetitions, then count becomes double of count for n-1 because we get count(n-1) more subsequences by adding current character at the end of all subsequences possible with n-1 length.
If there repetitions, then we find count of all distinct subsequences ending with previous occurrence. This count can be obtained be recursively calling for index of previous occurrence.

Since above recurrence has overlapping subproblems, we can solve it using Dynamic Programming.

    static final int MAX_CHAR = 256; 

    // Returns count of distinct sunsequences of str. 

    static int countSub(String str) 

    { 

        // Create an array to store index 

        // of last 

        int[] last = new int[MAX_CHAR]; 

        Arrays.fill(last, -1); 

        // Length of input string 

        int n = str.length(); 

        // dp[i] is going to store count of distinct 

        // subsequences of length i. 

        int[] dp = new int[n+1]; 

        // Empty substring has only one subsequence 

        dp[0] = 1; 

        // Traverse through all lengths from 1 to n. 

        for (int i=1; i<=n; i++) 

        { 

            // Number of subsequences with substring 

            // str[0..i-1] 

            dp[i] = 2*dp[i-1]; 

            // If current character has appeared 

            // before, then remove all subsequences 

            // ending with previous occurrence. 

            if (last[(int)str.charAt(i-1)] != -1) 

                dp[i] = dp[i] - dp[last[(int)str.charAt(i-1)]]; 

            // Mark occurrence of current character 

            last[(int)str.charAt(i-1)] = (i-1); 

        } 

        return dp[n]; 

    }

https://leetcode.com/problems/distinct-subsequences-ii/discuss/192030/Java-DP-O(N2)-time-greater-O(N)-time-greater-O(1)-space

dp[i] represents the count of unique subsequence ends with S[i].
dp[i] is initialized to 1 for S[0 ... i]
For each dp[i], we define j from 0 to i - 1, we have:

if s[j] != s[i], dp[i] += dp[j]
if s[j] == s[i], do nothing to avoid duplicates.

Then result = sum(dp[0], ... dp[n - 1])
Time complexity: O(n^2)

    public int distinctSubseqII(String S) {
        int n = S.length(), M = (int)1e9 + 7, result = 0;
        int[] dp = new int[n];
        Arrays.fill(dp, 1);
        for (int i = 0; i < n; i++) {
            for (int j = 0; j < i; j++) {
                if (S.charAt(j) != S.charAt(i)) {
                    dp[i] += dp[j];
                    dp[i] %= M;
                }
            }
            result += dp[i];
            result %= M;
        }
        return result;
    }

X. http://www.noteanddata.com/leetcode-940-Distinct-Subsequences-II-java-solution-note.html

因为是subsequence，所以数量是比较大的，暴力解已经不可能，通常这个时候需要考虑dp
题目要求是unique的，就是要去重。我们可以先思考不去重应该怎么做
这里dp最多只有两个维度, dp[i], 或者dp[i][j]，我们先看一维的dp[i]能不能解决问题
假设dp[i]是以第i个字符串结尾的subsequence的个数
a: dp[0] = 1
ab: dp[0] = 1, dp[1] = 2
abc: dp[0] = 1, dp[1] = 2, dp[2] = 4
abcd: dp[0] = 1, dp[1] = 2, dp[2] = 4, dp[3] = 8

观察规律疑似， dp[i] = dp[0] + … dp[i-1] + 1
因为dp[i]一定是包括下面这些的和
* dp[0] + s.charAt(i)
* dp[1] + s.charAt(i)
* ….
* dp[i-1] + s.charAt(i)
* s.charAt(i)

所以，这个dp递推规则是符合逻辑的。
这时候，我们要去重, 假设j < i, 并且i和j字符重复，那么可以看到，所有的重复都会来自于j之前的，
比如abcc，所有的重复都来自于第4个c和前面的ab任意选择后组成的结果，以及加上c自己，
而这个结果就是dp[j], 也就是所有下面这些会重复，把其中的s.charAt(j)换成s.charAt(i)以后是一样的：
* dp[0]+s.charAt(j)
* dp[1]+s.charAt(j)
* …
* dp[j-1]+s.charAt(j)
* s.charAt(j)

如果这样的话，去重就好办了，对于i和j相等的情况，递推的时候不加上就好了。

public int distinctSubseqII(String s) {
    int[] dp = new int[s.length()];
    for(int i = 0; i < s.length(); ++i) {
        dp[i] = 1;
        for(int j = 0; j < i; ++j) {
            if(s.charAt(i) != s.charAt(j)) {
                dp[i] += dp[j];    
                dp[i] %= 1_000_000_007;
            }                
        }
    }
    int sum = 0;
    for(int i = 0;i < dp.length; ++i) {
        sum += dp[i];
        sum %= 1_000_000_007;
    }
    return sum;
}

数组做了两层循环，所以时间复杂度是O(N^2)
https://www.cnblogs.com/seyjs/p/10412749.html
记dp[i]为以S[i]元素结尾可以组成的子串的个数，很显然dp[0] = 1。显然dp[i]的前一个元素可以是dp[0] ~ dp[i-1]中的任何一个，那么应该有dp[i] = dp[0] + dp[1] +...dp[i-1]。这是对于元素没有重复的情况。假设S[j]是S[0-i]中与S[i]下标最接近的元素并且有S[i] = S[j]，那么在以S[i]结尾的子串中，前一个元素是在S[0]~S[j-1]中的任何一个，都会和以S[j]结尾的子串中并且前一个元素是在S[0]~S[j-1]中的任何一个重复，因此这种情况下dp[i] = dp[j]+dp[j+1] + ... dp[i-1]。最后，返回的结果应该为sum(dp）

    def distinctSubseqII(self, S):
        """
        :type S: str
        :rtype: int
        """
        dp = [1] * len(S)
        for i in range(1,len(S)):
            for j in range(i-1,-1,-1):
                if S[i] != S[j]:
                    dp[i] += dp[j]
                else:
                    dp[i] += dp[j]
                    dp[i] -= 1
                    break
        #print dp
        return sum(dp) % (pow(10,9) + 7)

LeetCode 940 - Distinct Subsequences II

Example

Labels

Popular Posts