Massive Algorithms: LeetCode 820 - Short Encoding of Words

LeetCode 820 - Short Encoding of Words

https://leetcode.com/problems/short-encoding-of-words/description/

Given a list of words, we may encode it by writing a reference string S and a list of indexes A.

For example, if the list of words is ["time", "me", "bell"], we can write it as S = "time#bell#" and indexes = [0, 2, 5].

Then for each index, we will recover the word by reading from the reference string from that index until we reach a "#" character.

What is the length of the shortest reference string S possible that encodes the given words?

Example:

Input: words = ["time", "me", "bell"]
Output: 10
Explanation: S = "time#bell#" and indexes = [0, 2, 5].

Note:

1 <= words.length <= 2000.
1 <= words[i].length <= 7.
Each word has only lowercase letters.

https://leetcode.com/problems/short-encoding-of-words/discuss/125784/Trie-Solution

2. Insert all reversed words to the trie.
For example, for word "time", we insert 'e', 'm', 'i', 't' successively.

3. Return the sum of depth of all leaves.
One way is to write a simple recursive function to do it.
Here I save all last nodes of words and its depth to a list and all leaves should be in this list.
I iterate this list and all nodes without children are leaved.

    public int minimumLengthEncoding(String[] words) {
        TrieNode root = new TrieNode();
        List<TrieNode> leaves = new  ArrayList<TrieNode>();
        for (String w : new HashSet<>(Arrays.asList(words))) {
            TrieNode cur = root;
            for (int i = w.length() - 1; i >= 0; --i) {
                char j = w.charAt(i);
                if (!cur.next.containsKey(j)) cur.next.put(j, new TrieNode());
                cur = cur.next.get(j);
            }
            cur.depth = w.length() + 1;
            leaves.add(cur);
        }
        int res = 0;
        for (TrieNode leaf : leaves) if (leaf.next.isEmpty()) res += leaf.depth;
        return res;
    }

As in Approach #1, the goal is to remove words that are suffixes of another word in the list.

Algorithm

To find whether different words have the same suffix, let's put them backwards into a trie (prefix tree). For example, if we have "time" and "me", we will put "emit" and "em" into our trie.

After, the leaves of this trie (nodes with no children) represent words that have no suffix, and we will count sum(word.length + 1 for word in words).

Time Complexity: $O(\sum w_i)$ , where $w_i$ is the length of words[i].
Space Complexity: $O(\sum w_i)$ , the space used by the trie.

public int minimumLengthEncoding(String[] words) {

TrieNode trie = new TrieNode();

Map<TrieNode, Integer> nodes = new HashMap<>();

for (int i = 0; i < words.length; ++i) {

String word = words[i];

TrieNode cur = trie;

for (int j = word.length() - 1; j >= 0; --j)

cur = cur.get(word.charAt(j));

nodes.put(cur, i);

}

int ans = 0;

for (TrieNode node : nodes.keySet()) {

if (node.count == 0)

ans += words[nodes.get(node)].length() + 1;

}

return ans;

}

class TrieNode {

TrieNode[] children;

int count;

TrieNode() {

children = new TrieNode[26];

count = 0;

}

public TrieNode get(char c) {

if (children[c - 'a'] == null) {

children[c - 'a'] = new TrieNode();

count++;

}

return children[c - 'a'];

}

public int minimumLengthEncoding(String[] words) {

Trie trie = new Trie();

for (String word : words) {

trie.insert(new StringBuilder(word).reverse().toString());

}

return trie.getLongestWords().stream().map(str -> str.length() + 1).mapToInt(i -> (int) i).sum();

}

private static class TrieNode {

public TrieNode[] children = new TrieNode[26];

public boolean isEnd;

public boolean hasChild;

@Override

public String toString() {

return "TrieNode [children=" + Arrays.toString(children) + ", isEnd=" + isEnd + "]";

}

private static class Trie {

public TrieNode root = new TrieNode();

// em, emtia, will only contains emtia

private Set<String> longestWords = new HashSet<>();

public void insert(String word) {

TrieNode curr = root;

StringBuilder currWord = new StringBuilder();

for (char ch : word.toCharArray()) {

int index = ch - 'a';

TrieNode child = curr.children[index];

curr.hasChild = true;

if (child == null) {

child = new TrieNode();

curr.children[index] = child;

}

currWord.append(ch);

if (child.isEnd) {

longestWords.remove(currWord.toString());

// so we don't remove again, but it doesn't matter

// child.isEnd= false;

}

curr = child;

}

curr.isEnd = true;

// Different case: em, emit, or emit, em

if (!curr.hasChild) { // don't forget this BUG

longestWords.add(word);

}

public Set<String> getLongestWords() {

return longestWords;

}

@Override

public String toString() {

return "Trie [root=" + root + ", longestWords=" + longestWords + "]";

}

https://leetcode.com/problems/short-encoding-of-words/discuss/125869/Simple-Concept-using-Trie

Firstly, sort the Strings based on their length in the descending order.

For each word in the String insert the word in Trie from right to left. For "time" insert "emit" in Trie.

Notice, that for all the string that are suffixes of another string, you won't need to create a new TrieNode.

Keep track of whether a new node was created when inserting into the trie. If yes, then add this word into the reference String S else continue.

Ex:
words = ["time", "me", "bell"]

Sort in descending order of length
words = ["time", "bell", "me"]

Let ans = 0

Inserting "time" in the Trie, we insert "emit" and create 4 TrieNodes starting from the root. (ans = 4(length of "time")+1(1 for the #))

Similarly, while inserting "bell" we will create new TrieNodes (ans+= 4 + 1)

However, while inserting "me" we already have an edges from root labelled as "e" and a node from "e" labelled as "m" and hence no new trie node is created.

Hence we get ans=10.

    class TrieNode {
        TrieNode[] child = new TrieNode[26]; 
    }
    
    public boolean adder(String word, TrieNode root){
        boolean newNeed = false; 
        for(int i = word.length() -1; i>=0; --i){
            char c = word.charAt(i); 
            if(root.child[c-'a']==null){
                newNeed = true; 
                root.child[c-'a'] = new TrieNode(); 
            }
            root = root.child[c-'a']; 
        }
        return newNeed; 
    }
    
    public int minimumLengthEncoding(String[] words) {
        if(words == null || words.length < 1)return 0; 
        Arrays.sort(words, (a, b)->b.length() - a.length()); 
        TrieNode root = new TrieNode(); 
        int len = 0 ; 
        for(String word : words){
            if(adder(word, root)){
                len+=word.length()+1; 
            }
        }
        return len; 
    }

X. Store Prefixes
https://leetcode.com/problems/short-encoding-of-words/discuss/125811/C++JavaPython-Easy-Understood-Solution-with-Explanation

我们将words中的所有字符串都放在一个哈希表中，然后对于哈希表中的每个单词，在哈希表中删除掉它的所有后缀（因为它的后缀可以由它本身encode得到）。最后再计算出来哈希表中所有单词的长度和即可。

Base on @awice's idea. This solution is not my intuition but it is really simple to write, compared with Trie solution.

Build a set of words.
Iterate on all words and remove all suffixes of every word from the set.
Finally the set will the set of all encoding words.
Iterate on the set and return sum(word's length + 1 for every word in the set)

Complexity
O(NK^2) for time and 'O(NK)' for space.
It is really kind of K with K <= 7, almost ignorable.
I should have suggested for bigger 'K' cases.
I believe it will take more time for most people to solve this problem if we have a big K.

If a word X is a suffix of Y, then it does not need to be considered, as the encoding of Y in the reference string will also encode X. For example, if "me" and "time" is in words, we can throw out "me" without changing the answer.

If a word Y does not have any other word X (in the list of words) that is a suffix of Y, then Y must be part of the reference string.

Thus, the goal is to remove words from the list such that no word is a suffix of another. The final answer would be sum(word.length + 1 for word in words).

Algorithm

Since a word only has up to 7 suffixes (as words[i].length <= 7), let's iterate over all of them. For each suffix, we'll try to remove it from our words list. For efficiency, we'll make words a set.

Time Complexity: $O(\sum w_i^2)$ , where $w_i$ is the length of words[i].
Space Complexity: $O(\sum w_i)$ , the space used in storing suffixes.

    public int minimumLengthEncoding(String[] words) {
        Set<String> s = new HashSet<>(Arrays.asList(words));
        for (String w : words)
            for (int i = 1; i < w.length(); ++i)
                s.remove(w.substring(i));
        int res = 0;
        for (String w : s) res += w.length() + 1;
        return res;
    }

X. Pre-Sort

https://www.cnblogs.com/seyjs/p/9626129.html

本题考查就是找出一个单词是不是另外一个单词的后缀，如果是的话，就可以Short Encode。所以，我们可以把words中每个单词倒置后排序，然后遍历数组，每个元素只要和其后面相邻的元素比较，如果是后缀则被Short Encode，否则不行。

https://leetcode.com/problems/short-encoding-of-words/discuss/227514/Reverse-the-words-and-sort-them.-Beats-100-Python

    def minimumLengthEncoding(self, words):
        reversed_sorted_words = sorted([word[::-1] for word in words])
  
  # The longest word is always present. So include its length
        min_len = len(reversed_sorted_words[-1]) + 1
        
        for i in range(len(reversed_sorted_words)-1):
            
            word = reversed_sorted_words[i]
            next_word = reversed_sorted_words[i+1]            
            
            if not next_word.startswith(word):
                min_len += len(word) + 1
        
        return min_len

X. https://leetcode.com/problems/short-encoding-of-words/discuss/126239/Java-5-lines-no-reverse-using-index-check

 public int minimumLengthEncoding(String[] words) {
    StringBuilder sb = new StringBuilder();
    Arrays.sort(words, (x,y) -> (y.length()- x.length()));
    for(int i = 0; i<words.length; i++)
    {
        if(sb.toString().indexOf(words[i]+"#") == -1)
            sb.append(words[i]).append("#");
    }
    return sb.toString().length();
}

First of all, sort by length in descending order, since the long ones could contain the short ones, but sort ones cannot contains the longs.
Then keep probing to see if next word is in the result or not.
It won't be too efficient though given using the indexOf method, which is scanning the whole string.

X. Brute Force
https://blog.csdn.net/qq2667126427/article/details/80056268

public int minimumLengthEncoding(String[] words) {

if (words == null || words.length == 0) {

return 1;

}

// 使用set对字符串去重

HashSet<String> strings = new HashSet<>();

// 添加数据，自动去重

Collections.addAll(strings, words);

Iterator<String> strs = strings.iterator();

int size = 0;

// 全部放入到原来的words中，其实新建一个大小为strings.size()的String数组是

// 个更好的选择，为了时间不管了，运行中也没有报错。幸好没有要求给出对应的下标数组

while (strs.hasNext()) {

words[size++] = strs.next();

}

int result = 0, add;

for (int i = 0; i < size; i++) {

// 表示本字符串可以给最终字符串增加的长度

add = words[i].length() + 1;

for (int j = 0; j < size; j++) {

// 如果这个字符串是别的字符串的后缀，那么这个字符串就可以用其表示

// 而不需要在主串后再加一段

if (i != j && words[j].endsWith(words[i])) {

add = 0;

break;

}

result += add;

}

return result;

}

https://leetcode.com/problems/short-encoding-of-words/discuss/125822/C%2B%2B-4-lines-reverse-and-sort

If we reverse all strings and then sort the vector, then strings will appear as ["em", "emit", "bell"]. Now we just need to check if the next string starts with the previous one, so we can encode it.

int minimumLengthEncoding(vector<string>& ws, int res = 0) {
    for (auto i = 0; i < ws.size(); ++i) reverse(ws[i].begin(), ws[i].end());
    sort(ws.begin(), ws.end());
    for (auto i = 0; i < ws.size() - 1; ++i) res += ws[i] == ws[i + 1].substr(0, ws[i].size()) ? 0 : ws[i].size() + 1;
    return res + ws[ws.size() - 1].size() + 1;
}

Insted of sorting, we can load reversed string in a set. A benefit of this approach is that we it automatically eliminates duplicate strings. However, it does not work faster than the first solution, probably due to the BST overhead.
I would recommend though using this approach if we expect loads of duplicated strings.

int minimumLengthEncoding(vector<string>& ws, int res = 0) {
    for (auto i = 0; i < ws.size(); ++i) reverse(ws[i].begin(), ws[i].end());
    set<string> s(ws.begin(), ws.end());
    for (auto it = s.begin(); next(it) != s.end(); ++it) res += *it == next(it)->substr(0, it->size()) ? 0 : it->size() + 1;
    return res + s.rbegin()->size() + 1;
}

We can avoid reversing by using a custom comparer, and check endings instead of beginnings. However, the code is more complicated and it does not work faster, given the current test cases (average runtime is 36 ms for both).
I think that this approach will work best if we have a lot of long strings that differts drammatically from each other.

struct reverseComparer {
  bool operator()(const string& a, const string& b) {
    for (auto pa = a.rbegin(), pb = b.rbegin(); pa != a.rend() && pb != b.rend(); ++pa, ++pb) if (*pa != *pb) return *pa < *pb;
    return a.size() < b.size();
  }
};
int minimumLengthEncoding(vector<string>& ws, int res = 0) {
  sort(ws.begin(), ws.end(), reverseComparer());
  for (auto i = 0; i < ws.size() - 1; ++i) res += ws[i].size() <= ws[i + 1].size() 
    && ws[i] == ws[i + 1].substr(ws[i + 1].size() - ws[i].size()) ? 0 : ws[i].size() + 1;
  return res + ws[ws.size() - 1].size() + 1;
}

LeetCode 820 - Short Encoding of Words

Labels

Popular Posts