Massive Algorithms: Google – Plate and Dictionary

Google – Plate and Dictionary

Google – Plate and Dictionary
给一个美国的车牌号(license plate)，还有一个dictionary, 找出字典里包含车牌号中所有英文字母（车牌号里还有数字）的最短word。

http://chuansong.me/n/449596249789

车牌 RO 1287 ["rolling", "real", "WhaT", "rOad"] => "rOad"

follow up:

(1) 如果dictionary里有上百万个字，该如何加速

(2) 如果dictionary有上百万个字，然后給你上千个车牌号码，要你回传相对应的最短字串，该如何optimize?

若有N个字，车牌长度为M，求问N+M算法。

[Solution]
[Solution #1]
Brute Force很简单，拿plate和dictionary里的单词挨个比就好。
[Solution #2]
另一种方法是类似Google – Shortest Words Contain Input String，建inverted index.
建inverted index的时间为O(LEN), where LEN is sum of length of all words in the dictionary.
query time为O(len + LEN), where len is the length of plate. Worst case的情况出现于dictionary里面所有单词都包含车牌里的英语字母。
[Solution #3]
还有一种方法很巧妙。

把26个字母分别map到一个 prime number.

把每个单词对应的字母的质数乘起来，得到一个product，然后对dictionary建String -> product的map.

Query的时候用同样方法计算plate的product，扫一遍dict, 谁能除得进plate的product，就说明包含Plate中的英文字母。

这样query的时间复杂度就为O(len + N)。 不过要注意乘积的overflow。

  public String plateAndDict(String[] dict, String plate) {

    Map<Character, Integer> primeMap = new HashMap<>();

    initMap(primeMap);

    plate = plate.toLowerCase();

    long platePrime = 1l;

    for (char c : plate.toCharArray()) {

      if (primeMap.containsKey(c)) {

        platePrime *= primeMap.get(c);

}

}

    String result = null;

    for (String word : dict) {

      String lowerWord = word.toLowerCase();

      long wordPrime = 1l;

      for (char c : lowerWord.toCharArray()) {

        if (primeMap.containsKey(c)) {

          wordPrime *= primeMap.get(c);

}

}

      if (wordPrime % platePrime == 0 && (result == null || word.length() < result.length())) {

        result = word;

}

}

    return result;

}

  private void initMap(Map<Character, Integer> primeMap) {

    primeMap.put('a', 2);

    int prev = 2;

    for (char i = 'b'; i <= 'z'; i++) {

      int curr = nextPrime(prev);

      primeMap.put(i, curr);

      prev = curr;

}

}

  private int nextPrime(int n) {

    int result = n + 1;

    while (result % 2 == 0 || !isPrime(result)) {

      result++;

}

    return result;

}

  private boolean isPrime(int n) {

    for (int i = 2; i * i <= n; i++) {

      if (n % i == 0) {

        return false;

}

}

    return true;

}

请问老师，原贴的讨论里面有提到将每个字母map到一个质数，一个单词就是所有字母表示质数的乘积。字典里的单词如果能被输入的字段除尽就是含有该输入字串的单词，然后求最短就好了。但是感觉这样还是需要每一个单词都要扫一遍字典，复杂度并没有降低。

有的同学认为，这道题可以用 trie 解决，但其实用 trie 并不能解决。

因为这个题说的是包含单词里的所有字母就好了，并没有要求顺序。也就是说从例子中来看，“orad” 也是包含 "RO" 的。

首先这类问题，给一个dictionary的，都是需要对 dictionary 做一些预处理的，因为dictionary又不会随时变化。既然是google的题，那么这个题多半就跟倒排索引有关系。一个直观的解决办法就是建立倒排索引。

比如 "Access" 这个单词，出现了 a, c, e, s 四个字母。所以把Access 分别放入 a, c, e, s 的队列中。

然后假如你来了一个车牌，包含字母'ace'，如何找到 Access呢？方法是，把 a 的倒排拿出来，c的拿出来，e的拿出来。然后做一次归并。把同时出现在这三个字母里的倒排里的单词找到。为了加速算法，倒排在建立的时候，就可以先按照长度排序，然后相同长度的按照字母序排序。这样在归并的时候找到的第一个在三个字母的倒排队列中出现的单词，就是答案。

你可能会意识到，这种方法在某些情况下还是很慢，因为比如上百万个单词，一共只有26个不同的字母，那么平均下来，每个字母也有 1m/26 个单词包含它。归并的时候效率并不高。那么怎么解决呢？上面的算法的key只有一个字母，自然这个1m/26的分母比较小。我们要想办法把这个分母变大。答案是，可以用2个字母作为key。

以 Access 为例，出现的字母有 aces，那么把 ac, ae, as, ce, cs, es 作为倒排的key，把Access 放到这6个key的倒排列表中。当我们需要查询 ace 被哪些单词包含的时候，我们就可以归并 ac 和 e 的倒排表。这就比 并 a, c, e 的三个倒排表要快了。

再进一步，你可以用3个字母作为 key, access => {ace, acs, ces, aes}，这样一口气就能找到 ace 了，无需归并。

进一步分析，你会发现这是一个 tradeoff，就是并不是用越多的字母作为key就越好。假如单词的平均长度是 L，那么C(L,1)是有多少个1元的key，C(L,2)是有多少个2元的key（基本是L^2的级别 ），C(L,3)是有多少个3元的key。。。

所以字母越多，一个单词被重复扔到各个倒排列表中的机会就越多。需要的存储空间也就越大。介于车牌的字母不是很多，我觉得差不多到3元的key就差不多可以了。

http://skyzjkang.blogspot.com/2015/04/license-plate-dictionary.html
原帖：http://www.meetqun.com/thread-2802-1-1.html
http://www.meetqun.com/forum.php?mod=viewthread&tid=4901&ctid=41

题目：

“AC1234R” => CAR, ARC | CART, CARED not the shortest

“OR4567S” => SORT, SORE | SORTED valid, not the shortest | OR is not valid, missing S

Google电面题目: 04/13/2015

1 <= letters <= 7

O(100) license plate lookups

O(4M) words in the dictionary

d1 = abc 000....111

d2 = bcd 000…..110

s1 = ac1234r 00001...101

s1 & d1 == s1 d1

s1 & d2 != s1

代码：

 int convert2Num(string s) {   
     int res = 0;   
     for (int i = 0; i < s.size(); ++i) {   
         if (!isdigit(s[i]) {   
             res |= 1 << (s[i] - ‘a’);    
         }   
     }   
     return res;   
 }   
   
 string find_shortest_word(string license, vector<string> words) {   
     int lis_num = convert2Num(license);   
     string res;    
     for (int i = 0; i < words.size(); ++i) {   
         int word_num = convert2Num(words[i]); // conversion;    
         if (lis_num & word_num == lis_num) {   
             if (res.empty() || res.size() > words[i].size())   
                 res = words[i]; // brute force; 可以sorting来减少比较次数   
         }   
     }   
     return res;   
 }

<key, conversion number》

<words, save shortest length of words>

abc, abccc, <000...111, abc>

Follow up:
dictionary = { “BAZ”, “FIZZ”, “BUZZ” } | BAZ only has one Z

vector<int> map(26, 0);

a -> 0, z -> 25;

map[s-’a’]++;

need[26];

need[i] < map[i]

代码：

 bool compare(const string& s1, const string& s2) {   
     return s1.size() < s2.size();   
 }    
   
 string find_shortest_string(string license, vector<string> words) {   
     sort(word.begin(), word.end(), compare);   
     int lic_num = convert2Num(license);   
     vector<int> map(26, 0);   
     for (auto i : license) {   
         if (!isdigit(i)) map[i-’a’]++;   
     }   
   
     for (int i = 0; i < words.size(); ++i) {   
         int word_num = convert2Num(words[i]);   
         // First check that it has all the letters, then check the counts   
         if (lic_num & word_num != lic_num) continue;   
   
         vector<int> word_map(26, 0);   
         for (auto k : words[i]) {   
             if (!isdigit(k)) map[k-’a’]++;   
         }    
         int j = 0;   
         for (; j < 26; ++j) {   
             if (word_map[‘a’+j] < map[‘a’+j]) break;   
         }   
         if (j == 26) return words[i];   
     }   
 }

Follow up:

find_shortest_word(“12345ZZ”, dictionary) 50% => “FIZZ” | 50% => “BUZZ”

vector<string> s = {};

s[rand() % 2];

reservoir sampling;

FIZZ: s = words;

count = 1;

BUZZ: count++; count = 2;

rand() % count == 0 : s = BUZZ;

Read full article from Google – Plate and Dictionary

Google – Plate and Dictionary

Labels

Popular Posts