Massive Algorithms: Re-Space - Cracking Coding Interview

Re-Space - Cracking Coding Interview

Oh, no! You have accidentally removed all spaces, punctuation, and capitalization in a

lengthy document. A sentence like "I reset the c omputer. It still didn't boot!"

became "iresetthec omputeri tstilldidntboot''. You'll deal with the punctuation and capitalization

later; right now you need to re-insert the spaces. Most of the words are in a dictionary but

a few are not. Given a dictionary (a list of strings) and the document (a string), design an algorithm

to unconcatenate the document in a way that minimizes the number of unrecognized characters.

EXAMPLE

Input jesslookedjustliketimherbrother

Output: jess looked just like tim her brother (7 unrecognized characters)

http://www.cnblogs.com/EdwardLiu/p/4356967.html

用一个int[] res = new int[len+1]; res[i] refers to minimized # of unrecognized chars in first i chars, res[0]=0, res[len]即为所求。

有了维护量，现在需要考虑转移方程，如下：

int unrecogNum = dict.contains(s.substring(j, i))? 0 : i-j; //看index从j到i-1的substring在不在dictionary里，如果不在，unrecogNum=j到i-1的char数
res[i] = Math.min(res[i], res[j]+unrecogNum);

 9         if (s==null || s.length()==0) return 0;
10         int len = s.length();
11         if (dict.isEmpty()) return len;
12         int[] res = new int[len+1]; // res[i] refers to minimized # of unrecognized chars in first i chars
13         Arrays.fill(res, Integer.MAX_VALUE);
14         res[0] = 0;
15         for (int i=1; i<=len; i++) {
16             for (int j=0; j<i; j++) {
17                 String str = s.substring(j, i);
18                 int unrecogNum = dict.contains(str)? 0 : i-j;
19                 res[i] = Math.min(res[i], res[j]+unrecogNum);
20             }
21         }
22         return res[len];
23     }

http://www.shuati123.com/blog/2014/10/01/optimal-unconcatenate-doc/

public static int parse(String doc, Trie dict) {
    int len = doc.length();
    int[] dp = new int[len + 1];
    // dp[i] denotes the number of special chars in first i chars of docs
    for (int i = 1; i <= len; i++) {
        dp[i] =  Integer.MAX_VALUE;
        for (int j = 0; j < i; j++) {
            String str = doc.substring(j, i);
            if (dict.contains(str, true)) {
                // consider (i to j) a valid word
                dp[i] = Math.min(dp[i], dp[j]);
            } else {
                // consider (i to j) special chars
                dp[i] = Math.min(dp[i], dp[j] + i - j);
            }
        }
    }
    return dp[len];
}

http://www.cnblogs.com/grandyang/p/5448539.html

1. 一些递归重复计算了，我们可以使用哈希表来保存之前计算好的最优拆分了，那么在之后的递归中遇到了直接取出来用就行了。

2. 在有些情况下我们可以预测一种拆分方式是否会生成无效单词，比如当我们拆分单词xten，没有单词是以xt开头的，但是当前的算法会计算xt+p(en)，xten+p(n)和xten。每一次我们都会发现这样的单词不存在，这样我们直接在x后面加一个空格就可以了，那么我们怎么样知道没有单词是以xt开始的呢，用前缀树trie，参见parseOptimized()函数。

我们使用哈希表来缓存结果，关键字是单词的起始位置，我们保存了最佳的位置来拆分余下的字符串。我们可以让代码返回拆分好的字符串，但是比较麻烦。我们使用一个外包类Result来返回无效字符的个数和最优字符串，而在C++中就不用这样，因为可以通过引用传递。

O(N^2)
public static String bestSplit(HashSet<String> dictionary, String sentence) {
ParseResult[] memo = new ParseResult[sentence.length()];
ParseResult r = split(dictionary, sentence, 0, memo);
return r == null ? null : r.parsed;
}

public static ParseResult split(HashSet<String> dictionary, String sentence, int start, ParseResult[] memo) {
if (start >= sentence.length()) {
return new ParseResult(0, "");
} if (memo[start] != null) {
return memo[start];
}

int bestInvalid = Integer.MAX_VALUE;
String bestParsing = null;

String partial = "";
int index = start;
while (index < sentence.length()) {
char c = sentence.charAt(index);
partial += c;
int invalid = dictionary.contains(partial) ? 0 :
partial.length();
if (invalid < bestInvalid) { // Short circuit
/* Recurse, putting a space after this character. If this
* is better than the current best option, replace the best
* option. */
ParseResult result = split(dictionary, sentence, index + 1, memo);
if (invalid + result.invalid < bestInvalid) {
bestInvalid = invalid + result.invalid;
bestParsing = partial + " " + result.parsed;
if (bestInvalid == 0) break; // Short circuit
}
}

index++;
}
memo[start] = new ParseResult(bestInvalid, bestParsing);
return memo[start];
}

public static String clean(String str) {
char[] punctuation = {',', '"', '!', '.', '\'', '?', ','};
for (char c : punctuation) {
str = str.replace(c, ' ');
}
return str.replace(" ", "").toLowerCase();
}

Brute Force:
public static String bestSplit(HashSet<String> dictionary, String sentence) {
ParseResult r = split(dictionary, sentence, 0);
return r == null ? null : r.parsed;
}

public static ParseResult split(HashSet<String> dictionary, String sentence, int start) {
if (start >= sentence.length()) {
return new ParseResult(0, "");
}

int bestInvalid = Integer.MAX_VALUE;
String bestParsing = null;

String partial = "";
int index = start;
while (index < sentence.length()) {
char c = sentence.charAt(index);
partial += c;
int invalid = dictionary.contains(partial) ? 0 :
partial.length();
if (invalid < bestInvalid) { // Short circuit
/* Recurse, putting a space after this character. If this
* is better than the current best option, replace the best
* option. */
ParseResult result = split(dictionary, sentence, index + 1);
if (invalid + result.invalid < bestInvalid) {
bestInvalid = invalid + result.invalid;
bestParsing = partial + " " + result.parsed;
if (bestInvalid == 0) break;
}
}

index++;
}
return new ParseResult(bestInvalid, bestParsing);
}

public static String clean(String str) {
char[] punctuation = {',', '"', '!', '.', '\'', '?', ','};
for (char c : punctuation) {
str = str.replace(c, ' ');
}
return str.replace(" ", "").toLowerCase();
}

Re-Space - Cracking Coding Interview

Labels

Popular Posts