Related:
[CC150v5] 17.14 Optimal Way to Unconcatenate Doc
https://github.com/careercup/CtCI-6th-Edition/tree/master/Java/Ch%2017.%20Hard/Q17_13_ReSpace
[CC150v5] 17.14 Optimal Way to Unconcatenate Doc
https://github.com/careercup/CtCI-6th-Edition/tree/master/Java/Ch%2017.%20Hard/Q17_13_ReSpace
Oh, no! You have accidentally removed all spaces, punctuation, and capitalization in a
lengthy document. A sentence like "I reset the c omputer. It still didn't boot!"
became "iresetthec omputeri tstilldidntboot''. You'll deal with the punctuation and capitalization
later; right now you need to re-insert the spaces. Most of the words are in a dictionary but
a few are not. Given a dictionary (a list of strings) and the document (a string), design an algorithm
to unconcatenate the document in a way that minimizes the number of unrecognized characters.
EXAMPLE
Input jesslookedjustliketimherbrother
Output: jess looked just like tim her brother (7 unrecognized characters)
http://www.cnblogs.com/EdwardLiu/p/4356967.html
用一个int[] res = new int[len+1]; res[i] refers to minimized # of unrecognized chars in first i chars, res[0]=0, res[len]即为所求。
有了维护量,现在需要考虑转移方程,如下:
int unrecogNum = dict.contains(s.substring(j, i))? 0 : i-j; //看index从j到i-1的substring在不在dictionary里,如果不在,unrecogNum=j到i-1的char数
res[i] = Math.min(res[i], res[j]+unrecogNum);
res[i] = Math.min(res[i], res[j]+unrecogNum);
9 if (s==null || s.length()==0) return 0; 10 int len = s.length(); 11 if (dict.isEmpty()) return len; 12 int[] res = new int[len+1]; // res[i] refers to minimized # of unrecognized chars in first i chars 13 Arrays.fill(res, Integer.MAX_VALUE); 14 res[0] = 0; 15 for (int i=1; i<=len; i++) { 16 for (int j=0; j<i; j++) { 17 String str = s.substring(j, i); 18 int unrecogNum = dict.contains(str)? 0 : i-j; 19 res[i] = Math.min(res[i], res[j]+unrecogNum); 20 } 21 } 22 return res[len]; 23 }
public static int parse(String doc, Trie dict) {
int len = doc.length();
int[] dp = new int[len + 1];
// dp[i] denotes the number of special chars in first i chars of docs
for (int i = 1; i <= len; i++) {
dp[i] = Integer.MAX_VALUE;
for (int j = 0; j < i; j++) {
String str = doc.substring(j, i);
if (dict.contains(str, true)) {
// consider (i to j) a valid word
dp[i] = Math.min(dp[i], dp[j]);
} else {
// consider (i to j) special chars
dp[i] = Math.min(dp[i], dp[j] + i - j);
}
}
}
return dp[len];
}
http://www.cnblogs.com/grandyang/p/5448539.html
1. 一些递归重复计算了,我们可以使用哈希表来保存之前计算好的最优拆分了,那么在之后的递归中遇到了直接取出来用就行了。
2. 在有些情况下我们可以预测一种拆分方式是否会生成无效单词,比如当我们拆分单词xten,没有单词是以xt开头的,但是当前的算法会计算xt+p(en),xten+p(n)和xten。每一次我们都会发现这样的单词不存在,这样我们直接在x后面加一个空格就可以了,那么我们怎么样知道没有单词是以xt开始的呢,用前缀树trie,参见parseOptimized()函数。
我们使用哈希表来缓存结果,关键字是单词的起始位置,我们保存了最佳的位置来拆分余下的字符串。我们可以让代码返回拆分好的字符串,但是比较麻烦。我们使用一个外包类Result来返回无效字符的个数和最优字符串,而在C++中就不用这样,因为可以通过引用传递。
public static String bestSplit(HashSet<String> dictionary, String sentence) {
ParseResult[] memo = new ParseResult[sentence.length()];
ParseResult r = split(dictionary, sentence, 0, memo);
return r == null ? null : r.parsed;
}
public static ParseResult split(HashSet<String> dictionary, String sentence, int start, ParseResult[] memo) {
if (start >= sentence.length()) {
return new ParseResult(0, "");
} if (memo[start] != null) {
return memo[start];
}
int bestInvalid = Integer.MAX_VALUE;
String bestParsing = null;
String partial = "";
int index = start;
while (index < sentence.length()) {
char c = sentence.charAt(index);
partial += c;
int invalid = dictionary.contains(partial) ? 0 :
partial.length();
if (invalid < bestInvalid) { // Short circuit
/* Recurse, putting a space after this character. If this
* is better than the current best option, replace the best
* option. */
ParseResult result = split(dictionary, sentence, index + 1, memo);
if (invalid + result.invalid < bestInvalid) {
bestInvalid = invalid + result.invalid;
bestParsing = partial + " " + result.parsed;
if (bestInvalid == 0) break; // Short circuit
}
}
index++;
}
memo[start] = new ParseResult(bestInvalid, bestParsing);
return memo[start];
}
public static String clean(String str) {
char[] punctuation = {',', '"', '!', '.', '\'', '?', ','};
for (char c : punctuation) {
str = str.replace(c, ' ');
}
return str.replace(" ", "").toLowerCase();
}
Brute Force:
public static String bestSplit(HashSet<String> dictionary, String sentence) {
ParseResult r = split(dictionary, sentence, 0);
return r == null ? null : r.parsed;
}
public static ParseResult split(HashSet<String> dictionary, String sentence, int start) {
if (start >= sentence.length()) {
return new ParseResult(0, "");
}
int bestInvalid = Integer.MAX_VALUE;
String bestParsing = null;
String partial = "";
int index = start;
while (index < sentence.length()) {
char c = sentence.charAt(index);
partial += c;
int invalid = dictionary.contains(partial) ? 0 :
partial.length();
if (invalid < bestInvalid) { // Short circuit
/* Recurse, putting a space after this character. If this
* is better than the current best option, replace the best
* option. */
ParseResult result = split(dictionary, sentence, index + 1);
if (invalid + result.invalid < bestInvalid) {
bestInvalid = invalid + result.invalid;
bestParsing = partial + " " + result.parsed;
if (bestInvalid == 0) break;
}
}
index++;
}
return new ParseResult(bestInvalid, bestParsing);
}
public static String clean(String str) {
char[] punctuation = {',', '"', '!', '.', '\'', '?', ','};
for (char c : punctuation) {
str = str.replace(c, ' ');
}
return str.replace(" ", "").toLowerCase();
}