Programming Interview Questions 28: Longest Compound Word | Arden DertatArden Dertat
Using Trie, which let words to share prefixes
Let’s illustrate the process with an example, the first word is cat and we add it to the trie. Since it doesn’t have a prefix, we continue. Second word is cats, we add it to the trie and check whether it has a prefix, and yes it does, the word cat. So we append the pair <’cats’, ‘s’> to the queue, which is the current word and the suffix. The third word is catsdogcats, we again insert it to the trie and see that it has 2 prefixes, cat and cats. So we add 2 pairs <’catsdogcats’, ‘sdogcats’> and <’catsdogcats’, ‘dogcats’> where the former suffix correspond to the prefix cat and the latter to cats. We continue like this by adding <’catxdogcatsrat’, ‘xdogcatsrat’> to the queue and so on. And here’s the trie formed by adding example the words in the problem definition:
After building the trie and the queue, then we start processing the queue by popping a pair from the beginning. As explained above, the pair contains the original word and the suffix we want to validate. We check whether the suffix is a valid or compound word. If it’s a valid word and the original word is the longest up to now, we store the result. Otherwise we discard the pair. The suffix may be a compound word itself, so we check if the it has any prefixes. If it does, then we apply the above procedure by adding the original word and the new suffix to the queue. If the suffix of the original popped pair is neither a valid word nor has a prefix, we simply discard that pair.
The complexity of this algorithm is O(kN) where N is the number of words in the input list, and k the maximum number of words in a compound word. The number k may vary from one list to another, but it’ll generally be a constant number like 5 or 10. So, the algorithm is linear in number of words in the list, which is an optimal solution to the problem.
https://github.com/szn1992/Longest-Compound-Word/blob/master/src/LongestCompoundWord.java
String word; // a word
List<Integer> sufIndices; // indices of suffixes of a word
// read words from the file
// fill up the queue with words which have suffixes, who are
// candidates to be compound words
// insert each word in trie
while (s.hasNext()) {
word = s.next();
sufIndices = trie.getSuffixesStartIndices(word);
for (int i : sufIndices) {
if (i >= word.length()) // if index is out of bound
break; // it means suffixes of the word has
// been added to the queue if there is any
queue.add(new Pair<String>(word, word.substring(i)));
}
trie.insert(word);
}
Pair<String> p; // a pair of word and its remaining suffix
int maxLength = 0; // longest compound word length
//int sec_maxLength = 0; // second longest compound word length
String longest = ""; // longest compound word
String sec_longest = ""; // second longest compound word
while (!queue.isEmpty()) {
p = queue.removeFirst();
word = p.second();
sufIndices = trie.getSuffixesStartIndices(word);
// if no suffixes found, which means no prefixes found
// discard the pair and check the next pair
if (sufIndices.isEmpty()) {
continue;
}
//System.out.println(word);
for (int i : sufIndices) {
if (i > word.length()) { // sanity check
break;
}
if (i == word.length()) { // no suffix, means it is a compound word
// check if the compound word is the longest
// if it is update both longest and second longest
// words records
if (p.first().length() > maxLength) {
//sec_maxLength = maxLength;
sec_longest = longest;
maxLength = p.first().length();
longest = p.first();
}
compoundWords.add(p.first()); // the word is compound word
} else {
queue.add(new Pair<String>(p.first(), word.substring(i)));
}
}
}
// print out the results
System.out.println("Longest Compound Word: " + longest);
System.out.println("Second Longest Compound Word: " + sec_longest);
System.out.println("Total Number of Compound Words: " + compoundWords.size());
https://github.com/szn1992/Longest-Compound-Word/blob/master/src/Trie.java
private class TrieNode {
@SuppressWarnings("unused")
private char val; // character stored in the node
private HashMap<Character, TrieNode> children; // map name of string to the node
// which has the string as value
private boolean isWord; // if the node is at the end of a word
// constructor
public TrieNode(char val) {
this.val = val;
children = new HashMap<Character, TrieNode>();
isWord = false;
}
// add child to trienode
public void addChild(char child) {
children.put(child, new TrieNode(child));
}
// get child of trienode that has the same character as the give one
public TrieNode getChild(char child) {
if (!children.keySet().contains(child)) {
return null;
}
return children.get(child);
}
// return true if child exists
public boolean containsChild(char child) {
return children.keySet().contains(child);
}
}
private TrieNode root;
private TrieNode curr;
public Trie() {
root = new TrieNode(' '); // root
curr = root;
}
// insert a word to trie
public void insert(String s) {
char letter;
curr = root;
// traverse every letter of a word
// update trie if a letter is not in the structure
for (int i = 0; i < s.length(); i++) {
letter = s.charAt(i);
if (!curr.containsChild(letter)) {
curr.addChild(letter);
}
curr = curr.getChild(letter);
}
// mark last letter as the end of a word
curr.isWord = true;
}
// return starting indices of all suffixes of a word
public List<Integer> getSuffixesStartIndices(String s) {
List<Integer> indices = new LinkedList<Integer>(); // store indices
char letter;
curr = root; // start from root
for (int i = 0; i < s.length(); i++) {
letter = s.charAt(i);
// if the current letter doesn't have one letter as child
// which means trie currently doesn't have the relationship
// returns indices of suffixes
if (!curr.containsChild(letter))
return indices;
// move on to the child node
curr = curr.getChild(letter);
// if the letter is an end to a word, it means after the letter is a suffix
// update indices
if (curr.isWord)
indices.add(i + 1);
}
return indices;
}
https://github.com/peiweige/LongestCompoundWord/blob/master/src/LongestCompoundWord/FindLongestCompoundWord.java
https://github.com/subhendu-git/compound-words/blob/master/src/com/foo/pkg/Main.java
https://amazoninterviewmaker.wordpress.com/find-the-longest-word-made-of-other-words/
Using trie and queue :-
X. DFS-Recursive Version:
http://www.hawstein.com/posts/20.7.html
http://blog.csdn.net/navyifanr/article/details/24396961
if word made of any number of words:-
==> kind of strange: why use hashmap?
http://codeanytime.blogspot.com/2014/12/longest-compound-word.html
http://www.shuatiblog.com/blog/2014/10/02/longest-word-made-from-other/
https://github.com/yxjiang/algorithms/blob/master/src/main/java/algorithm/cc150/chapter18/Question7.java
http://www.geeksforgeeks.org/word-formation-using-concatenation-of-two-dictionary-words/
https://medium.com/@jessgreb01/longest-concatenated-word-algorithm-34934b864e3e
Read full article from Programming Interview Questions 28: Longest Compound Word | Arden DertatArden Dertat
Given a sorted list of words, find the longest compound word in the list that is constructed by concatenating the words in the list. For example, if the input list is: ['cat', 'cats', 'catsdogcats', 'catxdogcatsrat', 'dog', 'dogcatsdog', 'hippopotamuses', 'rat', 'ratcat', 'ratcatdog', 'ratcatdogcat']. Then the longest compound word is 'ratcatdogcat' with 12 letters. Note that the longest individual words are 'catxdogcatsrat' and 'hippopotamuses' with 14 letters, but they're not fully constructed by other words. Former one has an extra 'x' letter, and latter is an individual word by itself not a compound word.
Using Trie, which let words to share prefixes
Let’s illustrate the process with an example, the first word is cat and we add it to the trie. Since it doesn’t have a prefix, we continue. Second word is cats, we add it to the trie and check whether it has a prefix, and yes it does, the word cat. So we append the pair <’cats’, ‘s’> to the queue, which is the current word and the suffix. The third word is catsdogcats, we again insert it to the trie and see that it has 2 prefixes, cat and cats. So we add 2 pairs <’catsdogcats’, ‘sdogcats’> and <’catsdogcats’, ‘dogcats’> where the former suffix correspond to the prefix cat and the latter to cats. We continue like this by adding <’catxdogcatsrat’, ‘xdogcatsrat’> to the queue and so on. And here’s the trie formed by adding example the words in the problem definition:
After building the trie and the queue, then we start processing the queue by popping a pair from the beginning. As explained above, the pair contains the original word and the suffix we want to validate. We check whether the suffix is a valid or compound word. If it’s a valid word and the original word is the longest up to now, we store the result. Otherwise we discard the pair. The suffix may be a compound word itself, so we check if the it has any prefixes. If it does, then we apply the above procedure by adding the original word and the new suffix to the queue. If the suffix of the original popped pair is neither a valid word nor has a prefix, we simply discard that pair.
https://github.com/szn1992/Longest-Compound-Word/blob/master/src/LongestCompoundWord.java
String word; // a word
List<Integer> sufIndices; // indices of suffixes of a word
// read words from the file
// fill up the queue with words which have suffixes, who are
// candidates to be compound words
// insert each word in trie
while (s.hasNext()) {
word = s.next();
sufIndices = trie.getSuffixesStartIndices(word);
for (int i : sufIndices) {
if (i >= word.length()) // if index is out of bound
break; // it means suffixes of the word has
// been added to the queue if there is any
queue.add(new Pair<String>(word, word.substring(i)));
}
trie.insert(word);
}
Pair<String> p; // a pair of word and its remaining suffix
int maxLength = 0; // longest compound word length
//int sec_maxLength = 0; // second longest compound word length
String longest = ""; // longest compound word
String sec_longest = ""; // second longest compound word
while (!queue.isEmpty()) {
p = queue.removeFirst();
word = p.second();
sufIndices = trie.getSuffixesStartIndices(word);
// if no suffixes found, which means no prefixes found
// discard the pair and check the next pair
if (sufIndices.isEmpty()) {
continue;
}
//System.out.println(word);
for (int i : sufIndices) {
if (i > word.length()) { // sanity check
break;
}
if (i == word.length()) { // no suffix, means it is a compound word
// check if the compound word is the longest
// if it is update both longest and second longest
// words records
if (p.first().length() > maxLength) {
//sec_maxLength = maxLength;
sec_longest = longest;
maxLength = p.first().length();
longest = p.first();
}
compoundWords.add(p.first()); // the word is compound word
} else {
queue.add(new Pair<String>(p.first(), word.substring(i)));
}
}
}
// print out the results
System.out.println("Longest Compound Word: " + longest);
System.out.println("Second Longest Compound Word: " + sec_longest);
System.out.println("Total Number of Compound Words: " + compoundWords.size());
https://github.com/szn1992/Longest-Compound-Word/blob/master/src/Trie.java
private class TrieNode {
@SuppressWarnings("unused")
private char val; // character stored in the node
private HashMap<Character, TrieNode> children; // map name of string to the node
// which has the string as value
private boolean isWord; // if the node is at the end of a word
// constructor
public TrieNode(char val) {
this.val = val;
children = new HashMap<Character, TrieNode>();
isWord = false;
}
// add child to trienode
public void addChild(char child) {
children.put(child, new TrieNode(child));
}
// get child of trienode that has the same character as the give one
public TrieNode getChild(char child) {
if (!children.keySet().contains(child)) {
return null;
}
return children.get(child);
}
// return true if child exists
public boolean containsChild(char child) {
return children.keySet().contains(child);
}
}
private TrieNode root;
private TrieNode curr;
public Trie() {
root = new TrieNode(' '); // root
curr = root;
}
// insert a word to trie
public void insert(String s) {
char letter;
curr = root;
// traverse every letter of a word
// update trie if a letter is not in the structure
for (int i = 0; i < s.length(); i++) {
letter = s.charAt(i);
if (!curr.containsChild(letter)) {
curr.addChild(letter);
}
curr = curr.getChild(letter);
}
// mark last letter as the end of a word
curr.isWord = true;
}
// return starting indices of all suffixes of a word
public List<Integer> getSuffixesStartIndices(String s) {
List<Integer> indices = new LinkedList<Integer>(); // store indices
char letter;
curr = root; // start from root
for (int i = 0; i < s.length(); i++) {
letter = s.charAt(i);
// if the current letter doesn't have one letter as child
// which means trie currently doesn't have the relationship
// returns indices of suffixes
if (!curr.containsChild(letter))
return indices;
// move on to the child node
curr = curr.getChild(letter);
// if the letter is an end to a word, it means after the letter is a suffix
// update indices
if (curr.isWord)
indices.add(i + 1);
}
return indices;
}
https://github.com/peiweige/LongestCompoundWord/blob/master/src/LongestCompoundWord/FindLongestCompoundWord.java
https://github.com/subhendu-git/compound-words/blob/master/src/com/foo/pkg/Main.java
https://amazoninterviewmaker.wordpress.com/find-the-longest-word-made-of-other-words/
Using trie and queue :-
X. DFS-Recursive Version:
http://www.hawstein.com/posts/20.7.html
- 按单词的长度从大到小排序。(先寻找最长的单词)
- 不断地取单词的前缀s,当s存在于单词数组中,递归调用该函数, 判断剩余串是否可以由其它单词组成。如果可以,返回true。
inline bool cmp(string s1, string s2){//按长度从大到小排
return s2.length() < s1.length();
}
bool MakeOfWords(string word, int length){
//cout<<"curr: "<<word<<endl;
int len = word.length();
//cout<<"len:"<<len<<endl;
if(len == 0) return true;
for(int i=1; i<=len; ++i){
if(i == length) return false;//取到原始串,即自身
string str = word.substr(0, i);
//cout<<str<<endl;
if(hash.find((char*)&str[0])){
if(MakeOfWords(word.substr(i), length))
return true;
}
}
return false;
}
void PrintLongestWord(string word[], int n){
for(int i=0; i<n; ++i)
hash.insert((char*)&word[i][0]);
sort(word, word+n, cmp);
for(int i=0; i<n; ++i){
if(MakeOfWords(word[i], word[i].length())){
cout<<"Longest Word: "<<word[i]<<endl;
return;
}
}
}
注意上述代码中有一句:
if(i == length) return false;//取到原始串,即自身
意思是当我们取一个单词前缀,最后取到整个单词时, 这种情况就认为是没有其它单词可以组成它。如果不要这一句, 那么你在哈希表中总是能查找到和它自身相等的串(就是它自己),从而返回true。 而这明显不是我们想要的。我们要的是其它单词来组成它,不包括它自己。
===> ask whether there may be duplicate in the input.
这样一来,又引出一个问题,如果单词中就是存在两个相同的单词。 比如一个单词数组中最长的单词是abcdefg,并且存在2个,而它又不能被更小的单词组成, 那么我们可以认为这个abcdefg是由另一个abcdefg组成的吗? 关于这一点,你可以和面试官进行讨论。(上述代码认为是不能的。)
由于使用哈希表会占用较多空间,一种不使用额外空间的算法是直接在单词数组中查找, 由于单词数组已经按长度从大小到排,因此单次查找时间为O(n)。一共有n个单词, 平均长度为d,所以总共需要的时间为O(nd*n)=O(dn2 )。 如果我们再开一个数组来保存所有单词,并将它按字典序排序, 那么我们可以使用二分查找,单次查找时间变为O(logn),总共需要O(dnlogn)。
http://techinterviewsolutions.net/2015/06/15/longest-compound-word/
private
static
String longestCompundWord(String[] words) {
Set<String> dictionary =
new
HashSet<String>();
TreeMap<String, Integer> wordTree =
new
TreeMap<String, Integer>(
new
Comparator<String>() {
@Override
public
int
compare(String o1, String o2) {
return
o1.length() - o2.length();
}
});
for
(
int
i =
0
; i < words.length; i++) {
wordTree.put(words[i], i);
dictionary.add(words[i]);
}
System.out.println(dictionary);
return
findLongestCompoundWord(wordTree, dictionary);
}
private
static
boolean
isCompoundWord(String word, Set<String> dictionary) {
if
(dictionary.contains(word))
return
true
;
for
(
int
i =
1
; i < word.length(); i++) {
String prefix = word.substring(
0
, i);
if
(isCompoundWord(prefix, dictionary)) {
String remainder = word.substring(i, word.length());
if
(remainder.length() ==
0
)
return
true
;
return
isCompoundWord(remainder, dictionary);
}
}
return
false
;
}
public
static
String findLongestCompoundWord(TreeMap<String, Integer> wordTree, ArrayList<String> dictionary) {
while
(wordTree.size() >
0
) {
String word = wordTree.lastKey();
wordTree.remove(word);
dictionary.remove(word);
if
(isCompoundWord(word, dictionary))
return
word;
}
return
""
;
}
static class LengthComparator implements Comparator<String>{ public int compare(String o1,String o2){ return o2.length()-o1.length(); } } public static String printLongestWord(String[] arr){ HashMap<String,Boolean> map=new HashMap<String,Boolean>(); for(String str:arr){ map.put(str,true); } Arrays.sort(arr,new LengthComparator()); for(String s:arr){ if(canBuildWord(s,true,map)){ System.out.println(s); return s; } } return ""; } // DFS深搜 public static boolean canBuildWord(String str,boolean isOriginalWord,HashMap<String,Boolean> map){ if(map.containsKey(str)&&!isOriginalWord){ // 一个词只能由其他词组成,自己不能组成自己! return map.get(str); } for(int i=1;i<str.length();i++){ String left=str.substring(0,i); String right=str.substring(i); if(map.containsKey(left)&&map.get(left)==true&&canBuildWord(right,false,map)){ return true; } } map.put(str,false); // 记录str不能被其他的单词表示,如果以后再遇到str就可以不继续找了 return false; }if word made of two words:-
String getLongestWord(String[] list)
{
String[] array = list.SortByLength();
/* Create map for easy lookup */
HashMap<String, Boolean> map =
new
HashMap<String, Boolean>;
for
(String str : array)
{
map.put(str,
true
);
}
for
(String s : array)
{
// Divide into every possible pair
for
(
int
i = Ij i < s.length(); i++)
{
String left = s.substring^, i);
String right = s.substring(i);
// Check if both sides are in the array
if
(map[left] ==
true
&& map[right] ==
true
)
{
return
s;
}
}
}
return
str;
}
==> kind of strange: why use hashmap?
String printl_ongestWord(String arr[])
{
HashMap<String, Boolean> map =
new
HashMap<String, Boolean>();
for
(String str : arr)
{
map.put(str,
true
);
}
Arrays.sort(arr,
new
LengthComparatorQ);
// Sort by length
for
(String s : arr)
{
if
(canBuildWord(s,
true
, map))
{
System.out.println(s);
return
s;
}
}
return
""
;
}
boolean
canBuildWord(String str,
boolean
isOriginalWord,
HashMap<String, Boolean> map)
{
if
(map.containsKey(str) && !isOriginalWord)
{
return
map.get(str);
}
for
(
int
1
=
1
; i < str.lengthQ; i++)
{
String left = str.substring(
0
, i);
String right = str.substring(i);
if
(map.containsKey(left) && map.get(left) ==
true
&&
canBuildWord(right,
false
, map))
{
return
true
;
}
}
map.put(str,
false
);
return
false
;
}
http://codeanytime.blogspot.com/2014/12/longest-compound-word.html
http://www.shuatiblog.com/blog/2014/10/02/longest-word-made-from-other/
public static void printLongestWord(String[] arr) {
Arrays.sort(arr, new LengthComparator());
HashSet<String> set = new HashSet<String>();
for (String str : arr) {
set.add(str);
}
for (String word : arr) {
if (canDivide(word, 0, set)) {
System.out.println(word);
return;
}
}
System.out.println("can not find such word");
}
private static boolean canDivide(String word, int from, HashSet<String> set) {
if (from == word.length()) {
return true;
}
for (int i = from; i < word.length(); i++) {
String str = word.substring(from, i + 1);
if (from == 0 && i == word.length() - 1) {
continue;
} else if (!set.contains(str)) {
continue;
if (canDivide(word, i + 1, set)) {
return true;
}
}
return false;
}
http://www.geeksforgeeks.org/word-formation-using-concatenation-of-two-dictionary-words/
Given a dictionary find out if given word can be made by two words in the dictionary.
Note: Words in the dictionary must be unique and the word to be formed should not be a repetition of same words that are present in the Trie.
Input : dictionary[] = {"news", "abcd", "tree", "geeks", "paper"} word = "newspaper" Output : Yes
The idea is store all words of dictionary in a Trie. We do prefix search for given word. Once we find a prefix, we search for rest of the word.
Nor correct
Nor correct
struct
TrieNode
{
TrieNode *children[SIZE];
// isLeaf is true if the node represents
// end of a word
bool
isLeaf;
};
// Returns new trie node (initialized to NULLs)
TrieNode *getNode()
{
TrieNode * newNode =
new
TrieNode;
newNode->isLeaf =
false
;
for
(
int
i =0 ; i< SIZE ; i++)
newNode->children[i] = NULL;
return
newNode;
}
// If not present, inserts key into Trie
// If the key is prefix of trie node, just
// mark leaf node
void
insert(TrieNode *root, string Key)
{
int
n = Key.length();
TrieNode * pCrawl = root;
for
(
int
i=0; i<n; i++)
{
int
index = char_int(Key[i]);
if
(pCrawl->children[index] == NULL)
pCrawl->children[index] = getNode();
pCrawl = pCrawl->children[index];
}
// make last node as leaf node
pCrawl->isLeaf =
true
;
}
// Searches a prefix of key. If prefix is present,
// returns its ending position in string. Else
// returns -1.
int
findPrefix(
struct
TrieNode *root, string key)
{
int
pos = -1, level;
struct
TrieNode *pCrawl = root;
for
(level = 0; level < key.length(); level++)
{
int
index = char_int(key[level]);
if
(pCrawl->isLeaf ==
true
)
pos = level;
if
(!pCrawl->children[index])
return
pos;
pCrawl = pCrawl->children[index];
}
if
(pCrawl != NULL && pCrawl->isLeaf)
return
level;
}
// Function to check if word formation is possible
// or not
bool
isPossible(
struct
TrieNode* root, string word)
{
// Search for the word in the trie and
// store its position upto which it is matched
int
len = findPrefix(root, word);
// print not possible if len = -1 i.e. not
// matched in trie
if
(len == -1)
return
false
;
// If word is partially matched in the dictionary
// as another word
// search for the word made after splitting
// the given word up to the length it is
// already,matched
string split_word(word, len, word.length()-(len));
int
split_len = findPrefix(root, split_word);
// check if word formation is possible or not
return
(len + split_len == word.length());
}
Exercise :
A generalized version of the problem is to check if a given word can be formed using concatenation of 1 or more dictionary words. Write code for the generalized version.
http://techinterviewsolutions.net/2014/09/longest-compound-word-java-solution/A generalized version of the problem is to check if a given word can be formed using concatenation of 1 or more dictionary words. Write code for the generalized version.
https://medium.com/@jessgreb01/longest-concatenated-word-algorithm-34934b864e3e
Read full article from Programming Interview Questions 28: Longest Compound Word | Arden DertatArden Dertat