Word Distance - Cracking the coding interview--Q20.5


Cracking the coding interview--Q20.5
You have a large text file containing words. Given any two words, find the shortest distance (in terms of number of words) between them in the file. Can you make the searching operation in O(1) time? What about the space complexity for your solution?
译文:
有一个很大的文本文件,里面包含许多英文单词。给出两个单词,找到它们的最短距离 (以它们之间隔了多少个单词计数)。你能在O(1)的时间内返回任意两个单词间的最短距离吗? 你的解法空间复杂度是多少?

首先,我们遇到的第一个问题是:是否要考虑顺序?我们求的是is和name间的距离, 那么文本中先出现name再出现is的情况要不要算进来。这一点是要和面试官进行交流确认的。 这里我们假设不考虑顺序,并且认为本文中只有单词,没有标点。 为了进一步简化问题,我们可以用一个字符串数组来保存单词, 接下来考虑如何计算两个单词间的最短距离。

如果只要找一次就用第一种O(n)解法.
如果要找多次就多用一个Hashtable,把所有的组合都保存起来.
https://tianrunhe.wordpress.com/2012/06/04/shortest-distances-between-two-words-in-a-file/
X. We can build and store the mapping between pairs of words to their shortest distance. That is space complexity. But the query is constant since it’s just one step of look-up.
private static Map<HashSet<String>, Integer> distances =
        new HashMap<HashSet<String>, Integer>();
public static int query(String word1, String word2) {
    return distances.get(new Pair<String>(word1, word2));
}
public static void buildMap(String text) {
    // clean up the text
    String[] temp = text.split("[\\s\\p{Punct}]");
    List<String> words = new ArrayList<String>();
    for (String word : temp) {
        word = word.trim();
        if (!word.isEmpty()) {
            words.add(word);
        }
    }
    // build the mapping between pairs of words to
    // their shortest distances
    for (int i = 0; i < words.size(); ++i) {
        for (int j = i + 1; j < words.size(); ++j) {
            if (!words.get(i).equals(words.get(j))) {
                HashSet<String> pair = new HashSet<String>();
                pair.add(words.get(i));
                pair.add(words.get(j));
                if (distances.keySet().contains(pair)) {
                    int curr = distances.get(pair);
                    if (j - i < curr)
                        distances.put(pair, j - i);
                } else {
                    distances.put(pair, j - i);
                }
            }
        }
    }
}
方法一
遍历一次文本,用哈希函数将每个单词映射到不同结点,结点后保存该单词出现的位置。 比如对于上面的例子
What is your name My name is Hawstein
0    1  2    3    4  5    6  7  
遍历一次并处理后,我们得到每个单词在文本中出现的位置:(哈希值是随便写的,示意用)
单词    哈希值   出现位置
What:     3       0
is:       7       1, 6
your:     13      2
name:     14      3, 5
My:       25      4
Hawstein: 27      7
求两个单词间的最小距离时,首先用O(1)时间通过哈希函数映射到指定结点, 然后对于其中一个单词的每个位置,去与第二个单词的所有位置比较,找到最小的差值。 由于位置是递增的,因此可以修改二分查找进行搜索。
该方法的平均查找复杂度应该是O(1)的,但最坏情况下无法保证O(1)的查找时间, 考虑一种极端情况,文本中的单词就只有is和name,它们的数量各为(½)n, 使用这种算法,我们需要O(nlogn)的时间。

X. O(N)
int ShortestDist(string text[], int n, string word1, string word2){
    int min = kMaxInt / 2;
    int pos1 = -min;
    int pos2 = -min;

    for(int pos=0; pos<n; ++pos){
        if(text[pos] == word1){
            pos1 = pos;
            int dist = pos1 - pos2;
            if(dist < min)
                min = dist;
        }
        else if(text[pos] == word2){
            pos2 = pos;
            int dist = pos2 - pos1;
            if(dist < min)
                min = dist;
        }
    }

    return min;
}

https://github.com/careercup/CtCI-6th-Edition/tree/master/Java/Ch%2017.%20Hard/Q17_11_Word_Distance
public static HashMapList<String, Integer> getWordLocations(String[] words) {
HashMapList<String, Integer> locations = new HashMapList<String, Integer>();
for (int i = 0; i < words.length; i++) {
locations.put(words[i], i);
}
return locations;
}

public static LocationPair findMinDistancePair(ArrayList<Integer> array1, ArrayList<Integer> array2) {
if (array1 == null || array2 == null || array1.size() == 0 || array2.size() == 0) {
return null;
}

int index1 = 0;
int index2 = 0;
LocationPair best = new LocationPair(array1.get(0), array2.get(0));
LocationPair current = new LocationPair(array1.get(0), array2.get(0));

while (index1 < array1.size() && index2 < array2.size()) {
current.setLocations(array1.get(index1), array2.get(index2));
best.updateWithMin(current);
if (current.location1 < current.location2) {
index1++;
} else {
index2++;
}
}

return best;
}

public static LocationPair findClosest(String word1, String word2, HashMapList<String, Integer> locations) {
ArrayList<Integer> locations1 = locations.get(word1);
ArrayList<Integer> locations2 = locations.get(word2);
return findMinDistancePair(locations1, locations2);
}


public static LocationPair findClosest(String[] words, String word1, String word2) {
LocationPair best = new LocationPair(-1, -1);
LocationPair current = new LocationPair(-1, -1);
for (int i = 0; i < words.length; i++) {
String word = words[i];
if (word.equals(word1)) {
current.location1 = i;
best.updateWithMin(current);
} else if (word.equals(word2)) {
current.location2 = i;
best.updateWithMin(current);
}
}
return best;
}

Read full article from Cracking the coding interview--Q20.5

Labels

LeetCode (1432) GeeksforGeeks (1122) LeetCode - Review (1067) Review (882) Algorithm (668) to-do (609) Classic Algorithm (270) Google Interview (237) Classic Interview (222) Dynamic Programming (220) DP (186) Bit Algorithms (145) POJ (141) Math (137) Tree (132) LeetCode - Phone (129) EPI (122) Cracking Coding Interview (119) DFS (115) Difficult Algorithm (115) Lintcode (115) Different Solutions (110) Smart Algorithm (104) Binary Search (96) BFS (91) HackerRank (90) Binary Tree (86) Hard (79) Two Pointers (78) Stack (76) Company-Facebook (75) BST (72) Graph Algorithm (72) Time Complexity (69) Greedy Algorithm (68) Interval (63) Company - Google (62) Geometry Algorithm (61) Interview Corner (61) LeetCode - Extended (61) Union-Find (60) Trie (58) Advanced Data Structure (56) List (56) Priority Queue (53) Codility (52) ComProGuide (50) LeetCode Hard (50) Matrix (50) Bisection (48) Segment Tree (48) Sliding Window (48) USACO (46) Space Optimization (45) Company-Airbnb (41) Greedy (41) Mathematical Algorithm (41) Tree - Post-Order (41) ACM-ICPC (40) Algorithm Interview (40) Data Structure Design (40) Graph (40) Backtracking (39) Data Structure (39) Jobdu (39) Random (39) Codeforces (38) Knapsack (38) LeetCode - DP (38) Recursive Algorithm (38) String Algorithm (38) TopCoder (38) Sort (37) Introduction to Algorithms (36) Pre-Sort (36) Beauty of Programming (35) Must Known (34) Binary Search Tree (33) Follow Up (33) prismoskills (33) Palindrome (32) Permutation (31) Array (30) Google Code Jam (30) HDU (30) Array O(N) (29) Logic Thinking (29) Monotonic Stack (29) Puzzles (29) Code - Detail (27) Company-Zenefits (27) Microsoft 100 - July (27) Queue (27) Binary Indexed Trees (26) TreeMap (26) to-do-must (26) 1point3acres (25) GeeksQuiz (25) Merge Sort (25) Reverse Thinking (25) hihocoder (25) Company - LinkedIn (24) Hash (24) High Frequency (24) Summary (24) Divide and Conquer (23) Proof (23) Game Theory (22) Topological Sort (22) Lintcode - Review (21) Tree - Modification (21) Algorithm Game (20) CareerCup (20) Company - Twitter (20) DFS + Review (20) DP - Relation (20) Brain Teaser (19) DP - Tree (19) Left and Right Array (19) O(N) (19) Sweep Line (19) UVA (19) DP - Bit Masking (18) LeetCode - Thinking (18) KMP (17) LeetCode - TODO (17) Probabilities (17) Simulation (17) String Search (17) Codercareer (16) Company-Uber (16) Iterator (16) Number (16) O(1) Space (16) Shortest Path (16) itint5 (16) DFS+Cache (15) Dijkstra (15) Euclidean GCD (15) Heap (15) LeetCode - Hard (15) Majority (15) Number Theory (15) Rolling Hash (15) Tree Traversal (15) Brute Force (14) Bucket Sort (14) DP - Knapsack (14) DP - Probability (14) Difficult (14) Fast Power Algorithm (14) Pattern (14) Prefix Sum (14) TreeSet (14) Algorithm Videos (13) Amazon Interview (13) Basic Algorithm (13) Codechef (13) Combination (13) Computational Geometry (13) DP - Digit (13) LCA (13) LeetCode - DFS (13) Linked List (13) Long Increasing Sequence(LIS) (13) Math-Divisible (13) Reservoir Sampling (13) mitbbs (13) Algorithm - How To (12) Company - Microsoft (12) DP - Interval (12) DP - Multiple Relation (12) DP - Relation Optimization (12) LeetCode - Classic (12) Level Order Traversal (12) Prime (12) Pruning (12) Reconstruct Tree (12) Thinking (12) X Sum (12) AOJ (11) Bit Mask (11) Company-Snapchat (11) DP - Space Optimization (11) Dequeue (11) Graph DFS (11) MinMax (11) Miscs (11) Princeton (11) Quick Sort (11) Stack - Tree (11) 尺取法 (11) 挑战程序设计竞赛 (11) Coin Change (10) DFS+Backtracking (10) Facebook Hacker Cup (10) Fast Slow Pointers (10) HackerRank Easy (10) Interval Tree (10) Limited Range (10) Matrix - Traverse (10) Monotone Queue (10) SPOJ (10) Starting Point (10) States (10) Stock (10) Theory (10) Tutorialhorizon (10) Kadane - Extended (9) Mathblog (9) Max-Min Flow (9) Maze (9) Median (9) O(32N) (9) Quick Select (9) Stack Overflow (9) System Design (9) Tree - Conversion (9) Use XOR (9) Book Notes (8) Company-Amazon (8) DFS+BFS (8) DP - States (8) Expression (8) Longest Common Subsequence(LCS) (8) One Pass (8) Quadtrees (8) Traversal Once (8) Trie - Suffix (8) 穷竭搜索 (8) Algorithm Problem List (7) All Sub (7) Catalan Number (7) Cycle (7) DP - Cases (7) Facebook Interview (7) Fibonacci Numbers (7) Flood fill (7) Game Nim (7) Graph BFS (7) HackerRank Difficult (7) Hackerearth (7) Inversion (7) Kadane’s Algorithm (7) Manacher (7) Morris Traversal (7) Multiple Data Structures (7) Normalized Key (7) O(XN) (7) Radix Sort (7) Recursion (7) Sampling (7) Suffix Array (7) Tech-Queries (7) Tree - Serialization (7) Tree DP (7) Trie - Bit (7) 蓝桥杯 (7) Algorithm - Brain Teaser (6) BFS - Priority Queue (6) BFS - Unusual (6) Classic Data Structure Impl (6) DP - 2D (6) DP - Monotone Queue (6) DP - Unusual (6) DP-Space Optimization (6) Dutch Flag (6) How To (6) Interviewstreet (6) Knapsack - MultiplePack (6) Local MinMax (6) MST (6) Minimum Spanning Tree (6) Number - Reach (6) Parentheses (6) Pre-Sum (6) Probability (6) Programming Pearls (6) Rabin-Karp (6) Reverse (6) Scan from right (6) Schedule (6) Stream (6) Subset Sum (6) TSP (6) Xpost (6) n00tc0d3r (6) reddit (6) AI (5) Abbreviation (5) Anagram (5) Art Of Programming-July (5) Assumption (5) Bellman Ford (5) Big Data (5) Code - Solid (5) Code Kata (5) Codility-lessons (5) Coding (5) Company - WMware (5) Convex Hull (5) Crazyforcode (5) DFS - Multiple (5) DFS+DP (5) DP - Multi-Dimension (5) DP-Multiple Relation (5) Eulerian Cycle (5) Graph - Unusual (5) Graph Cycle (5) Hash Strategy (5) Immutability (5) Java (5) LogN (5) Manhattan Distance (5) Matrix Chain Multiplication (5) N Queens (5) Pre-Sort: Index (5) Quick Partition (5) Quora (5) Randomized Algorithms (5) Resources (5) Robot (5) SPFA(Shortest Path Faster Algorithm) (5) Shuffle (5) Sieve of Eratosthenes (5) Strongly Connected Components (5) Subarray Sum (5) Sudoku (5) Suffix Tree (5) Swap (5) Threaded (5) Tree - Creation (5) Warshall Floyd (5) Word Search (5) jiuzhang (5)

Popular Posts