Puzzles, Maths and Algorithms: Finding Relevant Keywords from Large Corpus of Text
Problem 1: Given a large paragraph of words containing n words (separated by space), and k keywords. Find the smallest distance between these keywords in the paragraph.
To illustrate the problem, consider that the paragraph is A B A C E D O A B. Let us say that we are interested in two keywords A and D. Then the closest they appear is at position 8 and 6 (D O A) and the minimum distance is 1 (i.e. number of non-keyword words between them).
Build an efficient algorithm with O(k) space and O(n log k) running cost.
To solve this problem, we need to register the last seen position of each of k keywords, as we read the paragraph word-by-word. If the next word, and if that matches a keyword, we need to update the last seen position of the matching keyword and also compute the minimum of the position register. That helps us find the smallest distance.
In order to implement this efficiently, we need a hash map and min heap (tree based implementation, not array based). Following is the code to solve this problem.
Problem 2: Given a large paragraph of words (separated by space). Consider N tuples of names and interests. If a given person and their interest appears within K words then its a match else not. Find all matches from the paragraph.
To illustrate the problem, consider that the paragraph is "John likes swimming and Jack likes Tennis". Let the interest map is <john swimming>, <john tennis>, <jack swimming>, <jack tennis>. For k = 3, 4, 5, 6 the matches found in the paragraph are <john swimming>, <jack swimming>, <jack tennis>.
Propose an efficient algorithm for two cases: k is small, k is very large.
Read full article from Puzzles, Maths and Algorithms: Finding Relevant Keywords from Large Corpus of Text
Problem 1: Given a large paragraph of words containing n words (separated by space), and k keywords. Find the smallest distance between these keywords in the paragraph.
To illustrate the problem, consider that the paragraph is A B A C E D O A B. Let us say that we are interested in two keywords A and D. Then the closest they appear is at position 8 and 6 (D O A) and the minimum distance is 1 (i.e. number of non-keyword words between them).
Build an efficient algorithm with O(k) space and O(n log k) running cost.
To solve this problem, we need to register the last seen position of each of k keywords, as we read the paragraph word-by-word. If the next word, and if that matches a keyword, we need to update the last seen position of the matching keyword and also compute the minimum of the position register. That helps us find the smallest distance.
In order to implement this efficiently, we need a hash map and min heap (tree based implementation, not array based). Following is the code to solve this problem.
Let K[i] indicate ith keyword. Let W[i] indicate ith word in the paragraph. // init map and heap map = {} heap = [] for i in 1 to k // init heap with -Inf as position of the keyword node = heap_node() node.val = -Inf // map stores the keyword and its pointer to node in the heap heap.put(node) map.put(K[i], node) end idx_start = 0 idx_end = 0 min_dist = Inf for i in 1 to n w = W[i] // check if w is one of the keywords, if not ignore if !map.contains_key(w) continue end // get the heap node corresponding to the // key W[i] and update its position node = map.get(w) node.val = i // re-heapify the min heap heap.heapify() // min of all positions idx = heap.min() // if idx = -Inf then some keywords are not seen yet. if idx == -Inf continue end // compute distance between keywords dist = i - idx - k + 1 if dist < min_dist idx_start = idx idx_end = i min_dist = dist end endThe above algorithm requires O(k) memory for the map and the heap. For each word, it requires O(log k) to re-heapify the min heap.
Problem 2: Given a large paragraph of words (separated by space). Consider N tuples of names and interests. If a given person and their interest appears within K words then its a match else not. Find all matches from the paragraph.
To illustrate the problem, consider that the paragraph is "John likes swimming and Jack likes Tennis". Let the interest map is <john swimming>, <john tennis>, <jack swimming>, <jack tennis>. For k = 3, 4, 5, 6 the matches found in the paragraph are <john swimming>, <jack swimming>, <jack tennis>.
Propose an efficient algorithm for two cases: k is small, k is very large.
Read full article from Puzzles, Maths and Algorithms: Finding Relevant Keywords from Large Corpus of Text