Massive Algorithms: Puzzles, Maths and Algorithms: Finding Relevant Keywords from Large Corpus of Text

Puzzles, Maths and Algorithms: Finding Relevant Keywords from Large Corpus of Text
Problem 1: Given a large paragraph of words containing n words (separated by space), and k keywords. Find the smallest distance between these keywords in the paragraph.

To illustrate the problem, consider that the paragraph is A B A C E D O A B. Let us say that we are interested in two keywords A and D. Then the closest they appear is at position 8 and 6 (D O A) and the minimum distance is 1 (i.e. number of non-keyword words between them).

Build an efficient algorithm with O(k) space and O(n log k) running cost.
To solve this problem, we need to register the last seen position of each of k keywords, as we read the paragraph word-by-word. If the next word, and if that matches a keyword, we need to update the last seen position of the matching keyword and also compute the minimum of the position register. That helps us find the smallest distance.

In order to implement this efficiently, we need a hash map and min heap (tree based implementation, not array based). Following is the code to solve this problem.

Let K[i] indicate ith keyword.
Let W[i] indicate ith word in the paragraph. 

// init map and heap
map = {}
heap = []
for i in 1 to k

  // init heap with -Inf as position of the keyword
  node = heap_node()
  node.val = -Inf

  // map stores the keyword and its pointer to node in the heap
  heap.put(node)
  map.put(K[i], node)
end

idx_start = 0
idx_end = 0
min_dist = Inf
for i in 1 to n
  w = W[i]

  // check if w is one of the keywords, if not ignore
  if !map.contains_key(w)
    continue
  end

  // get the heap node corresponding to the
  // key W[i] and update its position
  node = map.get(w)
  node.val = i

  // re-heapify the min heap
  heap.heapify()

  // min of all positions
  idx = heap.min()

  // if idx = -Inf then some keywords are not seen yet.
  if idx == -Inf
    continue
  end

  // compute distance between keywords
  dist = i - idx - k + 1
  if dist < min_dist
    idx_start = idx
    idx_end = i
    min_dist = dist
  end 
end

The above algorithm requires O(k) memory for the map and the heap. For each word, it requires O(log k) to re-heapify the min heap.

Problem 2: Given a large paragraph of words (separated by space). Consider N tuples of names and interests. If a given person and their interest appears within K words then its a match else not. Find all matches from the paragraph.

To illustrate the problem, consider that the paragraph is "John likes swimming and Jack likes Tennis". Let the interest map is <john swimming>, <john tennis>, <jack swimming>, <jack tennis>. For k = 3, 4, 5, 6 the matches found in the paragraph are <john swimming>, <jack swimming>, <jack tennis>.

Propose an efficient algorithm for two cases: k is small, k is very large.
Read full article from Puzzles, Maths and Algorithms: Finding Relevant Keywords from Large Corpus of Text

Puzzles, Maths and Algorithms: Finding Relevant Keywords from Large Corpus of Text

Labels

Popular Posts