Massive Algorithms: Reservoir Sampling

Reservoir Sampling | GeeksforGeeks

Reservoir Sampling | GeeksforGeeks
Reservoir sampling is a family of randomized algorithms for randomly choosing k samples from a list of n items, where n is either a very large or unknown number.

So we are given a big array (or stream) of numbers (to simplify), and we need to write an efficient function to randomly select k numbers where 1 <= k <= n. Let the input array be stream[].

It can be solved in O(n) time. The solution also suits well for input in the form of stream. The idea is similar to this post. Following are the steps.
1) Create an array reservoir[0..k-1] and copy first k items of stream[] to it.
2) Now one by one consider all items from (k+1)th item to nth item.
…a) Generate a random number from 0 to i where i is index of current item in stream[]. Let the generated random number is j.
…b) If j is in range 0 to k-1, replace reservoir[j] with arr[i]

Proof
https://en.wikipedia.org/wiki/Reservoir_sampling
To see this, consider the following proof by induction. After the

(i-1)

^th round, let us assume, the probability of a number being in the reservoir array is

k/(i-1)

. Since the probability of the number being replaced in the

i

^th round is

1/i

, the probability that it survives the

i

^th round is

(i-1)/i

. Thus, the probability that a given number is in the reservoir after the

i

^th round is the product of these two probabilities, i.e. the probability of being in the reservoir after the

(i-1)

^th round, and surviving replacement in the

i

^th round. This is

(k/(i-1))*((i-1)/i)=k/i

. Hence, the result holds for

i

, and is therefore true by induction.
https://gist.github.com/Nan-Zhang/9730320
http://the-paper-trail.org/blog/reservoir-sampling/
Say we want to generate a set of s elements and that we have already seen n>s elements.

Let's assume that our current s elements have already each been chosen with probability s/n.

By the definition of the algorithm, we choose element n+1 with probability s/(n+1).

Each element already part of our result set has a probability 1/s of being replaced.

The probability that an element from the n-seen result set is replaced in the n+1-seen result set is therefore (1/s)*s/(n+1)=1/(n+1). Conversely, the probability that an element is not replaced is 1-1/(n+1)=n/(n+1).

Thus, the n+1-seen result set contains an element either if it was part of the n-seen result set and was not replaced---this probability is (s/n)*n/(n+1)=s/(n+1)---or if the element was chosen---with probability s/(n+1).
蓄水池算法
要求从N个元素中随机的抽取k个元素，其中N无法确定(N是个流，可能无穷大)。

这种应用的场景一般是数据流的情况下，由于数据只能被读取一次，而且数据量很大，并不能全部保存，因此数据量N是无法在抽样开始时确定的；但又要保持随机性，于是有了这个问题。所以搜索网站有时候会问这样的问题。

解决方案就是蓄水库抽样（reservoid sampling）。主要思想就是保持一个集合（这个集合中的每个数字出现），作为蓄水池，依次遍历所有数据的时候以一定概率替换这个蓄水池中的数字。
1、程序的开始就是把前k个元素都放到水库中。
2、从第k+1个开始，以k/i的概率替换掉这个水库中的某一个元素。

void selectKItems(int stream[], int n, int k)

{

    int i;  // index for elements in stream[]

    // reservoir[] is the output array. Initialize it with

    // first k elements from stream[]

    int reservoir[k];

    for (i = 0; i < k; i++)

        reservoir[i] = stream[i];

    // Use a different seed value so that we don't get

    // same result each time we run this program

    srand(time(NULL));

    // Iterate from the (k+1)th element to nth element

    for (; i < n; i++)

{

        // Pick a random index from 0 to i.

        int j = rand() % (i+1);

        // If the randomly  picked index is smaller than k, then replace

        // the element present at the index with new element from stream

        if (j < k)

          reservoir[j] = stream[i];

}

    printf("Following are k randomly selected items \n");

    printArray(reservoir, k);

}

How does this work?
We must prove that the probability that any item stream[i]where 0 <= i < n will be in final reservoir[] is k/n.

ase 1: For last n-k stream items, i.e., for stream[i] where k <= i < n
For every such stream item stream[i], we pick a random index from 0 to i and if the picked index is one of the first k indexes, we replace the element at picked index with stream[i]

To simplify the proof, let us first consider the last item. The probability that the last item is in final reservoir = The probability that one of the first k indexes is picked for last item = k/n (the probability of picking one of the k items from a list of size n)

Let us now consider the second last item. The probability that the second last item is in final reservoir[]= [Probability that one of the first k indexes is picked in iteration for stream[n-2]] X [Probability that the index picked in iteration for stream[n-1] is not same as index picked for stream[n-2] ] = [k/(n-1)]*[(n-1)/n] = k/n.

Similarly, we can consider other items for all stream items from stream[n-1] to stream[k] and generalize the proof.

Case 2: For first k stream items, i.e., for stream[i] where 0 <= i < k
The first k items are initially copied to reservoir[] and may be removed later in iterations for stream[k]to stream[n].
The probability that an item from stream[0..k-1] is in final array = Probability that the item is not picked when items stream[k], stream[k+1], …. stream[n-1] are considered = [k/(k+1)] x [(k+1)/(k+2)] x [(k+2)/(k+3)] x … x [(n-1)/n] = k/n

http://medriscoll.com/post/78238180948/knuths-reservoir-sampling-in-python-and-perl
http://n00tc0d3r.blogspot.com/2013/08/reservoir-sampling-and-variants.html

public void selectKSamples(int[] data, int k) {  
   int n = data.length;  
   if (n < k) return;  
   if (n == k) {  
     printArray(data);  
     return;  
   }  
   
   int[] pool = new int[k];  
   for (int i=0; i<k; ++i) {  
     pool[i] = data[i];  
   }  
   // random pick  
   for (int i=k; i<n; ++i) {  
     int rand = (int)(Math.random() * (i+1)); // generate a random number of [0,i]  
     if (rand < k) {  
       pool[rand] = data[i];  
     }  
   }  
   
   printArray(pool);  
 }

http://shatteredterminal.com/2009/11/random-sample-from-a-long-linked-list/

public static <T> List<T> sample(
LinkedList<T> list, int m) {
ArrayList<T> samples = new ArrayList<T>(m);
int k = 0;
for (T element : list) {
int pos = k++;
if (pos < m) {
samples.add(element);
} else if ((pos = rand.nextInt(k)) < m) {
samples.set(pos, element);
}
}
return samples;
}

Random Sampling in Linked List
http://n00tc0d3r.blogspot.com/2013/08/reservoir-sampling-and-variants.html
Given a linked list, generate a random node from the list. You are allowed to use only O(1) space.

 public ListNode selectASamples(ListNode head) {  
   ListNode sample = null;  
   for (int count=1; head!=null; head = head.next, ++count) {  
     // generate a randome number of [0, count-1]  
     if (count == 1 || (int)(Math.random() * count) == count - 1) {  
       sample = head;  
     }  
   }  
   return sample;  
 }

http://jeremykun.com/2013/07/05/reservoir-sampling/
def reservoirSample(stream):
for k,x in enumerate(stream, start=1):
if random.random() < 1.0 / k:
chosen = x

return chosen

Related:

给定一个二叉树，要求随机选择树上的一个节点。

解法：遍历树的过程中，随机选择一个节点即可

http://bylijinnan.iteye.com/blog/1468985
http://www.cnblogs.com/HappyAngel/archive/2011/02/07/1949762.html
　　How could you select one of n objects at random, where you see the objects sequentially but you do not know the value of n beforehand? For concreteness, how would you read a text file, and select and print one random line, when you don’t know the number of lines in advance?

　　问题定义可以简化如下：在不知道文件总行数的情况下，如何从文件中随机的抽取一行？

import random

def reservoirSample(stream):

   for k,x in enumerate(stream, start=1):

      if random.random() < 1.0 / k:

         chosen = x

   return chosen

Facebook interview question
1. 一个数组，找最大数，可能有重复，要求random输出最大index，
比如[ 1 2 3 4 5 6 6 6], 最大的是6， index可能是5，6， 7。每次call这个
function的时候，random输出5，6，7.

So scan through the array, from 0 to k, record the max, number of same max and the index, using the same reservoir sampling method above. we can easily random output the index.

http://prismoskills.appspot.com/lessons/Brain_Teasers/Equal_probability_always.jsp
http://n1b-algo.blogspot.com/2009/01/picking-k-elements-randomly-from-stream.html
Case 1: N is known
In this case problem can be solved in O(N) time using O(K) space. The idea is that pick the first element with probability K/N. If the element is picked then pick the next element with probability (K-1)/(N-1) otherwise pick it with probability K/(N-1). To see that the second element is still picked with probability K/N. There are two cases.

First element is picked. So second element is picked with probability = K/N * (K-1)/(N-1).
First element is not picked. So second element is picked with probability = (1-K/N) * K/(N-1).

So the selection probability of second element is = K/N*(K-1)/(N-1) + (1-K/N)*K/(N-1) = K/N. Similar logic extends further. The following code implements this approach.

for i = 1 to N
   if rand <= K/(N+1-i)
      print A[i]
      K -= 1
      if K == 0
         break
      end
   end
end
if K != 0
  print A[N-K:N]
end

http://n1b-algo.blogspot.com/2009/01/picking-element-randomly-from-stream.html
f we know the length of the stream, then problem is trivial. Simply generate a random number r between 1 and N. Declare the number at index r as the desired random number.

For an unknown but finite N, we need that elements are selected with equal probability. Let us first build the loop invariant property for the solution. Say we have seen k elements so far, the selected number (say n) must be selected with probability 1/k. As we see a new element (say n1), we should set n=n1 with probability 1/(k+1). This ensures that n1 is selected with probability 1/(k+1). In the other case, n (the one NOT replaced by n1) is selected with probability = 1/k * (1-1/(k+1)) = 1/(k+1). Hence this strategy ensures that a number a selected with probability 1/N for N>=k. This loop invariant is adhered by the following code:

n=0, N=0
while Stream.hasElement()
  ++N
  if rand <= 1/N 
    n = Stream.nextElement()
return n

After the algorithm processes all elements, n would contain the randomly selected element and N would contain the number of elements in the stream.
https://gist.github.com/gcrfelix/621c3b71e298de6758f6
Reservoir Sampling：从N个数中随机抽取k个元素，保证每个元素被选中的概率相等，N不知道有多大。

public void selectKitems(int[] stream, int k) {
int i=0;
int n = stream.length;
int[] reservoir = new int[k];
for(int i=0; i<k; i++) {
reservoir[i] = stream[i];
}

// Iterate from the (k+1)th element to nth element
for(int i=k; i<n; i++) {

// Pick a random index from 0 to i.
int rand = Random.nextInt(i) + 1;

// If the randomly picked index is smaller than k, then replace
// the element present at the index with new element from stream
if(rand < k) {
resevior[rand] = stream[i];
}
}
}

modification:
Given a infinite stream of number, return a random element with equal probability.
然后她直接给了我方法头：public int getRandom(int[] arr) {}

解法：
主要是reservoir sampling这个concept和思想不太懂。
这题就是给Reservoir Sampling做一个modify， k = 1, 每次call的时候，当前长度为n的话，random出来一个1 到 n的数，
如果是n就返回最后这个，如果是 1到(n - 1)就返回上次的那个。
(n-1)/n * (1/(n-1)) = 1/n
Read full article from Reservoir Sampling | GeeksforGeeks

Reservoir Sampling | GeeksforGeeks

Labels

Popular Posts