Massive Algorithms: Distributing Reservoir Sampling

http://blogs.msdn.com/b/spt/archive/2008/02/05/reservoir-sampling.aspx

Split the data stream into n partitions, one for each node. Apply reservoir sampling with reservoir size s, the final reservoir size, on each of the partitions. Finally, aggregate each reservoir into a final reservoir sample by carrying out reservoir sampling on them.

Lets say you split data of size n into 2 nodes, where each partition is of size n/2. Sub-reservoirs R1 and R2 are each of size s.

Probability that a record will be in sub-reservoir is:
s / (n/2) = 2s/n

The Probability that a record will end up in the final reservoir given it is in a sub-reservoir is: s/(2s) = 1/2.

It follows that the probability any given record will end up in the final reservoir is:
2s/n * 1/2 = s/n

http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/
http://gregable.com/2007/10/reservoir-sampling.html
Distributed Reservoir Sampling Variation
This is the problem that got me researching the weighted sample above. In both of the above algorithms, I can process the stream in O(N) time where N is length of the stream, in other words: in a single pass. If I want to break break up the problem on say 10 machines and solve it close to 10 times faster, how can I do that?

The answer is to have each of the 10 machines take roughly 1/10th of the input to process and generate their own reservoir sample from their subset of the data using the weighted variation above. Then, a final process must take the 10 output reservoirs and merge them.

The trick is that the final process must use the original "key" weights computed in the first pass. For example, If one of your 10 machines processed only 10 items in a size-10 sample, and the other 10 machines each processed 1 million items, you would expect that the one machine with 10 items would likely have smaller keys and hence be less likely to be selected in the final output. If you recompute keys in the final process, then all of the input items would be treated equally when they shouldn't.
http://hicammie.blogspot.com/2007/12/reservoir-sampling.html

http://had00b.blogspot.com/2013/07/random-subset-in-mapreduce.html
let's say we have a number of files where each line is one of the input elements (the number of lines over all files sum up to n) and we'd like to select exactly k of those lines.
https://ballsandbins.wordpress.com/2014/04/13/distributedparallel-reservoir-sampling/

for(sub-stream s: sub-streams) do in parallel {

  simple sequential reservoir sampling and count length of s;

}

double p = (double) m / (m+n);

for(int i = 0; i < k; ++i){

  j = rand.nextDouble();

  if(j <= p)

    move a random item from R to T;

  else

    move a random item from S to T;

}

return T;

Distributing Reservoir Sampling

Labels

Popular Posts