Massive Algorithms: Weighted Reservoir Sampling

Weighted Reservoir Sampling

Weighted Reservoir Sampling
https://shanzi.gitbooks.io/algorithm-notes/content/problem_solutions/weighted_random_distribution.html

How many items to be randomly chosen from?
Do the members of item list change frequently?
Does weight value of an item change frequently?
How much extra space we can use?
How is the precision of weight value? That is, is the value an integer or decimal? if it is a float number, how many decimal spaces is allowed?

Expanding
Expanding is the easiest way to solve this problem. The idea is simple, we repeat an item as many as the weight value after translation the value into integer. This method is proper for that an item won't be repeat too many time.
In this solution, you can add a new item easily. But removing or changing the value will still cost a lot.

class ExpandingWeightRandom {
    ArrayList<Integer> items = new ArrayList<Integer>();

    void addItem(int item, int weight) {
        for(int i = 0; i < weight; i++) {
            items.add(item);
        }
    }

    void removeItem(int item) {
        int start = -1;
        for(int i = 0; i < items.size(); i++) {
            if (items.get(i) == item){
                start = i;
                break;
            }
        }
        int end = start + 1;
        for (; end < items.size(); end++) {
            if (items.get(i) != item) break;
        }
        items.removeRange(start, end);
    }

    int chooseOne() {
        if (items.isEmpty()) return 0;

        return items.get(Random.nextInt(items.size() - 1))
    }
}

In-place (Unsorted)
As it is said above, if the weight value is very large, say 412308312, it will be not even possible to use expanding at all. Here we have another simple solution.

Instead of using an ArrayList, we use a HashMap to keep the item as well as its weight value in record. Each time we generate a random number between zero and the total weight value in storage. Then we iterate through the items, sum up the weight values processed. The iterate should terminated when we have an item that, before added the weight value of it, the sum of processed weight is smaller or equal to the random number, and after that, the sum exceed.

class InPlaceWeightRandom {
    HashMap<Integer, Integer> weightsMap = new HashMap<Integer, Integer>();
    int totalWeight = 0;

    void putItem(int item, int weight) {
        if (weightsMap.containsKey(item)) {
            totalWeight -= weight;
        }
        weightsMap.put(item, weight);
        totalWeight += weight;
    }

    void removeItem(int item) {
        if (weightsMap.containsKey(item)) {
            totalWeights -= weightMap.get(item);
            weightsMap.remove(item);
        }
    }

    int chooseOne() {
        int sum = 0; 
        int random = Random.nextInt(totalWeights - 1);
        for(Map.Entry<Integer, Integer> entry: weightsMap) {// no order
            sum += entry.getValue();
            if (sum > random) return entry.getKey();
        }
        return 0;
    }
}

In-place(sorted)
this algorithm is in fact not a good one. But we should pay attention to the idea that we use greedy strategy to speed up the summing process.

X. Pre-calculated
The main idea is, we pre-calculate the sum of weights and decide the boundary to choose each item. Each time we'd like to pick an element, we use binary search technique to find and return the element fast.

class PreCalculateWeightsRandom{
    int totalWeight = 0;
    ArrayList<Integer> itemList = new ArrayList<Integer>();
    ArrayList<Integer> weightsList = new ArrayList<Integer>();

    void addItem(int item, int weight) {
        itemList.add(item);
        totalWeights += weight;
        weightList.add(totalWeights);
    }

    void removeItem(int item) {
        int index = itemList.indexOf(item);
        itemList.remove(index);
        int weight = weightList.get(index);
        weightsList.remove(index);
        for (int i = index; i < weightsList.size(); i++) {
            weightList.get(i) = weightList.get(i) - weight;
        }
    }

    int chooseOne() {
        int random = Random.nextInt(totalWeights - 1);
        int index = binarySearchBoundary(weightList, random);
        return itemList.get(index);
    }

    int binarySearchBoundary(ArrayList<Integer> list, int value) {
        int l = 0;
        int r = list.size() - 1;

        while (l < r) {
            int mid = (l + r) / 2;
            if (list.get(mid) < value) l = mid + 1;
            else r = mid;
        }

        return r;
    }
}

http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/
The second algorithm is even more interesting: a weighted distributed reservoir sample, where every item in the set has an associated weight, and we want to sample such that the probability that an item is selected is proportional to its weight. It wasn’t even clear whether or not this was even possible until Pavlos Efraimidis and Paul Spirakis figured out a way to do it and published it in the 2005 paper “Weighted Random Sampling with a Reservoir.” The solution is as simple as it is elegant, and it is based on the same idea as the distributed reservoir sampling algorithm described above. For each item in the stream, we compute a score as follows: first, generate a random number R between 0 and 1, and then take the nth root of R, where n is the weight of the current item. Return the K items with the highest score as the sample. Items with higher weights will tend to have scores that are closer to 1, and are thus more likely to be picked than items with smaller weights.

public class WeightedReservoirSampling {

    public static int[] weightedReservoirSampling(int[][] items, int m){
        if(items==null || m > items.length){
            System.out.println("incorrect input.");
            return null;
        }
        PriorityQueue<WrsElement> heap = new PriorityQueue<WrsElement>(10);
        /** Transform weight into a double between (0,1). Put it in min heap. **/
        for(int i=0; i < items.length ; i++){
            double key = Math.pow((Math.random()), 1/(double)items[i][1]);
            WrsElement element = new WrsElement(items[i][0], key);
            heap.add(element);
            if(heap.size() > m){
                heap.poll();
            }
        }
        /** Output result. **/
        int[] result = new int[m];
        for(int i=0;i<m; i++){
            result[i] = heap.poll().value;
        }
        return result;
    }

    static class WrsElement implements Comparable<WrsElement>{

        int value;
        double key;

        public WrsElement(int value, double key){
            this.value = value;
            this.key = key;
        }

        public int compareTo(WrsElement wrsElement) {
            return Double.compare(this.key, wrsElement.key);
        }

        @Override
        public String toString() {
            return "WrsElement{" +
                    "value=" + value +
                    ", key=" + key +
                    '}';
        }
    }

    public static void main(String[] args) {
        /** each item = {id, weight} **/
        int[][] items = {
                {1, 1}, {2, 5}, {3, 20}, {4, 5}, {5, 10}, {6, 3}, {7, 3}, {8, 3}
        };
        int[] result = weightedReservoirSampling(items, 3);
        System.out.println(Arrays.toString(result));
    }

    public static void test(){
        int[][] items = {{0, 2}, {1, 1}, {2, 1}, {3, 1}, {4, 1} };
        int m = 2;
        int[] probabilities = new int[items.length];
        for(int i=0; i< 1000; i++) {
            int[] result = weightedReservoirSampling(items, m);
            for(int j = 0; j < m; j++){
                probabilities[result[j]]++;
            }
        }
        System.out.println(Arrays.toString(probabilities));
    }
}

http://blog.gainlo.co/index.php/2016/11/11/uber-interview-question-weighted-random-numbers/

Write a function that returns values randomly, according to their weight.

Let me give you an example. Suppose we have 3 elements with their weights: A (1), B (1) and C (2). The function should return A with probability 25%, B with 25% and C with 50% based on the weights.

a naive solution is to duplicate each element with the time of its weight and then do the random sampling from the new set. There are two potential problems here, maybe you can think about it first.

Two problems of the above solution:

This won’t work when the weight is not an integer.
If the weight is large, it uses too much memory as it stores the same element for many times.

In essence, we can denote the new set above as a horizontal line like this:

The sampling is like randomly select a point and see which area it falls into. Going into this idea, we can have the following algorithm:

W is the sum of all the weights (length of the horizontal line)
Get a random number R from [0, W] (randomly select a point)
Go over each element in order and keep the sum of weights of visited elements. Once the sum is larger than R, return the current element. This is finding which area includes the point.

The time/space complexity should be extremely straightforward. Since we are not using extra memory, the space complexity is O(1). Time complexity is O(N), because we have to iterate over all the elements in both step 1 and 3. It’s worth to note that even if we can put step 1 in preprocessing, we still have to do the traversal in step 3.

Any way to optimize this?

If making step 3 to O(1) is hard, let’s try make it O(logN). The first thing we should consider is binary search. If we keep an array that each element is the accumulative sum of weights of all the previous elements, we can do a binary search to locate the element. This improves the algorithm to O(logN) (excluding the preprocessing).

If we want to make it O(1), we may need to add some constraints. As we’ve been mentioning for so many times, the tradeoff between time and space complexity provides a lot of hints about optimization. Obviously, we’d like to speed it up and we can think about how to use more memories here.

The most time-consuming part is at step 3 and we need to have an approach of O(1) to get the corresponding area. Let’s say if we preprocessing all the elements once by keeping an array M where M[i] stores the element corresponding to point i. This allows us to return the area immediately from the array.

In essence, this approach is like storing results for every single input in a hashmap and it has space complexity O[W] where W is the sum of all the weights.

Be careful about the edge case. When comparing the current sum to the random number, should you use “>”, “>=” or it makes no difference?
If you could define a class, it’ll be better to have the preprocessing code that calculates the sum of all the weights.

An extra question, how do you test your function?

https://gist.github.com/gcrfelix/99db26e4128c88e5c324
题目：给一组id和表示每个id出现概率的数组，概率之和为1.要求随机生成id，使得随机出的id满足之前的概率数组

所给ID： 1, 2, 3
所给概率：0.1， 0.3， 0.6.
现在我们要随机生成一个ID，这个生成方法要求按照概率来生成。
也就是10% 给出1, 30% 给出2， 60% 给出3.

一个数组
1,4,2,6.....
每次调⽤用一个函数，按照数组里面的数字的大小，返回相应的Index。
比如，上面的例⼦子就是
1/13 的概率返回0,
4/13的概率返回1

解题方法：
先找cumulative probability［1, 5, 7, 13]，然后弄个0～13之间的random数字⽐比较过去找它的位置就好 (binary search)
Read full article from Weighted Reservoir Sampling

Weighted Reservoir Sampling

Labels

Popular Posts