Massive Algorithms: Count-Min Sketch

http://lkozma.net/blog/sketching-data-structures/
The Count-Min (CM) sketch is less known than the Bloom filter, but it is somewhat similar (especially to the counting variants of the Bloom filter).

The problem here is to store a numerical value associated with each element, say the number of occurrences of the element in a stream (for example when counting accesses from different IP addresses to a server). Surprisingly, this can be done using less space than the number of elements, with the trade-off that the result can be slightly off sometimes, but mostly on the small values. Again, the parameters of the data structure can be chosen such as to obtain a desired accuracy.

CM works as follows: we have k different hash functions and kdifferent tables which are indexed by the outputs of these functions (note that the Bloom filter can be implemented in this way as well). The fields in the tables are now integer values. Initially we have all fields set to 0 (all unseen elements have count 0). When we increase the count of an element, we increment all the corresponding k fields in the different tables (given by the hash values of the element). If a decrease operation is allowed (which makes things more difficult), we similarly subtract a value from all k elements.

To obtain the count of an element, we take the minimum of the kfields that correspond to that element (as given by the hashes). This makes intuitive sense. Out of the k values, probably some have been incremented on other elements also (if there were collisions on the hash values). However, if not all k fields have been returned by the hash functions on other elements, the minimum will give the correct value.

we want to notice if an IP address is responsible for a lot of traffic (to further investigate if there is a problem or some kind of attack). The CM structure allows us to do this without storing a record for each address. When we increment the fields corresponding to an address, simultaneously we check if the minimum is above some threshold and we do some costly operation if it is (which might be a false alert). On the other hand, the real count can never be larger than the reported number, so if the minimum is a small number, we don’t have to do anything (this holds for the presented simple variant that does not allow decreases). As the example shows, CM sketch is most useful for detecting “heavy hitters” in a stream.

Java code:
https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java
http://www.sanfoundry.com/java-program-implement-count-min-sketch/

```
class CountMinSketch
```
```
{
```
```
    private int[] h1;
```
```
    private int[] h2;
```
```
    private int[] h3;
```
```
    private int size;
```

    private static int DEFAULT_SIZE = 11;

```
 
```
```
    /** Constructor **/
```
```
    public CountMinSketch()
```
```
    {
```
```
        size = DEFAULT_SIZE;
```
```
        h1 = new int[ size ];
```
```
        h2 = new int[ size ];
```
```
        h3 = new int[ size ];
```
```
    }
```

    /** Function to clear al counters **/

```
    public void clear()
```
```
    {
```
```
        size = DEFAULT_SIZE;
```
```
        h1 = new int[ size ];
```
```
        h2 = new int[ size ];
```
```
        h3 = new int[ size ];
```
```
    }
```
```
    /** Function to insert value **/
```
```
    public void insert(int val)
```
```
    {
```
```
        int hash1 = hashFunc1(val);
```
```
        int hash2 = hashFunc2(val);
```
```
        int hash3 = hashFunc3(val);
```
```
        /** increment counters **/
```
```
        h1[ hash1 ]++;
```
```
        h2[ hash2 ]++;
```
```
        h3[ hash3 ]++;
```
```
    }
```

    /** Function to get sketch count **/

```
    public int sketchCount(int val)
```
```
    {
```
```
        int hash1 = hashFunc1(val);
```
```
        int hash2 = hashFunc2(val);
```
```
        int hash3 = hashFunc3(val);
```

        return min( h1[ hash1 ], h2[ hash2 ], h3[ hash3 ] );

```
    }
```

    private int min(int a, int b, int c)

```
    {
```
```
        int min = a;
```
```
        if (b < min)
```
```
            min = b;
```
```
        if (c < min)
```
```
            min = c;
```
```
        return min;
```
```
    }
```
```
    /** Hash function 1 **/
```
```
    private int hashFunc1(int val)
```
```
    {
```
```
        return val % size;
```
```
    }
```
```
    /** Hash function 2 **/
```
```
    private int hashFunc2(int val)
```
```
    {
```

        return ((val * (val + 3)) % size);

```
    }
```
```
    /** Hash function 3 **/
```
```
    private int hashFunc3(int val)
```
```
    {
```
```
        return (size - 1) - val % size;
```
```
    }   
```
```
}
```

Ruminations of a Programmer: Count-Min Sketch - A Data Structure for Stream Mining Applications
The idea is quite simple and the data structure is based on probabilistic algorithms to serve various types of queries on streaming data. The data structure is parameterized by two factors - ε and δ, where the error in answering the query is within a factor of ε with probability δ. So you can tune these parameters based on the space that you can afford and accordingly amortize the accuracy of results that the data structure can serve you.

https://dzone.com/articles/count-min-sketch-data
One widely used technique for storing a subset of data is through Random Sampling, where the data stored is selected through some stochastic mechanism.

https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java
Read full article from Ruminations of a Programmer: Count-Min Sketch - A Data Structure for Stream Mining Applications

Count-Min Sketch - Approximate Algorithm

Labels

Popular Posts