https://en.wikipedia.org/wiki/HyperLogLog
HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset (the cardinality).
http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html?currentPage=2
HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset (the cardinality).
http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html?currentPage=2
One common approach to this problem is the use of bitmaps. Bitmaps can be used to quickly and accurately get the cardinality of a given input. The basic idea with a bitmap is mapping the input dataset to a bit field using a hash function where each input element uniquely maps to one of the bits in the field. This produces zero collisions, and reduces the space required to count each unique element to 1 bit. While bitmaps drastically reduce the space requirements from the naive set implementation described above they are still problematic when the cardinality is very high and/or you have a very large number of different sets to count. For example, if we want to count to one billion using a bitmap you will need one billion bits, or roughly 120 megabytes for each counter. Sparse bitmaps can be compressed in order to gain space efficiency, but that is not always helpful.
http://blog.mparticle.com/a-must-know-for-data-scientists-hyperloglog-algorithm/
Hyperloglog is a space-efficient algorithm for accurate cardinality estimations. Nowadays a lot of data products use it, including Redis, AWS Redshift, and Druid. Redis’s implementation of HLL uses at most 12Kb of memory to estimate the cardinality of practically any set.
http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
The basic idea of HLL is as follows. Imagine you flip a coin for 100 times and count the number of consecutive heads before the first tail. Let’s call this one experiment. You perform a few experiments, and you may get the following series (H for head and T for tail):
experiment 1: HHTHTTHH….
experiment 2: HTHHHTTT….
experiment 3: TTHTHHHT….
The number of consecutive heads at the beginning of the series is 2, 1, 0, respectively, marked in red. If you only do a small number of experiments, the chance is that the max number of consecutive heads is small. However, if you have patience to perform a lot of experiments, you may get a large number of consecutive heads from at least one experiment.
HLL essentially hashes every element in a set, counts the number of consecutive leading 0’s from each hash, and estimates the cardinality based on the largest count of consecutive leading 0’s out of the hashes.
- Estimate the cardinality of union of multiple sets. It is natural to combine multiple HLL’s; simply take the largest count of consecutive leading 0’s from all the HLL’s.
- Estimate the overlap of two sets. Since |A ∩ B| = |A| + |B| – |A ∪ B|, the overlap of two sets can be calculated from the cardinality of each set and the cardinality of their union.
Bit-pattern observables: these are based on certain patterns of bits occurring at the beginning of the (binary) S-values. For instance, observing in the stream S at the beginning of a string a bit- pattern is more or less a likely indication that the cardinality n of S is at least .
Order statistics observables: these are based on order statistics, like the smallest (real) values, that appear in S. For instance, if X = min(S), we may legitimately hope that n is roughly of the order of 1/X…
http://opensourceconnections.com/blog/2015/02/04/its-log-its-log-its-big-its-hyper-its-good/
HyperLogLog is based on analyzing some bit patterns in the hashed value of an item. We look at the length of the sequence of most significant zero bits. The maximum length among such zero bit sequences from all the hashed values is indicative of the unique item count.
In reality, to improve the quality of the result, multiple independent hash functions are used and the longest sequence zero bits resulting from each hash function is used to produce the final averaged unique count value.
Twitter's algebird
Java code: