Massive Algorithms: Bloom filters in Java: example implementation

Bloom filters in Java: example implementation
Bloom Filters: an example Java implementation Having introduced the concept of a Bloom filter , we now look at an example Java implementation. We will consider the case of representing a set of strings. Generating hash codes The basic component that we need to implement our Bloom filter is a way of generating an arbitrary number of good-quality hash codes for a given string. Previously, we discussed a strong hash code implementation which produced 64-bit hash codes based on a table of random 64-bit numbers.

http://www.cnblogs.com/dolphin0520/archive/2012/11/10/2755089.html
布隆过滤器（Bloom Filter），它是由Burton Howard Bloom在1970年提出的，它结合了位图和Hash表两者的优点，位图的优点是节省空间，但是只能处理整型值一类的问题，无法处理字符串一类的问题，而Hash表却恰巧解决了位图无法解决的问题，然而Hash太浪费空间。针对这个问题，布隆提出了一种基于二进制向量和一系列随机函数的数据结构-布隆过滤器。它的空间利用率和时间效率是很多算法无法企及的，但是它也有一些缺点，就是会有一定的误判率并且不支持删除操作。

　　尤其要注意的是，布隆过滤器是不允许删除元素的，因为若删除一个元素，可能会发生漏判的情况。不过有一种布隆过滤器的变体Counter Bloom Filter，可以支持删除元素

　　布隆过滤器在很多场合能发挥很好的效果，比如：网页URL的去重，垃圾邮件的判别，集合重复元素的判别，查询加速（比如基于key-value的存储系统）等，下面举几个例子：

　　1.有两个URL集合A,B，每个集合中大约有1亿个URL，每个URL占64字节，有1G的内存，如何找出两个集合中重复的URL。

　　很显然，直接利用Hash表会超出内存限制的范围。这里给出两种思路：

　　第一种：如果不允许一定的错误率的话，只有用分治的思想去解决，将A,B两个集合中的URL分别存到若干个文件中{f1,f2...fk}和{g1,g2....gk}中，然后取f1和g1的内容读入内存，将f1的内容存储到hash_map当中，然后再取g1中的url，若有相同的url，则写入到文件中，然后直到g1的内容读取完毕，再取g2...gk。然后再取f2的内容读入内存。。。依次类推，知道找出所有的重复url。

　　第二种：如果允许一定错误率的话，则可以用布隆过滤器的思想。

　　2.在进行网页爬虫时，其中有一个很重要的过程是重复URL的判别，如果将所有的url存入到数据库中，当数据库中URL的数量很多时，在判重时会造成效率低下，此时常见的一种做法就是利用布隆过滤器，还有一种方法是利用berkeley db来存储url，Berkeley db是一种基于key-value存储的非关系数据库引擎，能够大大提高url判重的效率。

http://www.fgdsb.com/2015/01/03/estimate-the-number-of-unique-strings-with-limited-memory/

Estimate the number of unique strings with limited memory
Given a large array of strings S = [s1, s2, … sN], determine Uniq(S) = how many unique strings there are in S.

If it’s only required to estimate the number with limited memory, It’s possible to use a simplified hash table to do this.
We can define a boolean array S
For each iteration we calculate the hash function H of current string, and check if the S[H] is true which means that we have inserted the same string before current iteration, so we do not count the string.
If the hash function have no collision, the algorithm can always get the accurate result.
To minimize the collision, we can use a group of hash functions like a bloom filter ….(each time we check S1[H1] S2[H2] .. SN[HN], if all those variable have been set, we assert that the string is duplicated)

Assuming good hashing scheme, this can do 4 billion strings in 512MB memory (ideal case). This means ~64 billion strings in 8GB. If N becomes two large, a distributed hash table can be used to span the data across multiple servers. Another alternative could be to store the data on disk and maintain indexes for faster seeks and updates

Also read https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/hash/BloomFilter.java
Read full article from Bloom filters in Java: example implementation

Bloom filters in Java: example implementation

Labels

Popular Posts