Wednesday, June 13, 2012

Bloom Filters

Bloom filters are probabilistic data-structures built with the aim of handling specific usecases of huge DataSets while keeping the memory consumption minimum. Bloom filters are created by parsing the dataset once and once the entire dataset is parsed, the Bloom filters can be used to quickly Query if a particular Data item is there in the dataset or not.

An important thing to note is that Bloom filters can return False Positives but never False negatives. That means if a Query made on a Bloom Filter returns that an item doesnt exist in a dataset we can be sure about this result, but if the Query made on the Bloom Filter returns that data exists in the dataset, there is a small chance that the element might not exist in the dataset.

BloomFilters consist of a Number of hashtables of fixed sizes. Initially all the bits are set to zero in all the tables.

Each word of the dataset is hashed individually into the Bloom Filters using a single Hash Function or separate hash Functions, and the mod of the value returned by the hash is % with the size of the hashtable and the result key is set to 1.

The data-structure is probabilistic because there is a low but finite probability that two words will have collisions in each of the hash tables and Query might return true for one of them even though it might not be present in the original dataset. However, the chances of such errors are very low and BloomFilters can be customized, configured and optimized to better suit the given DataSet.

Many kinds of variations exist. One which use a single hash function for all tables has tables of varying lengths, so the % comes out to be different. Another implementation uses multiple hash functions but just a single hash table, where all the generated hashes are set.

From http://www.javamex.com/tutorials/collections/bloom_filter.shtml -

  • we allocate m bits to represent the set data;
  • we write a hash function that, instead of a single hash code, produces k hash codes for a given object;
  • to add an object to the set, we derive bit indexes from all k hash codes and set those bits;
  • to determine if an object is in the set, we again calculate the corresponding hash codes and bit indexes, and say that it is present if and only if all corresponding bits are set.

 

No comments:

Powered By Blogger
Custom Search