What are the Bloom Filters?
- Suppose you are creating an account on a website, you want to enter a cool username, you enter it and get a message, “Username is already taken”. You added your birth date along with your username, still no luck. Now you have added your university roll number also, still got “Username is already taken”. It’s really frustrating, isn’t it?
- But have you ever thought about how quickly a website checks availability of username by searching millions of username registered with it.
- Bloom Filter is a data structure that can do this job.
- A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set.
- For example, checking availability of username is a set membership problem, where the set is the list of all registered usernames.
- The price we pay for efficiency is that it is probabilistic in nature that means, there might be some False Positive results.
- False positive means, it might tell that a given username is already taken but actually it’s not.
Properties
- Bloom filters never generate false negative results, i.e., telling you that a username doesn’t exist when it actually exists.
- Deleting elements from a filter is not possible because, if we delete a single element by clearing bits at indices generated by k hash functions, it might cause deletion of few other elements.
- Adding an element never fails. However, the false positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point all queries yield a positive result.
Working of Bloom Filters
- A empty bloom filter is a bit array of m bits, all set to zero, like this –
- We need k number of hash functions to calculate the hashes for a given input. When we want to add an item in the filter, the bits at k indices h1(x), h2(x), … hk(x) are set, where indices are calculated using hash functions.
- Example – Suppose we want to enter “geeks” in the filter, we are using 3 hash functions and a bit array of length 10, all set to 0 initially. First we’ll calculate the hashes as follows:
h1(“geeks”) % 10 = 1
h2(“geeks”) % 10 = 4
h3(“geeks”) % 10 = 7
- Now we will set the bits at indices 1, 4 and 7 to 1
- Again we want to enter “nerd”, similarly, we’ll calculate hashes
h1(“nerd”) % 10 = 3
h2(“nerd”) % 10 = 5
h3(“nerd”) % 10 = 4
- Set the bits at indices 3, 5 and 4 to 1
- Now we want to check if “geeks” is present in the filter or not. We’ll do the same process but this time in reverse order.
- We calculate respective hashes using h1, h2 and h3 and check if all these indices are set to 1 in the bit array.
- If all the bits are set then we can say that “geeks” is probably present. If any of the bit at these indices are 0 then “geeks” is definitely not present.
False Positive case in Bloom Filters
- The question is why we said “probably present”, why this uncertainty. Let’s understand this with an example. Suppose we want to check whether “cat” is present or not. We’ll calculate hashes using h1, h2 and h3.
h1(“cat”) % 10 = 1
h2(“cat”) % 10 = 3
h3(“cat”) % 10 = 7
- If we check the bit array, bits at these indices are set to 1 but we know that “cat” was never added to the filter. Bit at index 1 and 7 was set when we added “geeks” and bit 3 was set we added “nerd”.
- So, because bits at calculated indices are already set by some other item, bloom filter erroneously claims that “cat” is present and generates a false positive result. Depending on the application, it could be a huge downside or relatively okay.
- We can control the probability of getting a false positive by controlling the size of the Bloom filter. More space means fewer false positives.
- If we want to decrease the probability of false positive result, we have to use more hash functions and larger bit arrays.
NOTE : We cannot delete an element in Bloom Filter.
Application
- Medium uses bloom filters for recommending posts to users by filtering posts which have been seen by users.
- Quora implemented a shared bloom filter in the feed backend to filter out stories that people have seen before.
- The Google Chrome web browser used to use a Bloom filter to identify malicious URLs
- Google BigTable, Apache HBase and Apache Cassandra, and Postgresql use Bloom filters to reduce the disk lookups for non-existent rows or columns.