We run multiple elaborate data pipelines here at Semantics3. So, keeping track of what has already been indexed in a resource-efficient manner is quite important to us. To this end, we use nifty sketching data structures wherever possible.
In this post I will talk about one such data structure — the bloom filter.
What’s a bloom filter?
A bloom filter is a probabilistic data structure that allows for membership lookups in constant space & time. It is really useful when you want to do membership lookups with a reasonable margin of error in a space-constrained environment. To distill the idea, a bloom filter is akin to a hash table except that it is a lot more space efficient.
Let us take a bit array of size m with all the m bits set to 0, a hash function h such that the modulo of the resultant hash is an index in the array and a set of n items.
Whenever a new value from the set is to be inserted, we hash the value using the hash function h and mark the bit in the corresponding index to 1.
def insert(val): index = hash(val) % m bit_array[index] = 1
Now when we want to check if a particular value exists, we simply hash the value to be checked and look up against the index in the bit array — if the bit at the corresponding index is 0, the value definitely does not exist (false negatives impossible). However, if the bit is 1, it means that the filter has previously seen a value which resulted in the same index (false positives possible).
def lookup(val): index = hash(val) % m if bit_array[index]: return True else: return False
Plotting the false positive probability against the size of the bit array for a fixed number of elements to be inserted, we can see that the rate of false positives decreases sharply as bit array size increases. From the graph for p < 0.01, we would need a bit array of size 100x the number of values(n) if we used only one hash function.
An important idea behind a bloom filter is that if we instead use k hash functions, the space requirement(m) can be greatly reduced.
def insert(val): for hash in hash_functions: index = hash(val) % m bit_array[index] = 1
Whenever a new value is to be added to the filter, it is hashed with each one of the k hash functions (h1, h2, … hk) and the corresponding bit array indices are set to 1. Similarly, when we need to check if the given value exists, we hash it using each of the hash functions and check if every corresponding bit is set to 1. Even if one bit isn’t 1, it is concluded that the value doesn’t exist.
def lookup(val): results =  for hash in hash_functions: index = hash(val) % m if bit_array[index]: results.append(True) else: results.append(False) return reduce(lambda a, b : a & b, results)
The graph suggests that increasing the number of hash functions(k) from 1 to 5 has achieved p < 0.01 in a space(m) 10x smaller.
Tuning ’em filters
Depending on the number of values to be inserted and the maximum permissible false positive rate, we can arrive at the optimal number of hash functions. There is a good amount of literature on how to choose good hash functions.
From the graph above we can see that p < 0.01 at (k=1, m=1B), (k=5, m=100M) and (k=8, m=100M). In this case(n=10M, p < 0.01) our ideal choice would be (k=5, m=100M) as it minimizes both the space and time/compute requirements.
A bloom filter, by its very design and requirements, aims to minimize the size of the bit array and the number of hash functions.
The optimal size of the bit array m for a given intended capacity n and false positive probability p can be calculated as:
def m(p, n): return -(math.log(p) * n)/(math.log(2) ** 2)
Now, for a given value of m and n, the minimum number of hash functions required (k) can be calculated as:
def k(m, n): return (m/n) * math.log(2)
If you notice carefully, (m/n) i.e. the number of bits allocated for each value that we want to insert to is directly proportional to k and it becomes necessary that our choice of k be less than (m/n). In addition, k also influences the number of lookups required to check if a value exists.
The time taken for a lookup is O(k), especially since the value we are looking up needs to be hashed k times.
Applying the formula above the optimal values for m and k can be calculated as (k=6, m=95850584) which is close to (k=5, m=100M) seen in the graph.
Most bloom filter libraries do these optimizations by default given a maximum capacity(n) and target error rate(p).
- A bloom filter which is over its capacity would affect the false positive rate.
- From the graph we see the false positives steadily increase as the filter gets overfilled to a point where it is no longer useful for lookups. Setting an arbitrarily large bit array size to solve this issue would only make the bloom filter sparse and slow down the lookups.
- When the number of elements in the filter hits the ceiling, expanding the size of the filter isn’t straightforward rendering them to be very rigid.
- The rigidity can, however, be tackled by adding a new bloom filter whenever (m/n) crosses a certain threshold. Additional elements, when inserted, go directly into the new filter, while lookups would have to go via every single filter in the chain. This leads to multiplicative increases in the overall false positive probability.
- Chaining bloom filters, while solving the scaling problem, would still require tighter constraints on false positive rates to maintain the overall error rate at the desired level, due to a multiplicative effect in the chaining.
For our use we needed the implementation to be:
- Scalable as new elements are added without compromising too much on memory.
- Accessible via a network socket, so that multiple clients in a cluster can use a shared bloom filter.
- Fairly standardized such that it is very easy for any team member to start using it in their micro-services.
After considering varied implementations from using bloom filters in application memory, to bloom filters stored on disk, to pyreBloom (Redis-backed), we settled on bloomd. Back when we adopted it in early 2013, bloomd was still a relatively nascent project(Armon was yet to join Hashicorp as its co-founder!). In fact, we even wrote one of the earliest Perl libraries to interface with bloomd.
bloomd is like Redis for bloom filters - it decouples all the complexity of persisting data structures from the application.
We really like it for the following reasons:
- It has a neat Redis-like API i.e. CREATE, SET, CHECK semantics.
- It has first-class support for scalable bloom filters. Scaling is as simple as setting the scale_size.
- It is blazing fast.
- It is straightforward to Dockerize. Therefore, it fits nicely into our micro-services architecture.
Using a simple, high-performance implementation like bloomd, it is easy to integrate bloom filters in to existing workflows, and reap the benefits of this nifty little data structure.