Bloom Filter (English: Bloom Filter) was proposed by Bloom in 1970. It is actually a very long binary vector and a series of random mapping functions. It is mainly used to determine whether an element is in a set.
Usually we encounter a lot of business scenarios to determine whether an element is in a collection, the general idea is to save all the elements in the collection, and then determine by comparison. Chained lists, trees, hash tables (also known as hash tables, Hash table) and other data structures are this idea. But with the increase of elements in the collection, we need storage space will also show linear growth, and eventually reach a bottleneck. At the same time, the retrieval speed is getting slower and slower, the retrieval time complexity of the above three structures are , , respectively.
At this point, the Bloom Filter was born.
Before understanding the principle of the Bloom Filter, review the principle of the Hash function.
The concept of the hash function is: any size of the input data into a specific size of the output data function, the converted data is called a hash value or hash code, also known as the hash value. Here is a schematic diagram:
All hash functions have the following basic characteristics:
But when storing large amounts of data with a hash table, the space efficiency is still very low, and when there is only one hash function, hash collisions are also very likely to occur.
BloomFilter is composed of a fixed-size binary vector or bitmap and a set of mapping functions.
In the initial state, for a bit array of length m, all its bits are set to 0, as shown below:
[Image upload failed... (image-303c04-1595324887187)]
When a variable is added to the set, map this variable to K points in the bitmap by K mapping functions, setting them to 1 (assuming there are two variables that both pass through 3 mapping functions).
When querying for a variable, all we have to do is look at the points to see if they are all 1s, and we know with a high degree of probability that it's in the set
Why is it probable, rather than certain? That's because mapping functions are themselves hash functions, and hash functions collide.
The misclassification of the Bloom filter is that multiple inputs are hashed to the same bit position 1, so there is no way to tell which input is actually generated, so the root cause of the misclassification is that the same bit bit has been mapped multiple times and set to 1.
This situation also creates the problem of deletion of Bloom filters, because each bit of the Bloom filter is not exclusive, and it is very likely that multiple elements **** enjoy the same bit position. It is possible that more than one element ****s a particular bit. If we delete this bit directly, it will affect other elements. (e.g., bit 3 in the image above)
Bloom filters have a huge advantage over other data structures in terms of space and time. Bloom filter storage space and insertion/query time are constants , in addition, hash functions are not related to each other and are easily implemented in parallel by hardware. Bloom filters do not need to store the elements themselves, which is advantageous in some situations where confidentiality is very strict.
Bloom filters can represent the full set in a way that no other data structure can;
but the disadvantages of Bloom filters are as obvious as the advantages. The miscalculation rate is one of them. As the number of elements deposited increases, the miscalculation rate increases. But if the number of elements is too small, a hash table is sufficient.
Additionally, it is generally not possible to remove elements from a Bloom filter. It's easy to think of turning an array of bits into an array of integers, adding 1 to the counter for each inserted element, and subtracting the counter when deleting an element. However, deleting elements safely is not so simple. First we have to make sure that the deleted element is actually inside the Bloom filter. This cannot be guaranteed by the filter alone. Also counter wrapping can cause problems.
There has been quite a bit of work on reducing the miscount rate, which has led to a number of Bloom filter variants.
In the world of programs, Bloom filters are a great tool for programmers to use to quickly solve some of the trickier problems in a project.
Such as web page URL de-duplication, spam identification, determination of duplicate elements in large collections and cache penetration.
Typical applications of Bloom filters are:
Knowing the principle of Bloom filters and the use of scenarios, we can realize a simple Bloom filter
Distributed environment, Bloom filters certainly also need to consider is a resource that can be **** enjoy, at this time we would think of Redis, yes, Redis also implements a Bloom filter.
Of course, we can also write the bloom filter to a file via bloomFilter.writeTo() and put it into an object store like OSS or S3.
Redis provides a bitMap that can implement a bloom filter, but you need to design the mapping function and some of the details yourself, which is no different than customizing it.
The official Redis Bloom Filter didn't make its debut until Redis 4.0, when it was made available as a plugin. The Bloom Filter is loaded into Redis Server as a plugin, giving Redis powerful Bloom de-duplication capabilities.
To install RedisBloom with Redis already installed, there are two ways
Install by compiling directly
Install using Docker
Install using
Install using
Install using
Install using
The basic Bloom filter directive:
We only have these few parameters, surely there will be no false positives, when the number of elements gradually increases, there will be a certain amount of false positives, so we will not do this experiment here.
The Bloom filter used above is just the Bloom filter with default parameters, which is automatically created when we first add.
Redis also provides a Bloom filter with custom parameters, bf.reserve filter name error_rate initial_size
But this action needs to be explicitly created before the add. If the corresponding key already exists, bf.reserve will report an error
I'm a Javaer, and I definitely want to use Java to implement it. There are more Redis clients for Java, and some of them don't provide a command extension mechanism yet, but I know that Redisson and lettuce can use Bloom filters. Redisson
In order to solve the Bloom filter can not delete elements of the problem, Cuckoo Filter came out. The authors of the paper "Cuckoo Filter: Better Than Bloom" provide an in-depth comparison of the Cuckoo Filter and the Bloom Filter. Compared to the Cuckoo Filter the Bloom Filter has the following shortcomings: weak query performance, inefficient space utilization, no support for reverse operations (deletion), and no support for counting.