One quick question regarding Bloom Filter,
If I allocate the size of Bloom filter exactly same as that of number of elements going to be inserted and also using unique hash functions, can I ensure that it won't result in false positive case.
Note that in my case, I know the number of elements going to be inserted well in advance before bloom filter creation
thanks
Prabu
Yes you can. You can create some hash function that does 1:1 mapping. But in that case, there is no point of using Bloom Filter. The whole point of Bloom filter is to save space.
Related
Is there an algorithm similar to the bloom filter, that allows you to:
Compactly represent two (large) sets independently of each other and
probabilistically check for disjointness between them using their lossy-compressed representations?
In the special case where one set only has a single same and you don't compress it, this problem reduces to probablistic set membership, for which one can consider the Bloom filter.
OK, so here's a terrible answer just to get the discussion going:
You could compress both sets as Bloom filters. Choose a very large number of random samples from the larger set both sets are drawn from.
For each sample you pick, look it up in the bloom filters.
If both indicate "probable member", then you have found a "probable intersection" and the sets are "probably overlapping". Otherwise you declare that the sets are "probably not overlapping".
The Apache datasketches library has interesting functionality:
https://datasketches.apache.org/docs/DistinctCountFeaturesMatrix.html
"Theta sketches" support intersection, and count operators.
Suppose we have to build a Bloom filter with 10^12 buckets on one machine with 32 GB RAM and a hard drive. Assume the keys are small and already on the hard drive. How could we build it in an efficient way?
My guess is to split Bloom filter into 4 parts (125GB / 4 fits into 32GB). Then pass through data 4 times each time hashing and updating corresponding slice in memory. Concatenate 4 slices back to get complete Bloom filter. Is this correct?
Why do you need so big filter ? Do you try to overestimate it in order to handle unbounded data as from a streaming source ? If yes, you can read about Stable Bloom filter and Scalable Bloom filter. Both are better adapted to such type of data than the classical Bloom filter.
To answer to your question, if you split your filter what you tell should work. But ensure that you deal with indexes correctly. If for instance the bit vector of 4 elements is splitted on 2 nodes, the first will be responsible for indexes (0, 1) and the second for (2, 3). You'll probably complicate it a little bit and store somewhere the mapping of what range is stored in what node and modify both reading and writing part accordingly.
You can also search for an example of implementation of so distributed Bloom filter. Maybe it'll give you another questioning points or, instead of developing your solution from scratch, you'll be able to test quickly how it behaves with your data pipeline.
In all cases, it would be great if you can give your short feedback here about how did you handled the problem and if you finally chosen another solution.
I am trying to create a data structure for a fixed size set that should support the following operations:
Query whether an element is in the set (false positives are ok, false negatives are not)
Replace one element of the set with another element
In my case, the size of the set is likely to be very small (4-16 elements), but the lookups must be as fast as possible and read as few bits as possible. Also, it needs to be space efficient. Replacements (i.e. operation 2) are likely to be few. I looked into the following options:
Bloom Filters: This is the standard solution. However, it is difficult to delete elements and as such difficult to implement operation 2.
Counting Bloom Filters: The space requirement becomes much higher (~ 3-4x) of that of the standard Bloom filter for no decrease in false +ve rates.
Simply storing a list of hashes of all the elements: Gives better false +ve rates than counting bloom filter for similar space requirements, but is expensive to look up (in worst case all bits will be looked up).
Previous idea with perfect hashing for location: I don't have an idea about fast perfect hashes for small sets of elements.
Additional Information:
The elements are 64 bit numbers.
Any ideas on how to solve this?
Cuckoo Filter is an option that should be considered. To quote their abstract
Cuckoo Filter: Practically Better Than Bloom
We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set member-ship tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than Bloom filters. For applications that store many items and
target moderately low false positive rates, **cuckoo filters have lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters out-perform previous data structures that extend Bloom filters to support deletions substantially in both time and space.
https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf
Well, note the following:
Using standard hash table, with a descent hash function (since it is numbers, there are bunch of standard hash functions) with 4|S| entries will require on average less then 2 look-ups (assuming unbiased numbers as input), though it might deteriorate to terrible worst case of 4|S|. Of course you can bound it as follows:
- If number of cells searched exceeds k - abort and return true (will cause FP at some probability that you can caclculate, and will give faster worst case performance).
Regarding counting bloom filters - this is the way to do it, IMO. Note that a bloom filter (standard) requires 154 bits to have FP probability of 1%, or 100 bits to have FP probability of 5%. (*)
So, if you need 4 times this number, you get 616 bits / 400 bits, Note that in most modern machine this is small enough to fill a few CPU-Cache blocks, which means (depending on the machine) - reading all these bits could really take less then 10 cycles on some machines.
IMO you cannot do anything to beat it without getting much higher FP rate.
(*) Calculated according to:
m = n ln(p) / ln(2)2
P.S. If you can guarantee each element is removed at most once, you can use a variation of bloom filter with double space instead that has slightly better FP, but also has some FNs, by simply using 2 bloom filters: 1 for regular and 1 for deleted. An element is in the set if it is in regular and NOT in deleted.
This improves FP rate at the expense of having also FNs
Check out succinct data structures, for example Membership in Constant Time and Minimum Space.
There are many situations in dealing with a subset chosen from the
bounded universe, in which the size of the subset is relatively big
but not big enough to use a bit map.
I wanted to ask does it matter in Image Processing in which order I apply the filter. If I apply Median Filter first and then some Low Pass Filter, will it be different if I applied Low Pass Filter first and then Median Filter?
How can we explain this conceptually?
Yes, the results will be different. This is because a median filter is not an LTI system, and thus the operations cannot be arbitrarily reordered.
Since they fill up and the percentage of false positives increase, what are some of the techniques used to keep them from saturating? It seems like you cannot empty out bits, since that would make an immediate negative on data stored in that node.
Even if you have a set of known size, in a data store using bloom filters like Cassandra what confuses me is that the data in a node is going to be added and removed, right? But when you remove a key you cannot set its bloom filter buckets to 0 since that might create a false negative for data in the node that hash to one or more same buckets as the removed key. So over time, it is as if the filter fills up
I think you need to set an upper bound on the size of the set that the bloom filter covers. If the set exceeds that size, you need to recalculate the bloom filter.
As used in cassandra, the size of the set covered by the bloom filter is known before creating the filter, so this is not an issue.
Another aproach is Scalable Bloom Filters
The first thing you should realize is that bloom filters are only additive. There are some approaches to approximate deletion:
Rewriting the bloom filter
You have to keep the old data
You pay a performance price
A negative bloom filter
Much cheaper than the above, also helps deal with false positives if you can detect them.
Counting bloom filters (decrement the count)
Buckets
Keep multiple categorized bloom filters, discarding a category when it is no longer needed (e.g. 'Tuesday', 'Wednesday', 'Thursday',...)
Others?
If you have time-limited data, it may be efficient to use buckets, and discard filters that are too old.