any possible ways to implement "reliable bloom filter"? - data-structures

I know bloom filters can help with checking if some element is in a set while saving considerable storage space compared to keeping every element in a container, like std::set, to search for.
I also understand that bloom filters are probabilistic data structure, where the accuracy, or the possibility of generating false positives, will converge to a math expression. I was wondering whether it is possible to find some kind of data structure that is as efficient as bloom filters in terms of storage space requirement, probably with some inevitable compromise in time complexity of searching, but at the same time deliver 100% positive judgment, which excludes any chance of false positives.
I checked out cuckoo filters and xor filters and came to realize that it seems impossible to find the answer out of bloom filters because of their probability nature.
Is there any kind of data structure that satisfies my requirements? Or is this kind of data structure literally impossible to be implemented? Are there any more directions I can do further research in, and if so would you please name some keywords?
Sincere thanks for your patience!

Related

Best filter for MANET

I know about Bloom filter. It is very useful where storage requirement is less and where we need to check only whether element is "definitely not exist" or "may exist" e.g. mobile devices, browser in-memory.
As Best example given by Tarun
I need to know at least 2/3 better & faster filter than bloom filter where there is less storage required?
I need a filter or any better technique than bloom filter which can be useful in Mobile ad hoc network for storing device ip addresses and identifying address collisions.
Not that much better than a Bloom Filter, but you can take a look into Cuckoo Filters. However, it will be harder for you to find an open source implementation; here is one in Go.
Citing from the original Cuckoo Filter paper:
Cuckoo filters improve upon Bloom filters in three ways: (1) support
for deleting items dynamically; (2) better lookup performance; and (3)
better space efficiency for applications requiring low false positive
rates ( < 3%).

What is the purpose of a Bloomier filter?

This question is about the Bloomier filter, which is not the same as a standard Bloom filter.
I'm learning about the Bloomier filter and I don't see the advantage of using one. As far as I'm concerned, a Bloomier filter is a generalization of a Bloom filter. It can return the specific items themselves.
However, you could accomplish this by simply using hash tables and they seems faster and more space-efficient.
Given this, what's the purpose of a Bloomier filter?
There is a talk by Michael Mitzenmacher available here. On slide 41 he mentions the following about Bloomier Filter:
Bloomier filter [Chazelle, Kilian, Rubinfeld, Tal]:
Extend to handle approximate functions.
Each element of set has associated function value.
Non-set elements should return null.
Want to always return correct function value for set elements.
A false positive returns a function value for a non-null element.
You might find the following analogy helpful:
Bloom filter is to set as Bloomier filter is to map.
A Bloom filter can be thought of as a way of storing a compressed version of a set of items in significantly less space than the original set would normally take. The tradeoff involved is that you might have some false positives - items that aren't actually in the set but which the Bloom filter says are in the set.
A Bloomier filter can be thought of as a way of storing a compressed version of a map (or associative array, or dictionary, depending on what terminology you're using). That is, it stores a compressed representation of an association from keys to values. The advantage of the Bloomier filter is that it uses significantly less space than what would normally be required to store the map. The drawback is that the Bloomier filter can have false positives - if you look something up that isn't in the map, the Bloomier filter might accidentally report that it's there and give back a nonsense associated value. (This is a bit different from what you described in your question. In your question, you characterized a Bloomier filter as a Bloom filter that can also hand back the items stored in the filter.)
So to directly answer your question, yes, you absolutely could use a hash table instead of a Bloomier filter. However, doing so would use considerably more space. In most cases, we don't care about that space, but there are applications where space is at a premium and reducing storage would be valuable. As an example, check out the paper Weightless: Lossy Weight Encoding For Deep Neural Network Compression, which uses Bloomier filters to reduce the storage space needed to store large neural networks without too much of a decrease in the quality of the network.
As for how a Bloomier filter works, it's based on a technique that's closely related to that of the XOR filter, and the linked question gives a rough overview of the key techniques.
On example of the use of a Bloomier filter is in an LSM tree. The Bloomier filter can store a map of each key and the run it is a member of, and this can fit in memory much easier than the full LSM tree, assuming the values are large. This reduces lookup time substantially, and industry LSM trees like levelDB and RocksDB do use Bloom-filter-like structures to help reduce lookup time.

What are probabilistic data structures?

I have read about "probabilistic" data structures like bloom filters and skip lists.
What are the common characteristics of probabilistic data structures and what are they used for?
There are probably a lot of different (and good) answers, but in my humble opinion, the common characteristics of probabilistic data structures is that they provide you with approximate, not precise answer.
How many items are here?
About 1523425 with probability of 99%
Update:
Quick search produced link to decent article on the issue:
https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
If you are interested in probabilistic data structures, you might want to read my recently published book "Probabilistic Data Structures and Algorithms for Big Data Applications" (ISBN: 9783748190486, available at Amazon) where I have explained many of such space-efficient data structures and fast algorithms that are extremely useful in modern Big Data applications.
In this book, you can find the state of the art algorithms and data structures that help to handle such common problems in Big Data processing as
Membership querying (Bloom filter, Counting Bloom filter, Quotient filter, Cuckoo filter).
Cardinality (Linear counting, probabilistic counting, LogLog, HyperLogLog, HyperLogLog++).
Frequency (Majority algorithm, Frequent, Count Sketch, Count-Min Sketch).
Rank (Random sampling, q-digest, t-digest).
Similarity (LSH, MinHash, SimHash).
You can get a free preview and all related information about the book at https://pdsa.gakhov.com
Probabilistic data structures can't give you a definite answer, instead they provide you with a reasonable approximation of the answer and a way to approximate this estimation. They are extremely useful for big data and streaming application because they allow to dramatically decrease the amount of memory needed (in comparison to data structures that give you exact answers).
In majority of the cases these data structures use hash functions to randomize the items. Because they ignore collisions they keep the size constant, but this is also a reason why they can't give you exact values. The advantages they bring:
they use small amount of memory (you can control how much)
they can be easily parallelizable (hashes are independent)
they have constant query time (not even amortized constant like in dictionary)
Frequently used probabilistic data structures are:
bloom filters
count-min sketch
hyperLogLog
There is a list of probabilistic data structures in wikipedia for your reference:
https://en.wikipedia.org/wiki/Category:Probabilistic_data_structures
There are different definitions about what "probabilistic data structure" is. IMHO, probabilistic data structure means that the data structure uses some randomized algorithm or takes advantage of some probabilistic characteristics internally, but they don't have to behave probabilistically or un-deterministically from the data structure user's perspective.
There are many "probabilistic data structures" with probabilistically
behavior such as the bloom filter and HyperLogLog mentioned
by the other answers.
At the same time, there are other "probabilistic data structures"
with determined behavior (from a user's perspective) such as skip
list. For skip list, users can use it similarly as a balanced binary search tree but is implemented with some probability related idea internally. And according to skip list's author William Pugh:
Skip lists are a probabilistic data structure that seem likely to
supplant balanced trees as the implementation method of choice for
many applications. Skip list algorithms have the same asymptotic
expected time bounds as balanced trees and are simpler, faster and use
less space.
Probabilistic data structures allow for constant memory space and extremely fast processing while still maintaining a low error rate with a specified degree on uncertainity.
Some use-cases are
Checking presence of value in a data set
Frequency of events
Estimate approximate size of a data set
Ranking and grouping

Are different salted hashes, equivalent to different hashing algorithms for a bloom filter?

As your data set gets larger, you need more hashing algorithms to keep a low false positive rate of 1%.
If I want my bloom filter to grow dynamically at run time, it's unknown how many hashing algorithms I will need. If I use the same (say MD5) hasher, but with randomly generated salts, that are appended to the value before hashing it, will this have the same effect as using a different hasher (say MD5, SHA1, etc)?
I use .NET C# for reference, but the language is almost irrelevant for this question.
MD5 is a pretty expensive way to generate hashes for a Bloom filter. You probably want to use something that executes a bit faster such as a Jenkins hash or one of its variants, or something along these lines.
As you've noted, the Bloom filter requires a lot of hash functions. Coming up with 17 unique hash functions is difficult at best. Fortunately, there's a way to avoid having to do that. I used the technique described in the paper Less Hashing, Same Performance: Building a Better Bloom Filter. This turned out to be very easy in C#, and the performance was very good.
The math in the paper can be a bit hard to follow, but you can get the gist of it fairly easily. And the paper describes a couple of different ways to generate multiple hash code values simply and quickly.
Also, Bloom filters aren't in general easy to size dynamically. If you want the Bloom filter to grow, you need have to specifically build a scalable Bloom filter that supports it. A Google search on [scalable bloom filter] will provide a number of references and some code samples.

Any good nearest-neighbors algorithm for similar images?

I am looking for an algorithm that can search for similar images in a large collection.
I'm currently using a SURF implementation in OpenCL.
At first I used the KNN search algorithm to compare every image's interrest points to the rest of the collection but tests revealed that it doesn't scale well. I've also tried a Hadoop implementation of KNN-Join which really takes a lot of temporary space in HDFS, way too much compared to the amount of input data. In fact pairwise distance approach isn't really appropriate because of the dimension of my input vectors (64).
I heard of Locally Sensitive Hashing and wondered if there was any free implementation, or if it's worth implementing it, maybe there's another algorithm I am not aware of ?
IIRC the flann algorithm is a good compromise:
http://people.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN

Resources