What is the purpose of a Bloomier filter? - data-structures

This question is about the Bloomier filter, which is not the same as a standard Bloom filter.
I'm learning about the Bloomier filter and I don't see the advantage of using one. As far as I'm concerned, a Bloomier filter is a generalization of a Bloom filter. It can return the specific items themselves.
However, you could accomplish this by simply using hash tables and they seems faster and more space-efficient.
Given this, what's the purpose of a Bloomier filter?

There is a talk by Michael Mitzenmacher available here. On slide 41 he mentions the following about Bloomier Filter:
Bloomier filter [Chazelle, Kilian, Rubinfeld, Tal]:
Extend to handle approximate functions.
Each element of set has associated function value.
Non-set elements should return null.
Want to always return correct function value for set elements.
A false positive returns a function value for a non-null element.

You might find the following analogy helpful:
Bloom filter is to set as Bloomier filter is to map.
A Bloom filter can be thought of as a way of storing a compressed version of a set of items in significantly less space than the original set would normally take. The tradeoff involved is that you might have some false positives - items that aren't actually in the set but which the Bloom filter says are in the set.
A Bloomier filter can be thought of as a way of storing a compressed version of a map (or associative array, or dictionary, depending on what terminology you're using). That is, it stores a compressed representation of an association from keys to values. The advantage of the Bloomier filter is that it uses significantly less space than what would normally be required to store the map. The drawback is that the Bloomier filter can have false positives - if you look something up that isn't in the map, the Bloomier filter might accidentally report that it's there and give back a nonsense associated value. (This is a bit different from what you described in your question. In your question, you characterized a Bloomier filter as a Bloom filter that can also hand back the items stored in the filter.)
So to directly answer your question, yes, you absolutely could use a hash table instead of a Bloomier filter. However, doing so would use considerably more space. In most cases, we don't care about that space, but there are applications where space is at a premium and reducing storage would be valuable. As an example, check out the paper Weightless: Lossy Weight Encoding For Deep Neural Network Compression, which uses Bloomier filters to reduce the storage space needed to store large neural networks without too much of a decrease in the quality of the network.
As for how a Bloomier filter works, it's based on a technique that's closely related to that of the XOR filter, and the linked question gives a rough overview of the key techniques.

On example of the use of a Bloomier filter is in an LSM tree. The Bloomier filter can store a map of each key and the run it is a member of, and this can fit in memory much easier than the full LSM tree, assuming the values are large. This reduces lookup time substantially, and industry LSM trees like levelDB and RocksDB do use Bloom-filter-like structures to help reduce lookup time.

Related

any possible ways to implement "reliable bloom filter"?

I know bloom filters can help with checking if some element is in a set while saving considerable storage space compared to keeping every element in a container, like std::set, to search for.
I also understand that bloom filters are probabilistic data structure, where the accuracy, or the possibility of generating false positives, will converge to a math expression. I was wondering whether it is possible to find some kind of data structure that is as efficient as bloom filters in terms of storage space requirement, probably with some inevitable compromise in time complexity of searching, but at the same time deliver 100% positive judgment, which excludes any chance of false positives.
I checked out cuckoo filters and xor filters and came to realize that it seems impossible to find the answer out of bloom filters because of their probability nature.
Is there any kind of data structure that satisfies my requirements? Or is this kind of data structure literally impossible to be implemented? Are there any more directions I can do further research in, and if so would you please name some keywords?
Sincere thanks for your patience!

Checking a given hash is in a very, very long list

I have a list of hashes. Long list. Very long list. I need to check does a given hash is in that list.
The easiest way is to store hashes in memory (in a map or a simple array) and check that. But it will require lots of RAM/SSD/HDD memory. More than a server(s) can handle.
I'm wondering is there a trick to do that in reasonable memory usage. Maybe there's an algorithm I'm not familiar with or a special collection?
Three thoughts-
Depending on the structure of these hashes, you may be able to borrow some ideas from the concept of a Rainbow Table to implicitly store some of them.
You could use a trie to compress storage for shared prefixes if you have enough hashes, however given their length and (likely) uniformity, you won't see terrific savings.
You could split the hash into multiple smaller hashes, and then use these to implement a Bloom Filter, however this a probabilistic test, so you'll still need them stored somewhere else (or able to be calculated / derived) if there's a perceived "hit", however this may enable you filter out enough "misses" that a less performant (speed-wise) data structure becomes feasible.

Best filter for MANET

I know about Bloom filter. It is very useful where storage requirement is less and where we need to check only whether element is "definitely not exist" or "may exist" e.g. mobile devices, browser in-memory.
As Best example given by Tarun
I need to know at least 2/3 better & faster filter than bloom filter where there is less storage required?
I need a filter or any better technique than bloom filter which can be useful in Mobile ad hoc network for storing device ip addresses and identifying address collisions.
Not that much better than a Bloom Filter, but you can take a look into Cuckoo Filters. However, it will be harder for you to find an open source implementation; here is one in Go.
Citing from the original Cuckoo Filter paper:
Cuckoo filters improve upon Bloom filters in three ways: (1) support
for deleting items dynamically; (2) better lookup performance; and (3)
better space efficiency for applications requiring low false positive
rates ( < 3%).

Are different salted hashes, equivalent to different hashing algorithms for a bloom filter?

As your data set gets larger, you need more hashing algorithms to keep a low false positive rate of 1%.
If I want my bloom filter to grow dynamically at run time, it's unknown how many hashing algorithms I will need. If I use the same (say MD5) hasher, but with randomly generated salts, that are appended to the value before hashing it, will this have the same effect as using a different hasher (say MD5, SHA1, etc)?
I use .NET C# for reference, but the language is almost irrelevant for this question.
MD5 is a pretty expensive way to generate hashes for a Bloom filter. You probably want to use something that executes a bit faster such as a Jenkins hash or one of its variants, or something along these lines.
As you've noted, the Bloom filter requires a lot of hash functions. Coming up with 17 unique hash functions is difficult at best. Fortunately, there's a way to avoid having to do that. I used the technique described in the paper Less Hashing, Same Performance: Building a Better Bloom Filter. This turned out to be very easy in C#, and the performance was very good.
The math in the paper can be a bit hard to follow, but you can get the gist of it fairly easily. And the paper describes a couple of different ways to generate multiple hash code values simply and quickly.
Also, Bloom filters aren't in general easy to size dynamically. If you want the Bloom filter to grow, you need have to specifically build a scalable Bloom filter that supports it. A Google search on [scalable bloom filter] will provide a number of references and some code samples.

Is there a hashing algorithm that is tolerant of minor differences?

I'm doing some web crawling type stuff where I'm looking for certain terms in webpages and finding their location on the page, and then caching it for later use. I'd like to be able to check the page periodically for any major changes. Something like md5 can be foiled by simply putting the current date and time on the page.
Are there any hashing algorithms that work for something like this?
A common way to do document similarity is shingling, which is somewhat more involved than hashing. Also look into content defined chunking for a way to split up the document.
I read a paper a few years back about using Bloom filters for similarity detection. Using Bloom Filters to Refine Web Search Results. It's an interesting idea, but I never got around to experimenting with it.
This might be a good place to use the Levenshtein distance metric, which quantifies the amount of editing required to transform one sequence into another.
The drawback of this approach is that you'd need to keep the full text of each page so that you could compare them later. With a hash-based approach, on the other hand, you simply store some sort of small computed value and don't require the previous full text for comparison.
You also might try some sort of hybrid approach--let a hashing algorithm tell you that any change has been made, and use it as a trigger to retrieve an archival copy of the document for more rigorous (Levenshtein) comparison.
http://www.phash.org/ did something like this for images. The jist: Take an image, blur it, convert it to greyscale, do a discrete cosine transform, and look at just the upper left quadrant of the result (where the important information is). Then record a 0 for each value less than the average and 1 for each value more than the average. The result is pretty good for small changes.
Min-Hashing is another possibility. Find features in your text and record them as a value. Concatenate all those values to make a hash string.
For both of the above, use a vantage point tree so that you can search for near-hits.
I am sorry to say, but hash algorithms are precisely. Theres none capable of be tolerant of minor differences. You should take another approach.

Resources