I know about Bloom filter. It is very useful where storage requirement is less and where we need to check only whether element is "definitely not exist" or "may exist" e.g. mobile devices, browser in-memory.
As Best example given by Tarun
I need to know at least 2/3 better & faster filter than bloom filter where there is less storage required?
I need a filter or any better technique than bloom filter which can be useful in Mobile ad hoc network for storing device ip addresses and identifying address collisions.
Not that much better than a Bloom Filter, but you can take a look into Cuckoo Filters. However, it will be harder for you to find an open source implementation; here is one in Go.
Citing from the original Cuckoo Filter paper:
Cuckoo filters improve upon Bloom filters in three ways: (1) support
for deleting items dynamically; (2) better lookup performance; and (3)
better space efficiency for applications requiring low false positive
rates ( < 3%).
Related
I know bloom filters can help with checking if some element is in a set while saving considerable storage space compared to keeping every element in a container, like std::set, to search for.
I also understand that bloom filters are probabilistic data structure, where the accuracy, or the possibility of generating false positives, will converge to a math expression. I was wondering whether it is possible to find some kind of data structure that is as efficient as bloom filters in terms of storage space requirement, probably with some inevitable compromise in time complexity of searching, but at the same time deliver 100% positive judgment, which excludes any chance of false positives.
I checked out cuckoo filters and xor filters and came to realize that it seems impossible to find the answer out of bloom filters because of their probability nature.
Is there any kind of data structure that satisfies my requirements? Or is this kind of data structure literally impossible to be implemented? Are there any more directions I can do further research in, and if so would you please name some keywords?
Sincere thanks for your patience!
Previous stackoverflow question regarding bloom and cuckoo filter comparison is 13 years old (Here) and predates redis-modules by a decade. And I guess cuckoo filters must have matured quite a bit over the years in terms of adoption.
So keeping that in mind, which is a better choice among the two in terms of performance as far as redis implementations are concerned? Is cuckoo filter an obvious choice over bloom given the extra features (like deletion and insertion count)? Are there any trade-offs?
I want to implement these filters for "existing username" invalidation. Are there any better techniques?
I guess cuckoo filters must have matured quite a bit over the years in terms of adoption.
Cuckoo filters are relatively simple, so no 'maturity process' was required.
That being said, since cuckoo filters introduction in 2014 many improvements have been suggested (and continuously being suggested) including:
Configurable bucket
Additive and subtractive cuckoo filter (ASCF)
Reducing Relocations in Cuckoo Filter
Tagged Cuckoo Filters
Optimized Cuckoo Filter (OCF)
Index-Independent Cuckoo filter (I2CF)
Leveraging the power of two choices to select the better candidate bucket during insertion
and even
CFBF: Reducing the Insertion Time of Cuckoo Filters With an Integrated Bloom Filter
Whether each of these methods guarantees better results (insert performance, query performance, memory consumption, etc.) for each use case requires a comparative analysis (I'm not aware of such unbiased research).
As for adoption:
There are many GitHub repositories implementing cuckoo filter in various languages
There is a strong academic interest in both theoretical improvements (see above) and applications of cuckoo filters.
So keeping that in mind, which is a better choice among the two in terms of performance as far as Redis implementations are concerned? Is cuckoo filter an obvious choice over bloom given the extra features (like deletion and insertion count)? Are there any trade-offs?
The question you referred to already has excellent answers concerning the performance and the tradeoffs between these two algorithms. It also discusses why performance is not just a single metric (insert performance vs. query performance; average time vs. worst time, etc.). Since Redis implements the data structure described in the original cuckoo filter paper (albeit in a highly optimized way), all issues discussed apply to the Redis implementation as well.
Note that in addition to Bloom and cuckoo filters several additional approximate membership query data structures were suggested, including XOR filters, Ribbon filters, and Binary fuse filters.
Which one is most suitable for each use case requires a non-trivial analysis.
This question is about the Bloomier filter, which is not the same as a standard Bloom filter.
I'm learning about the Bloomier filter and I don't see the advantage of using one. As far as I'm concerned, a Bloomier filter is a generalization of a Bloom filter. It can return the specific items themselves.
However, you could accomplish this by simply using hash tables and they seems faster and more space-efficient.
Given this, what's the purpose of a Bloomier filter?
There is a talk by Michael Mitzenmacher available here. On slide 41 he mentions the following about Bloomier Filter:
Bloomier filter [Chazelle, Kilian, Rubinfeld, Tal]:
Extend to handle approximate functions.
Each element of set has associated function value.
Non-set elements should return null.
Want to always return correct function value for set elements.
A false positive returns a function value for a non-null element.
You might find the following analogy helpful:
Bloom filter is to set as Bloomier filter is to map.
A Bloom filter can be thought of as a way of storing a compressed version of a set of items in significantly less space than the original set would normally take. The tradeoff involved is that you might have some false positives - items that aren't actually in the set but which the Bloom filter says are in the set.
A Bloomier filter can be thought of as a way of storing a compressed version of a map (or associative array, or dictionary, depending on what terminology you're using). That is, it stores a compressed representation of an association from keys to values. The advantage of the Bloomier filter is that it uses significantly less space than what would normally be required to store the map. The drawback is that the Bloomier filter can have false positives - if you look something up that isn't in the map, the Bloomier filter might accidentally report that it's there and give back a nonsense associated value. (This is a bit different from what you described in your question. In your question, you characterized a Bloomier filter as a Bloom filter that can also hand back the items stored in the filter.)
So to directly answer your question, yes, you absolutely could use a hash table instead of a Bloomier filter. However, doing so would use considerably more space. In most cases, we don't care about that space, but there are applications where space is at a premium and reducing storage would be valuable. As an example, check out the paper Weightless: Lossy Weight Encoding For Deep Neural Network Compression, which uses Bloomier filters to reduce the storage space needed to store large neural networks without too much of a decrease in the quality of the network.
As for how a Bloomier filter works, it's based on a technique that's closely related to that of the XOR filter, and the linked question gives a rough overview of the key techniques.
On example of the use of a Bloomier filter is in an LSM tree. The Bloomier filter can store a map of each key and the run it is a member of, and this can fit in memory much easier than the full LSM tree, assuming the values are large. This reduces lookup time substantially, and industry LSM trees like levelDB and RocksDB do use Bloom-filter-like structures to help reduce lookup time.
As a hobby I'm writing simple and primitive distributed web search engine and it occurred to me it currently has no protection against malicious peers trying to skew search results.
Current architecture of the project is storing inverse index and ranking factors in kad dht with peers updating this inverse index as they crawl web.
I've used google scholar in attempt to find some solution but it seems most of the authors of proposed p2p web search ignore above-mentioned problem.
I think I need some kind of reputation system or trust metrics, but my knowledge in this domain is sufficiently lacking and I would very much appreciate a few pointers.
One way you could avoid this is to only use reliable nodes for storing and retrieving values. The reliability of a node will have to be computed by known-good nodes, and it could be something like the similarity of a node's last few computed ranking factors compared to the same ranking factors computed by known-good nodes (i.e. compare the node's scores for google.com to known-good scores for google.com). Using this approach, you'll need to avoid the "rogue reliable node" problem (for example, by using random checks or reducing all reliability scores randomly).
Another way you could approach this is to duplicate computation of ranking factors across multiple nodes, fetch all of the values at search time, and rank them on the client side (using variance, for example). You could also limit searches to sites that only have >10 duplicate values computed, so that there is some time before new sites are ranked. Additionally, any nodes with values outside of the normal range could be reported by the client in the background, and their reliability scores could be computed this way. This approach is time-consuming for the end user (unless you replicate known-good results to known-good nodes for faster lookups).
Also, take a look at this paper which describes a sybil-proof weak-trust system (which, as the authors explain, is more robust than the impossible sybil-proof strong-trust system): http://www.eecs.harvard.edu/econcs/pubs/Seuken_aamas14.pdf
The problem you are describing is Byzantine General’s problem or Byzantine Fault Tolerance. You can read more about it on wikipedia but there must be plenty of papers written about it.
I don’t remember the exact algorithm, but basically it’s mathematically proven that for t traitors (malicious peers) you will need 3*t + 1 peers in total, in order to detect the traitors.
My general thought would be, this is a huge overhead in implementation and resource waste on the indexing side, and while there is enough research to be done in distributed indexing and distributed search, not many people are tackling it yet. Also the problem has been basically solved with the Byzantine General’s it “just" needs to be implemented on top of an existing (and working) distributed search engine.
If you don't mind having a time delay on index updates, you could opt for a block-chain algorithm similar to what bitcoin uses to secure funds.
Changes to the index (deltas only!) can be represented in a text or binary file format, and crunched by peers who accept a given block of deltas. A malicious peer would have to out-compute the rest of the network for a period of time in order to skew the index in their favor.
I believe the bitcoin hashing algorithm (SHA-256) to be flawed in that custom hardware renders the common users' hardware useless. A block chain using the litecoin's algorithm (scrypt) would work well, because cpus and gpus are effective tools in the computation.
You would weigh the difficulty accordingly, so that news block are produced on a fairly regular schedule -- maybe 2-5 minutes. A user of the search engine could posibly choose to use the index at least 30 minutes old, to guarantee that enough users in the network vouch for its contents.
more info:
https://en.bitcoin.it/wiki/Block_chain
https://en.bitcoin.it/wiki/Block_hashing_algorithm
https://litecoin.info/block_hashing_algorithm
https://www.coinpursuit.com/pages/bitcoin-altcoin-SHA-256-scrypt-mining-algorithms/
As your data set gets larger, you need more hashing algorithms to keep a low false positive rate of 1%.
If I want my bloom filter to grow dynamically at run time, it's unknown how many hashing algorithms I will need. If I use the same (say MD5) hasher, but with randomly generated salts, that are appended to the value before hashing it, will this have the same effect as using a different hasher (say MD5, SHA1, etc)?
I use .NET C# for reference, but the language is almost irrelevant for this question.
MD5 is a pretty expensive way to generate hashes for a Bloom filter. You probably want to use something that executes a bit faster such as a Jenkins hash or one of its variants, or something along these lines.
As you've noted, the Bloom filter requires a lot of hash functions. Coming up with 17 unique hash functions is difficult at best. Fortunately, there's a way to avoid having to do that. I used the technique described in the paper Less Hashing, Same Performance: Building a Better Bloom Filter. This turned out to be very easy in C#, and the performance was very good.
The math in the paper can be a bit hard to follow, but you can get the gist of it fairly easily. And the paper describes a couple of different ways to generate multiple hash code values simply and quickly.
Also, Bloom filters aren't in general easy to size dynamically. If you want the Bloom filter to grow, you need have to specifically build a scalable Bloom filter that supports it. A Google search on [scalable bloom filter] will provide a number of references and some code samples.