When is a Bloom filter useful? - data-structures

I understand what makes bloom filters an attractive data structure; however, I'm finding it difficult to really understand when you can use them since you still have to perform the expensive operation you're trying to avoid to be certain that you haven't found a false positive. Because of this wouldn't they generally just add a lot of overhead? For example the wikipedia article for bloom filters suggests they can be used for data synchronization. I see how it would be great the first time around when the bloom filter is empty but say you haven't changed anything and you go to synchronize your data again. Now every lookup to the bloom filter will report that the file has already been copied but wouldn't we still have to preform the slower lookup task we were trying to avoid to actually make sure that's correct?

Basically, you use Bloom filters to avoid the long and arduous task of proving an item doesn't exist in the data structure. It's almost always harder to determine if something is missing than if it exists, so the filter helps to shore up losses searching for things you won't find anyway. It doesn't always work, but when it does you reap a huge benefit.

Bloom filters are very efficient in the case of membership queries, i.e., to find out whether an element belongs to the set. The number of elements in the set does not affect the query performance.

A common example is when you add an email address to the email list. your application should check whether it’s already in the contacts list, and if it’s not, a popup should appear asking you if you want to add the new recipient. To Implement this, you normally follow those steps in front end application:
Get the list of contacts from a server
Create a local copy for fast lookup
Allow looking up a contact
Provide the option to add a new contact if the lookup is unsuccessful
Sync with the server when a new contact is added or an existing email is updated.
Bloom filter will handle all those steps fast and memory-efficient way. You could use a dictionary for a fast lookup but that would require to save the entire contact list in key-pair. For such a large contact-list you might not have enough storage space in browser


Making sure my Go page view counter isn't abused

I believe I have a found a very good and fast solution for efficiently counting page views:
Working example in go playground here: https://play.golang.org/p/q_mYEYLa1h
My idea is to push this to the database every X minutes, and after pushing a key then delete it from the page map.
My question now is, what would be the optimal way to ensure that this isn't abused? Ideally, I would only want to increase page count from the same person if there was a time interval of 2 hours since last visiting the page.
As far as I know, it would be ideal to store and compare both IP and user agent (I don't want to rely on cookie/localstorage), but I'm not quite sure how to efficiently store and compare this information.
I'd likely get both the IP (req.Header.Get("x-forwarded-for")) and UserAgent (req.UserAgent()) from http.Request.
I was thinking making a visitor struct similar to my page struct that would look like this:
type visitor struct {
mutex sync.Mutex
urlIPUAAndTime map[string]time
This way should make it possible to do something similar to before. However, imagine if the website had so many requests that there would be hundreds of millions of unique visitor maps being stored, and each of these could only be deleted after 2 (or more) hours. I therefore think this is not a good solution.
I guess it would be ideal/necessary to write to and read from some file, but not sure how this should be done efficiently. Help would be greatly appreciated
One of optimization ways is to add a Bloom filter before this map. Bloom filter is a probabilistic structure which can say one of these:
this user is definitely new
and this user possibly was here
This is a way to cut off computation on early stage. If many of your users are new then you save requests to database to check all of them.
What if structure says "user is possibly non-unique"? Then you go the database and check it.
Here's one more optimization: if you do not need very accurate information and can agree with mistake about several percent, you may use the sole bloom filter. I guess many large sites use this technique for estimation.

What is the most efficient Ruby data structure to track progress?

I am working on a small project which progressively grows a list of links and then processes them through a queue. There exists the likelihood that a link may be entered into the queue twice and I would like to track my progress so I can skip anything that has already been processed. I'm estimating around 10k unique links at most.
For larger projects I would use a database but that seems overkill for the amount of data I am working with and would prefer some form of in-memory solution that can potentially be serialized if I want to save progress across runs.
What data structure would best fit this need?
Update: I am already using a hash to track which links I have completed processing. Is this the most efficient way of doing it?
def process_link(link)
return if #processed_links[link]
# ... processing logic
#processed_links[link] = Time.now # or other state
If you aren't concerned about memory, then just use a Hash to check inclusion; insert and lookup times are O(1) average case. Serialization is straightforward (Ruby's Marshal class should take care of that for you, or you could use a format like JSON). Ruby's Set is an array-like object that is backed with a Hash, so you could just use that if you're so inclined.
However, if memory is a concern, then this is a great problem for a Bloom filter! You can achieve set inclusion testing in constant time and the filter uses substantially less memory than a hash would. The tradeoff is the Bloom filters are probabilistic - you can get false inclusion positives. You can eliminate the probability of most false positives with the right bloom filter parameters, but if duplicates are the exception rather than the rule, you could implement something like:
Check for set inclusion in the Bloom filter [O(1)]
If the bloom filter reports that the entry is found, perform an O(n) check of the input data, to see if this item has been found in the array of input data prior to now.
That would get you very fast and memory-efficient lookups for the common case, and you could make the choice to accept the possibility of false negatives (to keep the whole thing small and fast), or you could perform verification of set inclusion when a duplicate is reported (to only do expensive work when you absolutely have to).
https://github.com/igrigorik/bloomfilter-rb is a Bloom filter implementation I've used in the past; it works nicely. There are also redis-backed Bloom filters, if you need something that can perform set membership tracking and testing across multiple app instances.
How about a Set and convert your links to value object (rather than reference object) like Structs. By creating a value object the Set will be able to detect its uniqueness. Alternately, you could use a hash and store links by their PK.
The data structure could be a hash:
current_status = { links: [link3, link4, link5], processed: [link1, link2, link3] }
To track your progress (in percent):
links_count = current_status[:links].length + current_status[:processed].length
progress = (current_status[:processed].length * 100) / links_count # Will give you percent of progress
To process your links:
push any new link you need to process to current_status[:links].
Use shift to take from current_status[:links] the next link to be processed.
After processing a link, push it to current_status[:processed]
As I see it (and understand your question), the logic to process your links would be:
# Add any new link that needs to be processed to the queue unless it have been processed
def add_link_to_queue(link)
current_status[:to_process].push(link) unless current_status[:processed].include?(link)
# Process next link on the queue
def process_next_link
link = current_status[:to_process].shift # return first link on the queue
# ... login process the link
# shift method will not only return but also remove the link from the original array to avoid duplications

Mongo Db design (embed vs references)

I've read a lot of documents, Q&A etc about that topic (embed or to use references).
I understand the points why you should use one or another approach, but I can't see that someone discuss (asked) similar case:
I have 2 (A and B) entities and relation between them is ONE_TO_MANY (A could belongs to many B), I can use embed (denormalization approach) and it's ok (I clearly understand it), but what if I would like (later) to modify one of used, into many B documents, A document field ? Modify it does not mean replace A by A', it means some changes into exactly A record. It means that (in embed case) I have to apply such changes in all B documents which had A version already.
based on description here http://docs.mongodb.org/manual/tutorial/model-embedded-one-to-many-relationships-between-documents/#data-modeling-example-one-to-many
What If later we would like to change used in many documents address:name field ?
What If we need the list of available addresses in the system ?
How fast that operations will be done in MongoDb ?
It's based on what operations are used mostly. If you are inserting and selecting lot of documents and there is a possibility, that e.g. once a month you will need to modify many nested sub-documents, I think that storing A inside B is good practice, it's what mongodb is supposed to be. You will save lot of time just selecting one document without needing to join another ones and slower update once a time you can stand without any problems.
How fast the update ops will be is obviously dependent on volume of data.
Other considerations as to whether to use embedded docs or references is whether the volume of data in a single document would exceed 16mb. That's a lot of documents mind.
In some cases however, it simply doesn't make sense to denormalise entire documents especially where they're used/referenced elsewhere.
Take a User document for example, you wouldn't usually denormalise all user attributes across each collection that needs to reference a user. Instead you reference the user [with maybe some denormalised user detail].
Obviously each additional denormalised value (unless it was an audit) would need to be updated when the referenced User changes, but you could queue the updates for a background process to deal with - rather than making the caller wait.
I'll throw in some more advice as to speed.
If you have a sub-document called A that is embedded in lots of documents - and you want to change instances of A ...
Careful that the documents don't grow too much with a change. That will hurt performance if A grows too big because it will force Mongo to move the document in memory.
It obviously depends on how many embedded instances you have. The more you have, the slower it will be.
It depends on how you match the sub-document. If you are finding A without an index, it's going to be slow. If you are using range operators to identify it, it will be slow.
Someone already mentioned the size of documents will most likely affect the speed.
The best advice I heard about whether to link or embed was this ... if the entity (A in this case) is mutable ... if it is going to mutate/change often ... then link it, don't embed it.

How do you RESTfully get a complicated subset of records?

I have a question about getting 'random' chunks of available content from a RESTful service, without duplicating what the client has already cached. How can I do this in a RESTful way?
I'm serving up a very large number of items (little articles with text and urls). Let's pretend it's:
My (software) clients want to get random chunks of what's available. There's too many to load them all onto the client. They do not have a natural order, so it's not a situation where they can just ask for the latest. Instead, there are around 6-10 attributes that the client may give to 'hint' what type of articles they'd like to see (e.g. popular, recent, trending...).
Over time the clients get more and more content, but at the server I have no idea what they have already, and because they're sent randomly, I can't just pass in the 'most recent' one they have.
I could conceivably send up the GUIDS of what's stored locally. The clients only store 50-100 locally. That's small enough to stuff into a POST variable, but not into the GET query string.
What's a clean way to design this?
Key points:
Data has no logical order
Clients must cache the content locally
Each item has a GUID
Want to avoid pulling down duplicates
You'll never be able to make this work satisfactorily if the data is truly kept in a random order (bear in mind the Dilbert RNG Effect); you need to fix the order for a particular client so that they can page through it properly. That's easy to do though; just make that particular ordering be a resource itself; at that point, you've got a natural (if possibly synthetic) ordering and can use normal paging techniques.
The main thing to watch out for is that you'll be creating a resource in response to a GET when you do the initial query: you probably should use a resource name that is a hash of the query parameters (including the client's identity if that matters) so that if someone does the same query twice in a row, they'll get the same resource (so preserving proper idempotency). You can always delete the resource after some timeout rather than requiring manual disposal…

Organizing memcache keys

Im trying to find a good way to handle memcache keys for storing, retrieving and updating data to/from the cache layer in a more civilized way.
Found this pattern, which looks great, but how do I turn it into a functional part of a PHP application?
The Identity Map pattern: http://martinfowler.com/eaaCatalog/identityMap.html
Update: I have been told about the modified memcache (memcache-tag) that apparently does do a lot of this, but I can't install linux software on my windows development box...
Well, memcache use IS an identity map pattern. You check your cache, then you hit your database (or whatever else you're using). You can go about finding information about the source by storing objects instead of just values, but you'll take a performance hit for that.
You effectively cannot ask the cache what it contains as a list. To mass invalidate, you'll have to keep a list of what you put in and iterate it, or you'll have to iterate every possible key that could fit the pattern of concern. The resource you point out, memcache-tag can simplify this, but it doesn't appear to be maintained inline with the memcache project.
So your options now are iterative deletes, or totally flushing everything that is cached. Thus, I propose a design consideration is the question that you should be asking. In order to get a useful answer for you, I query thus: why do you want to do this?
