So this problem we have users matching to other online users. However it is not just a one to one match. A user is given a selection of 5 other users to choose from, which are then marked as seen and should not be shown again when the user requests for another 5 users to be shown. More people can come online during the process.
The problem is, I want a way for each user to be shown in the selection for other users, with redis but an algorithm is mostly what im looking for. I'm trying to implement this in the fastest way possible, using redis if possible but I can also make calls to the database if it's needed.
My current solution is as follows, hopefully someone will have some tips to improve this from O(N) calls.
So each user needs to have a seen set of user_ids. We can have a redis list (queue) of onlineusers. Where we keep poppping users from the left until we find one that isn't in the user's seen set, save it, add to users seen, then push it on the right. Then once we get 5 of those we left push back the ones we left popped off that were already seen.
This is the best I could think of however it is O(N) each time we want to find 5 users for this one user to select from. It's possible (though not likely) that the user has seen a huge amount and is popping off the whole list.
To help understand this better. A naiive approach is to have every single user contain a copy of all online users in the form of a set. So then we simply pop 5 random set members. But this can't work because theres not enough space, and each time a user goes online they'd have to be added to each user's online users. Or deleted when they go offline and those operations are O(N) considering they are done for N users at O(1)
Does anyone have any tips to match users with other users?
It would be good to know about which kind of data we are talking about. How many users exist? How many will be online at average? How is the ratio of "seen users" compared to all users (sparse vs. dense)?
Modification of your algorithm
Don't pop the first but choose a random element from the set of online users. This should improve balancing and may help with amortized complexity depending on the ratio of these two sets!
Alternative Algorithm (more structured; still bad worst-case; should be good if sparse seen)
Keep seen as a balanced tree (O(log n) insertion)
Keep online as a balanced tree.
While not enough users chosen:
Search for first gap in seen (e.g. [0,1,3,7] -> 2; O(log n) according to SO-link)
Search for first user >= gap-value (O(log n))
If user < next_gap_neighbor (in example above: 3; next value after picked gap 2)
-> pick
Else
-> add chosen-gap-value temporarily (for this moment; model-decision how often to update online) to seen OR limit search somehow to > chosen-gap-value (O(log n))
Depending on the data, this should work very good if data is huge and seen is sparse!
Related
I understand the basics of search engine ranking, including the ideas of "reverse index", "vector space model", "cosine similarity", "PageRank", etc.
However, when a user submits a popular query term, it is very likely that millions of pages containing this term. As a result, a search engine still needs to sort these millions of pages in real time. For example, I just tried searching "Barack Obama" in Google. It shows "About 937,000,000 results (0.49 seconds)". Ranking over 900M items within 0.5 seconds? That really blows my mind!
How does a search engine sort such a large number of items within 1 second? Can anyone give me some intuitive ideas or point out references?
Thanks!
UPDATE:
Most of the responses (including some older discussions) so far seem to contribute the credit to "reverse index". However, as far as I know, reverse index only helps find the "relevant pages". In other words, by inverse index Google could obtain the 900M pages containing "Barack Obama" (out of over several billions of pages). However, it is still not clear how to "rank" these millions of "relevant pages" based on the threads I read so far.
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.
The question would be really relevant if we were sure that the ranking was complete. It is quite possible that the ordering provided is approximate.
Given the fluidity of the ranking results, no answer that looks reasonable could be considered incorrect. For example, if an entire section of the web were excluded from the top results, you would not notice, provided they were included later.
This gives the developers a degree of latitude entirely unavailable in almost all other domains.
The real question to ask is - how precisely do the results match the actual rank assigned to each page?
There are two major factors that influence the time it takes for you to get a response from your search engine.
The first is if you're storing your index on hard disk. If you're using a database, it's very likely that you're using the hard disk at least a little. From a cold boot, your queries will be slow until the data necessary for those queries has been pulled into the database cache.
The other is having a cache for your popular queries. It takes a lot longer to search for a query than it does to return results from a cache. Now, the random access time for a disk is too slow, so they need to have it stored in RAM.
To solve both of these problems, Google uses memcached. It's an application that caches the output of the Google search engine and feeds slightly old results to users. This is fine because most of the time the web doesn't change fast enough for it to be a problem, and because of the significant overlap in searches. You can be almost guaranteed that Barack Obama has been searched for recently.
Another issue that effects search engine latency is the network overheads.
Google have been using a custom variant of the Linux (IIRC) that has been optimised for use as a web server. They've managed to reduce some of the time it takes to start turning around results to a query.
The moment a query hits their servers, the server immediately responds back to the user with the header for the HTTP response, even before Google has finished processing the query terms.
I'm sure they have a bunch of other tricks up their sleeves, too.
EDIT:
They also keep their inverted lists sorted already, from the indexing process (it's better to process once than for each query).
With these pre-sorted lists, the most expensive operation is list intersection. Although I'm fairly sure Google doesn't rely on a vector space model, so list intersection isn't so much a factor for them.
The models that pay off the best according to the literature are the probabilistic models. As an example, you may wish to look up Okapi BM25. It does fairly well in practice within my area of research (XML Retrieval). When working with probabilistic models, it tends to be much more efficient to process document at a time instead of term at a time. What this means is that instead of getting a list of all of the documents that contain a term, we look at each document and rank it based on the terms it contains from our query (skipping documents that have no terms).
But if we want to be smart, we can approach the problem in a different way (but only when it appears to be better). If there's a query term that is extremely rare, we can rank with that first, because it has the highest impact. Then we rank with the next best term, and we continue until we've determined if it's likely that this document will be within our top k results.
One possible strategy is just rank the top-k instead of the entire list.
For example, to find the top 100 results from 1 millions hits, by selection algorithm the time complexity is O(n log k). Since k = 100 and n = 1,000,000, in practice we could ignore log(k).
Now, you only need O(n) to obtain the top 100 results out of 1 million hits.
Also I guess the use of NoSQL databases instead of RDBMS helps.
NoSQL databases scales horizontally better, and don't generate bottlenecks. Big guys like Google Facebook or Twitter use them.
As other comments/answers suggested the data might be already sorted, and they are returning offsets of the data found instead of the whole batch.
The real question is not how they sort that many results that quickly, but how do they do it when tens or hundreds of millions of people around the world are querying google at the same time xD
As Xiao said, just rank the top-k instead of the entire list.
Google tells you there are 937,000,000 results, but it won't show them all to you. If you keep scrolling page after page, after a while it will truncate the results :)
Here you go, i looked it up for you and this is what i found! http://computer.howstuffworks.com/internet/basics/search-engine.htm
This ia my theory...Its highly impossible that you are the first guy to search for a keyword.So for every keyword (or a combination) searched on a search engine, it maintains a hash of links to relevent web pages. Everytime you click a link in search results it gets a vote-up on the hashset of that keyword combination. Unfortunatly if you are the first guy, it saves your search keyword(for suggesting future searches) and starts the hashing of that keyword. So you end up with a fewer or no results at all.
The page ranking as you might be knowing depends on many other factors too like backlinks,no. Of pages refering a keyword in seaech. etc.
Regarding your update:
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.
MapReduce is not just designed for batch tasks. There are quite a lot MapReduce frameworks supporting real time computing: Apache Spark, Storm, Infinispan Distributed Executor, Hazelcast Distributed Executor Service.
Back to your question MapReduce is the key to distribute the query task to multiple nodes, and then merge the result together.
There's no way you expect to get an accurate answer to this question here ;) Anyway, here's a couple of things to consider - Google uses a unique infrastructure, in every part of it. We cannot even guess the order of complexity of their network equipment or their database storage. That is all I know about the hardware component of this problem.
Now, for the software implementation - like the name says the PageRank is a rank by itself. It doesn't rank the pages when you enter the search query. I assume it ranks it on a totally independent part of the infrastructure every hour. And we already know that Google crawler bots are roaming the Web 24/7 so I assume that new pages are added into an "unsorted" hash map and then they are ranked on the next run of the algorithm.
Next, when you type your query, thousands of CPUs independently scan thousands of different parts of the PageRank database with a gapping factor. For example if the gapping factor is 10, one machine queries the part of the database that has PageRank values from 0-9.99, the other one queries the database from 10-19.99 etc. Since resources aren't an obstacle for Google they can set the gapping factor so low (for example 1) in order for each machine to query less than 100k pages which isn't to much for their hardware. Then when they need to compile the results of your query, since they know which machine ranks exactly which part of the database they can use the 'fill the pool' principle. Let n be the number of links on each Google page. The algorithm that combines the pages returned from queries ran on all those machines against all the different parts of database needs to only fill the first n results. So they take the results from the machine querying against the highest rank of the database. If it is greater than n they're done, if not they move to the next machine. This takes only O(q*g/r) where s is the quantity of the pages Google serves, g is the gapping factor and r is the highest value of PageRank. This assumption is encouraged by the fact that when you turn to second page your query is ran once again (notice the different time taken to generate it) .
This is just my two cents, but I think I'm pretty accurate with this hypothesis.
EDIT: You might want to check this out for complexity of high-order queries.
I don't know what Google really does, but surely they use approximation. For example if the search query is 'Search engine' then the number of results will be = (no. of documents where there is one or more occurrence of the word 'search' + no. of documents where there is one or more occurrence of the word 'engine' ). This can be done in O(1) time complexity. For details read the basic structure of Google http://infolab.stanford.edu/~backrub/google.html.
I have a 6 degrees of Kevin Bacon type problem. Lets say I have 2 twitter users and I want to figure out their relationship to each other through friends (I use friends to denote when you follow someone vs them following you) and followers in twitter. I have all id's in my database.
So for example:
Joel and Sally
Joel follows Fred who is friends with Steve who follows Sally.
There could be multiple ways to get there, but I want the shortest.
This seems like a well known computer science problem (shortest path algorihm).
Today I have a table called "influencers" where all my twitter ids are stored, then I have a followings table that is a self referential table (ids of followers on one side and friends on the other.)
So is this graph theory? If so can someone point me to any utilities/libraries/approaches that can be helpful. Im using ruby, but can parse most languages.
As you have said, it is a well known problem, as you can see in Wikipedia.
Just notice that in your case, the weights in all edges are equal to 1), so I don't think that Djikstra's algorithm would be very useful to you.
In order to find the minimum distance, I would suggest a breadth-first search. The problem is that the Twitter network may be extremelly connected and hence you may have a combinatorial explosion (imagine that each person is connected to 20 other persons - in the first level, you would visit 20 profiles, while in the next you would visit 400, and in the next 8000 - if you don't find Sally fast, you quickly will run out of memory).
There is also a linear programming formulation, with which I am not 100% familiar. These notes are good on linear programming, but not on the shortest path problem, while these seem more focused on the applications.
There is a video lecture on this problem available on line that seem quite complete.
I hope these references help.
This sounds like you need BFS http://en.wikipedia.org/wiki/Breadth-first_search
Online approach:
I think it can expensive depending on how you want to use it.
On worst case you would iterate all the data in the database: cost runtime O(n) (assume you have a lookup function to find the user in the graph in runtime O(1)).
Offline approach
You could do offline scheduled pre-calculation and store the distances as a lookup function but it requires some additional memory O(n*n) where n is number of users. The cost for the lookup function is now only O(1) or O(logn) depending on how you implement it
(disregarding the offline runtime which I would think will be in the area O(n) to O(n*n)).
Strategy
The strategy you want to follow can be depended on number of users you can expect as an upper limit and how well the users are connected to each other. If you have few users, online approach might be fine, if you have million of users, then you probably need an offline approach but it will cost you some memory.
Other considerations
Mix online and offline approach
Use caching strategies
Whenever a new reference is updated for a user, update the distance lookup function
Updated Answer There are 17 mio. users, we will need offline approach.
I would follow the offline version. You should avoid O(n*n) runtime which I think is possible.
DB model
You should think how you would model the DB as this will be the most expensive part of this implementation.
Maybe something like:
Create a table for every user (table-name could be userId). And every table has entries for every user (the record key is userId).
This will result in 17 mio. tables with 17 mio. entries each (This is the O(n*n) cost).
Offline you run the BFS once while keeping track of which user you have visited and at which level you are in the BFS iteration and save the distance to the DB. I haven't thought this part through but I think this strategy is feasible. Remember to run BFS on every node, i.e. until you have visited all the users.
If this strategy is not feasible then you could run BFS from every node which is O(n*n) runtime. This means it could take something like a month to run on worst case, i.e. your distance data could be old. How fast this runs depends on how connected your users are.
Or you could do the approach if possible "Whenever a new reference is updated for a user, update the distance lookup function". This would run BFS once which is O(n), i.e. a few seconds. Invoke BFS(userId) on first time event and afterwards on reference update.
Online you fetch the table by table-name using userId and fetch the entry by another userId to get the distance.
I am building an application that is supposed to extract a mission for the user from a finite mission pool. The thing is that I want:
that the user won't get the same mission twice,
that the user won't get the same missions as his friends (in the application) until some time has passed.
To summarize my problem, I need to extract the least common mission out of the pool.
Can someone please reference me to known algorithms of finding least common something (LFU).
I also need the theoretical aspect, so if someone knows some articles or research papers about this (from known magazines like Scientific American) that would be great.
For getting the least frequently used mission, simply give every mission a counter that counts how many times it was used. Then search for the mission with the lowest counter value.
For getting the mission that was least frequently used by a group of friends, you can store for every user the missions he/she has done (and the number of times). This information is probably useful anyway. Then when a new mission needs to be chosen for a user, a (temporary) combined list of used missions and their frequencies by the users and all his friends can easily be created and sorted by frequency. This is not very expensive.
Base on your 2 requirements, I don't see what "LEAST" used mission has anything to do with this. You said you want non repeating missions.
OPTION 1:
What container do you use to hold all missions? Assume it's a list, when you or your friend chooses a mission move that mission to the end of the list (swap it with the missions there). Now you have split your initial list into 2 sublists. The first part holds unused missions, and the second part holds used missions. Keep track of the pivot/index which separates the 2 lists.
Now every time you or your friends choose a new mission it is choosen it from the first sublist. Then move it into the second sublist and update the pivot.
OPTION 2:
If you repeat missions eventually, but choose first the ones which have been chosen the least amount of time, then you can make your container a min heap. Add a usage counter to each mission and add them to the heap based on that. Extract a mission and increment its usage counter then put it back into the heap. This is a good solution, but depending on how simple your program is, you could even use a circular buffer.
It would be nice to know more about what you're building :)
I think the structure you need is a min-heap. It allows extraction of the minimum in O(Log(n)) and it allows you to increase the value of an item in O(Log(n)) too.
A good start is Edmond Blossom V algorithm for a perfect minimum matching in general graph. If you have a bipartite graph you can look for the Floyd-Warshall algorithmus to find the shortest path. Maybe you can use also a topological search but I don't know because these algorithm are really hard to learn.
I have an algorithm that chooses a list of items that should fit the user's likings.
I'll skip the algorithm's details because of confidentiality issues...
Now, I'm trying to think of a way to check it statistically, with a group of people.
The way I'm checking it now is:
Algorithm gets best results per user.
shuffle top 5 results with lowest 5 results.
make person list the results he liked by order (0 = liked best, 9 = didn't like)
compare user results to algorithm results.
I'm doing this because i figured that to show that algorithm chooses good results, i need to put in some bad results and show that the algorithm knows its a bad result as well.
So, what I'm asking is:
Is shuffling top results with low results is a good idea ?
And if not, do you have an idea on how to get good statistics on how good an algorithm matches user preferences (we have users that can choose stuff) ?
First ask yourself:
What am I trying to measure?
Not to rag on the other submissions here, but while mjv and Sjoerd's answers offer some plausible heuristic reasons for why what you are trying to do may not work as you expect; they are not constructive in the sense that they do not explain why your experiment is flawed, and what you can do to improve it. Before either of these issues can be addressed, what you need to do is define what you hope to measure, and only then should you go about trying to devise an experiment.
Now, I can't say for certain what would constitute a good metric for your purposes, but I can offer you some suggestions. As a starting point, you could try using a precision vs. recall graph:
http://en.wikipedia.org/wiki/Precision_and_recall
This is a standard technique for assessing the performance of ranking and classification algorithms in machine learning and information retrieval (ie web searching). If you have an engineering background, it could be helpful to understand that precision/recall generalizes the notion of precision/accuracy:
http://en.wikipedia.org/wiki/Accuracy_and_precision
Now let us suppose that your algorithm does something like this; it takes as input some prior data about a user then returns a ranked list of other items that user might like. For example, your algorithm is a web search engine and the items are pages; or you have a movie recommender and the items are books. This sounds pretty close to what you are trying to do now, so let us continue with this analogy.
Then the precision of your algorithm's results on the first n is the number of items that the user actually liked out of your first to top n recommendations:
precision = #(items user actually liked out of top n) / n
And the recall is the number of items that you actually got right out of the total number of items:
recall = #(items correctly marked as liked) / #(items user actually likes)
Ideally, one would want to maximize both of these quantities, but they are in a certain sense competing objectives. To illustrate this, consider a few extremal situations: For example, you could have a recommender that returns everything, which would have perfect recall, but very low precision. A second possibility is to have a recommender that returns nothing or only one sure-fire hit, which would have (in a limiting sense) perfect precision, but almost no recall.
As a result, to understand the performance of a ranking algorithm, people typically look at its precision vs. recall graph. These are just plots of the precision vs the recall as the number of items returned are varied:
Image taken from the following tutorial (which is worth reading):
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
Now to approximate a precision vs recall for your algorithm, here is what you can do. First, return a large set of say n, results as ranked by your algorithm. Next, get the user to mark which items they actually liked out of those n results. This trivially gives us enough information to compute the precision at every partial set of documents < n (since we know the number). We can also compute the recall (as restricted to this set of documents) by taking the total number of items liked by the user in the entire set. This, we can plot a precision recall curve for this data. Now there are fancier statistical techniques for estimating this using less work, but I have already written enough. For more information please check out the links in the body of my answer.
Your method is biased. If you use the top 5 and bottom 5 results, It is very likely that the user orders it according to your algorithm. Let's say we have an algorithm which rates music, and I present the top 1 and bottom 1 to the user:
Queen
The Cheeky Girls
Of course the user will mark it exactly like your algorithm, because the difference between the top and bottom is so big. You need to make the user rate randomly selected items.
Independently of the question of mixing top and bottom guesses, an implicit drawback of the experimental process, as described, is that the data related to the user's choice can only be exploited in the context of one particular version of the algorithm:
When / if the algorithm or its parameters are ever slightly tuned, the record of past user's choices cannot be reused to validate the changes to the algorithm.
On mixing high and low results:
The main drawback of producing sets of items by mixing the algorithm's top and bottom guesses is that it may further complicate the choice of the error/distance function used to measure how well the algorithm performed. Unless the two subsets of items (topmost choices, bottom most choices) are kept separately for the purpose of computing distinct measurements, typical statistical measures of the error (say RMSE) will not be a good measurement of the effective algorithm's quality.
For example, an algorithm which frequently suggests, low guesses items which end up being picked as top choices by the user may have the same averaged error rate than an algorithm which never confuses highs with lows, but where there the user tends to reorders the items more within their subset.
A second drawback is that the algorithm evaluation method may merely qualify its ability of filtering the relative like/dislike of users for items it [the algorithm] chooses rather than its ability of producing the user's actual top choices.
In other words the user's actual top choices may never be offered to him; so yeah the algorithm does a good job at guessing that user will like say Rock-and-Roll before Rap, but never guessing that in fact user prefers Classical Baroque music over all.
Recently I found an nice blog post presenting 2 approaches for tracking online users of a web site with the help of Redis.
1) Smart-keys and setting their expiration
http://techno-weenie.net/2010/2/3/where-s-waldo-track-user-locations-with-node-js-and-redis
2) Set-s and intersects
http://www.lukemelia.com/blog/archives/2010/01/17/redis-in-practice-whos-online/
Can you judge which one should be faster and why?
For knowing whether or not a particular user is online, the first method will be a lot faster - nothing is faster than reading a single key.
Finding users on a particular page is not as clear (I haven't seen hard numbers on the performance of either intersection or wildcard keys), but if the set is big enough to cause performance problems in either implementation it isn't practical to display them all anyway.
For matching users to a friends list I would probably go with the first approach also - even a few hundred get operations (checking the status of everyone in the list) should outperform intersection on multiple sets if those sets have a large number of records and are difficult to maintain.
Redis sets are more appropriate for things that can't be done with keys, particularly where getting all items in the set is more important than checking if a particular item is in the set.