The algorithm used to generate recommendations in Google News? - algorithm

I'm study recommendation engines, and I went through the paper that defines how Google News generates recommendations to users for news items which might be of their interest, based on collaborative filtering.
One interesting technique that they mention is Minhashing. I went through what it does, but I'm pretty sure that what I have is a fuzzy idea and there is a strong chance that I'm wrong. The following is what I could make out of it :-
Collect a set of all news items.
Define a hash function for a user. This hash function returns the index of the first item from the news items which this user viewed, in the list of all news items.
Collect, say "n" number of such values, and represent a user with this list of values.
Based on the similarity count between these lists, we can calculate the similarity between users as the number of common items. This reduces the number of comparisons a lot.
Based on these similarity measures, group users into different clusters.
This is just what I think it might be. In Step 2, instead of defining a constant hash function, it might be possible that we vary the hash function in a way that it returns the index of a different element. So one hash function could return the index of the first element from the user's list, another hash function could return the index of the second element from the user's list, and so on. So the nature of the hash function satisfying the minwise independent permutations condition, this does sound like a possible approach.
Could anyone please confirm if what I think is correct? Or the minhashing portion of Google News Recommendations, functions in some other way? I'm new to internal implementations of recommendations. Any help is appreciated a lot.
Thanks!

I think you're close.
First of all, the hash function first randomly permutes all the news items, and then for any given person looks at the first item. Since everyone had the same permutation, two people have a decent chance of having the same first item.
Then, to get a new hash function, rather than choosing the second element (which would have some confusing dependencies on the first element), they choose a whole new permutation and take the first element again.
People who happen to have the same hash value 2-4 times (that is, the same first element in 2-4 permutations) are put together in a cluster. This algorithm is repeated 10-20 times, so that each person gets put into 10-20 clusters. Finally, recommendations are given based (the small number of) other people in the 10-20 clusters. Since all this work is done by hashing, people are put directly into buckets for their clusters, and large numbers of comparisons aren't needed.

Related

Is there a specific scenario of a hash table that isn't full yet an insertion can't occur?

What I mean to ask is for a hash-table following the standard size of a prime number, is it possible to have some scenario (of inserted keys) where no further insertion of a given element is possible even though there's some empty slots? What kind of hash-function would achieve that?
So, most hash functions allow for collisions ("Hash Collisions" is the phrase you should google to understand this better, by the way.) Collisions are handled by having a secondary data structure, like a list, to store all of the values inserted at keys with the same hash.
Because these data structures can generally store arbitrarily many elements, you will always be able to insert into the hash table, but the performance will get worse and worse, approaching the performance of the backing data structure.
If you do not have a backing data structure, then you can be unable to insert as soon as two things get added to the same position. Since a good hash function distributes things evenly and effectively randomly, this would happen pretty quickly (see "The Birthday Problem").
There are failure-to-insert scenarios for some but not all hash table implementations.
For example, closed hashing aka open addressing implementations use some logic to create a sequence of buckets in which they'll "probe" for values not found at the hashed-to bucket due to collisions. In the real world, sometimes the sequence-creation is pretty basic, for example:
the programmer might have hard-coded N prime numbers, thinking the odds of adding in each of those in turn and still not finding an empty bucket are low (but a malicious user who knows the hash table design may be able to calculate values to make the table fail, or it may simply be so full that the odds are no longer good, or - while emptier - a statistical freak event)
the programmer might have done something like picked a prime number they liked - say 13903 - to add to the last-probed bucket each time until a free one is found, but if the table size happens to be 13903 too it'll keep checking the same bucket.
Still, there are probing approaches such as linear probing that guarantee to try all buckets (unless the implementation goes out of its way to put a limit on retries). It has some other "issues" though, and won't always be the best choice.
If a hash table is implemented using open addressing instead of separate chaining, then it is a good idea to leave at least 1 slot empty to simplify the algorithm.
In open addressing when we are trying to find an element, we first compute the hash index i, then check the table at indexes {i, i + 1, i + 2, ... N - 1, (wrapping around) 0, 1, 2, ...}, until we either find the element we want or hit an empty slot. You can see that in this algorithm, if no slot is empty but the element can't be found, then the search would loop forever.
However, I should emphasize that enforcing merely simplifies the search algorithm. Because alternatively, the search algorithm can remember the starting index i, and halt the search if the entire table has been scanned and it lands back at index i.

Sorting a list based on multiple indices and weights

Sort of a very long winded explanation of what I'm looking at so I apologize in advance.
Let's consider a Recipe:
Take the bacon and weave it ...blahblahblah...
This recipe has 3 Tags
author (most important) - Chandler Bing
category (medium importance) - Meat recipe (out of meat/vegan/raw/etc categories)
subcategory (lowest importance) - Fast food (our of fast food / haute cuisine etc)
I am a new user that sees a list of randomly sorted recipes (my palate/profile isn't formed yet). I start interacting with different recipes (reading them, saving them, sharing them) and each interaction adds to my profile (each time I read a recipe a point gets added to the respective category/author/subcategory). After a while my profile starts to look something like this :
Chandler Bing - 100 points
Gordon Ramsey - 49 points
Haute cuisine - 12 points
Fast food - 35 points
... and so on
Now, the point of all this exercise is to actually sort the recipe list based on the individual user's preferences. For example in this case I will always see Chandler Bing's recipes on the top (regardless of category), then Ramsey's recipes. At the same time, Bing's recipes will be sorted based on my preferred categories and subcategories, seeing his fast food recipes higher than his haute cuisine ones.
What am I looking at here in terms of a sorting algorithm?
I hope that my question has enough information but if there's anything unclear please let me know and I'll try to add to it.
I would allow the "Tags" with the most importance to have the greatest capacity in point difference. Example: Give author a starting value of 50 points, with a range of 0-100 points. Give Category a starting value of 25 points, with a possible range of 0-50 points, give subcategory a starting value of 12.5 points, with a possible range of 0-25 points. That way, if the user's palate changes over time, s/he will only have to work down from the maximum, or work up from the minimum.
From there, you can simply add up the points for each "Tag", and use one of many languages' sort() methods to compare each recipe.
You can write a comparison function that is used in your sort(). The point is when you're comparing two recipes just add up the points respectively based on their tags and do a simple comparison. That and whatever sorting algorithm you choose should do just fine.
You can use a recursively subdividing MSD (sort of radix sort algorithm). Works as follows:
Take the most significant category of each recipe.
Sort the list of elements based on that category, grouping elements with the same category into one bucket (Ramsay bucket, Bing bucket etc).
Recursively sort each bucket, starting with the next category of importance (Meat bucket etc).
Concatenate the buckets together in order.
Complexity: O(kn) where k is the number of category types and N is the number of recipes.
I think what you're looking for is not a sorting algorithm, but a rating scheme.
You say, you want to sort by preferences. Let's assume, these preferences have different “dimensions”, like level of complexity, type of cuisine, etc.
These dimensions have different levels of measurement. These can be e.g. numeric or simple categories/tags. It would be your job to:
Create a scheme of dimensions and scales that can represent a user's preferences.
Operationalize real-world data to fit into this scheme.
Create a profile for the users which reflects their preferences. Same for the chefs; treat them just like normal users here.
To actually match a user to a chef (or, even to another user), create a sorting callback that matches all your dimensions against each other and makes sure that in each of the dimension the compared users have a similar value (on a numeric scale), or an overlapping set of properties (on a nominal scale, like tags). Then you sort the result by the best match.

n! combinations, how to find best one without killing computer?

I'll get straight to it. I'm working on an web or phone app that is responsible for scheduling. I want students to input courses they took, and I give them possible combinations of courses they should take that fits their requirements.
However, let's say there's 150 courses that fits their requirements and they're looking for 3 courses. That would be 150C3 combinations, right?.
Would it be feasible to run something like this in browser or a mobile device?
First of all you need a smarter algorithm which can prune the search tree. Also, if you are doing this for the same set of courses over and over again, doing the computation on the server would be better, and perhaps precomputing a feasible data structure can reduce the execution time of the queries. For example, you can create a tree where each sub-tree under a node contains nodes that are 'compatible'.
Sounds to me like you're viewing this completely wrong. At most institutions there are 1) curriculum requirements for graduation, and 2) prerequisites for many requirements and electives. This isn't a pure combinatorial problem, it's a dependency tree. For instance, if Course 201, Course 301, and Course 401 are all required for the student's major, higher numbers have the lower numbered ones as prereqs, and the student is a Junior, you should be strongly recommending that Course 201 be taken ASAP.
Yay, mathematics I think I can handle!
If there are 150 courses, and you have to choose 3, then the amount of possibilities are (150*149*148)/(3*2) (correction per jerry), which is certainly better than 150 factorial which is a whole lot more zeros ;)
Now, you really don't want to build an array that size, and you don't have to! All web languages have the idea of randomly choosing an element in an array, so you get an element in an array and request 3 random unique entries from it.
While the potential course combinations is very large, based on your post I see no reason to even attempt to calculate them. This task of random selection of k items from n-sized list is delightfully trivial even for old, slow devices!
Is there any particular reason you'd need to calculate all the potential course combinations, instead of just grab-bagging one random selection as a suggestion? If not, problem solved!
Option 1 (Time\Space costly): let the user on mobile phone browse the list of (150*149*148) possible choices, page by page, the processing is done at the server-side.
Option 2 (Simple): instead of the (150*149*148)-item decision tree, provide a 150-item bag, if he choose one item from the bag, remove it from the bag.
Option 3 (Complex): expand your decision tree (possible choices) using a dependency tree (parent course requires child courses) and the list of course already taken by the student, and his track\level.
As far as I know, most educational systems use the third option, which requires having a profile for the student.

Algorithm for grouping RESTful routes

Given a list of URLs known to be somewhat "RESTful", what would be a decent algorithm for grouping them so that URLs mapping to the same "controller/action/view" are likely to be grouped together?
For example, given the following list:
http://www.example.com/foo
http://www.example.com/foo/1
http://www.example.com/foo/2
http://www.example.com/foo/3
http://www.example.com/foo/1/edit
http://www.example.com/foo/2/edit
http://www.example.com/foo/3/edit
It would group them as follows:
http://www.example.com/foo
http://www.example.com/foo/1
http://www.example.com/foo/2
http://www.example.com/foo/3
http://www.example.com/foo/1/edit
http://www.example.com/foo/2/edit
http://www.example.com/foo/3/edit
Nothing is known about the order or structure of the URLs ahead of time. In my example, it would be somewhat easy since the IDs are obviously numeric. Ideally, I'd like an algorithm that does a good job even if IDs are non-numeric (as in http://www.example.com/products/rocket and http://www.example.com/products/ufo).
It's really just an effort to say, "Given these URLs, I've grouped them by removing what I think it he 'variable' ID part of the URL."
Aliza has the right idea, you want to look for the 'articulation points' (in REST, basically where a parameter is being passed). Looking only for a single point of change gets tricky
Example
http://www.example.com/foo/1/new
http://www.example.com/foo/1/edit
http://www.example.com/foo/2/edit
http://www.example.com/bar/1/new
These can be grouped several equally good ways since we have no idea of the URL semantics. This really boils down to the question of this - is this piece of the URL part of the REST descriptor or a parameter. If we know what all the descriptors are, the rest are parameters and we are done.
Give a sufficiently large dataset, we'd want to look at the statistics of all URLs at each depth. e.g., /x/y/z/t/. We would count the number of occurrences in each slot and generate a large joint probability distribution table.
We can now look at the distribution of symbols. A high count in a slot means it's likely a parameter. We would start from the bottom, look for conditional probability events, ie., What is the probability of x being foo, then what is the probability y being something given x, etc. etc. I'd have to think more to determine a systematic way to extracting these, but it seems like a promisign start
split each url to an array of strings with the delimiter being '/'
e.g. http://www.example.com/foo/1/edit will give the array [http:,www.example.com,foo,1,edit]
if two arrays (urls) share the same value in all indecies except for one, they will be in the same group.
e.g. http://www.example.com/foo/1/edit = [http:,www.example.com,foo,1,edit] and
http://www.example.com/foo/2/edit = [http:,www.example.com,foo,2,edit]. The arrays match in all indices except for #3 which is 1 in the first array and 2 in the second array. Therefore, the urls belong to the same group.
It is easy to see that urls like http://www.example.com/foo/3 and http://www.example.com/foo/1/edit will not belong to the same group according to this algorithm.

I was asked this in a recent interview

I was asked to stay away from HashMap or any sort of Hashing.
The question went something like this -
Lets say you have PRODUCT IDs of up to 20 decimals, along with Product Descriptions. Without using Maps or any sort of hashing function, what's the best/most efficient way to store/retrieve these product IDs along with their descriptions?
Why is using Maps a bad idea for such a scenario?
What changes would you make to sell your solution to Amazon?
A map is good to use when insert/remove/lookup operations are interleaved. Every operations are amortized in O(log n).
In your exemple you are only doing search operation. You may consider that any database update (inserting/removing a product) won't happen so much time. Therefore probably the interviewer want you to get the best data structure for lookup operations.
In this case I can see only some as already proposed in other answers:
Sorted array (doing a binary search)
Hasmap
trie
With a trie , if product ids do not share a common prefix, there is good chance to find the product description only looking at the first character of the prefix (or only the very first characters). For instance, let's take that product id list , with 125 products:
"1"
"2"
"3"
...
"123"
"124"
"1234567"
Let's assume you are looking for the product id titled "1234567" in your trie, only looking to the first letters: "1" then "2" then "3" then "4" will lead to the good product description. No need to read the remaining of the product id as there is no other possibilities.
Considering the product id length as n , your lookup will be in O(n). But as in the exemple explained it above it could be even faster to retreive the product description. As the procduct ID is limited in size (20 characters) the trie height will be limited to 20 levels. That actually means you can consider the look up operations will never goes beyond a constant time, as your search will never goes beyong the trie height => O(1). While any BST lookups are at best amortized O(log N), N being the number of items in your tree .
While an hashmap could lead you to slower lookup as you'll need to compute an index with an hash function that is probably implemented reading the whole product id length. Plus browsing a list in case of collision with other product ids.
Doing a binary search on a sorted array, and performance in lookup operations will depends on the number of items in your database.
A B-Tree in my opinion. Does that still count as a Map?
Mostly because you can have many items loaded at once in memory. Searching these items in memory is very fast.
Consecutive integer numbers give perfect choice for the hash map but it only has one problem, as it does not have multithreaded access by default. Also since Amazon was mentioned in your question I may think that you need to take into account concurency and RAM limitation issues.
What you might do in the response to such question is to explain that since
you are dissallowed to use any built-in data storage schemes, all you can do is to "emulate" one.
So, let's say you have M = 10^20 products with their numbers and descriptions.
You can partition this set to the groups of N subsets.
Then you can organize M/N containers which have sugnificantly reduced number of elements. Using this idea recursively will give you a way to store the whole set in containers with such property that access to them would have accepted performance rate.
To illustrate this idea, consider a smaller example of only 20 elements.
I would like you to imagive the file system with directories "1", "2", "3", "4".
In each directory you store the product descriptions as files in the following way:
folder 1: files 1 to 5
folder 2: files 6 to 10
...
folder 4: files 16 to 20
Then your search would only need two steps to find the file.
First, you search for a correct folder by dividing 20 / 5 (your M/N).
Then, you use the given ID to read the product description stored in a file.
This is just a very rough description, however, the idea is very intuitive.
So, perhaps this is what your interviewer wanted to hear.
As for myself, when I face such questions on interview, even if I fail to get the question correctly (which is the worst case :)) I always try to get the correct answer from the interviewer.
Best/efficient for what? Would have been my answer.
E.g. for storing them, probably the fast thing to do are two arrays with 20 elements each. One for the ids, on for the description. Iterating over those is pretty fast to. And it is efficient memory wise.
Of course the solution is pretty useless for any real application, but so is the question.
There is an interesting alternative to B-Tree: Radix Tree
I think what he wanted you to do, and I'm not saying it's a good idea, is to use the computer memory space.
If you use a 64-bit (virtual) memory address, and assuming you have all the address space for your data (which is never the case) you can store a one-byte value.
You could use the ProductID as an address, casting it to a pointer, and then get that byte, which might be an offset in another memory for actual data.
I wouldn't do it this way, but perhaps that is the answer they were looking for.
Asaf
I wonder if they wanted you to note that in an ecommerce application (such as Amazon's), a common use case is "reverse lookup": retrieve the product ID using the description. For this, an inverted index is used, where each keyword in a description is an index key, which is associated with a list of relevant product identifiers. Binary trees or skip lists are good ways to index these key words.
Regarding the product identifier index: In practice, B-Trees (which are not binary search trees) would be used for a large, disk-based index of 20-digit identifiers. However, they may have been looking for a toy solution that could be implemented in RAM. Since the "alphabet" of decimal numbers is so small, it lends itself very nicely to a trie.
The hashmaps work really well if the hashing function gives you a very uniform distribution of the hashvalues of the existing keys. With really bad hash function it can happen so that hash values of your 20 values will be the same, which will push the retrieval time to O(n). The binary search on the other hand guaranties you O(log n), but inserting data is more expensive.
All of this is very incremental, the bigger your dataset is the less are the chances of a bad key distribution (if you are using a good, proven hash algorithm), and on smaller data sets the difference between O(n) and O(log n) is not much to worry about.
If the size is limited sometimes it's faster to use a sorted list.
When you use Hash-anything, you first have to calculate a hash, then locate the hash bucket, then use equals on all elements in the bucket. So it all adds up.
On the other hand you could use just a simple ArrayList ( or any other List flavor that is suitable for the application), sort it with java.util.Collections.sort and use java.util.Collections.binarySearch to find an element.
But as Artyom has pointed out maybe a simple linear search would be much faster in this case.
On the other hand, from maintainability point of view, I would normally use HashMap ( or LinkedHashMap ) here, and would only do something special here when profiler would tell me to do it. Also collections of 20 have a tendency to become collections of 20000 over time and all this optimization would be wasted.
There's nothing wrong with hashing or B-trees for this kind of situation - your interviewer probably just wanted you to think a little, instead of coming out with the expected answer. It's a good sign, when interviewers want candidates to think. It shows that the organization values thought, as opposed to merely parroting out something from the lecture notes from CS0210.
Incidentally, I'm assuming that "20 decimal product ids" means "a large collection of product ids, whose format is 20 decimal characters".... because if there's only 20 of them, there's no value in considering the algorithm. If you can't use hashing or Btrees code a linear search and move on. If you like, sort your array, and use a binary search.
But if my assumption is right, then what the interviewer is asking seems to revolve around the time/space tradeoff of hashmaps. It's possible to improve on the time/space curve of hashmaps - hashmaps do have collisions. So you might be able to get some improvement by converting the 20 decimal digits to a number, and using that as an index to a sparsely populated array... a really big array. :)
Selling it to Amazon? Good luck with that. Whatever you come up with would have to be patentable, and nothing in this discussion seems to rise to that level.
20 decimal PRODUCT IDs, along with Product Description
Simple linear search would be very good...
I would create one simple array with ids. And other array with data.
Linear search for small amount of keys (20!) is much more efficient then any binary-tree or hash.
I have a feeling based on their answer about product ids and two digits the answer they were looking for is to convert the numeric product ids into a different base system or packed form.
They made a point to indicate the product description was with the product ids to tell you that a higher base system could be used within the current fields datatype.
Your interviewer might be looking for a trie. If you have a [small] constant upper bound on your key, then you have O(1) insert and lookup.
I think what he wanted you to do, and
I'm not saying it's a good idea, is to
use the computer memory space.
If you use a 64-bit (virtual) memory
address, and assuming you have all the
address space for your data (which is
never the case) you can store a
one-byte value.
Unfortunately 2^64 =approx= 1.8 * 10^19. Just slightly below 10^20. Coincidence?
log2(10^20) = 66.43.
Here's a slightly evil proposal.
OK, 2^64 bits can fit inside a memory space.
Assume a bound of N bytes for the description, say N=200. (who wants to download Anna Karenina when they're looking for toasters?)
Commandeer 8*N 64-bit machines with heavy RAM. Amazon can swing this.
Every machine loads in their (very sparse) bitmap one bit of the description text for all descriptions. Let the MMU/virtual memory handle the sparsity.
Broadcast the product tag as a 59-bit number and the bit mask for one byte. (59 = ceil(log2(10^20)) - 8)
Every machine returns one bit from the product description. Lookups are a virtual memory dereference. You can even insert and delete.
Of course paging will start to be a bitch at some point!
Oddly enough, it will work the best if product-id's are as clumpy and ungood a hash as possible.

Resources