Please be patient with my writing, as my English is not proficient.
As a programmer, I wanna learn about the algorithm, or the machine learning intelligence, that are implemented underneath recommendation systems or related-based systems. For instance, the most obvious example would be from Amazon. They have a really good recommendation system. They get to know: if you like this, you might also like that, or something else like: What percentage of people like this and that together.
Of course I know Amazon is a big website and they invested a lot of brain and money into these systems. But, on the very basic core, how can we implement something like that within our database? How can we identify how one object relates to other? How can we build a statistic unit that handles this kind of thing?
I'd appreciate if someone can point out some algorithms. Or, basically, point out some good direct references/ books that we can all learn from. Thank you all!
The are 2 different types of recommendation engines.
The simplest is item-based ie "customers that bought product A also bought product B". This is easy to implement. Store a sparse symmetrical matrix nxn (where n is the number of items). Each element (m[a][b]) is the number of times anyone has bought item 'a' along with item 'b'.
The other is user-based. That is "people like you often like things like this". A possible solution to this problem is k-means clustering. ie construct a set of clusters where users of similar taste are placed in the same cluster and make suggestions based on users in the same cluster.
A better solution, but an even more complicated one is a technique called Restricted Boltzmann Machines. There's an introduction to them here
A first attempt could look like this:
//First Calculate how often any product pair was bought together
//The time/memory should be about Sum over all Customers of Customer.BoughtProducts^2
Dictionary<Pair<ProductID,ProductID>> boughtTogether=new Dictionary<Pair<ProductID,ProductID>>();
foreach(Customer in Customers)
{
foreach(product1 in Customer.BoughtProducts)
foreach(product2 in Customer.BoughtProducts)
{
int counter=boughtTogether[Pair(product1,product2)] or 0 if missing;
counter++;
boughtTogether[Pair(product1,product2)]=counter;
}
}
boughtTogether.GroupBy(entry.Key.First).Select(group.OrderByDescending(entry=>entry.Value).Take(10).Select(new{key.Second as ProductID,Value as Count}));
First I calculate how often each pair of products was bought together, and then I group them by the product and select the top 20 other products bought with it. The result should be put into some kind of dictionary keyed by product ID.
This might get too slow or cost too much memory for large databases.
I think, you talk about knowledge base systems. I don't remember the programming language (maybe LISP), but there is implementations. Also, look at OWL.
There's also prediction.io if you're looking for an open source solution or SaaS solutions like mag3llan.com.
Related
I'm looking to write a basic recommender system in Objective-C and I'm looking for a basic algorithm for the job. Unfortunately off-the-shelf systems are off the table since none seem to be for Objective-C.
I'm going to have a database of items, each with tags (think movies with tags like "horror", "action", etc). Each item would have ~5 or so of these tags. When a user first uses the app their profile will be primed based on their input to a series of questions, associating some tags with their profile.
As the user continues to use the system and rate various items (on a hate/like/love basis) I'd like to adjust the weighting of the recommended tags based on that feedback. I'd also like to take in a few other properties of their ratings as their profile grows, like for example "the 80s" if this dealt with movies. Or maybe director, sticking with the movie theme.
I'm opting to avoid the normal (or at least popular) recommender systems where it looks for similar users to generate recommendations. This is going to have a large item database and minimal users to start.
Can anyone recommend a good starting point for an algorithm like this, I'd hate to reinvent the wheel, and there's a lot out there?
could you please refer the python-recsys: https://github.com/ocelma/python-recsys, this software use SVD algorithm, I think it is a base algorithm but effective enough. The required library is numpy and scipy, which are written in C, and wrapped by Python. I think it is easy to compile and port to objective-c
You can use item based recommendation which sounds perfect for your needs. You can later start to incorporate tags into the weighting but for now I'd suggest considering only the items.
You can learn a little more at http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/itembased.html There are a good few implementations of it on the net.
What you mentioned in your post would be called user based collaborative filtering.
I am building a Rails app that recommends tutors to students and vise versa. I need to match them based on multiple dimensions, such as their majors (Math, Biology etc.), experience (junior etc.), class (Math 201 etc.), preference (self-described keywords) and ratings.
I checked out some Rails collaborative recommendation engines (recommendable, recommendify) and Mahout. It seems that collaborative recommendation is not the best choice in my case, since I have much more structured data, which allows a more structured query. For example, I can have a recommendation logic for a student like:
if student looks for a Math tutor in Math 201:
if there's a tutor in Math major offering tutoring in Math 201 then return
else if there's a tutor in Math major then sort by experience then return
else if there's a tutor in quantitative major then sort by experience then return
...
My questions are:
What are the benefits of a collaborative recommendation algorithm given that my recommendation system will be preference-based?
If it does provide significant benefits, how I can combine it with a preference-based recommendation as mentioned above?
Since my approach will involve querying multiple tables, it might not be efficient. What should I do about this?
Thanks a lot.
It sounds like your measurement of compatibility could be profitably reformulated as a metric. What you should do is try to interpret your `columns' as being different components of the dimension of your data. The idea is that you ultimately should produce a binary function which returns a measurement of compatibility between students and tutors (and also students/students and tutors/tutors). The motivation for extending this metric to all types of data is that you can then use this idea to reformulate your matching criteria as a nearest-neighbor search:
http://en.wikipedia.org/wiki/Nearest_neighbor_search
There are plenty of data structures and solutions to this problem as it has been very well studied. For example, you could try out the following library which is often used with point cloud data:
http://www.cs.umd.edu/~mount/ANN/
To optimize things a bit, you could also try prefiltering your data by running principal component analysis on your data set. This would let you reduce the dimension of the space in which you do nearest neighbor searches, and usually has the added benefit of reducing some amount of noise.
http://en.wikipedia.org/wiki/Principal_component_analysis
Good luck!
Personally, I think collaborative filtering (cf) would work well for you. Note that a central idea of cf is serendipity. In other words, adding too many constraints might potentially result in a lukewarm recommendations to users. The whole point of cf is to provide exciting and relevant recommendations based on similar users. You need not impose such tight constraints.
If you might decide on implementing a custom cf algorithm, I would recommend reading this article published by Amazon [pdf], which discusses Amazon's recommendation system. Briefly, the algorithm they use is as follows:
for each item I1
for each customer C who bought I1
for each I2 bought by a customer
record purchase C{I1, I2}
for each item I2
calculate sim(I1, I2)
//this could use your own similarity measure, e.g., cosine based
//similarity, sim(A, B) = cos(A, B) = (A . B) / (|A| |B|) where A
//and B are vectors(items, or courses in your case) and the dimensions
//are customers
return table
Note that the creation of this table would be done offline. The online algorithm would be quick to return recommendations. Apparently, the recommendation quality is excellent.
In any case, if you want to get a better idea about cf in general (e.g., various cf strategies) and why it might be suited to you, go through that article (don't worry, it is very readable). Implementing a simple cf recommender is not difficult. Optimizations can be made later.
Recently, I found several web site have something like : "Recommended for You", for example youtube, or facebook, the web site can study my using behavior, and recommend some content for me... ...I would like to know how they analysis this information? Is there any Algorithm to do so? Thank you.
Amazon and Netflix (among others) use a technique called Collaborative filtering to suggest things you might like based on the likes/dislikes of others who have made purchases and selections similar to yours.
Is there any Algorithm to do so?
Yes
Yes. One fairly common one is to look at things you've selected in the past, find other people who've made those selections, then find the other selections most common among those other people, and guess that you're likely to be interested in those as well.
Yup there are lots of algorithms. Things such as k-nearest neighbor: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm.
Here is a pretty good book on the subject that covers making these sorts of systems along with others: http://www.amazon.com/gp/product/0596529325?ie=UTF8&tag=ianburriscom-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0596529325.
It's generally done by matching you with other users who have similar usage history / profile and then recommending other things that they've purhased/watched/whatever.
Searching for "recommendation algorithm" yields lots of papers. Most algorithms incorporate "machine learning" algorithms to determine groups of things (comedy movies, books on gardening, orchestral music, etc.). Your matching with those groups yields recommendations. Some companies use humans to classify things, too.
Such an algorithm is going to vary wildly from company to company. In many cases, it analyzes some combination of your search history, purchase history, physical location, and other factors. It probably will also compare purchases/searches amongst other people to find what those people have purchased/searched for, and recommend some of those products to you.
There are probably hundreds of these algorithms out there, but I doubt you can use any of them (that are actually good). Probably you are better off figuring it out yourself.
If you can categorize your contents (i.e. by tagging or content analysis), you can also categorize your users and their preferences.
For example: you have a video portal with 5 million videos .. 1 mio of them are tagged mostly red. If 80% of all videos watched by a user (who is defined by an IP, a persistent user account, ...) are tagged mostly red, you might want to recommend even more red videos to him. You might want to refine your recommendations by looking at his further actions: does he like your recommendations -- if so, why not give him even more, if not, try the second-best guess, maybe he's not looking for color, but for the background music ...
There's no absolute algorithm to do it, but all implementations will go into a similar direction. It's always basing on observing users, which scares me from time to time :-)
There's whole lot of algorithms tackling the issue: Wiki article. It's a Machine Learning domain problem. Computer's can be learned using two main techniques: classification and clustering. They require some datasets as input. If the dataset is informative (really holds some useful patterns) than those ML techniques can dig most of it.
Clustering could be best to use for this kind of problem. It's main usage is to find similarities among points in provided dataset. If the points are, e.g. your search history, they can be grouped together to form certain clusters. If Your search history closely relates to another, a hint can be given - picking links that are most similar to Your's.
The same comes with book recommendations - it's obvious what dataset they use: "Other people who bought this product also bought Product A, Product B,...". The key here is to match your profile to other's and use the most similar to recommend.
The computer retrieves information from the human brain with complex memory scan process, sorts it accordingly and outputs results based on what you have experienced in your life so far.
What technology goes in behind the screens of Amazon recommendation technology? I believe that Amazon recommendation is currently the best in the market, but how do they provide us with such relevant recommendations?
Recently, we have been involved with similar recommendation kind of project, but would surely like to know about the in and outs of the Amazon recommendation technology from a technical standpoint.
Any inputs would be highly appreciated.
Update:
This patent explains how personalized recommendations are done but it is not very technical, and so it would be really nice if some insights could be provided.
From the comments of Dave, Affinity Analysis forms the basis for such kind of Recommendation Engines. Also here are some good reads on the Topic
Demystifying Market Basket Analysis
Market Basket Analysis
Affinity Analysis
Suggested Reading:
Data Mining: Concepts and Technique
It is both an art and a science. Typical fields of study revolve around market basket analysis (also called affinity analysis) which is a subset of the field of data mining. Typical components in such a system include identification of primary driver items and the identification of affinity items (accessory upsell, cross sell).
Keep in mind the data sources they have to mine...
Purchased shopping carts = real money from real people spent on real items = powerful data and a lot of it.
Items added to carts but abandoned.
Pricing experiments online (A/B testing, etc.) where they offer the same products at different prices and see the results
Packaging experiments (A/B testing, etc.) where they offer different products in different "bundles" or discount various pairings of items
Wishlists - what's on them specifically for you - and in aggregate it can be treated similarly to another stream of basket analysis data
Referral sites (identification of where you came in from can hint other items of interest)
Dwell times (how long before you click back and pick a different item)
Ratings by you or those in your social network/buying circles - if you rate things you like you get more of what you like and if you confirm with the "i already own it" button they create a very complete profile of you
Demographic information (your shipping address, etc.) - they know what is popular in your general area for your kids, yourself, your spouse, etc.
user segmentation = did you buy 3 books in separate months for a toddler? likely have a kid or more.. etc.
Direct marketing click through data - did you get an email from them and click through? They know which email it was and what you clicked through on and whether you bought it as a result.
Click paths in session - what did you view regardless of whether it went in your cart
Number of times viewed an item before final purchase
If you're dealing with a brick and mortar store they might have your physical purchase history to go off of as well (i.e. toys r us or something that is online and also a physical store)
etc. etc. etc.
Luckily people behave similarly in aggregate so the more they know about the buying population at large the better they know what will and won't sell and with every transaction and every rating/wishlist add/browse they know how to more personally tailor recommendations. Keep in mind this is likely only a small sample of the full set of influences of what ends up in recommendations, etc.
Now I have no inside knowledge of how Amazon does business (never worked there) and all I'm doing is talking about classical approaches to the problem of online commerce - I used to be the PM who worked on data mining and analytics for the Microsoft product called Commerce Server. We shipped in Commerce Server the tools that allowed people to build sites with similar capabilities.... but the bigger the sales volume the better the data the better the model - and Amazon is BIG. I can only imagine how fun it is to play with models with that much data in a commerce driven site. Now many of those algorithms (like the predictor that started out in commerce server) have moved on to live directly within Microsoft SQL.
The four big take-a-ways you should have are:
Amazon (or any retailer) is looking at aggregate data for tons of transactions and tons of people... this allows them to even recommend pretty well for anonymous users on their site.
Amazon (or any sophisticated retailer) is keeping track of behavior and purchases of anyone that is logged in and using that to further refine on top of the mass aggregate data.
Often there is a means of over riding the accumulated data and taking "editorial" control of suggestions for product managers of specific lines (like some person who owns the 'digital cameras' vertical or the 'romance novels' vertical or similar) where they truly are experts
There are often promotional deals (i.e. sony or panasonic or nikon or canon or sprint or verizon pays additional money to the retailer, or gives a better discount at larger quantities or other things in those lines) that will cause certain "suggestions" to rise to the top more often than others - there is always some reasonable business logic and business reason behind this targeted at making more on each transaction or reducing wholesale costs, etc.
In terms of actual implementation? Just about all large online systems boil down to some set of pipelines (or a filter pattern implementation or a workflow, etc. you call it what you will) that allow for a context to be evaluated by a series of modules that apply some form of business logic.
Typically a different pipeline would be associated with each separate task on the page - you might have one that does recommended "packages/upsells" (i.e. buy this with the item you're looking at) and one that does "alternatives" (i.e. buy this instead of the thing you're looking at) and another that pulls items most closely related from your wish list (by product category or similar).
The results of these pipelines are able to be placed on various parts of the page (above the scroll bar, below the scroll, on the left, on the right, different fonts, different size images, etc.) and tested to see which perform best. Since you're using nice easy to plug and play modules that define the business logic for these pipelines you end up with the moral equivalent of lego blocks that make it easy to pick and choose from the business logic you want applied when you build another pipeline which allows faster innovation, more experimentation, and in the end higher profits.
Did that help at all? Hope that give you a little bit of insight how this works in general for just about any ecommerce site - not just Amazon. Amazon (from talking to friends that have worked there) is very data driven and continually measures the effectiveness of it's user experience and the pricing, promotion, packaging, etc. - they are a very sophisticated retailer online and are likely at the leading edge of a lot of the algorithms they use to optimize profit - and those are likely proprietary secrets (you know like the formula to KFC's secret spices) and guaarded as such.
This isn't directly related to Amazon's recommendation system, but it might be helpful to study the methods used by people who competed in the Netflix Prize, a contest to develop a better recommendation system using Netflix user data. A lot of good information exists in their community about data mining techniques in general.
The team that won used a blend of the recommendations generated by a lot of different models/techniques. I know that some of the main methods used were principal component analysis, nearest neighbor methods, and neural networks. Here are some papers by the winning team:
R. Bell, Y. Koren, C. Volinsky, "The BellKor 2008 Solution to the Netflix Prize", (2008).
A. Töscher, M. Jahrer, “The BigChaos Solution to the Netflix Prize 2008", (2008).
A. Töscher, M. Jahrer, R. Legenstein, "Improved Neighborhood-Based Algorithms for Large-Scale Recommender Systems", SIGKDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition (KDD’08) , ACM Press (2008).
Y. Koren, "The BellKor Solution to the Netflix Grand Prize", (2009).
A. Töscher, M. Jahrer, R. Bell, "The BigChaos Solution to the Netflix Grand Prize", (2009).
M. Piotte, M. Chabbert, "The Pragmatic Theory solution to the Netflix Grand Prize", (2009).
The 2008 papers are from the first year's Progress Prize. I recommend reading the earlier ones first because the later ones build upon the previous work.
I bumped on this paper today:
Amazon.com Recommendations: Item-to-Item Collaborative Filtering
Maybe it provides additional information.
(Disclamer: I used to work at Amazon, though I didn't work on the recommendations team.)
ewernli's answer should be the correct one -- the paper links to Amazon's original recommendation system, and from what I can tell (both from personal experience as an Amazon shopper and having worked on similar systems at other companies), very little has changed: at its core, Amazon's recommendation feature is still very heavily based on item-to-item collaborative filtering.
Just look at what form the recommendations take: on my front page, they're all either of the form "You viewed X...Customers who also viewed this also viewed...", or else a melange of items similar to things I've bought or viewed before. If I specifically go to my "Recommended for You" page, every item describes why it's recommended for me: "Recommended because you purchased...", "Recommended because you added X to your wishlist...", etc. This is a classic sign of item-to-item collaborative filtering.
So how does item-to-item collaborative filtering work? Basically, for each item, you build a "neighborhood" of related items (e.g., by looking at what items people have viewed together or what items people have bought together -- to determine similarity, you can use metrics like the Jaccard index; correlation is another possibility, though I suspect Amazon doesn't use ratings data very heavily). Then, whenever I view an item X or make a purchase Y, Amazon suggests me things in the same neighborhood as X or Y.
Some other approaches that Amazon could potentially use, but likely doesn't, are described here: http://blog.echen.me/2011/02/15/an-overview-of-item-to-item-collaborative-filtering-with-amazons-recommendation-system/
A lot of what Dave describes is almost certainly not done at Amazon. (Ratings by those in my social network? Nope, Amazon doesn't have any of my social data. This would be a massive privacy issue in any case, so it'd be tricky for Amazon to do even if they had that data: people don't want their friends to know what books or movies they're buying. Demographic information? Nope, nothing in the recommendations suggests they're looking at this. [Unlike Netflix, who does surface what other people in my area are watching.])
I don't have any knowledge of Amazon's algorithm specifically, but one component of such an algorithm would probably involve tracking groups of items frequently ordered together, and then using that data to recommend other items in the group when a customer purchases some subset of the group.
Another possibility would be to track the frequency of item B being ordered within N days after ordering item A, which could suggest a correlation.
As far I know, it's use Case-Based Reasoning as an engine for it.
You can see in this sources: here, here and here.
There are many sources in google searching for amazon and case-based reasoning.
If you want a hands-on tutorial (using open-source R) then you could do worse than going through this:
https://gist.github.com/yoshiki146/31d4a46c3d8e906c3cd24f425568d34e
It is a run-time optimised version of another piece of work:
http://www.salemmarafi.com/code/collaborative-filtering-r/
However, the variation of the code on the first link runs MUCH faster so I recommend using that (I found the only slow part of yoshiki146's code is the final routine which generates the recommendation at user level - it took about an hour with my data on my machine).
I adapted this code to work as a recommendation engine for the retailer I work for.
The algorithm used is - as others have said above - collaborative filtering. This method of CF calculates a cosine similarity matrix and then sorts by that similarity to find the 'nearest neighbour' for each element (music band in the example given, retail product in my application).
The resulting table can recommend a band/product based on another chosen band/product.
The next section of the code goes a step further with USER (or customer) based collaborative filtering.
The output of this is a large table with the top 100 bands/products recommended for a given user/customer
Someone did a presentation at our University on something similar last week, and referenced the Amazon recommendation system. I believe that it uses a form of K-Means Clustering to cluster people into their different buying habits. Hope this helps :)
Check this out too: Link and as HTML.
What's a good algorithm for suggesting things that someone might like based on their previous choices? (e.g. as popularised by Amazon to suggest books, and used in services like iRate Radio or YAPE where you get suggestions by rating items)
Simple and straightforward (order cart):
Keep a list of transactions in terms of what items were ordered together. For instance when someone buys a camcorder on Amazon, they also buy media for recording at the same time.
When deciding what is "suggested" on a given product page, look at all the orders where that product was ordered, count all the other items purchased at the same time, and then display the top 5 items that were most frequently purchased at the same time.
You can expand it from there based not only on orders, but what people searched for in sequence on the website, etc.
In terms of a rating system (ie, movie ratings):
It becomes more difficult when you throw in ratings. Rather than a discrete basket of items one has purchased, you have a customer history of item ratings.
At that point you're looking at data mining, and the complexity is tremendous.
A simple algorithm, though, isn't far from the above, but it takes a different form. Take the customer's highest rated items, and the lowest rated items, and find other customers with similar highest rated and lowest rated lists. You want to match them with others that have similar extreme likes and dislikes - if you focus on likes only, then when you suggest something they hate, you'll have given them a bad experience. In suggestions systems you always want to err on the side of "lukewarm" experience rather than "hate" because one bad experience will sour them from using the suggestions.
Suggest items in other's highest lists to the customer.
Consider looking at "What is a Good Recommendation Algorithm?" and its discussion on Hacker News.
There isn't a definitive answer and it's highly unlikely there is a standard algorithm for that.
How you do that heavily depends on the kind of data you want to relate and how it is organized. It depends on how you define "related" in the scope of your application.
Often the simplest thought produces good results. In the case of books, if you have a database with several attributes per book entry (say author, date, genre etc.) you can simply chose to suggest a random set of books from the same author, the same genre, similar titles and others like that.
However, you can always try more complicated stuff. Keeping a record of other users that required this "product" and suggest other "products" those users required in the past (product can be anything from a book, to a song to anything you can imagine). Something that most major sites that have a suggest function do (although they probably take in a lot of information, from product attributes to demographics, to best serve the client).
Or you can even resort to so called AI; neural networks can be constructed that take in all those are attributes of the product and try (based on previous observations) to relate it to others and update themselves.
A mix of any of those cases might work for you.
I would personally recommend thinking about how you want the algorithm to work and how to suggest related "products". Then, you can explore all the options: from simple to complicated and balance your needs.
Recommended products algorithms are huge business now a days. NetFlix for one is offering 100,000 for only minor increases in the accuracy of their algorithm.
As you have deduced by the answers so far, and indeed as you suggest, this is a large and complex topic. I can't give you an answer, at least nothing that hasn't already been said, but I an point you to a couple of excellent books on the topic:
Programming CI:
http://oreilly.com/catalog/9780596529321/
is a fairly gentle introduction with
samples in Python.
CI In Action:
http://www.manning.com/alag looks a
bit more in depth (but I've only just
read the first chapter or 2) and has
examples in Java.
I think doing a Google on Least Mean Square Regression (or something like that) might give you something to chew on.
I think most of the useful advice has already been suggested but I thought I'll just put in how I would go about it, just thinking though, since I haven't done anything like this.
First I Would find where in the application I will sample the data to be used, so If I have a store it will probably in the check out. Then I would save a relation between each item in the checkout cart.
now if a user goes to an items page I can count the number of relations from other items and pick for example the 5 items with the highest number of relation to the selected item.
I know its simple, and there are probably better ways.
But I hope it helps
Market basket analysis is the field of study you're looking for:
Microsoft offers two suitable algorithms with their Analysis server:
Microsoft Association Algorithm Microsoft Decision Trees Algorithm
Check out this msdn article for suggestions on how best to use Analysis Services to solve this problem.
link text
there is a recommendation platform created by amazon called Certona, you may find this useful, it is used by companies such as B&Q and Screwfix find more information at www.certona.com/