How to implement this recommendation algorithm? - algorithm

Most recommendation algorithm articles I've read are focused on the Netflix model where users rate items. What I want to do is slightly different (I think).
Let's say that instead, I want to create a site where a user is presented with two pictures of cars. The user can then select which car they like better. The user can repeat this process as many times as s/he likes, but hopefully as they continue, the pictures become more and more refined towards what the user likes.
How would you implement this algorithm? It seems like one possible way would simply be to implement an ELO ranking algorithm and use the order of those results as a "rating", but that has serious flaws in that multiple items can't be given a maximum rating (which the user may have done if given the ability to rate the items themselves).
Another method, which seems more promising to me, would be to predetermine the general properties of each vehicle (e.g. color, body type, 2 door vs 4 door, etc.) and use those to get a general idea of the properties each user likes and base recommendations off of that.

I'll take a stab at this.
Suppose that each car is given a set of properties. If this set of properties were coded as a vector, one potential method of recommendation would be to use Self Organizing Maps (SOM). The basic gist of a SOM is that is a categorizer of input vectors. If you train a SOM with input vectors representing distinct classes of input, a SOM will start to cluster its storage vectors to be more like each class of input. Note that the original input vector is not retained. To train a SOM with an input vector, the best vector currently in the SOM is picked and then the area around that vector becomes more like the input. Of course, see Wikipedia http://en.wikipedia.org/wiki/Self-organizing_map.
So how does this apply to this situation? Well, one SOM could be used to train on images that the person does like and one could be trained on the ones they do like. Even if there are is no single style that they like, clusters should form around cars they like/don't like. Then seeing if they like a car that has not been picked by them is a matter of finding how well it matches to groups from their likes and dislikes. Note that in this case, it would be best to match up cars that are dissimilar to each other or more likely to not both be liked.
When the person first joins the site, it may be advantageous to allow them to pick a few likes and dislikes right off the bat to seed the SOMs.
Good luck!

Maby it's a bit too late to answer, but you might want to check this out. It's about an MIT professor that argues that 5-star rating, like-rating, etc. don't work, he proposes an algorithm that works with input by pairs, just as you suggest (Car A or Car B).
The Algorithm is quite complex but can be found on the link.

Related

A simple explanation of Random Forest

I'm trying to understand how random forest works in plain English instead of mathematics. Can anybody give me a really simple explanation of how this algorithm works?
As far as I understand, we feed the features and labels without telling the algorithm which feature should be classified as which label? As I used to do Naive Bayes which is based on probability we need to tell which feature should be which label. Am I completely far off?
If I can get any very simple explanation I'd be really appreciated.
RandomForest uses a so-called bagging approach. The idea is based on the classic bias-variance trade off. Suppose that we have a set (say N) of overfitted estimators that have low bias but high cross-sample-variance. So low bias is good and we want to keep it, high variance is bad and we want to reduce it. RandomForest tries to achieve this by doing a so-called bootstraps/sub-sampling (as #Alexander mentioned, this is a combination of bootstrap sampling on both observations and features). The prediction is the average of individual estimators so the low-bias property is successfully preserved. And further by Central Limit Theorem, the variance of this sample average has a variance equal to variance of individual estimator divided by square root of N. So now, it has both low-bias and low-variance properties, and this is why RandomForest often outperforms stand-alone estimator.
Adding on to the above two answers, Since you mentioned a simple explanation. Here is a write up that I feel is the most simple way you can explain random forests.
Credits go to Edwin Chen for the simple explanation here in layman terms for random forests. Posting the same below.
Suppose you’re very indecisive, so whenever you want to watch a movie, you ask your friend Willow if she thinks you’ll like it. In order to answer, Willow first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labeled training set). Then, when you ask her if she thinks you’ll like movie X or not, she plays a 20 questions-like game with IMDB, asking questions like “Is X a romantic movie?”, “Does Johnny Depp star in X?”, and so on. She asks more informative questions first (i.e., she maximizes the information gain of each question), and gives you a yes/no answer at the end.
Thus, Willow is a decision tree for your movie preferences.
But Willow is only human, so she doesn’t always generalize your preferences very well (i.e., she overfits). In order to get more accurate recommendations, you’d like to ask a bunch of your friends and watch movie X if most of them say they think you’ll like it. That is, instead of asking only Willow, you want to ask Woody, Apple, and Cartman as well, and they vote on whether you’ll like a movie (i.e., you build an ensemble classifier, aka a forest in this case).
Now you don’t want each of your friends to do the same thing and give you the same answer, so you first give each of them slightly different data. After all, you’re not absolutely sure of your preferences yourself – you told Willow you loved Titanic, but maybe you were just happy that day because it was your birthday, so maybe some of your friends shouldn’t use the fact that you liked Titanic in making their recommendations. Or maybe you told her you loved Cinderella, but actually you really really loved it, so some of your friends should give Cinderella more weight. So instead of giving your friends the same data you gave Willow, you give them slightly perturbed versions. You don’t change your love/hate decisions, you just say you love/hate some movies a little more or less (formally, you give each of your friends a bootstrapped version of your original training data). For example, whereas you told Willow that you liked Black Swan and Harry Potter and disliked Avatar, you tell Woody that you liked Black Swan so much you watched it twice, you disliked Avatar, and don’t mention Harry Potter at all.
By using this ensemble, you hope that while each of your friends gives somewhat idiosyncratic recommendations (Willow thinks you like vampire movies more than you do, Woody thinks you like Pixar movies, and Cartman thinks you just hate everything), the errors get canceled out in the majority. Thus, your friends now form a bagged (bootstrap aggregated) forest of your movie preferences.
There’s still one problem with your data, however. While you loved both Titanic and Inception, it wasn’t because you like movies that star Leonardo DiCaprio. Maybe you liked both movies for other reasons. Thus, you don’t want your friends to all base their recommendations on whether Leo is in a movie or not. So when each friend asks IMDB a question, only a random subset of the possible questions is allowed (i.e., when you’re building a decision tree, at each node you use some randomness in selecting the attribute to split on, say by randomly selecting an attribute or by selecting an attribute from a random subset). This means your friends aren’t allowed to ask whether Leonardo DiCaprio is in the movie whenever they want. So whereas previously you injected randomness at the data level, by perturbing your movie preferences slightly, now you’re injecting randomness at the model level, by making your friends ask different questions at different times.
And so your friends now form a random forest.
I will try to give another complementary explanation with simple words.
A random forest is a collection of random decision trees (of number n_estimators in sklearn).
What you need to understand is how to build one random decision tree.
Roughly speaking, to build a random decision tree you start from a subset of your training samples. At each node you will draw randomly a subset of features (number determined by max_features in sklearn). For each of these features you will test different thresholds and see how they split your samples according to a given criterion (generally entropy or gini, criterion parameter in sklearn). Then you will keep the feature and its threshold that best split your data and record it in the node.
When the construction of the tree ends (it can be for different reasons: maximum depth is reached (max_depth in sklearn), minimum sample number is reached (min_samples_leaf in sklearn) etc.) you look at the samples in each leaf and keep the frequency of the labels.
As a result, it is like the tree gives you a partition of your training samples according to meaningful features.
As each node is built from features taken randomly, you understand that each tree built in this way will be different. This contributes to the good compromise between bias and variance, as explained by #Jianxun Li.
Then in testing mode, a test sample will go through each tree, giving you label frequencies for each tree. The most represented label is generally the final classification result.

Asserting similarity based off likes

I need a way to basically check if one person is similar to another, based on what they like.
So if I log onto this website, then just jot down a few things I like, the service will go out and check what other people are similar to you. Now this won't be a 'I like music and so do you" type of search, as it will need to take into the fact that "I like nature and you don't", with different levels of importance.
I've looked through Latent Semantic Analysis as well as Naive Bayes Classification, however these seem to be solutions to calculate possibilities instead of classification.
Categorizing Words and Category Values This question seems like the right course I need to take, where likes may be classified into larger categories, however the answer does not seem correct.

How to count votes for qualitative survey answers

We are creating a website for a client that wants a website based around a survey of peoples' '10 favourite things'. There are 10 questions that each user must answer, e.g. 'What is your favourite colour', 'Who is your favourite celebrity', etc., and then the results are collated into a global Top 10 list on the home page.
The conundrum lies in both allowing the user to input anything they want, e.g. their favourite holiday destination might be 'Grandma's house', and being able to accurately count the votes accurately, e.g. User A might say their favourite celebrity is 'The Queen' and User B might says it's 'Queen of England' - we need those two answers to be counted as two votes for the same 'thing'.
If we force the user to choose from a large but predetermined list for each question, it restricts users' ability to define literally anything as their 'favourite thing'. Whereas, if we have a plain text input field and try to interpret answers after they have been submitted, it's going to be much more difficult to count votes where there are variations in names or spelling for the same answer.
Is it possible to automatically moderate their answers in real-time through some form of search phrase suggestion engine? How can we make sure that, if a plain text field is the input method, we make allowances for variations in spelling?
If anyone has any ideas as to possible solutions to this functionality, perhaps a piece of software, a plugin, an API, anything, then please do let us know.
Thank you and please just ask for any clarification.
If you want to automate counting "The Queen" and "The Queen of England", you're in for work that might be more complex than it's worth for a "fun little survey". If the volume is light enough, consider just manually counting the results. Just to give you a feeling, what if someone enters "The Queen of Sweden" or "Queen Letifah Concerts"?
If you really want to go down that route, look into Natural Language Processing (NLP). Specifically, the field of categorization.
For a general introduction to NLP, I recommend the relevant Wikipedia article
http://en.wikipedia.org/wiki/Natural_language_processing
RapidMiner is an open source NLP solution that would be worth looking into.
As Eric J said, this is getting into cutting edge NLP applications. These are fields of study that are very important for AI/automation researchers and computer science in general, but are still very fledgeling. There are a number of programs and algorithms you can use, the drawbacks and benefits of which very widely. RapidMiner is good, WordNet is widely used in medical applications and should be relatively easy to adjust to your own corpus, and there are more advanced methods like latent Dirichlet allocation. Here are a few resources you should start with (in addition to the Wikipedia article provided above)
http://www.semanticsearchart.com/index.html
http://www.mitpressjournals.org/loi/coli
http://marimba.d.umn.edu/ (try the SenseClusters calculator)
http://wordnet.princeton.edu/
The best to classify short answers is k-means clustering. You need to apply stemming. Then you need to convert words into indexes using elementary dictionary. You can use EverGroingDictionary.cs from sematicsearchart.com. After throwing phrase to a dictionary it will be converted to sequence of numbers or vector. Introduce measure of proximity as number of coincidences in words and apply k-means, which is lightning fast algorithm. k-means will organize all answers into groups. Most frequent words in each group will be a signature of the group. Your whole program in C++ or C# or Java must be less than 1000 lines.

Discover user behind multiple different user accounts according to words he uses

I would like to create algorithm to distinguish the persons writing on forum under different nicknames.
The goal is to discover people registring new account to flame forum anonymously, not under their main account.
Basicaly I was thinking about stemming words they use and compare users according to similarities or these words.
As shown on the picture there is user3 and user4 who uses same words. It means there is probably one person behind the computer.
Its clear that there are lot of common words which are being used by all users. So I should focus on "user specific" words.
Input is (related to the image above):
<word1, user1>
<word2, user1>
<word2, user2>
<word3, user2>
<word4, user2>
<word5, user3>
<word5, user4>
... etc. The order doesnt matter
Output should be:
user1
user2
user3 = user4
I am doing this in Java but I want this question to be language independent.
Any ideas how to do it?
1) how to store words/users? What data structures?
2) how to get rid of common words everybody use? I have to somehow ignore them among user specific words. Maybe I could just ignore them because they get lost. I am afraid that they will hide significant difference of "user specific words"
3) how to recognize same users? - somehow count same words between each user?
I am very thankful for every advice in advance.
In general this is task of author identification, and there are several good papers like this that may give you a lot of information. Here are my own suggestions on this topic.
1. User recognition/author identification itself
The most simple kind of text classification is classification by topic, and there you take meaningful words first of all. That is, if you want to distinguish text about Apple the company and apple the fruit, you count words like "eat", "oranges", "iPhone", etc., but you commonly ignore things like articles, forms of words, part-of-speech (POS) information and so on. However many people may talk about same topics, but use different styles of speech, that is articles, forms of words and all the things you ignore when classifying by topic. So the first and the main thing you should consider is collecting the most useful features for your algorithm. Author's style may be expressed by frequency of words like "a" and "the", POS-information (e.g. some people tend to use present time, others - future), common phrases ("I would like" vs. "I'd like" vs. "I want") and so on. Note that topic words should not be discarded completely - they still show themes the user is interested in. However you should treat them somehow specially, e.g. you can pre-classify texts by topic and then discriminate users not interested in it.
When you are done with feature collection, you may use one of machine learning algorithm to find best guess for an author of the text. As for me, 2 best suggestions here are probability and cosine similarity between text vector and user's common vector.
2. Discriminating common words
Or, in latest context, common features. The best way I can think of to get rid of the words that are used by all people more or less equally is to compute entropy for each such feature:
entropy(x) = -sum(P(Ui|x) * log(P(Ui|x)))
where x is a feature, U - user, P(Ui|x) - conditional probability of i-th user given feature x, and sum is the sum over all users.
High value of entropy indicates that distribution for this feature is close to uniform and thus is almost useless.
3. Data representation
Common approach here is to have user-feature matrix. That is, you just build table where rows are user ids and columns are features. E.g. cell [3][12] shows normalized how many times user #3 used feature #12 (don't forget to normalize these frequencies by total number of features user ever used!).
Depending on features your are going to use and size of the matrix, you may want to use sparse matrix implementation instead of dense. E.g. if you use 1000 features and for every particular user around 90% of cells are 0, it doesn't make sense to keep all these zeros in memory and sparse implementation is better option.
I recommend a language modelling approach. You can train a language model (unigram, bigram, parsimonious, ...) on each of your user accounts' words. That gives you a mapping from words to probabilities, i.e. numbers between 0 and 1 (inclusive) expressing how likely it is that a user uses each of the words you encountered in the complete training set. Language models can be stored as arrays of pairs, hash tables or sparse vectors. There are plenty of libraries on the web for fitting LMs.
Such a mapping can be considered a high-dimensional vector, in the same way documents are considered as vector in the vector space model of information retrieval. You can then compare these vectors by using KL-divergence or any of the popular distance metrics: Euclidean distance, cosine distance, etc. A strong similarity/small distance between two users' vectors might then indicate that they belong to one and the same user.
how to store words/users? What data structures?
You probably have some kind of representation for the users and the posts that they have made. I think you should have a list of words, and a list corresponding to each word containing the users who use it. Something like:
<word: <user#1, user#4, user#5, ...> >
how to get rid of common words everybody use?
Hopefully, you have a set of stopwords. Why not extend it to include commonly used words from your forum? For example, for stackoverflow, some of the most frequently used tags' names should qualify for it.
how to recognize same users?
In addition to using similarity or word-frequency based measures, you can also try using interactions between users. For example, user3 likes/upvotes/comments each and every post by user8, or a new user doing similar things for some other (older) user in this way.

Algorithms to find stuff a user would like based on other users likes

I'm thinking of writing an app to classify movies in an HTPC based on what the family members like.
I don't know statistics or AI, but the stuff here looks very juicy. I wouldn't know where to start do.
Here's what I want to accomplish:
Compose a set of samples from each users likes, rating each sample attribute separately. For example, maybe a user likes western movies a lot, so the western genre would carry a bit more weight for that user (and so on for other attributes, like actors, director, etc).
A user can get suggestions based on the likes of the other users. For example, if both user A and B like Spielberg (connection between the users), and user B loves Batman Begins, but user A loathes Katie Holmes, weigh the movie for user A accordingly (again, each attribute separately, for example, maybe user A doesn't like action movies so much, so bring the rating down a bit, and since Katie Holmes isn't the main star, don't take that into account as much as the other attributes).
Basically, comparing sets from user A similar to sets from user B, and come up with a rating for user A.
I have a crude idea about how to implement this, but I'm certain some bright minds have already thought of a far better solution already, so... any suggestions?
Actually, after a quick research, it seems a Bayesian filter would work. If so, would this be the better approach? Would it be as simple as just "normalizing" movie data, training a classifier for each user, and then just classify each movie?
If your suggestion includes some brain melting concepts (I'm not experienced in these subjects, specially in AI), I'd appreciate it if you also included a list of some basics for me to research before diving into the meaty stuff.
Thanks!
Matthew Podwysocki had some interesting articles on this stuff
http://codebetter.com/blogs/matthew.podwysocki/archive/2009/03/30/functional-programming-and-collective-intelligence.aspx
http://codebetter.com/blogs/matthew.podwysocki/archive/2009/04/01/functional-programming-and-collective-intelligence-ii.aspx
http://weblogs.asp.net/podwysocki/archive/2009/04/07/functional-programming-and-collective-intelligence-iii.aspx
This is similar to this question where the OP wanted to build a recommendation system. In a nutshell, we are given a set of training data consisting of users ratings to movies (1-5 star rating for example) and a set of attributes for each movie (year, genre, actors, ..). We want to build a recommender so that it will output for unseen movies a possible rating. So the inpt data looks like:
user movie year genre ... | rating
---------------------------------------------
1 1 2006 action | 5
3 2 2008 drama | 3.5
...
and for an unrated movie X:
10 20 2009 drama ?
we want to predict a rating. Doing this for all unseen movies then sorting by predicted movie rating and outputting the top 10 gives you a recommendation system.
The simplest approach is to use a k-nearest neighbor algorithm. Among the rated movies, search for the "closest" ones to movie X, and combine their ratings to produce a prediction.
This approach has the advantage of being very simple to easy implement from scratch.
Other more sophisticated approaches exist. For example you can build a decision tree, fit a set of rules on the training data. You can also use Bayesian networks, artificial neural networks, support vector machines, among many others... Going through each of these wont be easy for someone without the proper background.
Still I expect you would be using an external tool/library. Now you seem to be familiar with Bayesian Networks, so a simple naive bayes net, could in fact be very powerful. One advantage is that it allow for prediction under missing data.
The main idea would be somewhat the same; take the input data you have, train a model, then use it to predict the class of new instances.
If you want to play around with different algorithms in simple intuitive package which requires no programming, I suggest you take a look at Weka (my 1st choice), Orange, or RapidMiner. The most difficult part would be to prepare the dataset to the required format. The rest is as easy as choosing what algorithm and applying it (all in a few clicks!)
I guess for someone not looking to go into too much details, I would recommend going with the nearest neighbor method as it is intuitive and easy to implement.. Still the option of using Weka (or one of the other tools) is worth looking into.
There are a few algorithms that are good for this:
ARTMAP: groups via probability against each other (this isn't fast but its the best thing for your problem IMO)
ARTMAP holds a group of common attributes and determines likelyhood of simliarity via a percentages.
ARTMAP
KMeans: This seperates out the vectors by the distance that they are from each other
KMeans: Wikipedia
PCA: will seperate the average of all the values from the varing bits. This is what you would use to do face detection, and background subtraction in Computer Vision.
PCA
The K-nearest neighbor algorithm may be right up your alley.
Check out some of the work of the top teams for the netflix prize.

Resources