Asserting similarity based off likes - algorithm

I need a way to basically check if one person is similar to another, based on what they like.
So if I log onto this website, then just jot down a few things I like, the service will go out and check what other people are similar to you. Now this won't be a 'I like music and so do you" type of search, as it will need to take into the fact that "I like nature and you don't", with different levels of importance.
I've looked through Latent Semantic Analysis as well as Naive Bayes Classification, however these seem to be solutions to calculate possibilities instead of classification.
Categorizing Words and Category Values This question seems like the right course I need to take, where likes may be classified into larger categories, however the answer does not seem correct.

Related

How are keyword clouds constructed?

How are keyword clouds constructed?
I know there are a lot of nlp methods, but I'm not sure how they solve the following problem:
You can have several items that each have a list of keywords relating to them.
(In my own program, these items are articles where I can use nlp methods to detect proper nouns, people, places, and (?) possibly subjects. This will be a very large list given a sufficiently sized article, but I will assume that I can winnow the list down using some method by comparing articles. How to do this properly is what I am confused about).
Each item can have a list of keywords, but how do they pick keywords such that the keywords aren't overly specific or overly general between each item?
For example, trivially "the" can be a keyword that is a lot of items.
While "supercalifragilistic" could only be in one.
I suppose that I could create a heuristic where if a word exists in n% of the items where n is sufficiently small, but will return a nice sublist (say 5% of 1000 articles is 50, which seems reasonable) then I could just use that. However, the issue that I take with this approach is that given two different sets of entirely different items, there is most likely some difference in interrelatedness between the items, and I'm throwing away that information.
This is very unsatisfying.
I feel that given the popularity of keyword clouds there must have been a solution created already. I don't want to use a library however as I want to understand and manipulate the assumptions in the math.
If anyone has any ideas please let me know.
Thanks!
EDIT:
freenode/programming/guardianx has suggested https://en.wikipedia.org/wiki/Tf%E2%80%93idf
tf-idf is ok btw, but the issue is that the weighting needs to be determined apriori. Given that two distinct collections of documents will have a different inherent similarity between documents, assuming an apriori weighting does not feel correct
freenode/programming/anon suggested https://en.wikipedia.org/wiki/Word2vec
I'm not sure I want something that uses a neural net (a little complicated for this problem?), but still considering.
Tf-idf is still a pretty standard method for extracting keywords. You can try a demo of a tf-idf-based keyword extractor (which has the idf vector, as you say apriori determined, estimated from Wikipedia). A popular alternative is the TextRank algorithm based on PageRank that has an off-the-shelf implementation in Gensim.
If you decide for your own implementation, note that all algorithms typically need plenty of tuning and text preprocessing to work correctly.
The minimum you need to do is removing stopwords that you know that they never can be a keyword (prepositions, articles, pronouns, etc.). If you want something fancier, you can use for instance Spacy to keep only desired parts of speech (nouns, verbs, adjectives). You can also include frequent multiword expressions (gensim has good function for automatic collocation detection), named entities (spacy can do it). You can get better results if you run coreference resolution and substitute pronouns with what they refer to... There are endless options for improvements.

A simple explanation of Random Forest

I'm trying to understand how random forest works in plain English instead of mathematics. Can anybody give me a really simple explanation of how this algorithm works?
As far as I understand, we feed the features and labels without telling the algorithm which feature should be classified as which label? As I used to do Naive Bayes which is based on probability we need to tell which feature should be which label. Am I completely far off?
If I can get any very simple explanation I'd be really appreciated.
RandomForest uses a so-called bagging approach. The idea is based on the classic bias-variance trade off. Suppose that we have a set (say N) of overfitted estimators that have low bias but high cross-sample-variance. So low bias is good and we want to keep it, high variance is bad and we want to reduce it. RandomForest tries to achieve this by doing a so-called bootstraps/sub-sampling (as #Alexander mentioned, this is a combination of bootstrap sampling on both observations and features). The prediction is the average of individual estimators so the low-bias property is successfully preserved. And further by Central Limit Theorem, the variance of this sample average has a variance equal to variance of individual estimator divided by square root of N. So now, it has both low-bias and low-variance properties, and this is why RandomForest often outperforms stand-alone estimator.
Adding on to the above two answers, Since you mentioned a simple explanation. Here is a write up that I feel is the most simple way you can explain random forests.
Credits go to Edwin Chen for the simple explanation here in layman terms for random forests. Posting the same below.
Suppose you’re very indecisive, so whenever you want to watch a movie, you ask your friend Willow if she thinks you’ll like it. In order to answer, Willow first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labeled training set). Then, when you ask her if she thinks you’ll like movie X or not, she plays a 20 questions-like game with IMDB, asking questions like “Is X a romantic movie?”, “Does Johnny Depp star in X?”, and so on. She asks more informative questions first (i.e., she maximizes the information gain of each question), and gives you a yes/no answer at the end.
Thus, Willow is a decision tree for your movie preferences.
But Willow is only human, so she doesn’t always generalize your preferences very well (i.e., she overfits). In order to get more accurate recommendations, you’d like to ask a bunch of your friends and watch movie X if most of them say they think you’ll like it. That is, instead of asking only Willow, you want to ask Woody, Apple, and Cartman as well, and they vote on whether you’ll like a movie (i.e., you build an ensemble classifier, aka a forest in this case).
Now you don’t want each of your friends to do the same thing and give you the same answer, so you first give each of them slightly different data. After all, you’re not absolutely sure of your preferences yourself – you told Willow you loved Titanic, but maybe you were just happy that day because it was your birthday, so maybe some of your friends shouldn’t use the fact that you liked Titanic in making their recommendations. Or maybe you told her you loved Cinderella, but actually you really really loved it, so some of your friends should give Cinderella more weight. So instead of giving your friends the same data you gave Willow, you give them slightly perturbed versions. You don’t change your love/hate decisions, you just say you love/hate some movies a little more or less (formally, you give each of your friends a bootstrapped version of your original training data). For example, whereas you told Willow that you liked Black Swan and Harry Potter and disliked Avatar, you tell Woody that you liked Black Swan so much you watched it twice, you disliked Avatar, and don’t mention Harry Potter at all.
By using this ensemble, you hope that while each of your friends gives somewhat idiosyncratic recommendations (Willow thinks you like vampire movies more than you do, Woody thinks you like Pixar movies, and Cartman thinks you just hate everything), the errors get canceled out in the majority. Thus, your friends now form a bagged (bootstrap aggregated) forest of your movie preferences.
There’s still one problem with your data, however. While you loved both Titanic and Inception, it wasn’t because you like movies that star Leonardo DiCaprio. Maybe you liked both movies for other reasons. Thus, you don’t want your friends to all base their recommendations on whether Leo is in a movie or not. So when each friend asks IMDB a question, only a random subset of the possible questions is allowed (i.e., when you’re building a decision tree, at each node you use some randomness in selecting the attribute to split on, say by randomly selecting an attribute or by selecting an attribute from a random subset). This means your friends aren’t allowed to ask whether Leonardo DiCaprio is in the movie whenever they want. So whereas previously you injected randomness at the data level, by perturbing your movie preferences slightly, now you’re injecting randomness at the model level, by making your friends ask different questions at different times.
And so your friends now form a random forest.
I will try to give another complementary explanation with simple words.
A random forest is a collection of random decision trees (of number n_estimators in sklearn).
What you need to understand is how to build one random decision tree.
Roughly speaking, to build a random decision tree you start from a subset of your training samples. At each node you will draw randomly a subset of features (number determined by max_features in sklearn). For each of these features you will test different thresholds and see how they split your samples according to a given criterion (generally entropy or gini, criterion parameter in sklearn). Then you will keep the feature and its threshold that best split your data and record it in the node.
When the construction of the tree ends (it can be for different reasons: maximum depth is reached (max_depth in sklearn), minimum sample number is reached (min_samples_leaf in sklearn) etc.) you look at the samples in each leaf and keep the frequency of the labels.
As a result, it is like the tree gives you a partition of your training samples according to meaningful features.
As each node is built from features taken randomly, you understand that each tree built in this way will be different. This contributes to the good compromise between bias and variance, as explained by #Jianxun Li.
Then in testing mode, a test sample will go through each tree, giving you label frequencies for each tree. The most represented label is generally the final classification result.

Keyword association learning algorithm

To model my problem, I'll use a dating site as an example (although this is not the actual case). My problem is I have a set of keywords that a user can input that they like. Say "Tall, dark hair, blue eyes", etc. and I want to map them to other users that fit that criteria. More than that however, I need to be able to learn from data I get back to make better predictions that are not-so-exact matches.
For example, if other users that are looking for people with 'dark hair' like users with 'black hair', or have a height of 6'4 but don't mention they are tall. I want to be able to make an association between those similar keywords and be able to also suggest those as well so it best returns what the user wants, even if it wasn't exactly what they asked for.
My question is what algorithm/approach is best suited for this? I've been looking into areas like:
decision trees, but those seem to break down when no keywords match.
naive bayes, which seem a bit more tolerant to missing connections, but require some prior knowledge about the connections, and since keywords can be anything, this seems like a road bloack
ANN, but these don't seem to do well with text input
KNN, but I'm not sure how to handle the possibly infinite user classifications?
Some sort of A* map search, where each time a user1 likes a user2, I make a map connection between user1's likes and user2's traits, if that connection already exists, I just shorten it, then find the closest N users. I'm just not sure how scalable this is.
Any input is appreciated,
Thanks!
This sounds like a rather classic application of association rule learning: basically, if people looking for partners with 'dark hair' like a lot of 'black hair' accounts, then you have an association rule between the two. There are algorithms that can detect this.
As for your suggestions, have you tried an ANN? ANNs don't work at all with text input, but for most machine learning + text tasks, you turn the text into numeric data (see the bag of words model, for example). Once you have your numeric features, they shouldn't do too badly.
For example, you'd want your network trained to return adequate recommendations based on profile settings, right? You would feed it the profile settings, and if you have training data that shows users looking for people with 'dark hair' liking users with 'black hair', the ANN should learn that relationship.
Association rules sound like the way to go though.

How to count votes for qualitative survey answers

We are creating a website for a client that wants a website based around a survey of peoples' '10 favourite things'. There are 10 questions that each user must answer, e.g. 'What is your favourite colour', 'Who is your favourite celebrity', etc., and then the results are collated into a global Top 10 list on the home page.
The conundrum lies in both allowing the user to input anything they want, e.g. their favourite holiday destination might be 'Grandma's house', and being able to accurately count the votes accurately, e.g. User A might say their favourite celebrity is 'The Queen' and User B might says it's 'Queen of England' - we need those two answers to be counted as two votes for the same 'thing'.
If we force the user to choose from a large but predetermined list for each question, it restricts users' ability to define literally anything as their 'favourite thing'. Whereas, if we have a plain text input field and try to interpret answers after they have been submitted, it's going to be much more difficult to count votes where there are variations in names or spelling for the same answer.
Is it possible to automatically moderate their answers in real-time through some form of search phrase suggestion engine? How can we make sure that, if a plain text field is the input method, we make allowances for variations in spelling?
If anyone has any ideas as to possible solutions to this functionality, perhaps a piece of software, a plugin, an API, anything, then please do let us know.
Thank you and please just ask for any clarification.
If you want to automate counting "The Queen" and "The Queen of England", you're in for work that might be more complex than it's worth for a "fun little survey". If the volume is light enough, consider just manually counting the results. Just to give you a feeling, what if someone enters "The Queen of Sweden" or "Queen Letifah Concerts"?
If you really want to go down that route, look into Natural Language Processing (NLP). Specifically, the field of categorization.
For a general introduction to NLP, I recommend the relevant Wikipedia article
http://en.wikipedia.org/wiki/Natural_language_processing
RapidMiner is an open source NLP solution that would be worth looking into.
As Eric J said, this is getting into cutting edge NLP applications. These are fields of study that are very important for AI/automation researchers and computer science in general, but are still very fledgeling. There are a number of programs and algorithms you can use, the drawbacks and benefits of which very widely. RapidMiner is good, WordNet is widely used in medical applications and should be relatively easy to adjust to your own corpus, and there are more advanced methods like latent Dirichlet allocation. Here are a few resources you should start with (in addition to the Wikipedia article provided above)
http://www.semanticsearchart.com/index.html
http://www.mitpressjournals.org/loi/coli
http://marimba.d.umn.edu/ (try the SenseClusters calculator)
http://wordnet.princeton.edu/
The best to classify short answers is k-means clustering. You need to apply stemming. Then you need to convert words into indexes using elementary dictionary. You can use EverGroingDictionary.cs from sematicsearchart.com. After throwing phrase to a dictionary it will be converted to sequence of numbers or vector. Introduce measure of proximity as number of coincidences in words and apply k-means, which is lightning fast algorithm. k-means will organize all answers into groups. Most frequent words in each group will be a signature of the group. Your whole program in C++ or C# or Java must be less than 1000 lines.

Beyond item-to-item recommendations

Simple item-to-item recommendation systems are well-known and frequently implemented. An example is the Slope One algorithm. This is fine if the user hasn't rated many items yet, but once they have, I want to offer more finely-grained recommendations. Let's take a music recommendation system as an example, since they are quite popular. If a user is viewing a piece by Mozart, a suggestion for another Mozart piece or Beethoven might be given. But if the user has made many ratings on classical music, we might be able to make a correlation between the items and see that the user dislikes vocals or certain instruments. I'm assuming this would be a two-part process, first part is to find correlations between each users' ratings, the second would be to build the recommendation matrix from these extra data. So the question is, are they any open-source implementations or papers that can be used for each of these steps?
Taste may have something useful. It's moved to the Mahout project:
http://taste.sourceforge.net/
In general, the idea is that given a user's past preferences, you want to predict what they'll select next and recommend it. You build a machine-learning model in which the inputs are what a user has picked in the past and the attributes of each pick. The output is the item(s) they'll pick. You create training data by holding back some of their choices, and using their history to predict the data you held back.
Lots of different machine learning models you can use. Decision trees are common.
One answer is that any recommender system ought to have some of the properties you describe. Initially, recommendations aren't so good and are all over the place. As it learns tastes, the recommendations will come from the area the user likes.
But, the collaborative filtering process you describe is fundamentally not trying to solve the problem you are trying to solve. It is based on user ratings, and two songs aren't rated similarly because they are similar songs -- they're rated similarly just because similar people like them.
What you really need is to define your notion of song-song similarity. Is it based on how the song sounds? the composer? Because it sounds like the notion is not based on ratings, actually. That is 80% of the problem you are trying to solve.
I think the question you are really answering is, what items are most similar to a given item? Given your item similarity, that's an easier problem than recommendation.
Mahout can help with all of these things, except song-song similarity based on its audio -- or at least provide a start and framework for your solution.
There are two techniques that I can think of:
Train a feed-forward artificial neural net using Backpropagation or one of it's successors (e.g. Resilient Propagation).
Use version space learning. This starts with the most general and the most specific hypotheses about what the user likes and narrows them down when new examples are integrated. You can use a hierarchy of terms to describe concepts.
Common characteristics of these methods are:
You need a different function for
each user. This pretty much rules
out efficient database queries when
searching for recommendations.
The function can be updated on the fly
when the user votes for an item.
The dimensions along which you classify
the input data (e.g. has vocals, beats
per minute, musical scales,
whatever) are very critical to the
quality of the classification.
Please note that these suggestions come from university courses in knowledge based systems and artificial neural nets, not from practical experience.

Resources