Keyword association learning algorithm - algorithm

To model my problem, I'll use a dating site as an example (although this is not the actual case). My problem is I have a set of keywords that a user can input that they like. Say "Tall, dark hair, blue eyes", etc. and I want to map them to other users that fit that criteria. More than that however, I need to be able to learn from data I get back to make better predictions that are not-so-exact matches.
For example, if other users that are looking for people with 'dark hair' like users with 'black hair', or have a height of 6'4 but don't mention they are tall. I want to be able to make an association between those similar keywords and be able to also suggest those as well so it best returns what the user wants, even if it wasn't exactly what they asked for.
My question is what algorithm/approach is best suited for this? I've been looking into areas like:
decision trees, but those seem to break down when no keywords match.
naive bayes, which seem a bit more tolerant to missing connections, but require some prior knowledge about the connections, and since keywords can be anything, this seems like a road bloack
ANN, but these don't seem to do well with text input
KNN, but I'm not sure how to handle the possibly infinite user classifications?
Some sort of A* map search, where each time a user1 likes a user2, I make a map connection between user1's likes and user2's traits, if that connection already exists, I just shorten it, then find the closest N users. I'm just not sure how scalable this is.
Any input is appreciated,
Thanks!

This sounds like a rather classic application of association rule learning: basically, if people looking for partners with 'dark hair' like a lot of 'black hair' accounts, then you have an association rule between the two. There are algorithms that can detect this.
As for your suggestions, have you tried an ANN? ANNs don't work at all with text input, but for most machine learning + text tasks, you turn the text into numeric data (see the bag of words model, for example). Once you have your numeric features, they shouldn't do too badly.
For example, you'd want your network trained to return adequate recommendations based on profile settings, right? You would feed it the profile settings, and if you have training data that shows users looking for people with 'dark hair' liking users with 'black hair', the ANN should learn that relationship.
Association rules sound like the way to go though.

Related

What matching algorithm could I use?

I would need some help because I don't know what algorithm i could use for the following (I use python) :
Steve is 25 and he buys everyday orange juice
Maria is 23 and she likes to buy smoothies
Steve & Maria tastes are pretty much the same.
Juan is 16 and he only drinks sodas
Juan tastes are not the same as Steve and Maria.
====================================================
I would like to use a matching algorithm that will detect the users who have the same drink preference and a close age. To continue with the example, Steve and Maria would be matched together but not Juan. Which one should I use ?
I agree with #klutt that your task is pretty vague. There are two approaches that come to mind, but not knowing more details about your problem does limit the details I can provide in my answer that would help you. I am interpreting the question as if you are taking in raw text and might want to process more sentences that have very similar semantic and syntactical structure.
An algorithmic approach:
Assuming that your word choices are static in their semantic meaning (Maria is 23 ... Steve is 25), we can parse each sentence and identify tokens like is or and or same and essentially perform lexical analysis on the text... from here, you could continue thinking about how you would go about matching and so forth... but this is rather complicated...
Neural Network approach:
If you are taking in raw text in the form of sentences, it's a problem that's not straight forward to solve using a top-down algorithmic approach.
You could take an approach with neural networks that trains a model to solve your problem, but then again what you seem to be asking is quite complex since there are multiple "facts" within each sentence that are not semantically related. For example, your second sentence identifies that Maria is 23 but at the end of that sentence there is a comparison between Steve and Maria. And your first sentence only identifies Steve as 25.
Even if you chunk raw text into sentences, you would have to have a very fine tuned neural network architecture and a lot of training data to get remotely close to your goal.
Now, both of those solutions are very complex... but if you wanted to create an application that collects this data (via a form or prompt) and puts it into a structured format (like a json or xml object) to organize and store the data in memory (perhaps writing out to a database or file for persistent storage), that might be a good route to go down.
This can serve as a good lesson in how to think about data as well. It is one thing if you have a pool of thousands of sentences, just raw data that you need to organize for quantitative purposes (classic qualitative -> quantitative problems). It is another thing if you are going to be collecting this data. If you are going to be collecting data, having a program that collects and organizes names, ages, and drink preferences (and then organizes that data within certain data structures), then we can talk about matching algorithms.
I will also add here that if you do have structured data, Collaborative filtering (mentioned by Shridhar) is a great starting place.
Collaborative filtering best suits your needs.
In the newer, narrower sense, collaborative filtering is a method of
making automatic predictions (filtering) about the interests of a user
by collecting preferences or taste information from many users
(collaborating). The underlying assumption of the collaborative
filtering approach is that if a person A has the same opinion as a
person B on an issue, A is more likely to have B's opinion on a
different issue than that of a randomly chosen person. For example, a
collaborative filtering recommendation system for television tastes
could make predictions about which television show a user should like
given a partial list of that user's tastes (likes or dislikes).[3]
Note that these predictions are specific to the user, but use
information gleaned from many users. This differs from the simpler
approach of giving an average (non-specific) score for each item of
interest, for example based on its number of votes.

Asserting similarity based off likes

I need a way to basically check if one person is similar to another, based on what they like.
So if I log onto this website, then just jot down a few things I like, the service will go out and check what other people are similar to you. Now this won't be a 'I like music and so do you" type of search, as it will need to take into the fact that "I like nature and you don't", with different levels of importance.
I've looked through Latent Semantic Analysis as well as Naive Bayes Classification, however these seem to be solutions to calculate possibilities instead of classification.
Categorizing Words and Category Values This question seems like the right course I need to take, where likes may be classified into larger categories, however the answer does not seem correct.

"Who to follow" algorithm

I want to give users the ability to view some personalized users they might find interesting and might follow them...
I was thinking of it like that:
- Get all users he is currently following
- Get all followers that they follow
- rank them by total posts they made (DESC), filled up personal information fields
- show 5 of them on each page load
in case user has followers then an information message will appear...
Can this kind of feature be done with this algorithm or is there a better or even easier way to do it?
In your algorithm, I'm wondering why you need to sort users based on number of posts, maybe it has something to do with reputation?
Recommendation is indeed a very large, open topic, and is also a hot academic research fields. If we are working on a practical project, I think it will be nice to to stay simple and focused.
I witnessed the following two kinds of recommendations on a very popular
social website. From my experience, the recommendation output is of high quality. Here I'm brainstorming the algorithms behind. Hope it helps.
Discover persons you might know: Recommend person whose 'following set' intersects with your 'following set'. It is based on the "clustering effect" of social network: The friend of your friend is more likely to be your friend.
Recommend person based on interests: If the users could be celebrities, companies, institutions, press media, etc., then recommendations like the following might be useful: "People following #Linus also follow #Stallman, #LinuxDeveloper, ...". Suppose you've just followed #Linus, to recommend #Stallman, #LinuxDeveloper, first we need to find out all users following #Linus, then figure out their common following list, possibly ranked by number of followers. The idea is to recommend users based on interest correlations. We calculate and discover high correlation users, assuming that users' following list are grouped by their interests.
(I'm also thinking, algorithm 1 will discover persons that share common interests with you, if users could be celebrities, etc.. This might be preferred for some scenarios.)
You're asking a very open-ended question here - how to pick a small number of recommendations out of a large set. So the answer is - you can make it as simple or as complicated as you want it to be! The simplest would be to pick a few at random (and any more complex algorithm had better prove that it produces better results than that.) Your solution of gathering all users who are two hops away, and then ranking by number of posts, is just a bit more complex, and then at the other extreme are the sophisticated algorithms used by the Amazons and Googles of the world. Companies put a lot of effort into building this sort of thing - have you heard of the Netflix Prize?
as I understand you want to follow the user that could offer high quality information about your Thema .we need an Algorithm to give this user as result to us ,but how can I find these users:
The users that have many Followers are a good choice but not always many of users in Twitter follow another users only as respect or ethiquet.
The users that his/her twitts retwitt many times with other user is a good choice
and the user that they are mentioned many times by other users.
I think ,to find theses users we should use Link based Analyse such as HITS or Page rank algorithim
You may want to consider not including people that are following the given user. I imagine might not be so interested in you, and this could potentially be problematic. However, you maybe very interested in finding more about the people that is following.
Are you considering showing the user the reason why these people were recommended to them? For example, saying like you may be interested in what little billy is saying because of his connection to your wife. If so, to potentially avoid angered users, it may be worth allowing them to in a sense opt-out.
It seems like other than that, it seems like it is a pretty good way of recommending users that someone would be interested in. The only other things that I can think of that might also help find people with similar interests, is if you allow users to tag posts. Allowing you to find users by similar interests, or by what they are posting about.
One other more problematic thing that you could look into is finding users by similar interest. for example, if person a is following person c, and person b is following person c, then maybe recommend person a to person b. though this seems like it could make for some very lengthy queries if you are not careful.

Beyond item-to-item recommendations

Simple item-to-item recommendation systems are well-known and frequently implemented. An example is the Slope One algorithm. This is fine if the user hasn't rated many items yet, but once they have, I want to offer more finely-grained recommendations. Let's take a music recommendation system as an example, since they are quite popular. If a user is viewing a piece by Mozart, a suggestion for another Mozart piece or Beethoven might be given. But if the user has made many ratings on classical music, we might be able to make a correlation between the items and see that the user dislikes vocals or certain instruments. I'm assuming this would be a two-part process, first part is to find correlations between each users' ratings, the second would be to build the recommendation matrix from these extra data. So the question is, are they any open-source implementations or papers that can be used for each of these steps?
Taste may have something useful. It's moved to the Mahout project:
http://taste.sourceforge.net/
In general, the idea is that given a user's past preferences, you want to predict what they'll select next and recommend it. You build a machine-learning model in which the inputs are what a user has picked in the past and the attributes of each pick. The output is the item(s) they'll pick. You create training data by holding back some of their choices, and using their history to predict the data you held back.
Lots of different machine learning models you can use. Decision trees are common.
One answer is that any recommender system ought to have some of the properties you describe. Initially, recommendations aren't so good and are all over the place. As it learns tastes, the recommendations will come from the area the user likes.
But, the collaborative filtering process you describe is fundamentally not trying to solve the problem you are trying to solve. It is based on user ratings, and two songs aren't rated similarly because they are similar songs -- they're rated similarly just because similar people like them.
What you really need is to define your notion of song-song similarity. Is it based on how the song sounds? the composer? Because it sounds like the notion is not based on ratings, actually. That is 80% of the problem you are trying to solve.
I think the question you are really answering is, what items are most similar to a given item? Given your item similarity, that's an easier problem than recommendation.
Mahout can help with all of these things, except song-song similarity based on its audio -- or at least provide a start and framework for your solution.
There are two techniques that I can think of:
Train a feed-forward artificial neural net using Backpropagation or one of it's successors (e.g. Resilient Propagation).
Use version space learning. This starts with the most general and the most specific hypotheses about what the user likes and narrows them down when new examples are integrated. You can use a hierarchy of terms to describe concepts.
Common characteristics of these methods are:
You need a different function for
each user. This pretty much rules
out efficient database queries when
searching for recommendations.
The function can be updated on the fly
when the user votes for an item.
The dimensions along which you classify
the input data (e.g. has vocals, beats
per minute, musical scales,
whatever) are very critical to the
quality of the classification.
Please note that these suggestions come from university courses in knowledge based systems and artificial neural nets, not from practical experience.

How the computer knows "Recommended for You"?

Recently, I found several web site have something like : "Recommended for You", for example youtube, or facebook, the web site can study my using behavior, and recommend some content for me... ...I would like to know how they analysis this information? Is there any Algorithm to do so? Thank you.
Amazon and Netflix (among others) use a technique called Collaborative filtering to suggest things you might like based on the likes/dislikes of others who have made purchases and selections similar to yours.
Is there any Algorithm to do so?
Yes
Yes. One fairly common one is to look at things you've selected in the past, find other people who've made those selections, then find the other selections most common among those other people, and guess that you're likely to be interested in those as well.
Yup there are lots of algorithms. Things such as k-nearest neighbor: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm.
Here is a pretty good book on the subject that covers making these sorts of systems along with others: http://www.amazon.com/gp/product/0596529325?ie=UTF8&tag=ianburriscom-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0596529325.
It's generally done by matching you with other users who have similar usage history / profile and then recommending other things that they've purhased/watched/whatever.
Searching for "recommendation algorithm" yields lots of papers. Most algorithms incorporate "machine learning" algorithms to determine groups of things (comedy movies, books on gardening, orchestral music, etc.). Your matching with those groups yields recommendations. Some companies use humans to classify things, too.
Such an algorithm is going to vary wildly from company to company. In many cases, it analyzes some combination of your search history, purchase history, physical location, and other factors. It probably will also compare purchases/searches amongst other people to find what those people have purchased/searched for, and recommend some of those products to you.
There are probably hundreds of these algorithms out there, but I doubt you can use any of them (that are actually good). Probably you are better off figuring it out yourself.
If you can categorize your contents (i.e. by tagging or content analysis), you can also categorize your users and their preferences.
For example: you have a video portal with 5 million videos .. 1 mio of them are tagged mostly red. If 80% of all videos watched by a user (who is defined by an IP, a persistent user account, ...) are tagged mostly red, you might want to recommend even more red videos to him. You might want to refine your recommendations by looking at his further actions: does he like your recommendations -- if so, why not give him even more, if not, try the second-best guess, maybe he's not looking for color, but for the background music ...
There's no absolute algorithm to do it, but all implementations will go into a similar direction. It's always basing on observing users, which scares me from time to time :-)
There's whole lot of algorithms tackling the issue: Wiki article. It's a Machine Learning domain problem. Computer's can be learned using two main techniques: classification and clustering. They require some datasets as input. If the dataset is informative (really holds some useful patterns) than those ML techniques can dig most of it.
Clustering could be best to use for this kind of problem. It's main usage is to find similarities among points in provided dataset. If the points are, e.g. your search history, they can be grouped together to form certain clusters. If Your search history closely relates to another, a hint can be given - picking links that are most similar to Your's.
The same comes with book recommendations - it's obvious what dataset they use: "Other people who bought this product also bought Product A, Product B,...". The key here is to match your profile to other's and use the most similar to recommend.
The computer retrieves information from the human brain with complex memory scan process, sorts it accordingly and outputs results based on what you have experienced in your life so far.

Resources