I would need some help because I don't know what algorithm i could use for the following (I use python) :
Steve is 25 and he buys everyday orange juice
Maria is 23 and she likes to buy smoothies
Steve & Maria tastes are pretty much the same.
Juan is 16 and he only drinks sodas
Juan tastes are not the same as Steve and Maria.
I would like to use a matching algorithm that will detect the users who have the same drink preference and a close age. To continue with the example, Steve and Maria would be matched together but not Juan. Which one should I use ?

I agree with #klutt that your task is pretty vague. There are two approaches that come to mind, but not knowing more details about your problem does limit the details I can provide in my answer that would help you. I am interpreting the question as if you are taking in raw text and might want to process more sentences that have very similar semantic and syntactical structure.
An algorithmic approach:
Assuming that your word choices are static in their semantic meaning (Maria is 23 ... Steve is 25), we can parse each sentence and identify tokens like is or and or same and essentially perform lexical analysis on the text... from here, you could continue thinking about how you would go about matching and so forth... but this is rather complicated...
Neural Network approach:
If you are taking in raw text in the form of sentences, it's a problem that's not straight forward to solve using a top-down algorithmic approach.
You could take an approach with neural networks that trains a model to solve your problem, but then again what you seem to be asking is quite complex since there are multiple "facts" within each sentence that are not semantically related. For example, your second sentence identifies that Maria is 23 but at the end of that sentence there is a comparison between Steve and Maria. And your first sentence only identifies Steve as 25.
Even if you chunk raw text into sentences, you would have to have a very fine tuned neural network architecture and a lot of training data to get remotely close to your goal.
Now, both of those solutions are very complex... but if you wanted to create an application that collects this data (via a form or prompt) and puts it into a structured format (like a json or xml object) to organize and store the data in memory (perhaps writing out to a database or file for persistent storage), that might be a good route to go down.
This can serve as a good lesson in how to think about data as well. It is one thing if you have a pool of thousands of sentences, just raw data that you need to organize for quantitative purposes (classic qualitative -> quantitative problems). It is another thing if you are going to be collecting this data. If you are going to be collecting data, having a program that collects and organizes names, ages, and drink preferences (and then organizes that data within certain data structures), then we can talk about matching algorithms.
I will also add here that if you do have structured data, Collaborative filtering (mentioned by Shridhar) is a great starting place.

Collaborative filtering best suits your needs.
In the newer, narrower sense, collaborative filtering is a method of
making automatic predictions (filtering) about the interests of a user
by collecting preferences or taste information from many users
(collaborating). The underlying assumption of the collaborative
filtering approach is that if a person A has the same opinion as a
person B on an issue, A is more likely to have B's opinion on a
different issue than that of a randomly chosen person. For example, a
collaborative filtering recommendation system for television tastes
could make predictions about which television show a user should like
given a partial list of that user's tastes (likes or dislikes).[3]
Note that these predictions are specific to the user, but use
information gleaned from many users. This differs from the simpler
approach of giving an average (non-specific) score for each item of
interest, for example based on its number of votes.


How to count votes for qualitative survey answers

We are creating a website for a client that wants a website based around a survey of peoples' '10 favourite things'. There are 10 questions that each user must answer, e.g. 'What is your favourite colour', 'Who is your favourite celebrity', etc., and then the results are collated into a global Top 10 list on the home page.
The conundrum lies in both allowing the user to input anything they want, e.g. their favourite holiday destination might be 'Grandma's house', and being able to accurately count the votes accurately, e.g. User A might say their favourite celebrity is 'The Queen' and User B might says it's 'Queen of England' - we need those two answers to be counted as two votes for the same 'thing'.
If we force the user to choose from a large but predetermined list for each question, it restricts users' ability to define literally anything as their 'favourite thing'. Whereas, if we have a plain text input field and try to interpret answers after they have been submitted, it's going to be much more difficult to count votes where there are variations in names or spelling for the same answer.
Is it possible to automatically moderate their answers in real-time through some form of search phrase suggestion engine? How can we make sure that, if a plain text field is the input method, we make allowances for variations in spelling?
If anyone has any ideas as to possible solutions to this functionality, perhaps a piece of software, a plugin, an API, anything, then please do let us know.
Thank you and please just ask for any clarification.
If you want to automate counting "The Queen" and "The Queen of England", you're in for work that might be more complex than it's worth for a "fun little survey". If the volume is light enough, consider just manually counting the results. Just to give you a feeling, what if someone enters "The Queen of Sweden" or "Queen Letifah Concerts"?
If you really want to go down that route, look into Natural Language Processing (NLP). Specifically, the field of categorization.
For a general introduction to NLP, I recommend the relevant Wikipedia article
RapidMiner is an open source NLP solution that would be worth looking into.
As Eric J said, this is getting into cutting edge NLP applications. These are fields of study that are very important for AI/automation researchers and computer science in general, but are still very fledgeling. There are a number of programs and algorithms you can use, the drawbacks and benefits of which very widely. RapidMiner is good, WordNet is widely used in medical applications and should be relatively easy to adjust to your own corpus, and there are more advanced methods like latent Dirichlet allocation. Here are a few resources you should start with (in addition to the Wikipedia article provided above)
http://marimba.d.umn.edu/ (try the SenseClusters calculator)
The best to classify short answers is k-means clustering. You need to apply stemming. Then you need to convert words into indexes using elementary dictionary. You can use EverGroingDictionary.cs from sematicsearchart.com. After throwing phrase to a dictionary it will be converted to sequence of numbers or vector. Introduce measure of proximity as number of coincidences in words and apply k-means, which is lightning fast algorithm. k-means will organize all answers into groups. Most frequent words in each group will be a signature of the group. Your whole program in C++ or C# or Java must be less than 1000 lines.

Beyond item-to-item recommendations

Simple item-to-item recommendation systems are well-known and frequently implemented. An example is the Slope One algorithm. This is fine if the user hasn't rated many items yet, but once they have, I want to offer more finely-grained recommendations. Let's take a music recommendation system as an example, since they are quite popular. If a user is viewing a piece by Mozart, a suggestion for another Mozart piece or Beethoven might be given. But if the user has made many ratings on classical music, we might be able to make a correlation between the items and see that the user dislikes vocals or certain instruments. I'm assuming this would be a two-part process, first part is to find correlations between each users' ratings, the second would be to build the recommendation matrix from these extra data. So the question is, are they any open-source implementations or papers that can be used for each of these steps?
Taste may have something useful. It's moved to the Mahout project:
In general, the idea is that given a user's past preferences, you want to predict what they'll select next and recommend it. You build a machine-learning model in which the inputs are what a user has picked in the past and the attributes of each pick. The output is the item(s) they'll pick. You create training data by holding back some of their choices, and using their history to predict the data you held back.
Lots of different machine learning models you can use. Decision trees are common.
One answer is that any recommender system ought to have some of the properties you describe. Initially, recommendations aren't so good and are all over the place. As it learns tastes, the recommendations will come from the area the user likes.
But, the collaborative filtering process you describe is fundamentally not trying to solve the problem you are trying to solve. It is based on user ratings, and two songs aren't rated similarly because they are similar songs -- they're rated similarly just because similar people like them.
What you really need is to define your notion of song-song similarity. Is it based on how the song sounds? the composer? Because it sounds like the notion is not based on ratings, actually. That is 80% of the problem you are trying to solve.
I think the question you are really answering is, what items are most similar to a given item? Given your item similarity, that's an easier problem than recommendation.
Mahout can help with all of these things, except song-song similarity based on its audio -- or at least provide a start and framework for your solution.
There are two techniques that I can think of:
Train a feed-forward artificial neural net using Backpropagation or one of it's successors (e.g. Resilient Propagation).
Use version space learning. This starts with the most general and the most specific hypotheses about what the user likes and narrows them down when new examples are integrated. You can use a hierarchy of terms to describe concepts.
Common characteristics of these methods are:
You need a different function for
each user. This pretty much rules
out efficient database queries when
searching for recommendations.
The function can be updated on the fly
when the user votes for an item.
The dimensions along which you classify
the input data (e.g. has vocals, beats
per minute, musical scales,
whatever) are very critical to the
quality of the classification.
Please note that these suggestions come from university courses in knowledge based systems and artificial neural nets, not from practical experience.

Media recommendation engine - Single user system - How to start

I want to implement a media recommendation engine. I saw a similar posts on this, but I think my requirements are bit different from those, so posting here.
Here is the deal.
I want to implement a recommendation engine for media players like VLC, which would be an engine that has to care for only single user. Like, it would be embedded in a media player on a PC which is typically used by single user. And it will start learning the likes and dislikes of the user and gradually learns what a user likes. Here it will not be able to find similar users for using their data for recommendation as its a single user system. So how to go about this?
Or you can consider it as a recommendation engine that has to be put in say iPods, which has to learn about a single user and recommend music/Movies from the collections it has.
I thought of start collecting the genre of music/movies (maybe even artist name) that user watches and recommend movies from the most watched Genre, but it look very crude, isn't it?
So is there any algorithms I can use or any resources I can refer up to?
MicroKernel :)
What you're trying to do is quite challenging... particularly because it's still in the research stage and a lot of PHDs from reputable universities across the world are trying to get a good solution for that.
SO here are some things that you might need:
Data that you can analyze:
Lots, and lots, and lots of data!
It could be meta data about the media (name, duration, title, author, style, etc.)
Or you can try to do some crazy feature extraction from the media itself.
References to correlate the data to.
Since you can't get other users, you always need the user feedback.
If you don't want to annoy your user to death with feedback questions, then make your application connect to a central server so you can compare users.
An algorithm that can model your data sufficiently well.
If you have no experience at all, then try k-nearest neighbor (the simplest one).
Collaborative filtering
Pearson Correlation
Matrix Factorization/Decomposition
Singular value decomposition (SVD)
Ensemble learning <-- Allows you to combine multiple algorithms and take advantage of their strengths.
The winners of the NetFlix prize said this:
Predictive accuracy is substantially
improved when blending multiple
predictors. Our experience is that
most efforts should be concentrated in
deriving substantially different
approaches, rather than refining a
single technique. Consequently, our
solution is an ensemble of many
There is no silver bullet for recommendation engines and it takes years of exploration to find a good combination of algorithms that produce sufficient results. :)

Is it possible to guess a user's mood based on the structure of text?

I assume a natural language processor would need to be used to parse the text itself, but what suggestions do you have for an algorithm to detect a user's mood based on text that they have written? I doubt it would be very accurate, but I'm still interested nonetheless.
EDIT: I am by no means an expert on linguistics or natural language processing, so I apologize if this question is too general or stupid.
This is the basis of an area of natural language processing called sentiment analysis. Although your question is general, it's certainly not stupid - this sort of research is done by Amazon on the text in product reviews for example.
If you are serious about this, then a simple version could be achieved by -
Acquire a corpus of positive/negative sentiment. If this was a professional project you may take some time and manually annotate a corpus yourself, but if you were in a hurry or just wanted to experiment this at first then I'd suggest looking at the sentiment polarity corpus from Bo Pang and Lillian Lee's research. The issue with using that corpus is it is not tailored to your domain (specifically, the corpus uses movie reviews), but it should still be applicable.
Split your dataset into sentences either Positive or Negative. For the sentiment polarity corpus you could split each review into it's composite sentences and then apply the overall sentiment polarity tag (positive or negative) to all of those sentences. Split this corpus into two parts - 90% should be for training, 10% should be for test. If you're using Weka then it can handle the splitting of the corpus for you.
Apply a machine learning algorithm (such as SVM, Naive Bayes, Maximum Entropy) to the training corpus at a word level. This model is called a bag of words model, which is just representing the sentence as the words that it's composed of. This is the same model which many spam filters run on. For a nice introduction to machine learning algorithms there is an application called Weka that implements a range of these algorithms and gives you a GUI to play with them. You can then test the performance of the machine learned model from the errors made when attempting to classify your test corpus with this model.
Apply this machine learning algorithm to your user posts. For each user post, separate the post into sentences and then classify them using your machine learned model.
So yes, if you are serious about this then it is achievable - even without past experience in computational linguistics. It would be a fair amount of work, but even with word based models good results can be achieved.
If you need more help feel free to contact me - I'm always happy to help others interested in NLP =]
Small Notes -
Merely splitting a segment of text into sentences is a field of NLP - called sentence boundary detection. There are a number of tools, OSS or free, available to do this, but for your task a simple split on whitespaces and punctuation should be fine.
SVMlight is also another machine learner to consider, and in fact their inductive SVM does a similar task to what we're looking at - trying to classify which Reuter articles are about "corporate acquisitions" with 1000 positive and 1000 negative examples.
Turning the sentences into features to classify over may take some work. In this model each word is a feature - this requires tokenizing the sentence, which means separating words and punctuation from each other. Another tip is to lowercase all the separate word tokens so that "I HATE you" and "I hate YOU" both end up being considered the same. With more data you could try and also include whether capitalization helps in classifying whether someone is angry, but I believe words should be sufficient at least for an initial effort.
I just discovered LingPipe that in fact has a tutorial on sentiment analysis using the Bo Pang and Lillian Lee Sentiment Polarity corpus I was talking about. If you use Java that may be an excellent tool to use, and even if not it goes through all of the steps I discussed above.
No doubt it is possible to judge a user's mood based on the text they type but it would be no trivial thing. Things that I can think of:
Capitals tends to signify agitation, annoyance or frustration and is certainly an emotional response but then again some newbies do that because they don't realize the significance so you couldn't assume that without looking at what else they've written (to make sure its not all in caps);
Capitals are really just one form of emphasis. Others are use of certain aggressive colours (eg red) or use of bold or larger fonts;
Some people make more spelling and grammar mistakes and typos when they're highly emotional;
Scanning for emoticons could give you a very clear picture of what the user is feeling but again something like :) could be interpreted as happy, "I told you so" or even have a sarcastic meaning;
Use of expletives tends to have a clear meaning but again its not clearcut. Colloquial speech by many people will routinely contain certain four letter words. For some other people, they might not even say "hell", saying "heck" instead so any expletive (even "sucks") is significant;
Groups of punctuation marks (like ##$#$#) tend to be replaced for expletives in a context when expletives aren't necessarily appropriate, so thats less likely to be colloquial;
Exclamation marks can indicate surprise, shock or exasperation.
You might want to look at Advances in written text analysis or even Determining Mood for a Blog by Combining Multiple Sources of Evidence.
Lastly it's worth noting that written text is usually perceived to be more negative than it actually is. This is a common problem with email communication in companies, just as one example.
I can't believe I'm taking this seriously... assuming a one-dimensional mood space:
If the text contains a curse word,
-10 mood.
I think exclamations would tend to be negative, so -2 mood.
When I get frustrated, I type in
Very. Short. Sentences. -5 mood.
The more I think about this, the more it's clear that a lot of these signifiers indicate extreme mood in general, but it's not always clear what kind of mood.
If you support fonts, bold red text is probably an angry user. Green regular sized texts with butterfly clip art a happy one.
My memory isn't good on this subject, but I believe I saw some research about the grammar structure of the text and the overall tone. That could be also as simple as shorter words and emotion expression words (well, expletives are pretty obvious).
Edit: I noted that the first person to answer had substantially similar post. There could be indeed some serious idea about shorter sentences.
Analysis of mood and behavior is very serious science. Despite the other answers mocking the question law enforcement agencies have been investigating categorization of mood for years. Uses in computers I have heard of generally had more context (timing information, voice pattern, speed in changing channels). I think that you could--with some success--determine if a user is in a particular mood by training a Neural Network with samples from two known groups: angry and not angry. Good luck with your efforts.
I think, my algorythm is rather straightforward, yet, why not calculating smilics through the text :) vs :(
Obviously, the text ":) :) :) :)" resolves to a happy user, while ":( :( :(" will surely resolve to a sad one. Enjoy!
I agree with ojblass that this is a serious question.
Mood categorization is currently a hot topic in the speech recognition area. If you think about it, an interactive voice response (IVR) application needs to handle angry customers far differently than calm ones: angry people should be routed quickly to human operators with the right experience and training. Vocal tone is a pretty reliable indicator of emotion, practical enough so that companies are eager to get this to work. Google "speech emotion recognition", or read this article to find out more.
The situation should be no different in web-based GUIs. Referring back to cletus's comments, the analogies between text and speech emotion detection are interesting. If a person types CAPITALS they are said to be 'shouting', just as if his voice rose in volume and pitch using a voice interface. Detecting typed profanities is analogous to "keyword spotting" of profanity in speech systems. If a person is upset, they'll make more errors using either a GUI or a voice user interface (VUI) and can be routed to a human.
There's a "multimodal" emotion detection research area here. Imagine a web interface that you can also speak to (along the lines of the IBM/Motorola/Opera XHTML + Voice Profile prototype implementation). Emotion detection could be based on a combination of cues from the speech and visual input modality.
Whether or not you can do it is another story. The problem seems at first to be AI complete.
Now then, if you had keystroke timings you should be able to figure it out.
Fuzzy logic will do I guess.
Any way it will be quite easy to start with several rules of determining the user's mood and then extend and combine the "engine" with more accurate and sophisticated ones.

Algorithms to find stuff a user would like based on other users likes

I'm thinking of writing an app to classify movies in an HTPC based on what the family members like.
I don't know statistics or AI, but the stuff here looks very juicy. I wouldn't know where to start do.
Here's what I want to accomplish:
Compose a set of samples from each users likes, rating each sample attribute separately. For example, maybe a user likes western movies a lot, so the western genre would carry a bit more weight for that user (and so on for other attributes, like actors, director, etc).
A user can get suggestions based on the likes of the other users. For example, if both user A and B like Spielberg (connection between the users), and user B loves Batman Begins, but user A loathes Katie Holmes, weigh the movie for user A accordingly (again, each attribute separately, for example, maybe user A doesn't like action movies so much, so bring the rating down a bit, and since Katie Holmes isn't the main star, don't take that into account as much as the other attributes).
Basically, comparing sets from user A similar to sets from user B, and come up with a rating for user A.
I have a crude idea about how to implement this, but I'm certain some bright minds have already thought of a far better solution already, so... any suggestions?
Actually, after a quick research, it seems a Bayesian filter would work. If so, would this be the better approach? Would it be as simple as just "normalizing" movie data, training a classifier for each user, and then just classify each movie?
If your suggestion includes some brain melting concepts (I'm not experienced in these subjects, specially in AI), I'd appreciate it if you also included a list of some basics for me to research before diving into the meaty stuff.
Matthew Podwysocki had some interesting articles on this stuff
This is similar to this question where the OP wanted to build a recommendation system. In a nutshell, we are given a set of training data consisting of users ratings to movies (1-5 star rating for example) and a set of attributes for each movie (year, genre, actors, ..). We want to build a recommender so that it will output for unseen movies a possible rating. So the inpt data looks like:
user movie year genre ... | rating
1 1 2006 action | 5
3 2 2008 drama | 3.5
and for an unrated movie X:
10 20 2009 drama ?
we want to predict a rating. Doing this for all unseen movies then sorting by predicted movie rating and outputting the top 10 gives you a recommendation system.
The simplest approach is to use a k-nearest neighbor algorithm. Among the rated movies, search for the "closest" ones to movie X, and combine their ratings to produce a prediction.
This approach has the advantage of being very simple to easy implement from scratch.
Other more sophisticated approaches exist. For example you can build a decision tree, fit a set of rules on the training data. You can also use Bayesian networks, artificial neural networks, support vector machines, among many others... Going through each of these wont be easy for someone without the proper background.
Still I expect you would be using an external tool/library. Now you seem to be familiar with Bayesian Networks, so a simple naive bayes net, could in fact be very powerful. One advantage is that it allow for prediction under missing data.
The main idea would be somewhat the same; take the input data you have, train a model, then use it to predict the class of new instances.
If you want to play around with different algorithms in simple intuitive package which requires no programming, I suggest you take a look at Weka (my 1st choice), Orange, or RapidMiner. The most difficult part would be to prepare the dataset to the required format. The rest is as easy as choosing what algorithm and applying it (all in a few clicks!)
I guess for someone not looking to go into too much details, I would recommend going with the nearest neighbor method as it is intuitive and easy to implement.. Still the option of using Weka (or one of the other tools) is worth looking into.
There are a few algorithms that are good for this:
ARTMAP: groups via probability against each other (this isn't fast but its the best thing for your problem IMO)
ARTMAP holds a group of common attributes and determines likelyhood of simliarity via a percentages.
KMeans: This seperates out the vectors by the distance that they are from each other
KMeans: Wikipedia
PCA: will seperate the average of all the values from the varing bits. This is what you would use to do face detection, and background subtraction in Computer Vision.
The K-nearest neighbor algorithm may be right up your alley.
Check out some of the work of the top teams for the netflix prize.
