How can I recommend content(say movies) based on multiple parameters - apache-spark-mllib

I am new to Spark ML. I want to recommend movies to users using Apache Spark ML. I learned here that we can recommend movies based on rating by a user.
My question is can we include other features for recommendation, say, his age, country, genre of movie, likes, etc.
For example, we have users 'U1', 'U2', 'U3' and 'U4' all of them watching Movie 'M'. U1 of age 22 living in America also watched movie M1, M2 and M3 and U2 of age 50 living in Australia watched movie M1, M4 and M5.
Now, I would like to recommend U3 of age 24 living in America movies M1, M2 and M3. Also Recommend movies M1,M4 and M5 to U4 of age 21 living in Australia.
Basically, I want to provide some weights to age and country.
How can we achieve this with Spark ml(say using ALS)?

This problem of introducing a context information can be addressed in a few different ways:
if you have to stick to Spark and 2D (user x item) recommender: then go for a contextual post/pre filtering. As you have the MF with ALS solver already there, you can use it as model base recommender in two ways for what you are trying to achieve: (1/pre-filtering) group your data by you context (e.g. age ranges, country) before creating 2D MF-ALS recommender for each of group or (2/post-filtering) create on single 2D user-item recommender an then do the post-filtering based on your contextual information after getting recommendations from your model. In the second option, you may try not only the boolean filtering but also boosting score value based on the contextual match, e.g. boost based on the avg(age of the watchers).
use contextual information directly into a model based recommender. This one cannot be directly applied with core/MLlib Spark matrix factorization method. This one is prepared for 2D, user-item actions. But to overcome this you can use different model-based method. I would advice going for Factorization Machines as it is a successful model for context-aware recommendations and it's very intuitive, especially in data preparation. You can see introduction from my slides here: Intro to FM slides. There is a package for Apache Spark - Spark libFM Package, but cannot say anything about it - is it really working or still maintained (?). But for sure can advice main libfm.org, python - fastFM and very similar+fast implementation lightfm from Lyst, which can be useful for your problem.
do kNN recommender where you introduce your own similarity measures while comparing user2user. It's heuristic base, where you can boost similarity score where the contextual information about both user match - e.g. using sth like: abs(u1.age - u2.age)/(max(age) - min(age), country, city or region distance between u1 and u2.
I would also advice to read the chapter from Recommender Systems Handbook[6] about introducing contextual information to recommender systems. And of course the whole book!
[6]: Kantor, Paul B. Recommender systems handbook. Eds. Francesco Ricci, Lior Rokach, and Bracha Shapira. Berlin, Germany:: Springer, 2015.

I made a project on movie recommendation system using Apache Spark MLlib. I had help from codementor. You can find the tutorial here.

Related

If I am using SIMILARITY_LOGLIKELIHOOD (LLR) are item ratings really ignored?

I used the movie lens data file (ml-100k.zip) u.data unchanged, so it had the columns: userID, MovieID and user rating.
I used LLR:
hadoop jar C:\hdp\mahout-0.9.0.2.1.3.0-1981\core\target\mahout-core-0.9.0.2.1.3.0-1981-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_LOGLIKELIHOOD --input u.data --output udata_output
When I look at the udata_output file I see recommended movie ID's followed by recommendation scores like:
1226:5.0
and
896:4.798878
The recommendation scores seemed to vary from 5.0 to 4.x
However, when I deleted the user rating column from the u.data file and re-ran the same command line above I received results like:
615:1.0
where ALL recommendation scores were 1.0.
2 questions:
1) If LLR ignores the user ratings and the only input I change is the whether to provide the user rating why do the recommendation scores change?
2) Overall, I am trying to determine recommendation ranking so I'm using LLR. In addition should I ignore the recommendation scores and only focus on the order of the items recommended (e.g.: the first item is ranked higher than the 2nd)?
Thanks in advance.
LLR does not use the strengths. The theory is that if a user actually interacted with an item, that is all the indication needed. LLR will correlate that interaction with other user's and score based on a probabilistic calculation called the Log Likelihood Ratio. It does create strengths but only uses the counts of interactions.
Answers
This could be a bug or could be because you are using a boolean recommender in one case and an non-boolean in the other. I could be that the recommender is trying to provide ratings by taking account of the values. But none of this really matters if you are trying to optimize ranking
You really never need to look at the recommendation weights unless you are trying to predict ratings, which seldom happens these days. Trust the ranking of recs.
BTW Mahout now has a completely new generation recommender based on using a search engine to serve recommendations and Mahout to calculate the model. It has many benefits over the older Hadoop version including:
Multimodal: it can ingest many different user actions on many different item set. This allow you to use much of the user's clickstream to recommend.
Realtime results: it has a very fast scalable server in Solr or Elastic search.
Due to the realtime nature it can recommend to new users or users with very recent history. The older Hadoop Mahout recommenders only recommend to users and items in the training data--they cannot react to history that was not used in training. The new recommender can use realtime gathered data, even on new users.
The new Multimodal Recommender in Mahout 1.0-snapshot or greater is described here:
Mahout site
A free ebook, which talks about the general idea: Practical Machine Learning
A slide deck, which talks about mixing actions or other indicators: Creating a Unified Multimodal Recommender
Two blog posts: What's New in Recommenders: part #1 and What's New in Recommenders: part #2
A post describing the log likelihood ratio: Surprise and Coincidence LLR is used to reduce noise in the data while keeping the calculations O(n) complexity.

Collaborative or structured recommendation?

I am building a Rails app that recommends tutors to students and vise versa. I need to match them based on multiple dimensions, such as their majors (Math, Biology etc.), experience (junior etc.), class (Math 201 etc.), preference (self-described keywords) and ratings.
I checked out some Rails collaborative recommendation engines (recommendable, recommendify) and Mahout. It seems that collaborative recommendation is not the best choice in my case, since I have much more structured data, which allows a more structured query. For example, I can have a recommendation logic for a student like:
if student looks for a Math tutor in Math 201:
if there's a tutor in Math major offering tutoring in Math 201 then return
else if there's a tutor in Math major then sort by experience then return
else if there's a tutor in quantitative major then sort by experience then return
...
My questions are:
What are the benefits of a collaborative recommendation algorithm given that my recommendation system will be preference-based?
If it does provide significant benefits, how I can combine it with a preference-based recommendation as mentioned above?
Since my approach will involve querying multiple tables, it might not be efficient. What should I do about this?
Thanks a lot.
It sounds like your measurement of compatibility could be profitably reformulated as a metric. What you should do is try to interpret your `columns' as being different components of the dimension of your data. The idea is that you ultimately should produce a binary function which returns a measurement of compatibility between students and tutors (and also students/students and tutors/tutors). The motivation for extending this metric to all types of data is that you can then use this idea to reformulate your matching criteria as a nearest-neighbor search:
http://en.wikipedia.org/wiki/Nearest_neighbor_search
There are plenty of data structures and solutions to this problem as it has been very well studied. For example, you could try out the following library which is often used with point cloud data:
http://www.cs.umd.edu/~mount/ANN/
To optimize things a bit, you could also try prefiltering your data by running principal component analysis on your data set. This would let you reduce the dimension of the space in which you do nearest neighbor searches, and usually has the added benefit of reducing some amount of noise.
http://en.wikipedia.org/wiki/Principal_component_analysis
Good luck!
Personally, I think collaborative filtering (cf) would work well for you. Note that a central idea of cf is serendipity. In other words, adding too many constraints might potentially result in a lukewarm recommendations to users. The whole point of cf is to provide exciting and relevant recommendations based on similar users. You need not impose such tight constraints.
If you might decide on implementing a custom cf algorithm, I would recommend reading this article published by Amazon [pdf], which discusses Amazon's recommendation system. Briefly, the algorithm they use is as follows:
for each item I1
for each customer C who bought I1
for each I2 bought by a customer
record purchase C{I1, I2}
for each item I2
calculate sim(I1, I2)
//this could use your own similarity measure, e.g., cosine based
//similarity, sim(A, B) = cos(A, B) = (A . B) / (|A| |B|) where A
//and B are vectors(items, or courses in your case) and the dimensions
//are customers
return table
Note that the creation of this table would be done offline. The online algorithm would be quick to return recommendations. Apparently, the recommendation quality is excellent.
In any case, if you want to get a better idea about cf in general (e.g., various cf strategies) and why it might be suited to you, go through that article (don't worry, it is very readable). Implementing a simple cf recommender is not difficult. Optimizations can be made later.

Recommendation algorithm (and implementation) for finding similar items and users

I have a database of about 700k users along with items they have watched/listened to/read/bought/etc.
I would like to build a recommendation engine that recommends new items based on what users with similar taste in things have enjoyed, as well as actually finding people the user might want to be friends with on a social network I'm building (similar to last.fm).
My requirements are as follows:
Majority of the "users" in my database aren't actually users of my website. They have been data mined from third-party sources. However, when recommending users, I would like to limit the search to people who are members of my website (while still taking advantage of the bigger data set).
I need to take multiple items into consideration. Not "people who like this one item you enjoyed...", but "people who like most of the items you enjoyed...".
I need to compute similarities between users and show them when viewing their profiles (taste-o-meter).
Some items are rated, others are not. Ratings are from 1-10, not boolean values. In most cases it would be possible to deduct a rating value from other stats if it's not present (e.g. if the user has favourited an item, but hasn't rated it, I could just assume a rating of 9).
It has to interact with Python code in one way or another. Preferably, it should use a seperate (possibly NoSQL) database and expose an API to use in my web back-end. The project I'm making uses Pyramid and SQLAlchemy.
I would like to take item genres into account.
I would like to display similar items on item pages based on both its genre (possibly tags) and what users who enjoyed the item liked (like Amazon's "people who bought this item" and Last.fm artist pages). Items from different genres should still be shown, but have a lower similarity value.
I would prefer a well-documented implementation of an algorithm with some examples.
Please don't give an answer like "use pysuggest or mahout", since those implement a plethora of algorithms and I'm looking for one that's most suitable for my data/use. I've been interested in Neo4j and how it all could be expressed as a graph of connections between users and items.
To determine similarity between users you can run cosine or pearson similarity (Found in Mahout and everywhere on the net really!) across the user vector. So your data representation should look something like
u1 [1,2,3,4,5,6]
u2 [35,24,3,4,5,6]
u1 [35,3,9,2,1,11]
In the point where you want to take multiple items into consideration you can use the above to determine how similar someones profiles are. The higher the correlation score the likelihood they have very similar items is. You can set a threshold so someone with .75 similarity has a similar set of items in their profile.
Where you are missing values you can of course make up your own values. I'd just keep them binary and try to blend the various different algorithms. That's called an ensemble.
Overall you are looking for something called item based collaborative filtering as the recommendation aspect of your set up and also used to identify similar items. It's a standard recommendation algorithm that does pretty much everything you've asked for.
When trying to find similar users you can perform some type of similarity metric across your user vectors.
Regarding Python, the book called programming in collective intelligence gives all their samples in python so go pick up a copy and read chapter 1.
Representing all of this as a graph will be somewhat problamatic as your undying representation is a Bipartile Graph. There are lots of recommendation approaches out there that use a graph based approach but its generally not the best performing approach.
Actually that is one of the sweetspots of a graph database like Neo4j.
So if your data model looks like this:
user -[:LIKE|:BOUGHT]-> item
You can easily get recommendations for an user with a cypher statement like this:
start user = node:users(id="doctorkohaku")
match user -[r:LIKE]->item<-[r2:LIKE]-other-[r3:LIKE]->rec_item
where r.stars > 2 and r2.stars > 2 and r3.stars > 2
return rec_item.name, count(*) as cnt, avg(r3.stars) as rating
order by rating desc, cnt desc limit 10
This can also be done using the Neo4j Core-API or the Traversal-API.
Neo4j has an Python API that is also able to run cypher queries.
Disclaimer: I work for Neo4j
There are also some interesting articles by Marko Rodriguez about collaborative filtering.
I can suggest to have a look at my open source project Reco4j. It is a graph-based recommendation engine that can be used on a graph database like yours in a very straigthforward way. We support as graph database neo4j. It is in an early version but very soon a more complete version will be available. In the meantime we are looking for some use case of our project, so please contact me so that we can see how we can collaborate.

How does the Amazon Recommendation feature work?

What technology goes in behind the screens of Amazon recommendation technology? I believe that Amazon recommendation is currently the best in the market, but how do they provide us with such relevant recommendations?
Recently, we have been involved with similar recommendation kind of project, but would surely like to know about the in and outs of the Amazon recommendation technology from a technical standpoint.
Any inputs would be highly appreciated.
Update:
This patent explains how personalized recommendations are done but it is not very technical, and so it would be really nice if some insights could be provided.
From the comments of Dave, Affinity Analysis forms the basis for such kind of Recommendation Engines. Also here are some good reads on the Topic
Demystifying Market Basket Analysis
Market Basket Analysis
Affinity Analysis
Suggested Reading:
Data Mining: Concepts and Technique
It is both an art and a science. Typical fields of study revolve around market basket analysis (also called affinity analysis) which is a subset of the field of data mining. Typical components in such a system include identification of primary driver items and the identification of affinity items (accessory upsell, cross sell).
Keep in mind the data sources they have to mine...
Purchased shopping carts = real money from real people spent on real items = powerful data and a lot of it.
Items added to carts but abandoned.
Pricing experiments online (A/B testing, etc.) where they offer the same products at different prices and see the results
Packaging experiments (A/B testing, etc.) where they offer different products in different "bundles" or discount various pairings of items
Wishlists - what's on them specifically for you - and in aggregate it can be treated similarly to another stream of basket analysis data
Referral sites (identification of where you came in from can hint other items of interest)
Dwell times (how long before you click back and pick a different item)
Ratings by you or those in your social network/buying circles - if you rate things you like you get more of what you like and if you confirm with the "i already own it" button they create a very complete profile of you
Demographic information (your shipping address, etc.) - they know what is popular in your general area for your kids, yourself, your spouse, etc.
user segmentation = did you buy 3 books in separate months for a toddler? likely have a kid or more.. etc.
Direct marketing click through data - did you get an email from them and click through? They know which email it was and what you clicked through on and whether you bought it as a result.
Click paths in session - what did you view regardless of whether it went in your cart
Number of times viewed an item before final purchase
If you're dealing with a brick and mortar store they might have your physical purchase history to go off of as well (i.e. toys r us or something that is online and also a physical store)
etc. etc. etc.
Luckily people behave similarly in aggregate so the more they know about the buying population at large the better they know what will and won't sell and with every transaction and every rating/wishlist add/browse they know how to more personally tailor recommendations. Keep in mind this is likely only a small sample of the full set of influences of what ends up in recommendations, etc.
Now I have no inside knowledge of how Amazon does business (never worked there) and all I'm doing is talking about classical approaches to the problem of online commerce - I used to be the PM who worked on data mining and analytics for the Microsoft product called Commerce Server. We shipped in Commerce Server the tools that allowed people to build sites with similar capabilities.... but the bigger the sales volume the better the data the better the model - and Amazon is BIG. I can only imagine how fun it is to play with models with that much data in a commerce driven site. Now many of those algorithms (like the predictor that started out in commerce server) have moved on to live directly within Microsoft SQL.
The four big take-a-ways you should have are:
Amazon (or any retailer) is looking at aggregate data for tons of transactions and tons of people... this allows them to even recommend pretty well for anonymous users on their site.
Amazon (or any sophisticated retailer) is keeping track of behavior and purchases of anyone that is logged in and using that to further refine on top of the mass aggregate data.
Often there is a means of over riding the accumulated data and taking "editorial" control of suggestions for product managers of specific lines (like some person who owns the 'digital cameras' vertical or the 'romance novels' vertical or similar) where they truly are experts
There are often promotional deals (i.e. sony or panasonic or nikon or canon or sprint or verizon pays additional money to the retailer, or gives a better discount at larger quantities or other things in those lines) that will cause certain "suggestions" to rise to the top more often than others - there is always some reasonable business logic and business reason behind this targeted at making more on each transaction or reducing wholesale costs, etc.
In terms of actual implementation? Just about all large online systems boil down to some set of pipelines (or a filter pattern implementation or a workflow, etc. you call it what you will) that allow for a context to be evaluated by a series of modules that apply some form of business logic.
Typically a different pipeline would be associated with each separate task on the page - you might have one that does recommended "packages/upsells" (i.e. buy this with the item you're looking at) and one that does "alternatives" (i.e. buy this instead of the thing you're looking at) and another that pulls items most closely related from your wish list (by product category or similar).
The results of these pipelines are able to be placed on various parts of the page (above the scroll bar, below the scroll, on the left, on the right, different fonts, different size images, etc.) and tested to see which perform best. Since you're using nice easy to plug and play modules that define the business logic for these pipelines you end up with the moral equivalent of lego blocks that make it easy to pick and choose from the business logic you want applied when you build another pipeline which allows faster innovation, more experimentation, and in the end higher profits.
Did that help at all? Hope that give you a little bit of insight how this works in general for just about any ecommerce site - not just Amazon. Amazon (from talking to friends that have worked there) is very data driven and continually measures the effectiveness of it's user experience and the pricing, promotion, packaging, etc. - they are a very sophisticated retailer online and are likely at the leading edge of a lot of the algorithms they use to optimize profit - and those are likely proprietary secrets (you know like the formula to KFC's secret spices) and guaarded as such.
This isn't directly related to Amazon's recommendation system, but it might be helpful to study the methods used by people who competed in the Netflix Prize, a contest to develop a better recommendation system using Netflix user data. A lot of good information exists in their community about data mining techniques in general.
The team that won used a blend of the recommendations generated by a lot of different models/techniques. I know that some of the main methods used were principal component analysis, nearest neighbor methods, and neural networks. Here are some papers by the winning team:
R. Bell, Y. Koren, C. Volinsky, "The BellKor 2008 Solution to the Netflix Prize", (2008).
A. Töscher, M. Jahrer, “The BigChaos Solution to the Netflix Prize 2008", (2008).
A. Töscher, M. Jahrer, R. Legenstein, "Improved Neighborhood-Based Algorithms for Large-Scale Recommender Systems", SIGKDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition (KDD’08) , ACM Press (2008).
Y. Koren, "The BellKor Solution to the Netflix Grand Prize", (2009).
A. Töscher, M. Jahrer, R. Bell, "The BigChaos Solution to the Netflix Grand Prize", (2009).
M. Piotte, M. Chabbert, "The Pragmatic Theory solution to the Netflix Grand Prize", (2009).
The 2008 papers are from the first year's Progress Prize. I recommend reading the earlier ones first because the later ones build upon the previous work.
I bumped on this paper today:
Amazon.com Recommendations: Item-to-Item Collaborative Filtering
Maybe it provides additional information.
(Disclamer: I used to work at Amazon, though I didn't work on the recommendations team.)
ewernli's answer should be the correct one -- the paper links to Amazon's original recommendation system, and from what I can tell (both from personal experience as an Amazon shopper and having worked on similar systems at other companies), very little has changed: at its core, Amazon's recommendation feature is still very heavily based on item-to-item collaborative filtering.
Just look at what form the recommendations take: on my front page, they're all either of the form "You viewed X...Customers who also viewed this also viewed...", or else a melange of items similar to things I've bought or viewed before. If I specifically go to my "Recommended for You" page, every item describes why it's recommended for me: "Recommended because you purchased...", "Recommended because you added X to your wishlist...", etc. This is a classic sign of item-to-item collaborative filtering.
So how does item-to-item collaborative filtering work? Basically, for each item, you build a "neighborhood" of related items (e.g., by looking at what items people have viewed together or what items people have bought together -- to determine similarity, you can use metrics like the Jaccard index; correlation is another possibility, though I suspect Amazon doesn't use ratings data very heavily). Then, whenever I view an item X or make a purchase Y, Amazon suggests me things in the same neighborhood as X or Y.
Some other approaches that Amazon could potentially use, but likely doesn't, are described here: http://blog.echen.me/2011/02/15/an-overview-of-item-to-item-collaborative-filtering-with-amazons-recommendation-system/
A lot of what Dave describes is almost certainly not done at Amazon. (Ratings by those in my social network? Nope, Amazon doesn't have any of my social data. This would be a massive privacy issue in any case, so it'd be tricky for Amazon to do even if they had that data: people don't want their friends to know what books or movies they're buying. Demographic information? Nope, nothing in the recommendations suggests they're looking at this. [Unlike Netflix, who does surface what other people in my area are watching.])
I don't have any knowledge of Amazon's algorithm specifically, but one component of such an algorithm would probably involve tracking groups of items frequently ordered together, and then using that data to recommend other items in the group when a customer purchases some subset of the group.
Another possibility would be to track the frequency of item B being ordered within N days after ordering item A, which could suggest a correlation.
As far I know, it's use Case-Based Reasoning as an engine for it.
You can see in this sources: here, here and here.
There are many sources in google searching for amazon and case-based reasoning.
If you want a hands-on tutorial (using open-source R) then you could do worse than going through this:
https://gist.github.com/yoshiki146/31d4a46c3d8e906c3cd24f425568d34e
It is a run-time optimised version of another piece of work:
http://www.salemmarafi.com/code/collaborative-filtering-r/
However, the variation of the code on the first link runs MUCH faster so I recommend using that (I found the only slow part of yoshiki146's code is the final routine which generates the recommendation at user level - it took about an hour with my data on my machine).
I adapted this code to work as a recommendation engine for the retailer I work for.
The algorithm used is - as others have said above - collaborative filtering. This method of CF calculates a cosine similarity matrix and then sorts by that similarity to find the 'nearest neighbour' for each element (music band in the example given, retail product in my application).
The resulting table can recommend a band/product based on another chosen band/product.
The next section of the code goes a step further with USER (or customer) based collaborative filtering.
The output of this is a large table with the top 100 bands/products recommended for a given user/customer
Someone did a presentation at our University on something similar last week, and referenced the Amazon recommendation system. I believe that it uses a form of K-Means Clustering to cluster people into their different buying habits. Hope this helps :)
Check this out too: Link and as HTML.

Algorithms to find stuff a user would like based on other users likes

I'm thinking of writing an app to classify movies in an HTPC based on what the family members like.
I don't know statistics or AI, but the stuff here looks very juicy. I wouldn't know where to start do.
Here's what I want to accomplish:
Compose a set of samples from each users likes, rating each sample attribute separately. For example, maybe a user likes western movies a lot, so the western genre would carry a bit more weight for that user (and so on for other attributes, like actors, director, etc).
A user can get suggestions based on the likes of the other users. For example, if both user A and B like Spielberg (connection between the users), and user B loves Batman Begins, but user A loathes Katie Holmes, weigh the movie for user A accordingly (again, each attribute separately, for example, maybe user A doesn't like action movies so much, so bring the rating down a bit, and since Katie Holmes isn't the main star, don't take that into account as much as the other attributes).
Basically, comparing sets from user A similar to sets from user B, and come up with a rating for user A.
I have a crude idea about how to implement this, but I'm certain some bright minds have already thought of a far better solution already, so... any suggestions?
Actually, after a quick research, it seems a Bayesian filter would work. If so, would this be the better approach? Would it be as simple as just "normalizing" movie data, training a classifier for each user, and then just classify each movie?
If your suggestion includes some brain melting concepts (I'm not experienced in these subjects, specially in AI), I'd appreciate it if you also included a list of some basics for me to research before diving into the meaty stuff.
Thanks!
Matthew Podwysocki had some interesting articles on this stuff
http://codebetter.com/blogs/matthew.podwysocki/archive/2009/03/30/functional-programming-and-collective-intelligence.aspx
http://codebetter.com/blogs/matthew.podwysocki/archive/2009/04/01/functional-programming-and-collective-intelligence-ii.aspx
http://weblogs.asp.net/podwysocki/archive/2009/04/07/functional-programming-and-collective-intelligence-iii.aspx
This is similar to this question where the OP wanted to build a recommendation system. In a nutshell, we are given a set of training data consisting of users ratings to movies (1-5 star rating for example) and a set of attributes for each movie (year, genre, actors, ..). We want to build a recommender so that it will output for unseen movies a possible rating. So the inpt data looks like:
user movie year genre ... | rating
---------------------------------------------
1 1 2006 action | 5
3 2 2008 drama | 3.5
...
and for an unrated movie X:
10 20 2009 drama ?
we want to predict a rating. Doing this for all unseen movies then sorting by predicted movie rating and outputting the top 10 gives you a recommendation system.
The simplest approach is to use a k-nearest neighbor algorithm. Among the rated movies, search for the "closest" ones to movie X, and combine their ratings to produce a prediction.
This approach has the advantage of being very simple to easy implement from scratch.
Other more sophisticated approaches exist. For example you can build a decision tree, fit a set of rules on the training data. You can also use Bayesian networks, artificial neural networks, support vector machines, among many others... Going through each of these wont be easy for someone without the proper background.
Still I expect you would be using an external tool/library. Now you seem to be familiar with Bayesian Networks, so a simple naive bayes net, could in fact be very powerful. One advantage is that it allow for prediction under missing data.
The main idea would be somewhat the same; take the input data you have, train a model, then use it to predict the class of new instances.
If you want to play around with different algorithms in simple intuitive package which requires no programming, I suggest you take a look at Weka (my 1st choice), Orange, or RapidMiner. The most difficult part would be to prepare the dataset to the required format. The rest is as easy as choosing what algorithm and applying it (all in a few clicks!)
I guess for someone not looking to go into too much details, I would recommend going with the nearest neighbor method as it is intuitive and easy to implement.. Still the option of using Weka (or one of the other tools) is worth looking into.
There are a few algorithms that are good for this:
ARTMAP: groups via probability against each other (this isn't fast but its the best thing for your problem IMO)
ARTMAP holds a group of common attributes and determines likelyhood of simliarity via a percentages.
ARTMAP
KMeans: This seperates out the vectors by the distance that they are from each other
KMeans: Wikipedia
PCA: will seperate the average of all the values from the varing bits. This is what you would use to do face detection, and background subtraction in Computer Vision.
PCA
The K-nearest neighbor algorithm may be right up your alley.
Check out some of the work of the top teams for the netflix prize.

Resources