Algorithm for clustering people with similar interests - algorithm

I want to cluster people into groups based on their interests. For eg. people who like machine learning and graphs may be placed in a group and people who have interest in mathematics and economics etc. may be placed in a different group.
The algorithm should be able to decide which people have most matching interests based on the interests of the people and create clusters.It should also be able to output about other persons in the group in which a particular person is placed.

This does not sound like a particularly difficult clustering problem, and any of the off-the-shelf clustering algorithm will probably work well. If you know how many clusters you want, then try k-means or k-medoid clustering. If you don't know how many clusters, then try agglomerative clustering.
The difficult part of the problem will be the features. You mentioned that 'interests' could be used as the features upon which to cluster, but feature engineering and selection will always involve some trial and error.

Without more context of your problem, I can't really give a definite answer. Most clustering algorithms will work though, the problem is how "good" are your results. I'm quoting the word "good" because you'll need some sort of metric to measure that (generally inter-cluster and intra-cluster distance).
Here's the advice given to me when I was taught on how to decide on an algorithm for data mining: Try the simplest algorithms first - quite often these are overlooked but perform quite well (Naive Bayes for supervised learning is a classic example).
To start you off, try something like K-means which is a simple and popular method, you can find more info here http://en.wikipedia.org/wiki/K-means_clustering (if you look at the Software section you can also find a list of implementations that you could try).
The second part of the criteria is to be able to output the other people in the group based on a target person. This is doable in all clustering algorithms since you'll have X subsets of people, you simply need to find the subset which the target person is in and then iterate that subset and print all the people within out.

I think the right approach will be Kmeans clustering. The most important part of your problem is feature selection.
Try with some features that you think are most important and simply apply kmeans in some statistical programing language like R, inspect the result and improve it by feature modification or selecting more appropriate features.
Hit and trial can give you insight if you are not sure about feature selection.
If you can provide some sample data, it will help to give some specific solutions to your problem.

Its coming a bit late, but there's actually an app in the windows store that is doing exactly that : finding profiles having similar characteristics
its called k-modo

Related

Ordered Sequence of learning algorithm design techniques

So, I am taking a algorithms course this semester.
I have a basic understanding of design techniques and know that divide and conquer should be the first technique t learn.
But coming to backtracking, dynamic programming and greedy techniques I am confused to choose an appropriate order
While my course is structured in the order i described in above paragraph.
Suggest me..
I doubt you will lose anything by simply following the order that your course suggests. I notice that TutorialsPoint presents in a different order, and adds several other techniques. I think you will gain a lot by simply working your way through such material.
It's likely that your learning style is different from mine, but I find it very beneficial to do a breadth-first survey of a topic, happily accepting that I won't understand every detail the first time round, then drill down as my understanding increases. So I don't feel the need for a detailed order of study.

Which open-source recommendation system should I choose to deal with big dataset

I want to build a recommendation system, and the target is to deal with really big data set, like 1 TB data.
And each user has really huge amount of items, however the number of user is small, like thousands or 10 thousands.
I have search from google, I found there is some open-source recommendation engine based on hadoop like Mahout, I guess it may have ability to deal with such big data, however I'm not sure.
I also find some engine write in C++ python, even php, I don't think script languages can deal with such big data, cause memory can't contain the whole dataset.
Or I'm wrong? Could some give me some recommendation?
Your question title is:
Which opensource recommendation system should I choose to deal with
big dataset?
and in the first line you say
I want to build a recommendation system, and the target is to deal with really big data set, > like 1 TB data.
And you are asking for an recommendation as an answer.
To answer your second question first. In my experience of building recommender systems I would advise you do not "build" a recommender system from the ground up if you can avoid it. Recommender Systems are complex and can use a wide range of techniques to provide a user with a recommendation. So my recommendation is unless you are really committed, and have a team of people with a range of experience and knowledge in recommender systems, statistics, and software engineering then look to implement an existing recommender system rather than building your own.
In terms of which open source recommender system you should choose, this is actually pretty difficult to answer with great accuracy. Let me try to answer this by breaking it down.
Consider the open source license, its restrictions and your requirements.
Consider which algorithm you want to use to make recommendations
Consider the environment you will be running your recommender system on.
I recommend you look more into the algorithm side as it will be the determining factor as to which tool you can use, or whether you will need to roll your own. Start reading here http://www.ibm.com/developerworks/library/os-recommender1/ for a very brief insight in to the different approaches that recommender systems use. In summary the different approaches are:
Content based
Neighbourhood / Collaborative filtering based
Constraint based
Graph-based
In your case to keep things relatively straightforward it sounds like you should consider a user-user collaborative filtering algorithm for this. The reasons being:
Neighbourhood Collaborative Filtering is quite intuitive to understand and it can be relatively easy to implement.
With this method you can also justify your recommendations to your users in a basic way
There is no requirement to build a model for training, and the processing of neighbours can be done "offline", to provide quick recommendations to the end user.
Storing neighbours is actually quite memory efficient, which means better scalability. Something it sounds like you will need lots of.
The user-based part of my suggestion is because it sounds like you have less users than you do items. In a user-based nearest neighbourhood a predicted rating of a new item I for user U is calculated by looking at the other users who have also rated item I and are most similar to user U. Because you have fewer users than items in your system it will be faster to compute user-based collaborative filtering compared with item-based collaborative filtering.
Within the user-based collaborative filtering you need to consider what rating normalisation (mean-centering vs z-score) you want to use, the similarity weight computation method (e.g. Cosine vs Pearsons correlation vs other similarity measures) you want to use, neighbourhood selection criteria (pre-filtering of neighbours, number of neighbours involved in the prediction), and any Dimensionality Reduction methods (SVD, SVD++) you want to implement (with a large dataset like yours you will want to seriously consider DM).
So really instead of looking for an open source that will be able to process your data set you should consider your algorithm choice first, then look to find a tool that has an implementation of this algorithm, and then assess whether it can process your the volume involved in your dataset.
In saying all of that, if you do choose to go down the user-based collaborative filtering route then I am confident that Apache Mahout will be able to solve your problem, and if not it will certainly help you understand the complexity involved in building your own (just look at their source code).
Please note the advice is really consider the algorithm choice. "Good" recommender systems are so much more than just being able to process a large dataset. You need to think about accuracy, coverage, confidence, novelty, serendipity, diversity, robustness, privacy, risk user trust, and finally scalability. You should also consider how you are going to perform experiments and evaluate your recommendations, remember if the recommendations you are churning out are rubbish and it is turning your users off then there is no point to have a recommender system!
It is such a big area with lots to think about, there is probably no one single tool that is going to help you with everything, so be prepared to do a lot of reading and research as well as implementing lots of different open source tools to help you.
In saying that, start looking at Apache Mahout. Going back to the break-down of the 3 areas I said you should think about.
It has a commercial-friendly open-source license,
it has really great implementation of the algorithms you are likely going to need to use, and
it can work on distributed environments (read scalable).
Hope that helps, and good luck.

Beyond item-to-item recommendations

Simple item-to-item recommendation systems are well-known and frequently implemented. An example is the Slope One algorithm. This is fine if the user hasn't rated many items yet, but once they have, I want to offer more finely-grained recommendations. Let's take a music recommendation system as an example, since they are quite popular. If a user is viewing a piece by Mozart, a suggestion for another Mozart piece or Beethoven might be given. But if the user has made many ratings on classical music, we might be able to make a correlation between the items and see that the user dislikes vocals or certain instruments. I'm assuming this would be a two-part process, first part is to find correlations between each users' ratings, the second would be to build the recommendation matrix from these extra data. So the question is, are they any open-source implementations or papers that can be used for each of these steps?
Taste may have something useful. It's moved to the Mahout project:
http://taste.sourceforge.net/
In general, the idea is that given a user's past preferences, you want to predict what they'll select next and recommend it. You build a machine-learning model in which the inputs are what a user has picked in the past and the attributes of each pick. The output is the item(s) they'll pick. You create training data by holding back some of their choices, and using their history to predict the data you held back.
Lots of different machine learning models you can use. Decision trees are common.
One answer is that any recommender system ought to have some of the properties you describe. Initially, recommendations aren't so good and are all over the place. As it learns tastes, the recommendations will come from the area the user likes.
But, the collaborative filtering process you describe is fundamentally not trying to solve the problem you are trying to solve. It is based on user ratings, and two songs aren't rated similarly because they are similar songs -- they're rated similarly just because similar people like them.
What you really need is to define your notion of song-song similarity. Is it based on how the song sounds? the composer? Because it sounds like the notion is not based on ratings, actually. That is 80% of the problem you are trying to solve.
I think the question you are really answering is, what items are most similar to a given item? Given your item similarity, that's an easier problem than recommendation.
Mahout can help with all of these things, except song-song similarity based on its audio -- or at least provide a start and framework for your solution.
There are two techniques that I can think of:
Train a feed-forward artificial neural net using Backpropagation or one of it's successors (e.g. Resilient Propagation).
Use version space learning. This starts with the most general and the most specific hypotheses about what the user likes and narrows them down when new examples are integrated. You can use a hierarchy of terms to describe concepts.
Common characteristics of these methods are:
You need a different function for
each user. This pretty much rules
out efficient database queries when
searching for recommendations.
The function can be updated on the fly
when the user votes for an item.
The dimensions along which you classify
the input data (e.g. has vocals, beats
per minute, musical scales,
whatever) are very critical to the
quality of the classification.
Please note that these suggestions come from university courses in knowledge based systems and artificial neural nets, not from practical experience.

Automatic Tagging Algorithm

Does anyone know how to build automatic tagging (blog post/document) algorithm? Any example will be appreciated.
I agree with what Wooble is saying. However the naïve solution is to simply write an algorithm that calculates the lexical similarities and differences of the given blog post compared to a corpus of text. This lexical difference will give you words that are found in the blog post with more frequency than those found in the corpus. And from those words, you can infer a tag.
But I strongly recommend against it. Automatic tagging doesn't seem to work in practice. Just outsource the tagging work to your users or to services like Mechanical Turk
Late response but also had this task for a course - so in case someone else is looking to explore this, here is a starting point:
If you are looking for simple solutions or perhaps as a machine learning exercise, you might view automatic tagging as a text categorization/classification task. Naive Bayes classifiers are simple tools to figure out and there is plenty of pseudocode and material to understand these. TFIDF (term frequency-inverse document frequency) metric is something else you can look into - although commonly associated with information retrieval it can be tasked for this problem when combined with other machine learning techniques.
However, instead of assigning the new sample a single label based on a the definition of NB classifier, you will have to determine multiple labels. You can probably use the tag co-occurrence information from training set to help you with this.
This is a simplistic and naive solution and there are a lot of details on feature selection left out (stemming to reduce independent parameters, information gain, etc). Plenty of easily accessible papers on this research topic to try it out!

Algorithm for finding potential matches

I need to find and algorithm to find the best matches in a social network. The system is a college student social network, and basically the main idea is to find a study partner for a class. The idea it's to suggest to the user what are the potential best partners based on different criteria, such as common class, GPA, rating, common schedule, etc. I wonder what would be the best algorithm to use.
Such problem is called collaborative filtering. Collaborative filtering systems can produce personal recommendations by computing the similarity between your preference and the one of other people.
There are a lot of information about such teqniques. You might start with good presentation.
Maybe some sort of clustering algorithm could help. Those whose vectors (Common class, GPA etc...) are similar would be clustered together.
You might want to start off by looking at recommendation systems and nearest neighbor search.

Resources