Social network functionality finding connections you might know - algorithm

I want to create a functionality for suggesting connections in a social network.
In the network you can have connections and connect to other users.
I want to implement a connection suggestion functionality on the network.
I think the most basic approach to implement this is to check all my connections most occurring common connection that my user is not connected to and sugest this user to my user to connect to.
My questions is:
Is this a good basic approach for an easy connection finder?
Is there any good implementation algorithm that i can use for finding my connections most occurring user that they are connected to?

I'd try a machine learning approach for this problem.
I'll suggest two common machine learning concepts in order to solve this problem. In order for both of them to work - you need to extract features from the data (for example look at a subgraph, and friendship with each member in the subgraph is a binary feature).
The two approaches are:
Classification. In here, you are trying to find a classifier C:UserxUser->Boolean (A classifier that given two users, gives a boolean answer - should they be friends). The classification approach will require you to first manually label, or extract some classified information (A big enough set of pairs, each with a classification). The algorithm will learn this pattern, and use it to predict future inputs.
Clustering (AKA Unsupervised learning). You can try and find clusters in your graph, and suggest users to be friends with all members in their cluster.
I have to admit I never used any of these methods for friendship suggestion - so I have no idea how accurate it will be. You can use cross-validation in order to estimate the accuracy of the algorithm.
If you are interested in learning more about it - two weeks ago an on line free course has started in stanford about machine learning: https://class.coursera.org/ml-2012-002

Related

Algorithm for clustering people with similar interests

I want to cluster people into groups based on their interests. For eg. people who like machine learning and graphs may be placed in a group and people who have interest in mathematics and economics etc. may be placed in a different group.
The algorithm should be able to decide which people have most matching interests based on the interests of the people and create clusters.It should also be able to output about other persons in the group in which a particular person is placed.
This does not sound like a particularly difficult clustering problem, and any of the off-the-shelf clustering algorithm will probably work well. If you know how many clusters you want, then try k-means or k-medoid clustering. If you don't know how many clusters, then try agglomerative clustering.
The difficult part of the problem will be the features. You mentioned that 'interests' could be used as the features upon which to cluster, but feature engineering and selection will always involve some trial and error.
Without more context of your problem, I can't really give a definite answer. Most clustering algorithms will work though, the problem is how "good" are your results. I'm quoting the word "good" because you'll need some sort of metric to measure that (generally inter-cluster and intra-cluster distance).
Here's the advice given to me when I was taught on how to decide on an algorithm for data mining: Try the simplest algorithms first - quite often these are overlooked but perform quite well (Naive Bayes for supervised learning is a classic example).
To start you off, try something like K-means which is a simple and popular method, you can find more info here http://en.wikipedia.org/wiki/K-means_clustering (if you look at the Software section you can also find a list of implementations that you could try).
The second part of the criteria is to be able to output the other people in the group based on a target person. This is doable in all clustering algorithms since you'll have X subsets of people, you simply need to find the subset which the target person is in and then iterate that subset and print all the people within out.
I think the right approach will be Kmeans clustering. The most important part of your problem is feature selection.
Try with some features that you think are most important and simply apply kmeans in some statistical programing language like R, inspect the result and improve it by feature modification or selecting more appropriate features.
Hit and trial can give you insight if you are not sure about feature selection.
If you can provide some sample data, it will help to give some specific solutions to your problem.
Its coming a bit late, but there's actually an app in the windows store that is doing exactly that : finding profiles having similar characteristics
its called k-modo

Neo4j and Cluster Analysys

I'm developing a web application that will heavily depend on its ability to make suggestions on items basing on users with similar preferences. A friend of mine told me that what I'm looking for - mathematically - is some Cluster Analysis algorithm. On the other hand, here on SO, I was told that Neo4j (or some other Graph DB) was the kind DB that I should have approached for this task (the preferences one).
I started studying both this tools, and I'm having some doubts.
For Cluster Analysis purposes it looks to me that a standard SQL DB would still be the perfect choice, while Neo4j would be better suited for a Neural Network kind of approach (although still perfectly fit for the task).
Am I missing something? Am I trying to use the wrong tools combination?
I would love to hear some ideas on the subject.
Thanks for sharing
this depends on your data. neo4j is capable to provide even complex recommendations in real-time for one particular node - let's say you want to recommend to a user some product and this can be handle within a graph db in real-time
whereas using some clustering system is the best way to do recommendations for all users at once (and than maybe save it somewhere so you wouldn't need to calculate it again).
the computational difference:
neo4j has has no initialization cost and can give you one recommendations in an acceptable time
clustering needs more time for initialization (e.g. not in seconds but most likely in minutes/hours) and is better to calculate the recommendations for the whole dataset. in fact, taking strictly the time for one calculations for a specific user this clustering can do it faster than neo4j but the big restriction is the initial initialization - thus not good for real-time application
the practical difference:
if you have mostly static data and is ok for you to do recommendations once in a time than do clustering with SQL
if you got dynamical data where the data are being updated with each interaction and is necessary for you to always provide the newest recommendation, than use neo4j
I am currently working on various topics related to recommendation and clustering with neo4j.
I'm not exactly sure what you're looking for, but depending on how you implement you data on the graph, you can easily work out clustering algorithms based on counting links to various type of nodes.
If you plan correctly you nodes and relationships, you can then identify group of nodes that share most common links to a set of category.
let me introduce Reco4J (http://www.reco4j.org), is is an open source framework that provide recommendation based on graph database source. It uses neo4j as graph database management system.
Have a look at it and contact us if you are interested in support.
It is in a really early release but we are working hard to provide extended documentation and new interesting features.
Cheers,
Alessandro

User recognition algorithm

let's say you have a big IRC chan log, and you want to find out what user is using multiple accounts. As input you have the time the user connects to the server, and some sort of text analysis ( word frequency, and so on), and as output you want the likelihood two user "matches".
Is it possible to do it using ANN? Are there better algorithms to accomplish that task?
PS : use IP addresses is not an accepted solution :)
The problem with using neural networks is that you need a robust set of training data--that is, you need to have lots of examples of people using multiple accounts where you already know that's what they're doing. Furthermore, if the people you're trying to identify have ever played a role-playing game, they'll probably be able to make themselves seem quite a bit different if they want to.
So, if people are acting just like themselves and you have a pretty good training data set, then you stand a chance. You should probably start with methods used by forensic linguistics.
But I suspect that what you'll probably end up doing is identifying people who are sort of similar to each other. Good for a matchmaking site, perhaps; not so cool for most other things. (For example, I would think this would be a perfectly dreadful way to try to find members of Anonymous in other guises.)
This problem is known as "authorship detection" (or sometimes, in a particular domain, "plagiarism detection"). It can be done using a variety of statistical algorithms, of which neural networks aren't the easiest.
Check out the Cavnar & Trenkle algorithm for text classification. That may be made into a useful baseline algorithm for this task. Implementations in various languages are available on the web. You may want to turn it into a clustering algorithm instead of a classifier.

Algorithm for finding potential matches

I need to find and algorithm to find the best matches in a social network. The system is a college student social network, and basically the main idea is to find a study partner for a class. The idea it's to suggest to the user what are the potential best partners based on different criteria, such as common class, GPA, rating, common schedule, etc. I wonder what would be the best algorithm to use.
Such problem is called collaborative filtering. Collaborative filtering systems can produce personal recommendations by computing the similarity between your preference and the one of other people.
There are a lot of information about such teqniques. You might start with good presentation.
Maybe some sort of clustering algorithm could help. Those whose vectors (Common class, GPA etc...) are similar would be clustered together.
You might want to start off by looking at recommendation systems and nearest neighbor search.

Is the stackoverflow community a scale-free or a small world network?

I am a graph/network enthusiast and this just for my curiosity :)
I am trying to model the StackOverflow community as a graph/network. Assume that the people in the SO community are nodes and that the answers given to any of the question establishes a relationship between these nodes. The relationship can be assumed to be directed(link from answer -> question) or undirected. The graph could be weighted and that the weights of the nodes could represented number of vote-ups/downs (normalized on the scale of 0 to 1).
What kind of graph/network does one end up with at any given snapshot of time? Is it scale-free? Is it a small-world? The graph is continuously evolving over a period of time and i would like to understand its structure and dynamics.
Is there a way where can i retrieve this relationship data from - may be SO APIs or some one from SO can help me out with (sample) data?
Clarification edit:
Scale-free network: A network whose degree distribution asymptotically follows a power law Small-world: A network that has sub-networks characterized by presence of connections between almost any two nodes within them and most pairs of nodes are connected by at least one short path.
To the second part of your question:
Is there a way where can i retrieve
this relationship data from - may be
SO APIs or some one from SO can help
me out with (sample) data?
Try these questions instead. There are a lot of plans to implement an API to access SO data. Some things are in change, but there are possibilities to screen-scrape the data or access them via JSON (afaik).
Is there a guide to accessing StackOverflow data programmatically?
What would you want to see in a StackOverflow API?
Are there plans for a StackOverflow API?
Try it out. Good luck!
What kind of graph/network does one end up with at any given snapshot of time? Is it scale-free? Is it a small-world? The graph is continuously evolving over a period of time and i would like to understand its structure and dynamics.
It takes only a few links between remote clusters to turn a random network into a small world one, so it's quite likely to be small world.
As to whether it's scale free, that would require there to be a few posters with lots of answers and many with only one or two. I seem to recall Jeff saying that there were lots with only one question in one of the pod-casts; you might be better off asking the question there rather than here, as he will have the data.

Resources