pre calculate users interests - algorithm

i need an approach or an algorithm to pre-calculate an users interest based on his tweets..
the user connects his account with his twitter account and after retrieving his tweets for the first time i will have to pre-calculate his tastes and interests..
as this user continues to use the my system i will have to make those predictions more accurate..
is there an algorithm or a mathematical model which will help in this requirement?
please provide - existing research links or open source code or examples which will help me to get started..

You can use Machine-Learning for this task.
One possible machine learning algorithm is Bag Of Words with k-nearest neighbors:
Create a training set [users which you know what their interest are], and use the Bag Of Words [preferably with n-grams] to "learn" the training set.
When a new user arrives - have the words/n-grams extracted as features - and find the k nearest neighbors to determine what the interests are.
To get improvement over time - you can have some additional explicit feedback - users can click on agreement/disagreement for what the algorithm said. You can later use this information to extend the size of your training set - which will probably result in more accurate decisions.
This is a standrad algorithm for learning "features" between sets of sentences/words, so you should at least use it as a guideline.
There is also an open source project that might help you: Apache Mahout.

Related

evaluating the performance of item-based collaborative filtering for binary (yes/no) product recommendations

I'm attempting to write some code for item based collaborative filtering for product recommendations. The input has buyers as rows and products as columns, with a simple 0/1 flag to indicate whether or not a buyer has bought an item. The output is a list similar items for a given purchased, ranked by cosine similarities.
I am attempting to measure the accuracy of a few different implementations, but I am not sure of the best approach. Most of the literature I find mentions using some form of mean square error, but this really seems more applicable when your collaborative filtering algorithm predicts a rating (e.g. 4 out of 5 stars) instead of recommending which items a user will purchase.
One approach I was considering was as follows...
split data into training/holdout sets, train on training data
For each item (A) in the set, select data from the holdout set where users bought A
Determine which percentage of A-buyers bought one of the top 3 recommendations for A-buyers
The above seems kind of arbitrary, but I think it could be useful for comparing two different algorithms when trained on the same data.
Actually your approach is quiet similar with the literature but I think you should consider to use recall and precision as most of the papers do.
http://en.wikipedia.org/wiki/Precision_and_recall
Moreover if you will use Apache Mahout there is an implementation for recall and precision in this class; GenericRecommenderIRStatsEvaluator
Best way to test a recommender is always to manually verify that the results. However some kind of automatic verification is also good.
In the spirit of a recommendation system, you should split your data in time, and see if you algorithm can predict what future buys the user does. this should be done for all users.
Don't expect that it can predict everything, a 100% correctness is usually a sign of over-fitting.

trouble with recurrent neural network algorithm for structured data classification

TL;DR
I need help understanding some parts of a specific algorithm for structured data classification. I'm also open to suggestions for different algorithms for this purpose.
Hi all!
I'm currently working on a system involving classification of structured data (I'd prefer not to reveal anything more about it) for which I'm using a simple backpropagation through structure (BPTS) algorithm. I'm planning on modifying the code to make use of a GPU for an additional speed boost later, but at the moment I'm looking for better algorithms than BPTS that I could use.
I recently stumbled on this paper -> [1] and I was amazed by the results. I decided to give it a try, but I have some trouble understanding some parts of the algorithm, as its description is not very clear. I've already emailed some of the authors requesting clarification, but haven't heard from them yet, so, I'd really appreciate any insight you guys may have to offer.
The high-level description of the algorithm can be found in page 787. There, in Step 1, the authors randomize the network weights and also "Propagate the input attributes of each node through the data structure from frontier nodes to root forwardly and, hence, obtain the output of root node". My understanding is that Step 1 is never repeated, since it's the initialization step. The part I quote indicates that a one-time activation also takes place here. But, what item in the training dataset is used for this activation of the network? And is this activation really supposed to happen only once? For example, in the BPTS algorithm I'm using, for each item in the training dataset, a new neural network - whose topology depends on the current item (data structure) - is created on the fly and activated. Then, the error backpropagates, the weights are updated and saved, and the temporary neural network is destroyed.
Another thing that troubles me is Step 3b. There, the authors mention that they update the parameters {A, B, C, D} NT times, using equations (17), (30) and (34). My understanding is that NT denotes the number of items in the training dataset. But equations (17), (30) and (34) already involve ALL items in the training dataset, so, what's the point of solving them (specifically) NT times?
Yet another thing I failed to get is how exactly their algorithm takes into account the (possibly) different structure of each item in the training dataset. I know how this works in BPTS (I described it above), but it's very unclear to me how it works with their algorithm.
Okay, that's all for now. If anyone has any idea of what might be going on with this algorithm, I'd be very interested in hearing it (or rather, reading it). Also, if you are aware of other promising algorithms and / or network architectures (could long short term memory (LSTM) be of use here?) for structured data classification, please don't hesitate to post them.
Thanks in advance for any useful input!
[1] http://www.eie.polyu.edu.hk/~wcsiu/paper_store/Journal/2003/2003_J4-IEEETrans-ChoChiSiu&Tsoi.pdf

Matching two Social Media Profiles

How can you check if two profiles from two different Social Media sites are the same?
What algorithms exist to accomplish this and thereby assigning a weight measure for the match?
Let's say that I have a profile from LinkedIn and another profile from Facebook. I know the properties of these two profiles. What algorithm can I implement to find the matching distance between these two profile.
Thanks
Abhishek S
You can try machine learning algorithms, specifically classification
For simplicity, let's assume you want a binary answer: yes or not (this can be later improved).
What you have to do:
Extract the features you have from the two profile and create a
single instance for two combined profiles. This will be an instance
needed to be classified
Create a training set. A training set is a set of "instances" which you know the classification for (from manually labeling them usually).
Run a classification algorithm, given the training set - that will "guess" the classification for the unclassified instances you will later get.
Some algorithms you might want to use are:
SVM - which is considered by many the best classification algorithm exists today.
Decision Trees - especially C4.5 - Very intuitive classifier (human readable!) and simple to use, also - very short classification time.
K Nearest Neighbor - intuitive and simple to use, but behaves badly when the number of features is big.
You can also use cross validation to evaluate how good your results are.
For java - there is an open source project called Weka that implement these classification algorithms and more.

extract Similar users from logs using hadoop/pig

We need as part of our start-up product to compute "similar user feature". And we've decided to go with pig for it.
I've been learning pig for a few days now and understand how it work.
So to start here is how the log file look like.
user url time
user1 http://someurl.com 1235416
user1 http://anotherlik.com 1255330
user2 http://someurl.com 1705012
user3 http://something.com 1705042
user3 http://someurl.com 1705042
As the number of users and url can be huge, we can't use a bruteforce approach here, so first we need to find the user's that have access at least to on common url.
The algorithm could be splited as bellow:
Find all users that has accessed to some common urls.
generate pair-wise combination of all users for each resource accessed.
for each pair and and url, compute the similarity of those users: the similarity depend of the timeinterval between the access (so we need to keep track of the time).
sum up for each pair-url the similarity.
here is what i've written so far:
A = LOAD 'logs.txt' USING PigStorage('\t') AS (uid:bytearray, url:bytearray, time:long);
grouped_pos = GROUP A BY ($1);
I know it is not much yet, but now i don't know how to generate the pair or move further.
So any help would be appreciated.
Thanks.
There's a nice, detailed paper from IBM on doing co-clustering with MapReduce that may be useful for you.
The Google News Personalization paper describes a fairly straightforward implementation of Locality Sensitive Hashing for solving the same problem.
For algorithms, look at papers on query/URL bipartite graphs. Here are a couple of links:
Query suggestion using hitting time
by Qiaozhu Mei, Dengyong Zhou, Kenneth Church
http://www-personal.umich.edu/~qmei/pub/cikm08-sugg.ppt
Random walks on the click graph
Nick Craswell and Martin Szummer
July 2007
http://research.microsoft.com/apps/pubs/default.aspx?id=65235

Beyond item-to-item recommendations

Simple item-to-item recommendation systems are well-known and frequently implemented. An example is the Slope One algorithm. This is fine if the user hasn't rated many items yet, but once they have, I want to offer more finely-grained recommendations. Let's take a music recommendation system as an example, since they are quite popular. If a user is viewing a piece by Mozart, a suggestion for another Mozart piece or Beethoven might be given. But if the user has made many ratings on classical music, we might be able to make a correlation between the items and see that the user dislikes vocals or certain instruments. I'm assuming this would be a two-part process, first part is to find correlations between each users' ratings, the second would be to build the recommendation matrix from these extra data. So the question is, are they any open-source implementations or papers that can be used for each of these steps?
Taste may have something useful. It's moved to the Mahout project:
http://taste.sourceforge.net/
In general, the idea is that given a user's past preferences, you want to predict what they'll select next and recommend it. You build a machine-learning model in which the inputs are what a user has picked in the past and the attributes of each pick. The output is the item(s) they'll pick. You create training data by holding back some of their choices, and using their history to predict the data you held back.
Lots of different machine learning models you can use. Decision trees are common.
One answer is that any recommender system ought to have some of the properties you describe. Initially, recommendations aren't so good and are all over the place. As it learns tastes, the recommendations will come from the area the user likes.
But, the collaborative filtering process you describe is fundamentally not trying to solve the problem you are trying to solve. It is based on user ratings, and two songs aren't rated similarly because they are similar songs -- they're rated similarly just because similar people like them.
What you really need is to define your notion of song-song similarity. Is it based on how the song sounds? the composer? Because it sounds like the notion is not based on ratings, actually. That is 80% of the problem you are trying to solve.
I think the question you are really answering is, what items are most similar to a given item? Given your item similarity, that's an easier problem than recommendation.
Mahout can help with all of these things, except song-song similarity based on its audio -- or at least provide a start and framework for your solution.
There are two techniques that I can think of:
Train a feed-forward artificial neural net using Backpropagation or one of it's successors (e.g. Resilient Propagation).
Use version space learning. This starts with the most general and the most specific hypotheses about what the user likes and narrows them down when new examples are integrated. You can use a hierarchy of terms to describe concepts.
Common characteristics of these methods are:
You need a different function for
each user. This pretty much rules
out efficient database queries when
searching for recommendations.
The function can be updated on the fly
when the user votes for an item.
The dimensions along which you classify
the input data (e.g. has vocals, beats
per minute, musical scales,
whatever) are very critical to the
quality of the classification.
Please note that these suggestions come from university courses in knowledge based systems and artificial neural nets, not from practical experience.

Resources