I am working on a recommendation system. It would be an Android application in which user will input their preferences and on the bases of those preferences, other matched profiles would be shown to that user. I am getting data from the user and storing it in the Firebase.
These are the numerical values and in order to show the matched profiles to that user, I am using two algorithms for calculating the similarity count between the users: Cosine similarity and Pearson correlation
I am fetching the name of the algorithm from the application and then performing the algorithm in order to show similar profiles to the user.
if (request.query.algo === "cosine") {
// compute cosine value
}
else if (request.query.algo === "pearson-correlation") {
// compute pearson correlation coefficents
}
As it would be a real time application so this method is totally wrong, I want to implement Strategy design pattern where the algorithm could be decided on the run time rather than on the compile time.
So now the problem is, In Strategy design pattern, how would I decide when to use which algorithm?
For example, when you buy something with a credit card, the type of credit card doesn't matter. All credit cards have a magnetic strip that has encoded information in it. The strips, and what it contains, represent the 'interface' and the type of card would be the 'implementation'. Each credit card can be replaced by any other and all are fully independent of each other.
Similarly, On what bases I should choose in between Cosine and Pearson on run time with Strategy design pattern?
From my understanding of it, Pearson would perform worse in those cases where two user profiles have very differing set of items (in this case the preferences).
Perhaps that could be your criteria? In cases where the number of matching preferences is above a certain threshold use Pearson and for other cases use cosine.
You could perhaps show your user a CLOSE match list, which uses cosine to show users whos profiles have a lot in common.
Then you could show a second list, which says You might also be interested in, which uses Pearson to show matching profiles who dont have a lot of common preferences.
Related
I have created a two-stage ranking system based on textual similarity ( cosine similarity ) between query-documents pair. Now I need to validate my ranking system whether the retrieved duly-ranked items are correct or not with respect to the user, which approach should I opt for. I read about Pointwise/Pairwise/Listwise approach to validate ranking, but for manual evaluation of a ranking system, which would be more helpful. If somebody can enlighten a better strategy for ranking evaluation approach, it would be very helpful for me. Thanks
If I get the question correctly, you are looking for an evaluation methodology to figure out whether your two-stage retrieval system works well or not. If this is true, you can use one of the following evaluation methodologies:
Relevance judgements: You can use TREC-like collections with a few hundred queries and explicit relevance judgement and use IR evaluation metrics (like MAP, P#10, NDCG, etc.) to evaluate your model.
A/B testing: In fact, you can show the initial result and the re-ranked results by the second stage of your retrieval system and ask users to judge whether the re-ranked one is better or not.
Click data: If you have access to search engine logs, you can use the click information of users to evaluate your model. To do so, you should be aware of several bias problems, e.g., positional bias problem.
Among the aforementioned strategies, the first one should be easier and cheaper to do. You just need to have access to TREC data, which is not private (but you need to pay a few hundred dollars to get access to most of them).
I am trying to fully understand the item-to-item Amazon's algorithm to apply it to my system to recommend items the user might like, matching the previous items the user liked.
So far I have read these: Amazon paper, item-to-item presentation and item-based algorithms. Also I found this question, but after that I just got more confused.
What I can tell is that I need to follow the next steps to get the list of recommended items:
Have my data set with the items that liked to the users (I have set liked=1 and not liked=0).
Use Pearson Correlation Score (How is this done? I found the formula, but is there any example?).
Then what should I do?
So I came with this questions:
What are the differences between the item-to-item and item-based filtering? Are both algorithms the same?
Is it right to replace the ranked score with liked or not?
Is it right to use the item-to-item algorithm, or is there any other more suitable for my case?
Any information about this topic will be appreciated.
Great questions.
Think about your data. You might have unary (consumed or null), binary (liked and not liked), ternary (liked, not liked, unknown/null), or continuous (null and some numeric scale), or even ordinal (null and some ordinal scale). Different algorithms work better with different data types.
Item-item collaborative filtering (also called item-based) works best with numeric or ordinal scales. If you just have unary, binary, or ternary data, you might be better off with data mining algorithms like association rule mining.
Given a matrix of users and their ratings of items, you can calculate the similarity of every item to every other item. Matrix manipulation and calculation is built into many libraries: try out scipy and numpy in Python, for example. You can just iterate over items and use the built-in matrix calculations to do much of the work in https://en.wikipedia.org/wiki/Cosine_similarity. Or download a framework like Mahout or Lenskit, which does this for you.
Now that you have a matrix of every item's similarity to every other item, you might want to suggest items for User U. So look in her history of items. For each history item I, for each item in your dataset ID, add the similarity of I to ID to a list of candidate item scores. When you've gone through all history items, sort the list of candidate items by score descending, and recommend the top ones.
To answer the remaining questions: a continuous or ordinal scale will give you the best collaborative filtering results. Don't use a "liked" versus "unliked" scale if you have better data.
Matrix factorization algorithms perform well, and if you don't have many users and you don't have lots of updates to your rating matrix, you can also use user-user collaborative filtering. Try item-item first through: it's a good all-purpose recommender algorithm.
I'm attempting to write some code for item based collaborative filtering for product recommendations. The input has buyers as rows and products as columns, with a simple 0/1 flag to indicate whether or not a buyer has bought an item. The output is a list similar items for a given purchased, ranked by cosine similarities.
I am attempting to measure the accuracy of a few different implementations, but I am not sure of the best approach. Most of the literature I find mentions using some form of mean square error, but this really seems more applicable when your collaborative filtering algorithm predicts a rating (e.g. 4 out of 5 stars) instead of recommending which items a user will purchase.
One approach I was considering was as follows...
split data into training/holdout sets, train on training data
For each item (A) in the set, select data from the holdout set where users bought A
Determine which percentage of A-buyers bought one of the top 3 recommendations for A-buyers
The above seems kind of arbitrary, but I think it could be useful for comparing two different algorithms when trained on the same data.
Actually your approach is quiet similar with the literature but I think you should consider to use recall and precision as most of the papers do.
http://en.wikipedia.org/wiki/Precision_and_recall
Moreover if you will use Apache Mahout there is an implementation for recall and precision in this class; GenericRecommenderIRStatsEvaluator
Best way to test a recommender is always to manually verify that the results. However some kind of automatic verification is also good.
In the spirit of a recommendation system, you should split your data in time, and see if you algorithm can predict what future buys the user does. this should be done for all users.
Don't expect that it can predict everything, a 100% correctness is usually a sign of over-fitting.
I am trying to develop an method to identify browsing pattern of a user on the basis of page requests.
In a simple example I have created 8 pages and for each page request from the user to the page I have stored that page's request frequency in the database as you can see below:
Now, my hypothesis is to identify the difference in the page request pattern, which leads to my assumption that if the pattern differs from pre-existing one then its a different (fraudulent) user. I am trying to develop this method as a part of an Multifactor-Authentication system.
Now when a user logs in and browses with a different pattern from the ones observed previously, the system should be able to identify it as a change in pattern.
Question is how to utilize these data values to check if current pattern relates to pre-existing patterns or not.
OK, here's a pretty simple idea (and basically, what you're looking to do is generate a set of features, then identify if the current session behaviour is different to the previously observed behaviour). I like to think of these one-class problems (only normal behaviour to train on, want to detect significant departure) as density estimation problems, so here's a simple probability model which will allow you to get the probability of a current request pattern. Basically, when this gets too low (and how low that is will be something you need to tune for the desired behaviour), something is going on.
Our observations consist of counts for each of the pages. Let their sum, the total number of requests, be equal to c_total, and counts for each page i be p_i. Then I'd propose:
c_total ~ Poisson(\lambda)
p|c_total ~ Multinomial(\theta, c_total)
This allows you to assign probability to a new observation given learned user-specific parameters \lambda (uni-variate) and \theta (vector of same dimension as p). To do this, calculate the probability of seeing that many requests from the pmf of the Poisson distribution, then calculate the probability of seeing the page counts from the multinomial, and multiply them together. You probably then want to normalise by c_total so that you can compare sessions with different numbers of requests (since the more requests, the more numbers < 1 you're multiplying together).
So, all that's left is to get the parameters from previous, "good" sessions from that user. The simplest thing is maximum likelihood, where \lambda is the mean total number of requests in previous sessions, and \theta_i is the proportion of all page views which were p_i (for that particular user). This may work for you: however, given that you want to be learning from very small numbers of observations, I'd be tempted to go with a full Bayesian model. This will also let you neatly update parameters after each non-suspicious observation. Inference in these distributions is very easy, with conjugate priors for \lambda and \theta and analytic predictive distributions, so it won't be difficult if you're familiar with these kinds of model at all.
One approach would be to use an unsupervised learning method such as a Self-Organizing Map (SOM, http://en.wikipedia.org/wiki/Self-organizing_map). Train the SOM on data representing expected/normal user behavior and then see how well the candidate data set fits the trained map. Keywords to search for in conjunction with "Self-organizing maps" might be "novelty/anomaly/intrusion detection" (turns up e.g. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.2616&rep=rep1&type=pdf)
You should think about whether fraudulent use-cases can be modeled in advance (in which case you can train detectors specifically for them) or whether only deviations from normal behavior are of interest.
If you want to start simple, implement a cosine similarity measure. This would allow you to define a set of "good" vectors. The current user's activity could be compared to the good vectors. If you cannot retrieve a good vector, then the activity is flagged.
I would like to create algorithm to distinguish the persons writing on forum under different nicknames.
The goal is to discover people registring new account to flame forum anonymously, not under their main account.
Basicaly I was thinking about stemming words they use and compare users according to similarities or these words.
As shown on the picture there is user3 and user4 who uses same words. It means there is probably one person behind the computer.
Its clear that there are lot of common words which are being used by all users. So I should focus on "user specific" words.
Input is (related to the image above):
<word1, user1>
<word2, user1>
<word2, user2>
<word3, user2>
<word4, user2>
<word5, user3>
<word5, user4>
... etc. The order doesnt matter
Output should be:
user1
user2
user3 = user4
I am doing this in Java but I want this question to be language independent.
Any ideas how to do it?
1) how to store words/users? What data structures?
2) how to get rid of common words everybody use? I have to somehow ignore them among user specific words. Maybe I could just ignore them because they get lost. I am afraid that they will hide significant difference of "user specific words"
3) how to recognize same users? - somehow count same words between each user?
I am very thankful for every advice in advance.
In general this is task of author identification, and there are several good papers like this that may give you a lot of information. Here are my own suggestions on this topic.
1. User recognition/author identification itself
The most simple kind of text classification is classification by topic, and there you take meaningful words first of all. That is, if you want to distinguish text about Apple the company and apple the fruit, you count words like "eat", "oranges", "iPhone", etc., but you commonly ignore things like articles, forms of words, part-of-speech (POS) information and so on. However many people may talk about same topics, but use different styles of speech, that is articles, forms of words and all the things you ignore when classifying by topic. So the first and the main thing you should consider is collecting the most useful features for your algorithm. Author's style may be expressed by frequency of words like "a" and "the", POS-information (e.g. some people tend to use present time, others - future), common phrases ("I would like" vs. "I'd like" vs. "I want") and so on. Note that topic words should not be discarded completely - they still show themes the user is interested in. However you should treat them somehow specially, e.g. you can pre-classify texts by topic and then discriminate users not interested in it.
When you are done with feature collection, you may use one of machine learning algorithm to find best guess for an author of the text. As for me, 2 best suggestions here are probability and cosine similarity between text vector and user's common vector.
2. Discriminating common words
Or, in latest context, common features. The best way I can think of to get rid of the words that are used by all people more or less equally is to compute entropy for each such feature:
entropy(x) = -sum(P(Ui|x) * log(P(Ui|x)))
where x is a feature, U - user, P(Ui|x) - conditional probability of i-th user given feature x, and sum is the sum over all users.
High value of entropy indicates that distribution for this feature is close to uniform and thus is almost useless.
3. Data representation
Common approach here is to have user-feature matrix. That is, you just build table where rows are user ids and columns are features. E.g. cell [3][12] shows normalized how many times user #3 used feature #12 (don't forget to normalize these frequencies by total number of features user ever used!).
Depending on features your are going to use and size of the matrix, you may want to use sparse matrix implementation instead of dense. E.g. if you use 1000 features and for every particular user around 90% of cells are 0, it doesn't make sense to keep all these zeros in memory and sparse implementation is better option.
I recommend a language modelling approach. You can train a language model (unigram, bigram, parsimonious, ...) on each of your user accounts' words. That gives you a mapping from words to probabilities, i.e. numbers between 0 and 1 (inclusive) expressing how likely it is that a user uses each of the words you encountered in the complete training set. Language models can be stored as arrays of pairs, hash tables or sparse vectors. There are plenty of libraries on the web for fitting LMs.
Such a mapping can be considered a high-dimensional vector, in the same way documents are considered as vector in the vector space model of information retrieval. You can then compare these vectors by using KL-divergence or any of the popular distance metrics: Euclidean distance, cosine distance, etc. A strong similarity/small distance between two users' vectors might then indicate that they belong to one and the same user.
how to store words/users? What data structures?
You probably have some kind of representation for the users and the posts that they have made. I think you should have a list of words, and a list corresponding to each word containing the users who use it. Something like:
<word: <user#1, user#4, user#5, ...> >
how to get rid of common words everybody use?
Hopefully, you have a set of stopwords. Why not extend it to include commonly used words from your forum? For example, for stackoverflow, some of the most frequently used tags' names should qualify for it.
how to recognize same users?
In addition to using similarity or word-frequency based measures, you can also try using interactions between users. For example, user3 likes/upvotes/comments each and every post by user8, or a new user doing similar things for some other (older) user in this way.