extract Similar users from logs using hadoop/pig

extract Similar users from logs using hadoop/pig - hadoop

We need as part of our start-up product to compute "similar user feature". And we've decided to go with pig for it.
I've been learning pig for a few days now and understand how it work.
So to start here is how the log file look like.
user url time
user1 http://someurl.com 1235416
user1 http://anotherlik.com 1255330
user2 http://someurl.com 1705012
user3 http://something.com 1705042
user3 http://someurl.com 1705042
As the number of users and url can be huge, we can't use a bruteforce approach here, so first we need to find the user's that have access at least to on common url.
The algorithm could be splited as bellow:
Find all users that has accessed to some common urls.
generate pair-wise combination of all users for each resource accessed.
for each pair and and url, compute the similarity of those users: the similarity depend of the timeinterval between the access (so we need to keep track of the time).
sum up for each pair-url the similarity.
here is what i've written so far:
A = LOAD 'logs.txt' USING PigStorage('\t') AS (uid:bytearray, url:bytearray, time:long);
grouped_pos = GROUP A BY ($1);
I know it is not much yet, but now i don't know how to generate the pair or move further.
So any help would be appreciated.
Thanks.

There's a nice, detailed paper from IBM on doing co-clustering with MapReduce that may be useful for you.
The Google News Personalization paper describes a fairly straightforward implementation of Locality Sensitive Hashing for solving the same problem.

For algorithms, look at papers on query/URL bipartite graphs. Here are a couple of links:
Query suggestion using hitting time
by Qiaozhu Mei, Dengyong Zhou, Kenneth Church
http://www-personal.umich.edu/~qmei/pub/cikm08-sugg.ppt
Random walks on the click graph
Nick Craswell and Martin Szummer
July 2007
http://research.microsoft.com/apps/pubs/default.aspx?id=65235

Related

What matching algorithm could I use?

I would need some help because I don't know what algorithm i could use for the following (I use python) :
Steve is 25 and he buys everyday orange juice
Maria is 23 and she likes to buy smoothies
Steve & Maria tastes are pretty much the same.
Juan is 16 and he only drinks sodas
Juan tastes are not the same as Steve and Maria.
====================================================
I would like to use a matching algorithm that will detect the users who have the same drink preference and a close age. To continue with the example, Steve and Maria would be matched together but not Juan. Which one should I use ?

I agree with #klutt that your task is pretty vague. There are two approaches that come to mind, but not knowing more details about your problem does limit the details I can provide in my answer that would help you. I am interpreting the question as if you are taking in raw text and might want to process more sentences that have very similar semantic and syntactical structure.
An algorithmic approach:
Assuming that your word choices are static in their semantic meaning (Maria is 23 ... Steve is 25), we can parse each sentence and identify tokens like is or and or same and essentially perform lexical analysis on the text... from here, you could continue thinking about how you would go about matching and so forth... but this is rather complicated...
Neural Network approach:
If you are taking in raw text in the form of sentences, it's a problem that's not straight forward to solve using a top-down algorithmic approach.
You could take an approach with neural networks that trains a model to solve your problem, but then again what you seem to be asking is quite complex since there are multiple "facts" within each sentence that are not semantically related. For example, your second sentence identifies that Maria is 23 but at the end of that sentence there is a comparison between Steve and Maria. And your first sentence only identifies Steve as 25.
Even if you chunk raw text into sentences, you would have to have a very fine tuned neural network architecture and a lot of training data to get remotely close to your goal.
Now, both of those solutions are very complex... but if you wanted to create an application that collects this data (via a form or prompt) and puts it into a structured format (like a json or xml object) to organize and store the data in memory (perhaps writing out to a database or file for persistent storage), that might be a good route to go down.
This can serve as a good lesson in how to think about data as well. It is one thing if you have a pool of thousands of sentences, just raw data that you need to organize for quantitative purposes (classic qualitative -> quantitative problems). It is another thing if you are going to be collecting this data. If you are going to be collecting data, having a program that collects and organizes names, ages, and drink preferences (and then organizes that data within certain data structures), then we can talk about matching algorithms.
I will also add here that if you do have structured data, Collaborative filtering (mentioned by Shridhar) is a great starting place.

Collaborative filtering best suits your needs.
In the newer, narrower sense, collaborative filtering is a method of
making automatic predictions (filtering) about the interests of a user
by collecting preferences or taste information from many users
(collaborating). The underlying assumption of the collaborative
filtering approach is that if a person A has the same opinion as a
person B on an issue, A is more likely to have B's opinion on a
different issue than that of a randomly chosen person. For example, a
collaborative filtering recommendation system for television tastes
could make predictions about which television show a user should like
given a partial list of that user's tastes (likes or dislikes).[3]
Note that these predictions are specific to the user, but use
information gleaned from many users. This differs from the simpler
approach of giving an average (non-specific) score for each item of
interest, for example based on its number of votes.

pre calculate users interests

i need an approach or an algorithm to pre-calculate an users interest based on his tweets..
the user connects his account with his twitter account and after retrieving his tweets for the first time i will have to pre-calculate his tastes and interests..
as this user continues to use the my system i will have to make those predictions more accurate..
is there an algorithm or a mathematical model which will help in this requirement?
please provide - existing research links or open source code or examples which will help me to get started..

You can use Machine-Learning for this task.
One possible machine learning algorithm is Bag Of Words with k-nearest neighbors:
Create a training set [users which you know what their interest are], and use the Bag Of Words [preferably with n-grams] to "learn" the training set.
When a new user arrives - have the words/n-grams extracted as features - and find the k nearest neighbors to determine what the interests are.
To get improvement over time - you can have some additional explicit feedback - users can click on agreement/disagreement for what the algorithm said. You can later use this information to extend the size of your training set - which will probably result in more accurate decisions.
This is a standrad algorithm for learning "features" between sets of sentences/words, so you should at least use it as a guideline.
There is also an open source project that might help you: Apache Mahout.

Recommendation algorithm (and implementation) for finding similar items and users

I have a database of about 700k users along with items they have watched/listened to/read/bought/etc.
I would like to build a recommendation engine that recommends new items based on what users with similar taste in things have enjoyed, as well as actually finding people the user might want to be friends with on a social network I'm building (similar to last.fm).
My requirements are as follows:
Majority of the "users" in my database aren't actually users of my website. They have been data mined from third-party sources. However, when recommending users, I would like to limit the search to people who are members of my website (while still taking advantage of the bigger data set).
I need to take multiple items into consideration. Not "people who like this one item you enjoyed...", but "people who like most of the items you enjoyed...".
I need to compute similarities between users and show them when viewing their profiles (taste-o-meter).
Some items are rated, others are not. Ratings are from 1-10, not boolean values. In most cases it would be possible to deduct a rating value from other stats if it's not present (e.g. if the user has favourited an item, but hasn't rated it, I could just assume a rating of 9).
It has to interact with Python code in one way or another. Preferably, it should use a seperate (possibly NoSQL) database and expose an API to use in my web back-end. The project I'm making uses Pyramid and SQLAlchemy.
I would like to take item genres into account.
I would like to display similar items on item pages based on both its genre (possibly tags) and what users who enjoyed the item liked (like Amazon's "people who bought this item" and Last.fm artist pages). Items from different genres should still be shown, but have a lower similarity value.
I would prefer a well-documented implementation of an algorithm with some examples.
Please don't give an answer like "use pysuggest or mahout", since those implement a plethora of algorithms and I'm looking for one that's most suitable for my data/use. I've been interested in Neo4j and how it all could be expressed as a graph of connections between users and items.

To determine similarity between users you can run cosine or pearson similarity (Found in Mahout and everywhere on the net really!) across the user vector. So your data representation should look something like
u1 [1,2,3,4,5,6]
u2 [35,24,3,4,5,6]
u1 [35,3,9,2,1,11]
In the point where you want to take multiple items into consideration you can use the above to determine how similar someones profiles are. The higher the correlation score the likelihood they have very similar items is. You can set a threshold so someone with .75 similarity has a similar set of items in their profile.
Where you are missing values you can of course make up your own values. I'd just keep them binary and try to blend the various different algorithms. That's called an ensemble.
Overall you are looking for something called item based collaborative filtering as the recommendation aspect of your set up and also used to identify similar items. It's a standard recommendation algorithm that does pretty much everything you've asked for.
When trying to find similar users you can perform some type of similarity metric across your user vectors.
Regarding Python, the book called programming in collective intelligence gives all their samples in python so go pick up a copy and read chapter 1.
Representing all of this as a graph will be somewhat problamatic as your undying representation is a Bipartile Graph. There are lots of recommendation approaches out there that use a graph based approach but its generally not the best performing approach.

Actually that is one of the sweetspots of a graph database like Neo4j.
So if your data model looks like this:
user -[:LIKE|:BOUGHT]-> item
You can easily get recommendations for an user with a cypher statement like this:
start user = node:users(id="doctorkohaku")
match user -[r:LIKE]->item<-[r2:LIKE]-other-[r3:LIKE]->rec_item
where r.stars > 2 and r2.stars > 2 and r3.stars > 2
return rec_item.name, count(*) as cnt, avg(r3.stars) as rating
order by rating desc, cnt desc limit 10
This can also be done using the Neo4j Core-API or the Traversal-API.
Neo4j has an Python API that is also able to run cypher queries.
Disclaimer: I work for Neo4j
There are also some interesting articles by Marko Rodriguez about collaborative filtering.

I can suggest to have a look at my open source project Reco4j. It is a graph-based recommendation engine that can be used on a graph database like yours in a very straigthforward way. We support as graph database neo4j. It is in an early version but very soon a more complete version will be available. In the meantime we are looking for some use case of our project, so please contact me so that we can see how we can collaborate.

Intelligent web features, algorithms (people you may follow, similar to you ...)

I have 3 main questions about the algorithms in intelligent web (web 2.0)
Here the book I'm reading http://www.amazon.com/Algorithms-Intelligent-Web-Haralambos-Marmanis/dp/1933988665 and I want to learn the algorithms in deeper
1. People You may follow (Twitter)
How can one determine the nearest result to my requests ? Data mining? which algorithms?
2. How you’re connected feature (Linkedin)
Simply algorithm works like that. It draws the path between two nodes let say between Me and the other person is C. Me -> A, B -> A connections -> C . It is not any brute force algorithms or any other like graph algorithms :)
3. Similar to you (Twitter, Facebook)
This algorithms is similar to 1. Does it simply work the max(count) friend in common (facebook) or the max(count) follower in Twitter? or any other algorithms they implement? I think the second part is true because running the loop
dict{count, person}
for person in contacts:
dict.add(count(common(person)))
return dict(max)
is a silly act in every refreshing page.
4. Did you mean (Google)
I know that they may implement it with phonetic algorithm http://en.wikipedia.org/wiki/Phonetic_algorithm simply soundex http://en.wikipedia.org/wiki/Soundex and here is the Google VP of Engineering and CIO Douglas Merrill speak http://www.youtube.com/watch?v=syKY8CrHkck#t=22m03s
What about first 3 questions? Any ideas are welcome !
Thanks

People who you may follow
You can use the factors based calculations:
factorA = getFactorA(); // say double(0.3)
factorB = getFactorB(); // say double(0.6)
factorC = getFactorC(); // say double(0.8)
result = (factorA+factorB+factorC) / 3 // double(0.5666666666666667)
// if result is more than 0.5, you show this person
So say in the case of Twitter, "People who you may follow" can based on the following factors (User A is the user viewing this "People who you may follow" feature, there may be more or less factors):
Relativity between frequent keywords found in User A's and User B's tweets
Relativity between the profile description of both users
Relativity between the location of User A and B
Are people User A is following follows User B?
So where do they compare "People who you may follow" from? The list probably came from a combination of people with high amount of followers (they are probably celebrities, alpha geeks, famous products/services, etc.) and [people whom User A is following] is following.
Basically there's a certain level of data mining to be done here, reading the tweets and bios, calculations. This can be done on a daily or weekly cron job when the server load is least for the day (or maybe done 24/7 on a separate server).
How are you connected
This is probably a smart work here to make you feel that loads of brute force has been done to determine the path. However after some surface research, I find that this is simple:
Say you are User A; User B is your connection; and User C is a connection of User B.
In order for you to visit User C, you need to visit User B's profile first. By visiting User B's profile, the website already save the info indiciating that User A is at User B's profile. So when you visit User C from User B, the website immediately tells you that 'User A -> User B -> User C', ignoring all other possible paths.
This is the max level as at User C, User Acannot go on to look at his connections until User C is User A's connection.
Source: observing LinkedIN
Similar to you
It's the exact same thing as #1 (People you may follow), except that the algorithm reads in a different list of people. The list of people that the algorithm reads in is the people whom you follow.
Did you mean
Well you got it right there, except that Google probably used more than just soundex. There's language translation, word replacement, and many other algorithms used for the case of Google. I can't comment much on this because it will probably get very complex and I am not an expert to handle languages.
If we research a little more into Google's infrastructure, we can find that Google has servers dedicated to Spelling and Translation services. You can get more information on Google platform at http://en.wikipedia.org/wiki/Google_platform.
Conclusion
The key to highly intensified algorithms is caching. Once you cache the result, you don't have to load it every page. Google does it, Stack Overflow does it (on most of the pages with list of questions) and Twitter not surprisingly too!
Basically, algorithms are defined by developers. You may use others' algorithms, but ultimately, you can also create your own.

People you may follow
Could be one of many types of recommendation algorithms, maybe collaborative filtering?
How you are connected
This is just a shortest path algorithm on the social graph. Assuming there is no weight to the connections, it will simply use breadth-first.
Similar to you
Simply a re-arrangement of the data set using the same algorithm as People you may follow.
Check out the book Programming Collective Intelligence for a good introduction to the type of algorithms that are used for People you may follow and Similar to you, it has great python code available too.

People You may follow
From Twitter blog - "suggestions are based on several factors, including people you follow and the people they follow" http://blog.twitter.com/2010/07/discovering-who-to-follow.html
So if you follow A and B and they both follow C, then Twitter will suggest C to you...
How you’re connected feature
I think you have answered this one.
Similar to you
As above and as you say, although the results are probably cached - so its only done once per session or maybe even less frequently...
Hope that helps,
Chris

I don't use twitter; but with that in mind:
1). On the surface, this isn't that difficult: For each person I follow, see who they follow. Then for each of the people they follow, see who they follow, etc. The deeper you go, of course, the more number crunching it takes.
You can take this a bit further, if you can also efficiently extract the reverse: For those I follow, who also follows them?
For both ways, what's unsaid is a way to weight the tweeters to see if they're someone I'd really want to follow: A liberal follower may also follow a conservative tweeter, but that doesn't mean I'd want follow the conservative (see #3).
2). Not sure, thinking about it...
3). Assuming the bio and tweets are the only thing to go on, the hard parts are:
Deciding what attributes should exist (political affiliation, topic types, etc.)
Cleaning each 140 characters to data-mine.
Once you have the right set of attributes, then two different algorithms come to mind:
K means clustering, to decide which attributes I tend to discriminate on.
N-Nearest neighbor, to find the N most similar tweeters to you given the attributes I tend to give weight to.
EDIT: Actually, a decision tree is probably a FAR better way to do all of this...
This is all speculative, but it sounds fun if one were getting paid to do this.

Algorithms to find stuff a user would like based on other users likes

I'm thinking of writing an app to classify movies in an HTPC based on what the family members like.
I don't know statistics or AI, but the stuff here looks very juicy. I wouldn't know where to start do.
Here's what I want to accomplish:
Compose a set of samples from each users likes, rating each sample attribute separately. For example, maybe a user likes western movies a lot, so the western genre would carry a bit more weight for that user (and so on for other attributes, like actors, director, etc).
A user can get suggestions based on the likes of the other users. For example, if both user A and B like Spielberg (connection between the users), and user B loves Batman Begins, but user A loathes Katie Holmes, weigh the movie for user A accordingly (again, each attribute separately, for example, maybe user A doesn't like action movies so much, so bring the rating down a bit, and since Katie Holmes isn't the main star, don't take that into account as much as the other attributes).
Basically, comparing sets from user A similar to sets from user B, and come up with a rating for user A.
I have a crude idea about how to implement this, but I'm certain some bright minds have already thought of a far better solution already, so... any suggestions?
Actually, after a quick research, it seems a Bayesian filter would work. If so, would this be the better approach? Would it be as simple as just "normalizing" movie data, training a classifier for each user, and then just classify each movie?
If your suggestion includes some brain melting concepts (I'm not experienced in these subjects, specially in AI), I'd appreciate it if you also included a list of some basics for me to research before diving into the meaty stuff.
Thanks!

Matthew Podwysocki had some interesting articles on this stuff
http://codebetter.com/blogs/matthew.podwysocki/archive/2009/03/30/functional-programming-and-collective-intelligence.aspx
http://codebetter.com/blogs/matthew.podwysocki/archive/2009/04/01/functional-programming-and-collective-intelligence-ii.aspx
http://weblogs.asp.net/podwysocki/archive/2009/04/07/functional-programming-and-collective-intelligence-iii.aspx

This is similar to this question where the OP wanted to build a recommendation system. In a nutshell, we are given a set of training data consisting of users ratings to movies (1-5 star rating for example) and a set of attributes for each movie (year, genre, actors, ..). We want to build a recommender so that it will output for unseen movies a possible rating. So the inpt data looks like:
user movie year genre ... | rating
---------------------------------------------
1 1 2006 action | 5
3 2 2008 drama | 3.5
...
and for an unrated movie X:
10 20 2009 drama ?
we want to predict a rating. Doing this for all unseen movies then sorting by predicted movie rating and outputting the top 10 gives you a recommendation system.
The simplest approach is to use a k-nearest neighbor algorithm. Among the rated movies, search for the "closest" ones to movie X, and combine their ratings to produce a prediction.
This approach has the advantage of being very simple to easy implement from scratch.
Other more sophisticated approaches exist. For example you can build a decision tree, fit a set of rules on the training data. You can also use Bayesian networks, artificial neural networks, support vector machines, among many others... Going through each of these wont be easy for someone without the proper background.
Still I expect you would be using an external tool/library. Now you seem to be familiar with Bayesian Networks, so a simple naive bayes net, could in fact be very powerful. One advantage is that it allow for prediction under missing data.
The main idea would be somewhat the same; take the input data you have, train a model, then use it to predict the class of new instances.
If you want to play around with different algorithms in simple intuitive package which requires no programming, I suggest you take a look at Weka (my 1st choice), Orange, or RapidMiner. The most difficult part would be to prepare the dataset to the required format. The rest is as easy as choosing what algorithm and applying it (all in a few clicks!)
I guess for someone not looking to go into too much details, I would recommend going with the nearest neighbor method as it is intuitive and easy to implement.. Still the option of using Weka (or one of the other tools) is worth looking into.

There are a few algorithms that are good for this:
ARTMAP: groups via probability against each other (this isn't fast but its the best thing for your problem IMO)
ARTMAP holds a group of common attributes and determines likelyhood of simliarity via a percentages.
ARTMAP
KMeans: This seperates out the vectors by the distance that they are from each other
KMeans: Wikipedia
PCA: will seperate the average of all the values from the varing bits. This is what you would use to do face detection, and background subtraction in Computer Vision.
PCA

The K-nearest neighbor algorithm may be right up your alley.

Check out some of the work of the top teams for the netflix prize.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio