Recommender system using pig or mahout - hadoop

I am building a recommend system on Hadoop in a simple way can u give me an opinion on what to use to build this recommendation system.
I would like to use Apache pig or Apache mahout.
In my data set i am having
book_id,name,publisher
user_id,username
book_id,user_id,rating
i have my data in c.s.v format
so can you please suggest me which technology to use to produce item based and user based recommendation system.

Apache Mahout will provide you with a off-shelf recommendation engine based on collaborative filtering algorithms.
With Pig you will have to implement those algorithms yourself - in Pig Latin, which may be a rather complex task.

I know it's not one of your preferred methods, but another product you can use on Hadoop to create a recommendation engine is Oryx.
Oryx was created by Sean Owen (co-author of the book Mahout in Action, and a major contributor to the Mahout code base). It only has 3 algorithms at the moment (Alternating Least Squares, K-Means Clustering, and Random Decision Forests), but the ALS algorithm provides a fairly easy to use Collaborative Filtering engine, sitting on top of the Hadoop infrastructure.
From the brief description of your dataset, it sounds like it would would perfectly. It has a model generation engine (computational layer), and it can generate a new model based on one of 3 criteria:
1) Age (time between model generations)
2) Number of records added
3) Amount of data added
Once a generation of data has been built, there's another java daemon that runs (the service layer), which will serve out the recommendations (user to item, item to item, blind recommendations, etc) via a RESTful API. When a new generation of the model is created, it will automatically pick up that generation and serve it out.
There are some nice features in the model generation as well, such as aging historic data, which can help get around issues like seasonality (probably not a big deal if you're talking about books, though).
The computational layer (model generation) uses HDFS to store/lookup data, and uses MapReduce or YARN for job control. The serving layer is a daemon that can run on each data node, and it accesses the HDFS filesystem for the computed model data to present out over the API.

Related

If I am using SIMILARITY_LOGLIKELIHOOD (LLR) are item ratings really ignored?

I used the movie lens data file (ml-100k.zip) u.data unchanged, so it had the columns: userID, MovieID and user rating.
I used LLR:
hadoop jar C:\hdp\mahout-0.9.0.2.1.3.0-1981\core\target\mahout-core-0.9.0.2.1.3.0-1981-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_LOGLIKELIHOOD --input u.data --output udata_output
When I look at the udata_output file I see recommended movie ID's followed by recommendation scores like:
1226:5.0
and
896:4.798878
The recommendation scores seemed to vary from 5.0 to 4.x
However, when I deleted the user rating column from the u.data file and re-ran the same command line above I received results like:
615:1.0
where ALL recommendation scores were 1.0.
2 questions:
1) If LLR ignores the user ratings and the only input I change is the whether to provide the user rating why do the recommendation scores change?
2) Overall, I am trying to determine recommendation ranking so I'm using LLR. In addition should I ignore the recommendation scores and only focus on the order of the items recommended (e.g.: the first item is ranked higher than the 2nd)?
Thanks in advance.
LLR does not use the strengths. The theory is that if a user actually interacted with an item, that is all the indication needed. LLR will correlate that interaction with other user's and score based on a probabilistic calculation called the Log Likelihood Ratio. It does create strengths but only uses the counts of interactions.
Answers
This could be a bug or could be because you are using a boolean recommender in one case and an non-boolean in the other. I could be that the recommender is trying to provide ratings by taking account of the values. But none of this really matters if you are trying to optimize ranking
You really never need to look at the recommendation weights unless you are trying to predict ratings, which seldom happens these days. Trust the ranking of recs.
BTW Mahout now has a completely new generation recommender based on using a search engine to serve recommendations and Mahout to calculate the model. It has many benefits over the older Hadoop version including:
Multimodal: it can ingest many different user actions on many different item set. This allow you to use much of the user's clickstream to recommend.
Realtime results: it has a very fast scalable server in Solr or Elastic search.
Due to the realtime nature it can recommend to new users or users with very recent history. The older Hadoop Mahout recommenders only recommend to users and items in the training data--they cannot react to history that was not used in training. The new recommender can use realtime gathered data, even on new users.
The new Multimodal Recommender in Mahout 1.0-snapshot or greater is described here:
Mahout site
A free ebook, which talks about the general idea: Practical Machine Learning
A slide deck, which talks about mixing actions or other indicators: Creating a Unified Multimodal Recommender
Two blog posts: What's New in Recommenders: part #1 and What's New in Recommenders: part #2
A post describing the log likelihood ratio: Surprise and Coincidence LLR is used to reduce noise in the data while keeping the calculations O(n) complexity.

cluster analysis Hadoop, Map reduce environment

we are currently trying to create some very basic personas based on our user data base (few million profiles). The goal is to find out at this stage what the characteristics of our users are, for example what they look like and what they are looking for and to create several "typical" user profiles.
I believe the best way to achieve this would be to run a cluster analysis in order to find similarities among users.
The big roadblock however is how to get there. We are tracking our data in a Hadoop environment and I am being told that this could be potentially achieved with our tools.
I have familiarised myself with the theory of the topic and know that it can be done for example in SPSS (quite hard to use and limited to samples of large data sets).
The big question: Is it possible to perform a or different types of cluster analysis in a Hadoop environment and then visualise the results like in SPSS? It is my understanding that we would need to run several types of analysis in order to find the best way to cluster the data, also when it comes to distance measurements of the clusters.
I have not found any information on the internet with regards to this, so I wonder if this is possible at all, without a major programming effort (meaning literally implementing for example all the standard tools available in SPSS: Dendrograms, the different result tables and cluster graphs etc.).
Any input would be much appreaciated. Thanks.

Where does map-reduce/hadoop come in in machine learning training?

Map-reduce/hadoop is perfect in gathering insights from piles of data from various resources, and organize them in a way we want it to be.
But when it comes to training, my impression is that we have to dump all the training data into algorithm (be it SVN, Logistic regression, or random forest) all at once so that the algorithm is able to come up with a model that has it all. Can map-reduce/hadoop help in the training part? If yes, how in general?
Yes. There are many MapReduce implementations such as hadoop streaming and even some easy tools like Pig, which can be used for learning. In addition, there are distributed learning toolset built upon Map/Reduce such as vowpal wabbit (https://github.com/JohnLangford/vowpal_wabbit/wiki/Tutorial). The big idea of this kind of methods is to do training on small portion of data (split by HDFS) and then averaging the models and commutation with each nodes. So the model get updates directly from submodels built on part of the data.

MapReduce project with data mining

I am planning to do a MapReduce project involving Hadoop libraries and testing it on big data uploaded at AWS. I have not finalized an idea yet. But I am sure it will involve some kind of data processing, MapReduce design patterns and possibly Graph algorithms, Hive and PigLatin. I would really appreciate if someone can give me some ideas about it. I have few of mine in mind.
In the end I have to work on some large data set and get some information and derive some conclusions. For this I have used Weka before for data mining, (using Trees).
But I am not sure if that is the only thing I can work with right now (using Weka). Is there any other ways by which I can work on large data and derive conclusions on the large data set?
Also how can I involve graphs in this ?
Basically I want to make a research project but I am not sure what exactly I should be working on and what it should be like ? Any thoughts ? suggestive links/ideas ? Knowledge sharing ?
I will suggest you check Apache Mahout, it a scalable machine learning and data mining framework that should integrate nicely with Hadoop.
Hive gives you SQL-like language to query big data, essentially it translates your high-level query into MapReduce jobs and run it on the data cluster.
Another suggestion is to consider doing your data processing algorithm using R, it is a statistical software (similar to matlab), and I would recommend instead of the standard R environment is to use R Revolution, which is an environment to develop R, but with much powerful tools for big data and clustering.
Edit: If you are a student, R Revolution has a free academic edition.
Edit: A third suggestion, is to look at GridGain which is another Map/Reduce implementation in Java that is relatively easy to run on a cluster.
As you are already working with MapRedude and Hadoop, you can extract some knowledge from your data using Mahout or you can get some ideas from this very good book:
http://infolab.stanford.edu/~ullman/mmds.html
This books provide ideas to mine Social-Network Graphs, and works with graphs in a couple of other ways too.
Hope it helps!

practical usage of hadoop map reduce hive pig hbase

Hello,
I am learning Hadoop and after reading the material found on the net (tutorials, map reduce concepts, Hive, Ping and so on) and developed some small application with those I would like to learn the real world usages of these technologies.
What are the everyday software we use that are based upon Hadoop stack?
If you use the internet, there are good changes that you are indirectly impacted from Hadoop/MapReduce from Google Search to FaceBook to LinkedIn etc. Here are some interesting links to find how widespread Hadoop/MR usage is
Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)
10 ways big data changes everything
One thing to note is Hadoop/MR is not an efficient solution for every problem. Consider other distributed programming models like those based on BSP also.
Happy Hadooping !!!
Here are some sample mapreduce examples which will be helpful for beginners..
1.Word Count
2.SQL Aggregation using Map reduce
3.SQL Aggregation on multiple fields using Map reduce
URL - http://hadoopdeveloperguide.blogspot.in/

Resources