Apriori and association rules with Hadoop - hadoop

Is it doable to create an Apriori app using map-reduce? I am starting out but it's not clear how to create the next Candidate sets based on a previous run. Does anyone have any experience with this?

It could be useful to have a look to Apache Mahout. It is a machine learning and data mining framework in Java which abstracts sending MapReduce jobs for clustering, recommendation and classification tasks.
It seems the apriori algorithm is not implemented (there is one jira issue marked as won't fix: https://issues.apache.org/jira/browse/MAHOUT-108), but maybe other algorithm could be useful for you.
Even if you only need the apriori algorithm, it could be useful to have a look at their source code to get some ideas.

Related

Recently SVM implementation was added into Mahout & I am planning to use SVM. Anyone tried it yet?

Any new developments happening around SVM (Support Vector Machines) in Mahout (Machine
Learning With Hadoop) using Hadoop? Recently SVM implementation was added into Mahout. and I am planning to use SVM. Anyone tried it yet? Very little information is available on internet.
Any help/guidance is appreciated.
No, there is no SVM implementation in Mahout.
There are three Jira issues about it: Mahout-14 and Mahout-334 have been closed as won't fix. Mahout-232 was assigned later because some code was contributed early (2009) but it did not work so it was not incorporated into Mahout. Since then Mahout has changed, so porting the code would be difficult, and if you look at the issue there is some disagreement about whether it approaches the problem in the best way.
There is some code for a cascadeSVM implementation but the training part - the hard part - was never published.
There is a parallel SVM implementation that runs on MPI rather than Hadoop.
This San Francisco Meetup abstract has some discussion of current state of the art and issues for parallel SVM.

Can I use user-based recommendation on hadoop?

I'm reading 'Mahout in action',From this book, It has come to my attention that I can set a item-based recommendations.So I want to know is there anybody have set a user-based recommendation.
And as I know FileDataModel supports update files,this is used on sigle PC. How about on hadoop?
There is no user-based recommender algorithm available on Hadoop. It would be possible to write. In general I would steer you towards item-item similarity-based approaches at that scale. No there is no notion of update files in Hadoop as it's not needed.

Sentiment Analysis of given text

This topic has many thread. But also I am posting another one. All the post may be a way to do a sentiment analysis, but I found no way.
I want to implement the doing ways of sentiment analysis. So I would request to show me a way. During my research, I found that this is used anyway. I guess Bayesian algorithm is used to calculate positive words and negative words and calculate the probability of the sentence being positive or negative using bag of words.
This is only for the words, I guess we have to do language processing too. So is there anyone who has more knowledge? If yes, can you guide me with some algorithms with their links for reference so that I can implement. Anything in particular that may help me in my analysis.
Also can you prefer me language that I can work with? Some says Java is comparably time consuming so they don't recommend Java to work with.
Any type of help is much appreciated.
First of all, sentiment analysis is done on various levels, such as document, sentence, phrase, and feature level. Which one are you working on? There are many different approaches to each of them. You can find a very good intro to this topic here. For machine-learning approaches, the most important element is feature engineering and it's not limited to bag of words. You can find many other useful features in different applications from the tutorial I linked. What language processing you need to do depends on what features you want to use. You may need POS-tagging if POS information is needed for your features for example.
For classifiers, you can try Support Vector Machines, Maximum Entropy, and Naive Bayes (probably as a baseline) and these are frequently used in the literature, about which you can also find a pretty comprehensive list in the link. The Mallet toolkit contains ME and NB, and if you use SVMlight, you can easily convert the feature formats to the Mallet format with a function. Of course there are many other implementations of these classifiers.
For rule-based methods, Pointwise Mutual Information is frequently used, and some kinds of scoring-based methods, etc.
Hope this helps.
For the text analyzing there is no language stronger than SNOBOL. In SNOBOL-4 the Fortran interpretator, for example, takes only 60 lines.
NLTK offers really good Algorithm for sentiment analysis. It is open source so you can have a look at the source code and check out the algorithm used. You can even download NLTK book which is free and has some good material on sentiment analysis.
Coming to your second point I dont think Java is that slow. I am myself coding in c++ for years but lately also started with java as if you see a lot of very popular open source softwares like lucene, solr, hadoop, neo4j are all written in java.

What are some good resources for studying Hadoop's source code?

Are there any good resources that would help me study Hadoop's source code? I'm particularly looking for university courses or research papers.
Studying Hadoop or MapReduce can be a daunting task if you get your hand dirty at the start.I followed the schedule as follows :
Start with very basics of MR with
code.google.com/edu/parallel/dsd-tutorial.html
code.google.com/edu/parallel/mapreduce-tutorial.html
Then go for the first two lectures in
www.cs.washington.edu/education/courses/cse490h/08au/lectures.htm
A very good course intro to MapReduce and Hadoop.
Read the seminal paper
http://research.google.com/archive/mapreduce.html and its improvements in the updated version
http://www.cs.washington.edu/education/courses/cse490h/08au/readings/communications200801-dl.pdf
Then go for all the other videos in the U.Washington link given above.
Try youtubing the terms Map reduce and hadoop to find videos by ORielly and Google RoundTable for good overview of the future of Hadoop and MapReduce
Then off to the most important videos -
Cloudera Videos
www.cloudera.com/resources/?media=Video
and
Google MiniLecture Series
code.google.com/edu/submissions/mapreduce-minilecture/listing.html
Along with all the Multimedia above we need good written materialDocuments:
Architecture diagrams at hadooper.blogspot.com are good to have on your wall
Hadoop: The definitive guide goes more into the nuts and bolts of the whole system where as
Hadoop in Action is a good read with lots of teaching examples to learn the concepts of hadoop.
Pro Hadoop is not for beginners
pdfs of the documentation from Apache Foundation
hadoop.apache.org/common/docs/current/ and
hadoop.apache.org/common/docs/stable/
will help you learn as to how model your problem into a MR solution in order to gain the advantages of Hadoop in total.
HDFS paper by Yahoo! Research is also a good read in order to gain in depth knowledge of hadoop
Subscribe to the User Mailing List of Commons, MapReduce and HDFS in order to know problems, solutions and future solutions.
Try the http://developer.yahoo.com/hadoop/tutorial/module1.html link for beginners to expert path to Hadoop
For Any Queries ... Contact Apache, Google, Bing, Yahoo!
Your question seems overly broad - To get a resource to use while looking at source code you should narrow your focus of what you want to study. This will make it easier for you (and any on SO) to find papers/topics covering that topic.
I've dug into the Hadoop source a few times. Normally with a very specific class I needed to learn about. In these cases an external resource wasn't really needed, and since I had the class name, I just googled for that and found resources.
If I were to start trying to understand the hadoop source at a higher level I'd get the source code and my copy of Hadoop: The Definitive Guide and use that as a reference to understand the higher level connections of the source code.
I won't claim that this would be a perfect solution. H:TDG is at a more technical level than the other hadoop books I have and I find it to be very informative.
H:TDG is what I'd start with and as I found areas I wanted to dig into more, I would start searching for those specifically.

Where do I start with distributed computing?

I'm interested in learning techniques for distributed computing. As a Java developer, I'm probably willing to start with Hadoop. Could you please recommend some books/tutorials/articles to begin with?
Maybe you can read some papers related to MapReduce and distributed computing first, to gain a better understanding of it. Here are some I would like to recommand:
MapReduce: Simplified Data Processing on Large Clusters, http://www.usenix.org/events/osdi04/tech/full_papers/dean/dean_html/
Bigtable: A Distributed Storage System for Structured Data, http://www.usenix.org/events/osdi06/tech/chang/chang_html/
Dryad: Distributed data-parallel programs from sequential building blocks, http://pdos.csail.mit.edu/6.824-2007/papers/isard-dryad.pdf
The landscape of parallel computing research: A view from berkeley, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.8705&rep=rep1&type=pdf
On the other hand, if you want to know better of Hadoop, maybe you can start reading Hadoop MapReduce framework source code.
Currently, bookwise I would check out - Hadoop A Definitive Guide. Its written by Tom White who has worked on Hadoop for a good while now, and works at Cloudera with Doug Cutting (Hadoop creator).
Also on the free side, Jimmy Lin from UMD has written a book called: Data-Intensive Text Processing with MapReduce. Here's a link to the final pre-production verison (link provided by the author on his website).
Hadoop is not necessarily the best tool for all distributed computing problems. Despite its power, it also has a pretty steep learning curve and cost of ownership.
You might want to clarify your requirements and look for suitable alternatives in the Java world, such as HTCondor, JPPF or GridGain (my apologies to those I do not mention).
Here are some resources from Yahoo! Developer Network
a tutorial:
http://developer.yahoo.com/hadoop/tutorial/
an introductory course (requires Siverlight, sigh):
http://yahoo.hosted.panopto.com/CourseCast/Viewer/Default.aspx?id=281cbf37-eed1-4715-b158-0474520014e6
The All Things Hadoop Podcast http://allthingshadoop.com/podcast has some good content and good guests. A lot of it is geared to getting started with Distributed Computing.
MIT 6.824 is the best stuff. Only reading google papers related to Hadoop is not enough. A systematic course learning is required if you want to go deeper.
If you are looking to learn a distributed computing platform that is less complicated than Hadoop you can try Zillabyte. You only need to know some Ruby or Python to build apps on the platform.
As LoLo said, Hadoop is a powerful solution, but can be rough to start with.
For materials to learn about distributed computing try http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-824-distributed-computer-systems-engineering-spring-2006/syllabus/. There are several resources recommended by the course as well.

Resources