Best programming way to implement Map Reduce - hadoop

We have a problem, which is an ideal case for applying MapReduce programming technique. The initial code for this is written in Python. Now we have the following options:
Use Hadoop and Java to implement the MapReduce part.
Use mincemeat and Python to implement the MapReduce part.
Use Hadoop and Python (Hadoop MapReduce Program in Python) to implement the MapReduce part.
I'm not very sure which will be the best option. Can anyone please help ?

Since your initial code is in python and it doesn't make much of a difference whether writing MR in python or Java, (3) should be the best option to pursue for your scenario. You might also like to explore libraries like https://github.com/Yelp/mrjob which make it easier to write MR jobs in python.

Related

Other paradigms beyond MapReduce

MapReduce is the most popular paradigm to process files in a distributed system. However the YARN architecture make possible the programer build applications using others paradigm.
Of course some paradigms are better than others for some applications.
For example:
I'm using MapReduce to process a video file but this _________ paradigm is better.
I'm using MapReduce to process sql query files but this ________ paradigm is better.
I want to know the best way to choose a paradigm more efficient according the problem.
I hope someone can understand the question.

Are some Pig real time use cases available?

Please provide me real time Pig use cases. Banking and healthcare would be of great help. Also curious if Pig can be used as a ETL tool in Hadoop world.
Pig is typically a batch processing tool. But I'm not sure what do you refer for when you ask for "real time Pig use cases".
ETL - basically anything can be used for ETL purpose what can Extract Transform Load pig can do that. We're using it in batch workflows for ETL.
You can find few POC for understaning the Usage of PIG in below link
http://ybhavesh.blogspot.in/
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
I can recommend the book entitled "Pig design patterns" by Pradeep Pasupuleti for some useful examples (with source code included)

How to convert existing MapReduce applications to Crunch?

I have several (about a dozen) MapReduce tasks implemented, each of which functions as part of a workflow executed by a simple bash script. For a variety of reasons, I would like to move the workflow to Apache Crunch.
However, it's not clear to me how to run my MapReduce tasks as Crunch functions without re-implementing them. Is there a straightforward way to use Map and Reduce implementations as Crunch functions? I would like to maintain the Tool implementations as well so the MapReduce tasks can be run both standalone and as part of the Crunch workflow; is there any way to do this?
Thanks for any insight.
For any who might stumble across this, there is a minimally documented API in the Crunch libs. However, it is fairly straightforward.
See here: https://crunch.apache.org/apidocs/0.10.0/org/apache/crunch/lib/Mapreduce.html

Apriori and association rules with Hadoop

Is it doable to create an Apriori app using map-reduce? I am starting out but it's not clear how to create the next Candidate sets based on a previous run. Does anyone have any experience with this?
It could be useful to have a look to Apache Mahout. It is a machine learning and data mining framework in Java which abstracts sending MapReduce jobs for clustering, recommendation and classification tasks.
It seems the apriori algorithm is not implemented (there is one jira issue marked as won't fix: https://issues.apache.org/jira/browse/MAHOUT-108), but maybe other algorithm could be useful for you.
Even if you only need the apriori algorithm, it could be useful to have a look at their source code to get some ideas.

Where do I start with distributed computing?

I'm interested in learning techniques for distributed computing. As a Java developer, I'm probably willing to start with Hadoop. Could you please recommend some books/tutorials/articles to begin with?
Maybe you can read some papers related to MapReduce and distributed computing first, to gain a better understanding of it. Here are some I would like to recommand:
MapReduce: Simplified Data Processing on Large Clusters, http://www.usenix.org/events/osdi04/tech/full_papers/dean/dean_html/
Bigtable: A Distributed Storage System for Structured Data, http://www.usenix.org/events/osdi06/tech/chang/chang_html/
Dryad: Distributed data-parallel programs from sequential building blocks, http://pdos.csail.mit.edu/6.824-2007/papers/isard-dryad.pdf
The landscape of parallel computing research: A view from berkeley, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.8705&rep=rep1&type=pdf
On the other hand, if you want to know better of Hadoop, maybe you can start reading Hadoop MapReduce framework source code.
Currently, bookwise I would check out - Hadoop A Definitive Guide. Its written by Tom White who has worked on Hadoop for a good while now, and works at Cloudera with Doug Cutting (Hadoop creator).
Also on the free side, Jimmy Lin from UMD has written a book called: Data-Intensive Text Processing with MapReduce. Here's a link to the final pre-production verison (link provided by the author on his website).
Hadoop is not necessarily the best tool for all distributed computing problems. Despite its power, it also has a pretty steep learning curve and cost of ownership.
You might want to clarify your requirements and look for suitable alternatives in the Java world, such as HTCondor, JPPF or GridGain (my apologies to those I do not mention).
Here are some resources from Yahoo! Developer Network
a tutorial:
http://developer.yahoo.com/hadoop/tutorial/
an introductory course (requires Siverlight, sigh):
http://yahoo.hosted.panopto.com/CourseCast/Viewer/Default.aspx?id=281cbf37-eed1-4715-b158-0474520014e6
The All Things Hadoop Podcast http://allthingshadoop.com/podcast has some good content and good guests. A lot of it is geared to getting started with Distributed Computing.
MIT 6.824 is the best stuff. Only reading google papers related to Hadoop is not enough. A systematic course learning is required if you want to go deeper.
If you are looking to learn a distributed computing platform that is less complicated than Hadoop you can try Zillabyte. You only need to know some Ruby or Python to build apps on the platform.
As LoLo said, Hadoop is a powerful solution, but can be rough to start with.
For materials to learn about distributed computing try http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-824-distributed-computer-systems-engineering-spring-2006/syllabus/. There are several resources recommended by the course as well.

Resources