A good example in hadoop that needs iteration - hadoop

I am currently implement a parallel-for on hadoop to iterate the mapper a number of times as specify by the user. Can someone help me with a useful example that I can use my implementation for testing. Some application in Hadoop that needs iteration of the Mapper function.
Thank you

The simplest one is implementing Apriori algorithm which is used to find the frequent itemset.

What do you exactly mean by "iteration of the mapper"? I have an example of starting a job recursively (on the input of the last job).
Have a look here, it explains a simple graph mindist-search / graph exploration algorithm: http://codingwiththomas.blogspot.com/2011/04/graph-exploration-with-hadoop-mapreduce.html
A bit more generic version is this here:
http://codingwiththomas.blogspot.com/2011/04/controlling-hadoop-job-recursion.html

There are plenty of examples in data mining. You could try one of the clustering algorithms, for example.

Related

Implementing a MapReduce skeleton in Erlang

I am fairly new to both parallel programming and the Erlang language and I'm struggling a bit.
I'm having a hard time implementing a mapreduce skeleton. I spawn M mappers (their task is to map the power function into a list of floats) and R reducers (they sum the elements of the input list sent by the mapper).
What I then want to do is to send the intermediate results of each mapper to a random reducer, how do I go about linking one mapper to a reducer?
I have looked around the internet for examples. The closest thing to what I want to do that I could find is this word counter example, the author seems to have found a clever way to link a mapper to a reducer and the logic makes sense, however I have not been able to tweak it in order to fit my particular needs. Maybe the key-value implementation is not suitable for finding the sum of a list of powers?
Any help, please?
Just to give an update, apparently there were problems with the algorithm linked in the OP. It looks like there is something wrong with the sychronization protocol, which is hinted at by the presence of the call to the sleep() function (ie. it's not supposed to be there).
For a good working implementation of the map/reduce framework, please refer to Joe Armstrong's version in the Programming Erlang book (2nd ed).
Armstrong's version only uses one reducer, but it can be easily modified for more reducers in order to eliminate the bottleck.
I have also added a function to split the input list into chunks. Each mapper will get a chunk of data.

Depth First Search using Map Reduce

I have successfully implemented the Shortest Path algorithm in Hadoop Map Reduce(Breath First Search). However I have a question that:
Is it possible to do graph traversal "Depth First Search" using Hadoop map reduce ?
Any Links..?
The nature of the Depth First Search makes it inappropriate for map reduce jobs. Because you only follow one strict path to the end before forking into another one. That lead to the fact that you can't use the scalability provided by hadoop properly. I'm not aware of a fine working implementation and I'm pretty sure you won't find one which uses the MapReduce paradigm in a good way.
If you try to implement graph algorithms in hadoop on your own you might want to have a look at some useful frameworks like Apache Giraph, xrime or Pegasus. xrime also contains a shortest path implementation which might be interesting for you.

Is there any hadoop implement of Louvain method?

This is louvain method to find community in social graph.
https://sites.google.com/site/findcommunities/
I want to run it on a big graph.
If you are not stuck on Hadoop, I saw this implementation for Apach Spark.
https://github.com/Sotera/spark-distributed-louvain-modularity
I don't know of an implementation of this clustering method, which looks to be based on modularity. The main source of clustering algorithms in the Hadoop ecosystem is in Mahout.
Take a look here: https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
Perhaps one of the clustering algorithms listed would work or provide the basis for your own implementation.

Sort and shuffle optimization in Hadoop MapReduce

I'm looking for a research/implementation based project on Hadoop and I came across the list posted on the wiki page - http://wiki.apache.org/hadoop/ProjectSuggestions. But, this page was last updated in September, 2009. So, I'm not sure if some of these ideas have already been implemented or not. I was particularly interested in "Sort and Shuffle optimization in the MR framework" which talks about "combining the results of several maps on rack or node before the shuffle. This can reduce seek work and intermediate storage".
Has anyone tried this before? Is this implemented in the current version of Hadoop?
There is the combiner functionality (as described under the "Combine" section of http://wiki.apache.org/hadoop/HadoopMapReduce), which is more-or-less an in-memory shuffle. But I believe that the combiner only aggregates key-value pairs for a single map job, not all the pairs for a given node or rack.
The project description is aimed "optimization".
This feature is already present in the current Hadoop-MapReduce and it can probably run in a lot less time.
Sounds like a valuable enhancement to me.
I think it is very challenging task. In my understanding the idea is to make a computation tree instead of "flat" map-reduce.The good example of it is Google's Dremel engine (called BigQuey now). I would suggest to read this paper: http://sergey.melnix.com/pub/melnik_VLDB10.pdf
If you interesting in this kind of architecture - you can also take a look on the open source clone of this technology - Open Dremel.
http://code.google.com/p/dremel/

What type of problems can mapreduce solve?

Is there a theoretical analysis available which describes what kind of problems mapreduce can solve?
In Map-Reduce for Machine Learning on Multicore Chu et al describe "algorithms that fit the Statistical Query model can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers." They specifically implement 10 algorithms including e.g. weighted linear regression, k-Means, Naive Bayes, and SVM, using a map-reduce framework.
The Apache Mahout project has released a recent Hadoop (Java) implementation of some methods based on the ideas from this paper.
For problems requiring processing and generating large data sets. Say running an interest generation query over all accounts a bank hold. Say processing audit data for all transactions that happened in the past year in a bank. The best use case is from Google - generating search index for google search engine.
Many problems that are "Embarrassingly Parallel" (great phrase!) can use MapReduce. http://en.wikipedia.org/wiki/Embarrassingly_parallel
From this article....
http://www.businessweek.com/magazine/content/07_52/b4064048925836.htm
...
Doug Cutting, founder of Hadoop (an open source implementation of MapReduce) says...
“Facebook uses Hadoop to analyze user behavior and the effectiveness of ads on the site"
and... “the tech team at The New York Times rented computing power on Amazon’s cloud and used Hadoop to convert 11 million archived articles, dating back to 1851, to digital and searchable documents. They turned around in a single day a job that otherwise would have taken months.”
Anything that involves doing operations on a large set of data, where the problem can be broken down into smaller independent sub-problems who's results can then be aggregated to produce the answer to the larger problem.
A trivial example would be calculating the sum of a huge set of numbers. You split the set into smaller sets, calculate the sums of those smaller sets in parallel (which can involve splitting those into yet even smaller sets), then sum those results to reach the final answer.
The answer lies is really in the name of the algorithm. MapReduce is not a general purpose parallel programming work or batch execution framework as some of the answers suggest. Map Reduce is really useful when large data sets that need to be processed (Mapping phase) and derive certain attributes from there, and then need to be summarized on on those derived attributes (Reduction Phase).
You can also watch the videos # Google, I'm watching them myself and I find them very educational.
Sort of a hello world introduction to MapReduce
http://blog.diskodev.com/parallel-processing-using-the-map-reduce-prog
This question was asked before its time. Since 2009 there has actually been a theoretical analysis of MapReduce computations. This 2010 paper of Howard Karloff et al. formalizes MapReduce as a complexity class in the same way that theoreticians study P and NP. They prove some relationships between MapReduce and a class called NC (which can be thought of either as shared-memory parallel machines or a certain class of restricted circuits). But the main piece of work are their formal definitions.

Resources