Sort and shuffle optimization in Hadoop MapReduce - hadoop

I'm looking for a research/implementation based project on Hadoop and I came across the list posted on the wiki page - http://wiki.apache.org/hadoop/ProjectSuggestions. But, this page was last updated in September, 2009. So, I'm not sure if some of these ideas have already been implemented or not. I was particularly interested in "Sort and Shuffle optimization in the MR framework" which talks about "combining the results of several maps on rack or node before the shuffle. This can reduce seek work and intermediate storage".
Has anyone tried this before? Is this implemented in the current version of Hadoop?

There is the combiner functionality (as described under the "Combine" section of http://wiki.apache.org/hadoop/HadoopMapReduce), which is more-or-less an in-memory shuffle. But I believe that the combiner only aggregates key-value pairs for a single map job, not all the pairs for a given node or rack.

The project description is aimed "optimization".
This feature is already present in the current Hadoop-MapReduce and it can probably run in a lot less time.
Sounds like a valuable enhancement to me.

I think it is very challenging task. In my understanding the idea is to make a computation tree instead of "flat" map-reduce.The good example of it is Google's Dremel engine (called BigQuey now). I would suggest to read this paper: http://sergey.melnix.com/pub/melnik_VLDB10.pdf
If you interesting in this kind of architecture - you can also take a look on the open source clone of this technology - Open Dremel.
http://code.google.com/p/dremel/

Related

Hadoop for password cracking

Hi I came across this article and it made me wonder, how easy it would be for a hacker to crack passwords. What do you think guys???
If you want to try out several permutations in a brute force manner, I don't think that using hadoop would give you any benefit. Hadoop is not something that fits into all uses cases and would not every time perform well.
Computing permutations can be done in batch.. just set different start and end params for each machine. The overhead involved in setting a job, movement of data across nodes, job cleanup can surely be saved. I have seen that running different processes over 5 nodes be pre-dividing the load equally performed pretty well as compared to map-reduce. Offcourse, I dont mean that map-reduce is bad.. its just that the scenario wasnt right fit for getting the job done.
I found this Recursive Algorithm on Distributed Systems an interesting way to run recursive algorithms on distributed system. Now a permutation and combination algorithms can be used to do some interesting stuff

Mapreduce for dummies

Ok, I am attempting to learn Hadoop and mapreduce. I really want to start with mapreduce and what I find are many, many simplified examples of mappers and reducers, etc. However, I seen to be missing something.
While an example showing how many occurrences of a word are in a document is simple to understand it does not really help me solve any "real world" problems. Does anybody know of a good tutorial of implementing mapreduce in a psuedo-realistic situation. Say, for instance, I want to use hadoop and mapreduce on top of a data store similar to Adventureworks. Now I want to get orders for a given product in the month of may. How would that look from a hadoop/mapreduce perspective? (I realize this may not be the type of problem mapreduce is intended to solve but, it just came to mind quickly.)
Any direction would help.
The book Hadoop: The Definitive Guide is a good place to start. The introductory chapters should be really useful to you to figure out where MapReduce is useful and when you should use it. The more advanced chapters have plenty of more realistic examples than word count.
If you want to dive deeper, you may want to check out Data-Intensive Text Processing with MapReduce. This definitely has plenty of "real-world" use cases, but it doesn't sound like you are interested in doing text processing.
For your particular example, the main things to realize are:
The map phase is mostly for parsing, transforming data, and filtering out data. Think record-by-record, shared-nothing approach to record processing. In word count, this is parsing the line and splitting out the words.
The reduce phase is all about aggregation: counting, averaging, min/max, etc. In word count, this is counting up the instances of the word.
So, if you would want all the records for a given product in the month of May, you could use a map-only job to filter through all the data and only keep the records you want. However, you really should read about what Hadoop is useful for. The question that would fit Hadoop better would be: give me a count of how many times every item was purchased in every month (to build a matrix, perhaps). Very rarely are you looking for specific records like you suggest.
If you are looking for a more real-time access platform, you should check out HBase once you are done learning about Hadoop.
Hadoop can be used for a wide variety of problems. Check this blog entry from atbrox. Also, there is a lot of information on the internet about Hadoop and MapReduce and it's easy to get lost. So, here is the consolidated list of resources on Hadoop.
BTW, Hadoop - The Definitive Guide 3rd edition is due in May. Looks like it also covers MRv2 (NextGen MapReduce) and also includes more case studies. The 2nd edition is worth as mentioned by orangeoctopus.
MapReduce can be a complex topic so I found it easier to understand it by applying its approach to a simple problem. Then I go on to describe how MapReduce makes it straightforward to solve the same problem in a cluster. You can take a look in my article here: Intro to Parallel Processing with MapReduce.
Let me know if you think this article makes it easier to understand MapReduce and Hadoop.

A good example in hadoop that needs iteration

I am currently implement a parallel-for on hadoop to iterate the mapper a number of times as specify by the user. Can someone help me with a useful example that I can use my implementation for testing. Some application in Hadoop that needs iteration of the Mapper function.
Thank you
The simplest one is implementing Apriori algorithm which is used to find the frequent itemset.
What do you exactly mean by "iteration of the mapper"? I have an example of starting a job recursively (on the input of the last job).
Have a look here, it explains a simple graph mindist-search / graph exploration algorithm: http://codingwiththomas.blogspot.com/2011/04/graph-exploration-with-hadoop-mapreduce.html
A bit more generic version is this here:
http://codingwiththomas.blogspot.com/2011/04/controlling-hadoop-job-recursion.html
There are plenty of examples in data mining. You could try one of the clustering algorithms, for example.

Are there other algorithms like LSM tree?

There are many strategies for disk space (and memory) management in databases.
I try to track the best ones like log-structured merge tree in form of BigTable (and HBase, Hypertable, Cassandra) or fractal tree used in TokuDB. From what I have mentioned it is easy to guess, I mean algorithms what use wisely resources (for example avoiding I/O and scale well).
Are there other algorithms like LSM tree? Just direct me.
currently , google release levelDB (you can search it in google);
People say it is the memtable sstable implemetion of google's bigtable!
I think it is a simple version after read some source code!
Hope it can give some help
and nessDB.
It's using a simple LSM-Tree, https://github.com/shuttler/nessDB
H2Database's MVStore uses Log Structured Storage, a slight similar to LSM-Tree
Fragmented LSM-Tree, implemented in PebblesDB
WiscKey, implemented in this contest project

What type of problems can mapreduce solve?

Is there a theoretical analysis available which describes what kind of problems mapreduce can solve?
In Map-Reduce for Machine Learning on Multicore Chu et al describe "algorithms that fit the Statistical Query model can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers." They specifically implement 10 algorithms including e.g. weighted linear regression, k-Means, Naive Bayes, and SVM, using a map-reduce framework.
The Apache Mahout project has released a recent Hadoop (Java) implementation of some methods based on the ideas from this paper.
For problems requiring processing and generating large data sets. Say running an interest generation query over all accounts a bank hold. Say processing audit data for all transactions that happened in the past year in a bank. The best use case is from Google - generating search index for google search engine.
Many problems that are "Embarrassingly Parallel" (great phrase!) can use MapReduce. http://en.wikipedia.org/wiki/Embarrassingly_parallel
From this article....
http://www.businessweek.com/magazine/content/07_52/b4064048925836.htm
...
Doug Cutting, founder of Hadoop (an open source implementation of MapReduce) says...
“Facebook uses Hadoop to analyze user behavior and the effectiveness of ads on the site"
and... “the tech team at The New York Times rented computing power on Amazon’s cloud and used Hadoop to convert 11 million archived articles, dating back to 1851, to digital and searchable documents. They turned around in a single day a job that otherwise would have taken months.”
Anything that involves doing operations on a large set of data, where the problem can be broken down into smaller independent sub-problems who's results can then be aggregated to produce the answer to the larger problem.
A trivial example would be calculating the sum of a huge set of numbers. You split the set into smaller sets, calculate the sums of those smaller sets in parallel (which can involve splitting those into yet even smaller sets), then sum those results to reach the final answer.
The answer lies is really in the name of the algorithm. MapReduce is not a general purpose parallel programming work or batch execution framework as some of the answers suggest. Map Reduce is really useful when large data sets that need to be processed (Mapping phase) and derive certain attributes from there, and then need to be summarized on on those derived attributes (Reduction Phase).
You can also watch the videos # Google, I'm watching them myself and I find them very educational.
Sort of a hello world introduction to MapReduce
http://blog.diskodev.com/parallel-processing-using-the-map-reduce-prog
This question was asked before its time. Since 2009 there has actually been a theoretical analysis of MapReduce computations. This 2010 paper of Howard Karloff et al. formalizes MapReduce as a complexity class in the same way that theoreticians study P and NP. They prove some relationships between MapReduce and a class called NC (which can be thought of either as shared-memory parallel machines or a certain class of restricted circuits). But the main piece of work are their formal definitions.

Resources