Hadoop MapReduce Sorting levels - sorting

How many levels of sorting happens in a map reduce program? Is the sorting part of shuffling? I am totally confused with this?
Can you please help with the flow diagram for the MR workflow including all the steps? Thanks a lot guys.

A good article with a diagram is A Beginners Guide to Hadoop and Blog on Clouder
Sort and Merge use SortComparator and grouping uses GroupingComparator.
Secondary sort is the best example to understand the intricacies of sorting in hadoop.
UPDATE:
On a very high level, hadoop MR can be seen as distributed Sort Merge. When Mapper is working it produces sorted spills for each Reducer (using SortComparator), if Mapper had produced a couple of spills for each Reducer, than on the Mapper side it could Sort Merge them into large spill using SortComparator. Before all Mappers finish, Reducers are launched and start pulling their spills from each Mapper and sort merge them using SortComparator. When all Mappers have finished and spills on each reducer side are under a freehold, than data is sorted and Reducers will use GroupComparator to identify which sorted Key/Value pairs are covered by the Iterator values of the reduce call.

Related

Shuffle and sort for mapreduce

I read through the definitive guide and some other links on the web including the one here
My question is
where exactly does shuffling and sorting happen?
As per my understanding, they happen on both mappers and reducers. But some links mention that shuffling happens on mappers and sorting on reducers.
Can someone confirm if my understanding is correct; if not can they provide additional documentation I can go through?
Shuffle:
MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort and transfers map outputs to the reducers as inputs is known as the shuffle.
Sort:
Sorting happens in various stages of MapReduce program, So can exists in Map and Reduce phases.
Please have a look at this diagram
Adding more description to above image in Map and Reduce phases.
The Map Side:
When the map function starts producing output, it is not simply written to disk. Before Map output writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key.
The Reduce Side:
When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side), which merges the map outputs, maintaining their sort ordering. This will be done in rounds.
Source : Hadoop Definitive Guide.

Split input to a reducer in hadoop

This question is kind of related to my other question Hadoop handling data skew in reducer.
However, I would like to ask if there are some configuration settings available so that if say the max reducer memory is reached then spawn off a new reducer on another datanode with the remaining data in context ?
Or maybe even on the same datanode so that say some x records off the context are read in the reduce method upto some limit and then the remaining are read off in a new reducer ?
You could try out a combiner that would reduce the work load of a single reducer handling more key,value pairs by doing a possible aggregation before it goes through to the reducer. If you are doing a join then you could try out skewed join in Pig. It involves 2 MR jobs.In first MR it does a sampling on one input and if it finds a key which is skewed so much so that it is able to fit into memory, it splits that key into more than one reducers. For the other records than the one identified in the sample it does a default join. For the skewed input it duplicates the input and sends it to both reducers.
It is not possible to spawn a new auxiliary reducer to balance the load on the job run.
Rather you could thing of picking another key element from your records which will help in shuffling the data even across the reducers.
Else as a option, you could expand the existing reducer's memory settings to accommodate more shuffled records and to get the sorting/merging done quicker. Please refer the below properties,
mapreduce.reduce.memory.mb
mapreduce.reduce.java.opts
mapreduce.reduce.merge.inmem.threshold
mapreduce.reduce.shuffle.input.buffer.percent
mapreduce.reduce.shuffle.merge.percent
mapreduce.reduce.input.buffer.percent
I could remember, there was a extended mapreduce library, skewtune, written to load balance the data skew during the course of job run. But I never experimented this, kindly check if it is helpful.
That is not possible. The number of reducers is fixed in the Driver configuration.

Please help what is the necessity of Shuffle and Sorting in Hadoop?

In a normal wordcount program in mapreduce, do we need to set any method for shuffle and sort, or the framework will take care of this?
The framework will take care of this. Shuffling is the process of transfering data from mappers to reducers, which reduce the data in an ascending (lexicographical) order of their intermediate keys (words).
You can change the default settings, but there is no need to do it in a wordcount program.
You just need to set a mapper and a reducer and optionally (but really helps in speed) a combiner.
Even implementing a mapper and a reducer of your own is not necessary, as hadoop comes with such implementations of wordcount mapper (TokenCounterMapper) and reducer (IntSumReducer, which can be also used as a combiner).

What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

In Map Reduce programming the reduce phase has shuffling, sorting and reduce as its sub-parts. Sorting is a costly affair.
What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?
First of all shuffling is the process of transfering data from the mappers to the reducers, so I think it is obvious that it is necessary for the reducers, since otherwise, they wouldn't be able to have any input (or input from every mapper). Shuffling can start even before the map phase has finished, to save some time. That's why you can see a reduce status greater than 0% (but less than 33%) when the map status is not yet 100%.
Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply. Each reduce task takes a list of key-value pairs, but it has to call the reduce() method which takes a key-list(value) input, so it has to group values by key. It's easy to do so, if input data is pre-sorted (locally) in the map phase and simply merge-sorted in the reduce phase (since the reducers get data from many mappers).
Partitioning, that you mentioned in one of the answers, is a different process. It determines in which reducer a (key, value) pair, output of the map phase, will be sent. The default Partitioner uses a hashing on the keys to distribute them to the reduce tasks, but you can override it and use your own custom Partitioner.
A great source of information for these steps is this Yahoo tutorial (archived).
A nice graphical representation of this is the following (shuffle is called "copy" in this figure):
Note that shuffling and sorting are not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map phase, and the map phase does not include any kind of sorting (so even the map phase is faster).
UPDATE: Since you are looking for something more official, you can also read Tom White's book "Hadoop: The Definitive Guide". Here is the interesting part for your question.
Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation, so I guess it is pretty credible and official...
Let's revisit key phases of Mapreduce program.
The map phase is done by mappers. Mappers run on unsorted input key/values pairs. Each mapper emits zero, one, or multiple output key/value pairs for each input key/value pairs.
The combine phase is done by combiners. The combiner should combine key/value pairs with the same key. Each combiner may run zero, once, or multiple times.
The shuffle and sort phase is done by the framework. Data from all mappers are grouped by the key, split among reducers and sorted by the key. Each reducer obtains all values associated with the same key. The programmer may supply custom compare functions for sorting and a partitioner for data split.
The partitioner decides which reducer will get a particular key value pair.
The reducer obtains sorted key/[values list] pairs, sorted by the key. The value list contains all values with the same key produced by mappers. Each reducer emits zero, one or multiple output key/value pairs for each input key/value pair.
Have a look at this javacodegeeks article by Maria Jurcovicova and mssqltips article by Datta for a better understanding
Below is the image from safaribooksonline article
I thought of just adding some points missing in above answers. This diagram taken from here clearly states the what's really going on.
If I state again the real purpose of
Split: Improves the parallel processing by distributing the processing load across different nodes (Mappers), which would save the overall processing time.
Combine: Shrinks the output of each Mapper. It would save the time spending for moving the data from one node to another.
Sort (Shuffle & Sort): Makes it easy for the run-time to schedule (spawn/start) new reducers, where while going through the sorted item list, whenever the current key is different from the previous, it can spawn a new reducer.
Some of the data processing requirements doesn't need sort at all. Syncsort had made the sorting in Hadoop pluggable. Here is a nice blog from them on sorting. The process of moving the data from the mappers to the reducers is called shuffling, check this article for more information on the same.
I've always assumed this was necessary as the output from the mapper is the input for the reducer, so it was sorted based on the keyspace and then split into buckets for each reducer input. You want to ensure all the same values of a Key end up in the same bucket going to the reducer so they are reduced together. There is no point sending K1,V2 and K1,V4 to different reducers as they need to be together in order to be reduced.
Tried explaining it as simply as possible
Shuffling is the process by which intermediate data from mappers are transferred to 0,1 or more reducers. Each reducer receives 1 or more keys and its associated values depending on the number of reducers (for a balanced load). Further the values associated with each key are locally sorted.
Because of its size, a distributed dataset is usually stored in partitions, with each partition holding a group of rows. This also improves parallelism for operations like a map or filter. A shuffle is any operation over a dataset that requires redistributing data across its partitions. Examples include sorting and grouping by key.
A common method for shuffling a large dataset is to split the execution into a map and a reduce phase. The data is then shuffled between the map and reduce tasks. For example, suppose we want to sort a dataset with 4 partitions, where each partition is a group of 4 blocks.The goal is to produce another dataset with 4 partitions, but this time sorted by key.
In a sort operation, for example, each square is a sorted subpartition with keys in a distinct range. Each reduce task then merge-sorts subpartitions of the same shade.
The above diagram shows this process. Initially, the unsorted dataset is grouped by color (blue, purple, green, orange). The goal of the shuffle is to regroup the blocks by shade (light to dark). This regrouping requires an all-to-all communication: each map task (a colored circle) produces one intermediate output (a square) for each shade, and these intermediate outputs are shuffled to their respective reduce task (a gray circle).
The text and image was largely taken from here.
There only two things that MapReduce does NATIVELY: Sort and (implemented by sort) scalable GroupBy.
Most of applications and Design Patterns over MapReduce are built over these two operations, which are provided by shuffle and sort.
This is a good reading. Hope it helps. In terms of sorting you are concerning, I think it is for the merge operation in last step of Map. When map operation is done, and need to write the result to local disk, a multi-merge will be operated on the splits generated from buffer. And for a merge operation, sorting each partition in advanced is helpful.
Well,
In Mapreduce there are two important phrases called Mapper and reducer both are too important, but Reducer is mandatory. In some programs reducers are optional. Now come to your question.
Shuffling and sorting are two important operations in Mapreduce. First Hadoop framework takes structured/unstructured data and separate the data into Key, Value.
Now Mapper program separate and arrange the data into keys and values to be processed. Generate Key 2 and value 2 values. This values should process and re arrange in proper order to get desired solution. Now this shuffle and sorting done in your local system (Framework take care it) and process in local system after process framework cleanup the data in local system.
Ok
Here we use combiner and partition also to optimize this shuffle and sort process. After proper arrangement, those key values passes to Reducer to get desired Client's output. Finally Reducer get desired output.
K1, V1 -> K2, V2 (we will write program Mapper), -> K2, V' (here shuffle and soft the data) -> K3, V3 Generate the output. K4,V4.
Please note all these steps are logical operation only, not change the original data.
Your question: What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?
Short answer: To process the data to get desired output. Shuffling is aggregate the data, reduce is get expected output.

Mapper and Reducer in Hadoop

I have a confusion about the implementation of Hadoop.
I notice that when I run my Hadoop MapReduce job with multiple mappers and reducers, I would get many part-xxxxx files. Meanwhile, it is true that a key only appears in one of them.
Thus, I am wondering how MapReduce works such that a key only goes to one output file?
Thanks in advance.
The shuffle step in the MapReduce process is responsible for ensuring that all records with the same key end up in the same reduce task. See this Yahoo tutorial for a description of the MapReduce data flow. The section called Partition & Shuffle states that
Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin.
Shuffle
Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.
Sort
The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.
I got this from here
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Have a look on it i hope this will helpful

Resources