What's the difference between shuffle phase and combiner phase? - hadoop

i'm pretty confused about the MapReduce Framework. I'm getting confused reading from different sources about that. By the way, this is my idea of a MapReduce Job
1. Map()-->emit <key,value>
2. Partitioner (OPTIONAL) --> divide
intermediate output from mapper and assign them to different
reducers
3. Shuffle phase used to make: <key,listofvalues>
4. Combiner, component used like a minireducer wich perform some
operations on datas and then pass those data to the reducer.
Combiner is on local not HDFS, saving space and time.
5. Reducer, get the data from the combiner, perform further
operation(probably the same as the combiner) then release the
output.
6. We will have n outputs parts, where n is the number
of reducers
It is basically right? I mean, i found some sources stating that combiner is the shuffle phase and it basically groupby each record by key...

Combiner is NOT at all similar to the shuffling phase. What you describe as shuffling is wrong, which is the root of your confusion.
Shuffling is just copying keys from map to reduce, it has nothing to do with key generation. It is the first phase of a Reducer, with the other two being sorting and then reducing.
Combining is like executing a reducer locally, for the output of each mapper. It basically acts like a reducer (it also extends the Reducer class), which means that, like a reducer, it groups the local values that the mapper has emitted for the same key.
Partitioning is, indeed, assigning the map output keys to specific reduce tasks, but it is not optional. Overriding the default HashPartitioner with an implementation of your own is optional.
I tried to keep this answer minimal, but you can find more information on the book Hadoop: The Definitive Guide by Tom White, as Azim suggests, and some related things in this post.

Think of combiner as a mini-reducer phase that only works on the output of map task within each node before it emits it to the actual reducer.
Taking the classical WordCount example, map phase output would be (word,1) for each word the map task processes. Lets assume the input to be processed is
"She lived in a big house with a big garage on the outskirts of a big
city in India"
Without a combiner, map phase would emit (big,1) three times and (a,1) three times and (in,1) two times. But when a combiner is used, the map phase would emit (big,3), (a,3) and (in,2). Note that the individual occurrences of each of these words is aggregated locally within the map phase before it emits its output to reduce phase. In use cases where Combiner is used, it would optimise to ensure network traffic from map to reduce is minimised due to local aggregation.
During the shuffle phase, output from various map phases are redirected to the correct reducer phase. This is handled internally by the framework. If a partitioner is used, it would be helpful to shuffle the input to reduce accordingly.

I don't think that combiner is a part of Shuffle and Sort phase.
Combiner, itself is one of the phases(optional) of the job lifecycle.
The pipelining of these phases could be like:
Map --> Partition --> Combiner(optional) --> Shuffle and Sort --> Reduce
Out of these phases, Map, Partition and Combiner operate on the same node.
Hadoop dynamically selects nodes to run Reduce Phase depend upon the availability and accessibility of the resources in best possible way.
Shuffle and Sort, an important middle level phase works across the Map and Reduce nodes.
When a client submits a job, Map Phase starts working on input file which is stored across nodes in the form of blocks.
Mappers process each line of the file one by one and put the result generated into some memory buffer of 100MB(local memory to each mapper). When this buffer gets filled till a certain threshold, by default 80%, this buffer is sorted and then stored into the disk(as file). Each Mapper can generate multiple such intermediate sorted splits or files. When Mapper is done with all the lines of the block, all such splits are merged together(to form a single file), sorted(on the basis of key) and then Combiner phase starts working on this single file. Note that, if there is no Paritition phase, only one intermediate file will be produced, but in case of Parititioning multiple files get generated depending upon the developers logic. Below image from Oreilly Hadoop: The Definitive guide, may help you in understanding this concept in more details.
Later, Hadoop copies merged file from each of the Mapper nodes to the Reducer nodes depending upon the key value. That is all the records of the same key will be copied to the same Reducer node.
I think, you may know in depth about SS and Reduce Phase work, so not going into more details for these topics.
Also, for more information, I would suggest you to read Oreilly Hadoop: The Definitive guide. Its awesome book for Hadoop.

Related

When Partitioner runs in Map Reduce?

As per my understanding, mapper runs first followed by partitioner(if any) followed by Reducer. But if we use Partitioner class, I am not sure when Sorting and Shuffling phase runs?
A CLOSER LOOK
Below diagram explain the complete details.
From this diagram, you can see where the mapper and reducer components of the Word Count application fit in, and how it achieves its objective. We will now examine this system in a bit closer detail.
mapreduce-flow
Shuffle and Sort phase will always execute(across the mappers and reducers nodes).
The hierarchy of the different phase in MapReduce as below:
Map --> Partition --> Combiner(optional) --> Shuffle and Sort --> Reduce.
The short answer is: Data sorting runs on the reducers, shuffling/sorting runs before the reducer (always) and after the map/combiner(if any)/partitioner(if any).
The long answer is that into a MapReduce job there are 4 main players:
Mapper, Combiner, Partitioner, Reducer. All these are classes you can actually implement by your own.
Let's take the famous word count program, and let's assume the split where we are working contains:
pippo, pluto, pippo, pippo, paperino, pluto, paperone, paperino, paperino
and each word is record.
Mapper
Each mapper runs over a subset of your file, its task is to read each record from the split and assign a key to each record which will output.
Mapper will store intermediate result on disk ( local disk ).
The intermediate output from this stage will be
pippo,1
pluto,1
pippo,1
pippo,1
peperino,1
pluto,1
paperone,1
paperino,1
paperino,1
At this will be stored on the local disk of the node which runs the mapper.
Combiner
It's a mini-reducer and can aggregate data. It can also run joins, so called map-join. This object helps to save bandwidth into the cluster because it aggregates data on the local node.
The output from the combiner, which is still part of the mapper phase, will be:
pippo,3
pluto,2
paperino,3
paperone,1
Of course here are the data from ONE node. Now we have to send the data to the reducers in order to get the global result. Which reducer will process the record depends on the partitioner.
Partitioner
It's task is to spread the data across all the available reducers. This object will read the output from the combiner and will select the reducer which will process the key.
In this example we have two reducers and we use the following rule:
all the pippo goes to reducer 1
all the pluto goes to reducer 2
all the paperino goes to reducer 2
all the paperone goes to reducer 1
so all the nodes will send records which have the key pippo to the same reducer(1), all the nodes will send the records which have the key pluto to the same reducer (2) and so on...
Here is where the data get shuffled/sorted and, since the combiner already reduced the data locally, this node has to send only 4 record instead of 9.
Reducer
This object is able to aggregate the data from each node and it's also able to sort the data.

Shuffle and sort for mapreduce

I read through the definitive guide and some other links on the web including the one here
My question is
where exactly does shuffling and sorting happen?
As per my understanding, they happen on both mappers and reducers. But some links mention that shuffling happens on mappers and sorting on reducers.
Can someone confirm if my understanding is correct; if not can they provide additional documentation I can go through?
Shuffle:
MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort and transfers map outputs to the reducers as inputs is known as the shuffle.
Sort:
Sorting happens in various stages of MapReduce program, So can exists in Map and Reduce phases.
Please have a look at this diagram
Adding more description to above image in Map and Reduce phases.
The Map Side:
When the map function starts producing output, it is not simply written to disk. Before Map output writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key.
The Reduce Side:
When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side), which merges the map outputs, maintaining their sort ordering. This will be done in rounds.
Source : Hadoop Definitive Guide.

is the output of map phase of the mapreduce job always sorted?

I am a bit confused with the output I get from Mapper.
For example, when I run a simple wordcount program, with this input text:
hello world
Hadoop programming
mapreduce wordcount
lets see if this works
12345678
hello world
mapreduce wordcount
this is the output that I get:
12345678 1
Hadoop 1
hello 1
hello 1
if 1
lets 1
mapreduce 1
mapreduce 1
programming 1
see 1
this 1
wordcount 1
wordcount 1
works 1
world 1
world 1
As you can see, the output from mapper is already sorted. I did not run Reducer at all.
But I find in a different project that the output from mapper is not sorted.
So I am totally clear about this..
My questions are:
Is the mapper's output always sorted?
Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?
Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer? A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?
Is the mapper's output always sorted?
No. It is not sorted if you use no reducer. If you use a reducer, there is a pre-sorting process before the mapper's output is written to disk. Data gets sorted in the Reduce phase. What is happening here (just a guess) is that you are not specifying a Reducer class, which, in the new API, is translated into using the Identity Reducer (see this answer and comment). The Identity Reducer just outputs its input. To verify that, see the default Reducer counters (there should be some reduce tasks, reduce input records & groups, reduce output records...)
Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?
As I explained in the previous question, if you use no reducers, mapper does not sort the data. If you do use reducers, the data start getting sorted from the map phase and then get merge-sorted in the reduce phase.
Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer. A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?
Again, shuffling and sorting are parts of the Reduce phase. An Identity Reducer will do what you want. If you want to output one key-value pair per reducer, with the values being a concatenation of the iterables, just store the iterables in memory (e.g. in a StringBuffer) and then output this concatenation as a value. If you want the map output to go straight to the program's output, without going through a reduce phase, then set in the driver class the number of reduce tasks to zero, like that:
job.setNumReduceTasks(0);
This will not get your output sorted, though. It will skip the pre-sorting process of the mapper and write the output directly to HDFS.
Point 1: output from mapper is always sorted but based on Key.
i.e. if Map method is doing this: context.write(outKey, outValue); then result will be sorted based on outKey.
Following would be some explanations to your questions
Heading ##Does the output from mapper is always sorted?
Already answered by #SurJanSR
Heading ##Does the sort phase integrated with mapper phase already, so that the output of map phase is already sorted in the intermediate data?
In a Mapreduce Job, as you know, Mapper runs on individual splits of data and across nodes where data is persisting. The result of Mapper is written TEMPORARILY before it is written to the next phase.
In the case of a reduce operation, the TEMPORARILY stored Mapper output is sorted, shuffle based on the partitioner needs before moved to the reduce operation
In the case of Map Only Job, as in your case, The temorarily stored Mapper output is sorted based on the key and written to the final output folder (as specified in your arguments for the Job).
Heading ##Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer. A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?
Not sure what your requirement is. using a IdentityReducer would just persist the output. I'm not sure if this answers your question.
I support the answer of vefthym.
Usually the Mapper output is sorted before storing it locally on the node. But when you are explicitely setting up numReduceTasks to 0 in the job configuration then the mapper o/p will not be sorted and written directly to HDFS.
So we cannot say that Mapper output is always sorted!
1. Is the mapper's output always sorted?
2.Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?
From Apache MapReduceTutorial:
( Under Mapper Section )
All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer(s) to determine the final output.
The Mapper outputs are sorted and then partitioned per Reducer. The total number of partitions is the same as the number of reduce tasks for the job
( Under Reducer Section )
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by FileOutputFormat.setOutputPath(Job, Path). The framework does not sort the map-outputs before writing them out to the FileSystem.
3. Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer? A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?
I don't think so. From Apache condemnation on Reducer:
Reducer has 3 primary phases:
Shuffle:
The Reducer copies the sorted output from each Mapper using HTTP across the network.
Sort:
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
Reduce:
The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
As per the documentation, the shuffle and sort phase is driven by framework
If you want to persist the data, set number of reducers to Zero, which causes persistence of Map output into HDFS but it won't sort the data.
Have a look at related SE question:
hadoop: difference between 0 reducer and identity reducer?
I did not find IdentityReducer in Hadoop 2.x version:
identityreducer in the new Hadoop API

What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

In Map Reduce programming the reduce phase has shuffling, sorting and reduce as its sub-parts. Sorting is a costly affair.
What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?
First of all shuffling is the process of transfering data from the mappers to the reducers, so I think it is obvious that it is necessary for the reducers, since otherwise, they wouldn't be able to have any input (or input from every mapper). Shuffling can start even before the map phase has finished, to save some time. That's why you can see a reduce status greater than 0% (but less than 33%) when the map status is not yet 100%.
Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply. Each reduce task takes a list of key-value pairs, but it has to call the reduce() method which takes a key-list(value) input, so it has to group values by key. It's easy to do so, if input data is pre-sorted (locally) in the map phase and simply merge-sorted in the reduce phase (since the reducers get data from many mappers).
Partitioning, that you mentioned in one of the answers, is a different process. It determines in which reducer a (key, value) pair, output of the map phase, will be sent. The default Partitioner uses a hashing on the keys to distribute them to the reduce tasks, but you can override it and use your own custom Partitioner.
A great source of information for these steps is this Yahoo tutorial (archived).
A nice graphical representation of this is the following (shuffle is called "copy" in this figure):
Note that shuffling and sorting are not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map phase, and the map phase does not include any kind of sorting (so even the map phase is faster).
UPDATE: Since you are looking for something more official, you can also read Tom White's book "Hadoop: The Definitive Guide". Here is the interesting part for your question.
Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation, so I guess it is pretty credible and official...
Let's revisit key phases of Mapreduce program.
The map phase is done by mappers. Mappers run on unsorted input key/values pairs. Each mapper emits zero, one, or multiple output key/value pairs for each input key/value pairs.
The combine phase is done by combiners. The combiner should combine key/value pairs with the same key. Each combiner may run zero, once, or multiple times.
The shuffle and sort phase is done by the framework. Data from all mappers are grouped by the key, split among reducers and sorted by the key. Each reducer obtains all values associated with the same key. The programmer may supply custom compare functions for sorting and a partitioner for data split.
The partitioner decides which reducer will get a particular key value pair.
The reducer obtains sorted key/[values list] pairs, sorted by the key. The value list contains all values with the same key produced by mappers. Each reducer emits zero, one or multiple output key/value pairs for each input key/value pair.
Have a look at this javacodegeeks article by Maria Jurcovicova and mssqltips article by Datta for a better understanding
Below is the image from safaribooksonline article
I thought of just adding some points missing in above answers. This diagram taken from here clearly states the what's really going on.
If I state again the real purpose of
Split: Improves the parallel processing by distributing the processing load across different nodes (Mappers), which would save the overall processing time.
Combine: Shrinks the output of each Mapper. It would save the time spending for moving the data from one node to another.
Sort (Shuffle & Sort): Makes it easy for the run-time to schedule (spawn/start) new reducers, where while going through the sorted item list, whenever the current key is different from the previous, it can spawn a new reducer.
Some of the data processing requirements doesn't need sort at all. Syncsort had made the sorting in Hadoop pluggable. Here is a nice blog from them on sorting. The process of moving the data from the mappers to the reducers is called shuffling, check this article for more information on the same.
I've always assumed this was necessary as the output from the mapper is the input for the reducer, so it was sorted based on the keyspace and then split into buckets for each reducer input. You want to ensure all the same values of a Key end up in the same bucket going to the reducer so they are reduced together. There is no point sending K1,V2 and K1,V4 to different reducers as they need to be together in order to be reduced.
Tried explaining it as simply as possible
Shuffling is the process by which intermediate data from mappers are transferred to 0,1 or more reducers. Each reducer receives 1 or more keys and its associated values depending on the number of reducers (for a balanced load). Further the values associated with each key are locally sorted.
Because of its size, a distributed dataset is usually stored in partitions, with each partition holding a group of rows. This also improves parallelism for operations like a map or filter. A shuffle is any operation over a dataset that requires redistributing data across its partitions. Examples include sorting and grouping by key.
A common method for shuffling a large dataset is to split the execution into a map and a reduce phase. The data is then shuffled between the map and reduce tasks. For example, suppose we want to sort a dataset with 4 partitions, where each partition is a group of 4 blocks.The goal is to produce another dataset with 4 partitions, but this time sorted by key.
In a sort operation, for example, each square is a sorted subpartition with keys in a distinct range. Each reduce task then merge-sorts subpartitions of the same shade.
The above diagram shows this process. Initially, the unsorted dataset is grouped by color (blue, purple, green, orange). The goal of the shuffle is to regroup the blocks by shade (light to dark). This regrouping requires an all-to-all communication: each map task (a colored circle) produces one intermediate output (a square) for each shade, and these intermediate outputs are shuffled to their respective reduce task (a gray circle).
The text and image was largely taken from here.
There only two things that MapReduce does NATIVELY: Sort and (implemented by sort) scalable GroupBy.
Most of applications and Design Patterns over MapReduce are built over these two operations, which are provided by shuffle and sort.
This is a good reading. Hope it helps. In terms of sorting you are concerning, I think it is for the merge operation in last step of Map. When map operation is done, and need to write the result to local disk, a multi-merge will be operated on the splits generated from buffer. And for a merge operation, sorting each partition in advanced is helpful.
Well,
In Mapreduce there are two important phrases called Mapper and reducer both are too important, but Reducer is mandatory. In some programs reducers are optional. Now come to your question.
Shuffling and sorting are two important operations in Mapreduce. First Hadoop framework takes structured/unstructured data and separate the data into Key, Value.
Now Mapper program separate and arrange the data into keys and values to be processed. Generate Key 2 and value 2 values. This values should process and re arrange in proper order to get desired solution. Now this shuffle and sorting done in your local system (Framework take care it) and process in local system after process framework cleanup the data in local system.
Ok
Here we use combiner and partition also to optimize this shuffle and sort process. After proper arrangement, those key values passes to Reducer to get desired Client's output. Finally Reducer get desired output.
K1, V1 -> K2, V2 (we will write program Mapper), -> K2, V' (here shuffle and soft the data) -> K3, V3 Generate the output. K4,V4.
Please note all these steps are logical operation only, not change the original data.
Your question: What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?
Short answer: To process the data to get desired output. Shuffling is aggregate the data, reduce is get expected output.

Mapper and Reducer in Hadoop

I have a confusion about the implementation of Hadoop.
I notice that when I run my Hadoop MapReduce job with multiple mappers and reducers, I would get many part-xxxxx files. Meanwhile, it is true that a key only appears in one of them.
Thus, I am wondering how MapReduce works such that a key only goes to one output file?
Thanks in advance.
The shuffle step in the MapReduce process is responsible for ensuring that all records with the same key end up in the same reduce task. See this Yahoo tutorial for a description of the MapReduce data flow. The section called Partition & Shuffle states that
Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin.
Shuffle
Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.
Sort
The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.
I got this from here
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Have a look on it i hope this will helpful

Resources