I'm getting confused after reading below article on Hadoop- Definitive guide 4th edition(page-204)
Before it writes to disk, the thread first divides the data into
partitions corresponding to the reducers that they will ultimately be
sent to.
Within each partition, the background thread performs an
in-memory sort by key, and if there is a combiner function, it is run
on the output of the sort.
Running the combiner function makes for a
more compact map output, so there is less data to write to local disk
and to transfer to the reducer.
Here is my doubt:
1) Who will execute first combiner or partitions !!
2) When custom combiner and custom partitions will be there so how and what will be the execution steps hierarchy ?
3) Can we feed compress data (avro ,sequence ..etc) to Custom combiner ,if yes then how!!
Looking for a brief and in-depth explanation!!
Thanks in advance.
1/ The response is already specified in this part: "Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort."
So firstly the partitions are created in-memory, if there is a custom combiner, it will be executed in-memory, and the result will be spilled to disk at the end.
2/ custom combiner and custom partition will be there when they are specified on the driver class.
job.setCombinerClass(MyCombiner.class);
job.setPartitionerClass(MyPartitioner.class);
If there is no custom combiner specified, so there is no combiner executed.
If there is no custom partitioner specified, so the default executed partitioner is "HashPartitioner" (please see the page 221 for that).
3/ Yes, it is possible. Don't forget that the mechanism of the combiner is the same than the reducer. The reducer can consume compressed data.
If the consumer consumes the compressed data, that means that the input files format is compressed.
for that, you can specify on the driver class the instruction:
Sequence File case: job.setInputFormatClass(SequenceFileInputFormat.class);
Avro File case: job.setInputFormatClass(AvroKeyInputFormat.class);
The direct answer to your question is => COMBINER
Details: Combiner can be viewed as mini-reducers in the map phase. They perform a local-reduce on the mapper results before they are distributed further. Once the Combiner functionality is executed, it is then passed on to the Reducer for further work.
where as
Partitioner come into the picture when we are working on more than on Reducer. So, the partitioner decide which reucer is responsible for a particular key. They basically take the Mapper Result(if Combiner is used then Combiner Result) and send it to the responsible Reducer based on the key.
For a better understanding you can refer the following image, which I have taken from Yahoo Developer Tutorial on Hadoop. Figure 4.6: Combiner step inserted into the MapReduce data flow
Here is the tutorial .
This is the complete MR job flow. Your 1.) and 2.) is answered here.
Mapper reads the data and processes. This output goes to a intermediate output file.
Once mapper finishes all the key, values pairs. The intermediate output is partitioned into 'R' partitions using either default partitioner 'HashPartitioner' or custom partitioner.
Each partitioned file is sorted.
Any optional combiner code is executed on the sorted 'R' partitions. The combiner step is executed only if it is specified.
Reducers reach out to the mappers and pull their appropriate partitioned files.
After all the mapper tasks completed and all the intermediate data is copied to all the reducers. The reducers perform one more sort on the data.
Then reducers work on their individual key, value pairs one by one.
Answer-3: Yes, combiner can process the compressed data. The combiner function runs on the output of the map phase and is used as a filtering or an aggregating step to lessen the number of intermediate keys that are being passed to the reducer. In most of the cases the reducer class is set to be the combiner class. The difference lies in the output from these classes. The output of the combiner class is the intermediate data that is passed to the reducer whereas the output of the reducer is passed to the output file on disk. The combiner for job can be set like this:
job.setCombinerClass(CustomCombiner.class);
Partition runs before the Combinor.
a) The mapper will processed the data into
b) Followed by a partitioner ( either default or custom ) will partitioned the data as per requirement based on keys.
c) Followed by sorting on keys which will be taken care by the background threads/process.
d) If combinor exist :
Then followed by combinor,This will run on the output of the sorted keys
e) Followed by the Reducer which will run sort one more time on the input data followed by the reducer process.
I would like to summarize the entire flow:
Mapper reads the data and processes. This output goes to a intermediate output file.
Once mapper finishes all the key, values pairs.
output of Mapper first writen to memory buffer,
when buffer is about to overflow then spilled to local dir and then partitions are created in-memory["Within each partition, the background thread performs an in-memory sort by key and The intermediate output is partitioned into 'R' partitions using either default partitioner 'HashPartitioner' or custom partitioner]
The spilling data is parted according to Partitioner, and in each partition the result is sorted and
if there is a custom combiner, it will be executed in-memory, and the result will be spilled to disk at the end.
Reducers reach out to the mappers and pull their appropriate partitioned files.
After all the mapper tasks completed and all the intermediate data is copied to all the reducers. The reducers perform one more sort on the data.
Then reducers work on their individual key, value pairs one by one.
Please suggest if any gap in my understanding
Related
As per my understanding, mapper runs first followed by partitioner(if any) followed by Reducer. But if we use Partitioner class, I am not sure when Sorting and Shuffling phase runs?
A CLOSER LOOK
Below diagram explain the complete details.
From this diagram, you can see where the mapper and reducer components of the Word Count application fit in, and how it achieves its objective. We will now examine this system in a bit closer detail.
mapreduce-flow
Shuffle and Sort phase will always execute(across the mappers and reducers nodes).
The hierarchy of the different phase in MapReduce as below:
Map --> Partition --> Combiner(optional) --> Shuffle and Sort --> Reduce.
The short answer is: Data sorting runs on the reducers, shuffling/sorting runs before the reducer (always) and after the map/combiner(if any)/partitioner(if any).
The long answer is that into a MapReduce job there are 4 main players:
Mapper, Combiner, Partitioner, Reducer. All these are classes you can actually implement by your own.
Let's take the famous word count program, and let's assume the split where we are working contains:
pippo, pluto, pippo, pippo, paperino, pluto, paperone, paperino, paperino
and each word is record.
Mapper
Each mapper runs over a subset of your file, its task is to read each record from the split and assign a key to each record which will output.
Mapper will store intermediate result on disk ( local disk ).
The intermediate output from this stage will be
pippo,1
pluto,1
pippo,1
pippo,1
peperino,1
pluto,1
paperone,1
paperino,1
paperino,1
At this will be stored on the local disk of the node which runs the mapper.
Combiner
It's a mini-reducer and can aggregate data. It can also run joins, so called map-join. This object helps to save bandwidth into the cluster because it aggregates data on the local node.
The output from the combiner, which is still part of the mapper phase, will be:
pippo,3
pluto,2
paperino,3
paperone,1
Of course here are the data from ONE node. Now we have to send the data to the reducers in order to get the global result. Which reducer will process the record depends on the partitioner.
Partitioner
It's task is to spread the data across all the available reducers. This object will read the output from the combiner and will select the reducer which will process the key.
In this example we have two reducers and we use the following rule:
all the pippo goes to reducer 1
all the pluto goes to reducer 2
all the paperino goes to reducer 2
all the paperone goes to reducer 1
so all the nodes will send records which have the key pippo to the same reducer(1), all the nodes will send the records which have the key pluto to the same reducer (2) and so on...
Here is where the data get shuffled/sorted and, since the combiner already reduced the data locally, this node has to send only 4 record instead of 9.
Reducer
This object is able to aggregate the data from each node and it's also able to sort the data.
Am trying to determine if there are certain hooks available in the hadoop api (hadoop 2.0.0 mrv1) to handle data skew for a reducer.
Scenario : Have a custom Composite key and partitioner in place to route data to reducers. In order to deal with the odd case but very likely case of a million keys and large values ending up on the same reducer need some sort of heuristic so that this data can be further partitioned to spawn off new reducers.
Am thinking of a two step process
set mapred.max.reduce.failures.percent to say 10% and let the job
complete
rerun the job on the failed data set by passing a
configuration thru the driver which will cause my partitioner to
then randomly partition the skewed data. The partitioner will
implement the Configurable interface.
Is there a better way/another way ?
Possible counter-solution may be to write output of mappers and spin off another map job doing the work of the reducer, but do not want to pressurize the namenode.
This idea comes to my mind, I am not sure how good it is.
Lets say you are running the Job with 10 mappers currently, which is failing because of the data skewness. The idea is, you set the number of reducer to 15 and also define what the max number of (key,value) should go to one reducer from each mapper. You keep that information in a hash map in your custom partitioner class. Once a particular reducer reaches the limit, you start sending the next set of (key,value) pairs to another reducer from the extra 5 reducer which we have kept for handling the skewness.
If you process allow it, The use of a Combiner (reduce-type function) could help you. If you pre-aggregate the data in the Mapper side . Then, even all your data end in the same reducer the amount of data could be manageable.
An alternative could be reimplement the partitioner to avoid the skew case.
As we know, that during the shuffle phase of hadoop, each of the reducer read data from all the mapper's output (intermedia data).
Now, we also know that by default Hash-Partitioning is used for reducers.
My question is: How do we implement an algorithm, e.g. Locality-aware?
In short, you should not do it.
First, you have no control over where the mappers and reducers are executed on the cluster, so even when the complete output of a single mapper will go to a single reducer there is a huge probability that they would be on different hosts and the data would be transferred through the network
Second, to make the reducer process the whole output of the mapper, you first have to make mapper process the right part of the information, which means that you have to preprocess data by partitioning it and then run a single mapper and a single reducer for each partition, but this preprocessing itself would take much resources so it is mostly meaningless
And finally, why do you need it? The main concept of map-reduce is manipulation with key-value pairs, and reducer in general should aggregate list of values outputted by the mappers for the same keys. Here's why hash partitioning is used: distribute N keys between K reducers. Using different type of partitioner is a really seldom case. If you need data locality you might prefer to work with MPP database rather than Hadoop, for example.
If you really need a custom partitioner, here's an example of how it can be implemented: http://hadooptutorial.wikispaces.com/Custom+partitioner. Nothing special, just return reducer number based on the key and value passed and the number of reducers. Using hash code of the host name divided (%) by the number of reducers will make the whole output of a single mapper go to a single reducer. Also you might use process PID % number of reducers. But before doing this you have to check, whether you really need this behavior or not.
I am a bit confused with the output I get from Mapper.
For example, when I run a simple wordcount program, with this input text:
hello world
Hadoop programming
mapreduce wordcount
lets see if this works
12345678
hello world
mapreduce wordcount
this is the output that I get:
12345678 1
Hadoop 1
hello 1
hello 1
if 1
lets 1
mapreduce 1
mapreduce 1
programming 1
see 1
this 1
wordcount 1
wordcount 1
works 1
world 1
world 1
As you can see, the output from mapper is already sorted. I did not run Reducer at all.
But I find in a different project that the output from mapper is not sorted.
So I am totally clear about this..
My questions are:
Is the mapper's output always sorted?
Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?
Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer? A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?
Is the mapper's output always sorted?
No. It is not sorted if you use no reducer. If you use a reducer, there is a pre-sorting process before the mapper's output is written to disk. Data gets sorted in the Reduce phase. What is happening here (just a guess) is that you are not specifying a Reducer class, which, in the new API, is translated into using the Identity Reducer (see this answer and comment). The Identity Reducer just outputs its input. To verify that, see the default Reducer counters (there should be some reduce tasks, reduce input records & groups, reduce output records...)
Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?
As I explained in the previous question, if you use no reducers, mapper does not sort the data. If you do use reducers, the data start getting sorted from the map phase and then get merge-sorted in the reduce phase.
Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer. A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?
Again, shuffling and sorting are parts of the Reduce phase. An Identity Reducer will do what you want. If you want to output one key-value pair per reducer, with the values being a concatenation of the iterables, just store the iterables in memory (e.g. in a StringBuffer) and then output this concatenation as a value. If you want the map output to go straight to the program's output, without going through a reduce phase, then set in the driver class the number of reduce tasks to zero, like that:
job.setNumReduceTasks(0);
This will not get your output sorted, though. It will skip the pre-sorting process of the mapper and write the output directly to HDFS.
Point 1: output from mapper is always sorted but based on Key.
i.e. if Map method is doing this: context.write(outKey, outValue); then result will be sorted based on outKey.
Following would be some explanations to your questions
Heading ##Does the output from mapper is always sorted?
Already answered by #SurJanSR
Heading ##Does the sort phase integrated with mapper phase already, so that the output of map phase is already sorted in the intermediate data?
In a Mapreduce Job, as you know, Mapper runs on individual splits of data and across nodes where data is persisting. The result of Mapper is written TEMPORARILY before it is written to the next phase.
In the case of a reduce operation, the TEMPORARILY stored Mapper output is sorted, shuffle based on the partitioner needs before moved to the reduce operation
In the case of Map Only Job, as in your case, The temorarily stored Mapper output is sorted based on the key and written to the final output folder (as specified in your arguments for the Job).
Heading ##Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer. A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?
Not sure what your requirement is. using a IdentityReducer would just persist the output. I'm not sure if this answers your question.
I support the answer of vefthym.
Usually the Mapper output is sorted before storing it locally on the node. But when you are explicitely setting up numReduceTasks to 0 in the job configuration then the mapper o/p will not be sorted and written directly to HDFS.
So we cannot say that Mapper output is always sorted!
1. Is the mapper's output always sorted?
2.Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?
From Apache MapReduceTutorial:
( Under Mapper Section )
All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer(s) to determine the final output.
The Mapper outputs are sorted and then partitioned per Reducer. The total number of partitions is the same as the number of reduce tasks for the job
( Under Reducer Section )
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by FileOutputFormat.setOutputPath(Job, Path). The framework does not sort the map-outputs before writing them out to the FileSystem.
3. Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer? A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?
I don't think so. From Apache condemnation on Reducer:
Reducer has 3 primary phases:
Shuffle:
The Reducer copies the sorted output from each Mapper using HTTP across the network.
Sort:
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
Reduce:
The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
As per the documentation, the shuffle and sort phase is driven by framework
If you want to persist the data, set number of reducers to Zero, which causes persistence of Map output into HDFS but it won't sort the data.
Have a look at related SE question:
hadoop: difference between 0 reducer and identity reducer?
I did not find IdentityReducer in Hadoop 2.x version:
identityreducer in the new Hadoop API
I have a confusion about the implementation of Hadoop.
I notice that when I run my Hadoop MapReduce job with multiple mappers and reducers, I would get many part-xxxxx files. Meanwhile, it is true that a key only appears in one of them.
Thus, I am wondering how MapReduce works such that a key only goes to one output file?
Thanks in advance.
The shuffle step in the MapReduce process is responsible for ensuring that all records with the same key end up in the same reduce task. See this Yahoo tutorial for a description of the MapReduce data flow. The section called Partition & Shuffle states that
Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin.
Shuffle
Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.
Sort
The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.
I got this from here
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Have a look on it i hope this will helpful