Hadoop handling data skew in reducer - hadoop

Am trying to determine if there are certain hooks available in the hadoop api (hadoop 2.0.0 mrv1) to handle data skew for a reducer.
Scenario : Have a custom Composite key and partitioner in place to route data to reducers. In order to deal with the odd case but very likely case of a million keys and large values ending up on the same reducer need some sort of heuristic so that this data can be further partitioned to spawn off new reducers.
Am thinking of a two step process
set mapred.max.reduce.failures.percent to say 10% and let the job
complete
rerun the job on the failed data set by passing a
configuration thru the driver which will cause my partitioner to
then randomly partition the skewed data. The partitioner will
implement the Configurable interface.
Is there a better way/another way ?
Possible counter-solution may be to write output of mappers and spin off another map job doing the work of the reducer, but do not want to pressurize the namenode.

This idea comes to my mind, I am not sure how good it is.
Lets say you are running the Job with 10 mappers currently, which is failing because of the data skewness. The idea is, you set the number of reducer to 15 and also define what the max number of (key,value) should go to one reducer from each mapper. You keep that information in a hash map in your custom partitioner class. Once a particular reducer reaches the limit, you start sending the next set of (key,value) pairs to another reducer from the extra 5 reducer which we have kept for handling the skewness.

If you process allow it, The use of a Combiner (reduce-type function) could help you. If you pre-aggregate the data in the Mapper side . Then, even all your data end in the same reducer the amount of data could be manageable.
An alternative could be reimplement the partitioner to avoid the skew case.

Related

When Partitioner runs in Map Reduce?

As per my understanding, mapper runs first followed by partitioner(if any) followed by Reducer. But if we use Partitioner class, I am not sure when Sorting and Shuffling phase runs?
A CLOSER LOOK
Below diagram explain the complete details.
From this diagram, you can see where the mapper and reducer components of the Word Count application fit in, and how it achieves its objective. We will now examine this system in a bit closer detail.
mapreduce-flow
Shuffle and Sort phase will always execute(across the mappers and reducers nodes).
The hierarchy of the different phase in MapReduce as below:
Map --> Partition --> Combiner(optional) --> Shuffle and Sort --> Reduce.
The short answer is: Data sorting runs on the reducers, shuffling/sorting runs before the reducer (always) and after the map/combiner(if any)/partitioner(if any).
The long answer is that into a MapReduce job there are 4 main players:
Mapper, Combiner, Partitioner, Reducer. All these are classes you can actually implement by your own.
Let's take the famous word count program, and let's assume the split where we are working contains:
pippo, pluto, pippo, pippo, paperino, pluto, paperone, paperino, paperino
and each word is record.
Mapper
Each mapper runs over a subset of your file, its task is to read each record from the split and assign a key to each record which will output.
Mapper will store intermediate result on disk ( local disk ).
The intermediate output from this stage will be
pippo,1
pluto,1
pippo,1
pippo,1
peperino,1
pluto,1
paperone,1
paperino,1
paperino,1
At this will be stored on the local disk of the node which runs the mapper.
Combiner
It's a mini-reducer and can aggregate data. It can also run joins, so called map-join. This object helps to save bandwidth into the cluster because it aggregates data on the local node.
The output from the combiner, which is still part of the mapper phase, will be:
pippo,3
pluto,2
paperino,3
paperone,1
Of course here are the data from ONE node. Now we have to send the data to the reducers in order to get the global result. Which reducer will process the record depends on the partitioner.
Partitioner
It's task is to spread the data across all the available reducers. This object will read the output from the combiner and will select the reducer which will process the key.
In this example we have two reducers and we use the following rule:
all the pippo goes to reducer 1
all the pluto goes to reducer 2
all the paperino goes to reducer 2
all the paperone goes to reducer 1
so all the nodes will send records which have the key pippo to the same reducer(1), all the nodes will send the records which have the key pluto to the same reducer (2) and so on...
Here is where the data get shuffled/sorted and, since the combiner already reduced the data locally, this node has to send only 4 record instead of 9.
Reducer
This object is able to aggregate the data from each node and it's also able to sort the data.

Split input to a reducer in hadoop

This question is kind of related to my other question Hadoop handling data skew in reducer.
However, I would like to ask if there are some configuration settings available so that if say the max reducer memory is reached then spawn off a new reducer on another datanode with the remaining data in context ?
Or maybe even on the same datanode so that say some x records off the context are read in the reduce method upto some limit and then the remaining are read off in a new reducer ?
You could try out a combiner that would reduce the work load of a single reducer handling more key,value pairs by doing a possible aggregation before it goes through to the reducer. If you are doing a join then you could try out skewed join in Pig. It involves 2 MR jobs.In first MR it does a sampling on one input and if it finds a key which is skewed so much so that it is able to fit into memory, it splits that key into more than one reducers. For the other records than the one identified in the sample it does a default join. For the skewed input it duplicates the input and sends it to both reducers.
It is not possible to spawn a new auxiliary reducer to balance the load on the job run.
Rather you could thing of picking another key element from your records which will help in shuffling the data even across the reducers.
Else as a option, you could expand the existing reducer's memory settings to accommodate more shuffled records and to get the sorting/merging done quicker. Please refer the below properties,
mapreduce.reduce.memory.mb
mapreduce.reduce.java.opts
mapreduce.reduce.merge.inmem.threshold
mapreduce.reduce.shuffle.input.buffer.percent
mapreduce.reduce.shuffle.merge.percent
mapreduce.reduce.input.buffer.percent
I could remember, there was a extended mapreduce library, skewtune, written to load balance the data skew during the course of job run. But I never experimented this, kindly check if it is helpful.
That is not possible. The number of reducers is fixed in the Driver configuration.

Hadoop-2.4.1 custom partitioner to balance reducers

As we know, that during the shuffle phase of hadoop, each of the reducer read data from all the mapper's output (intermedia data).
Now, we also know that by default Hash-Partitioning is used for reducers.
My question is: How do we implement an algorithm, e.g. Locality-aware?
In short, you should not do it.
First, you have no control over where the mappers and reducers are executed on the cluster, so even when the complete output of a single mapper will go to a single reducer there is a huge probability that they would be on different hosts and the data would be transferred through the network
Second, to make the reducer process the whole output of the mapper, you first have to make mapper process the right part of the information, which means that you have to preprocess data by partitioning it and then run a single mapper and a single reducer for each partition, but this preprocessing itself would take much resources so it is mostly meaningless
And finally, why do you need it? The main concept of map-reduce is manipulation with key-value pairs, and reducer in general should aggregate list of values outputted by the mappers for the same keys. Here's why hash partitioning is used: distribute N keys between K reducers. Using different type of partitioner is a really seldom case. If you need data locality you might prefer to work with MPP database rather than Hadoop, for example.
If you really need a custom partitioner, here's an example of how it can be implemented: http://hadooptutorial.wikispaces.com/Custom+partitioner. Nothing special, just return reducer number based on the key and value passed and the number of reducers. Using hash code of the host name divided (%) by the number of reducers will make the whole output of a single mapper go to a single reducer. Also you might use process PID % number of reducers. But before doing this you have to check, whether you really need this behavior or not.

Mapper and Reducer in Hadoop

I have a confusion about the implementation of Hadoop.
I notice that when I run my Hadoop MapReduce job with multiple mappers and reducers, I would get many part-xxxxx files. Meanwhile, it is true that a key only appears in one of them.
Thus, I am wondering how MapReduce works such that a key only goes to one output file?
Thanks in advance.
The shuffle step in the MapReduce process is responsible for ensuring that all records with the same key end up in the same reduce task. See this Yahoo tutorial for a description of the MapReduce data flow. The section called Partition & Shuffle states that
Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin.
Shuffle
Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.
Sort
The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.
I got this from here
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Have a look on it i hope this will helpful

Same-machine-as-data processing on reduce side of map reduce

One of the big benefits of Hadoop MapReduce is the fact that Map processes take place on the same machine that the data they operate upon resides (to the extent possible). But can this be or is this perhaps already true of the Reduce side? For example, in the extreme case of a Map-only job, all of the output data ends up on the same machine as the corresponding input data (right?). But in an intermediate case in which the output is somewhat correlated with the output, it seems reasonable to partition the output and to the extent possible keep it on same machine at it started on.
Is this possible? Does this already happen?
Inputs to the Reducers can reside on any node(local or remote) and not necessarily on the same machine where they are running. As Mappers complete their output gets written onto the local FS of the machine where they are running. Once this is done the intermediate output is needed by the machines that are about to run the reduce task. One thing to note here is that all the values corresponding to a particular key go the same reducer. So, it's not always possible that the input to Reducers is local, since different sets of key/value pairs are processed by different Mappers running on different machines.
Now, before the Mapper output is sent to Reducers for further processing, the data is partitioned based on keys and each partition goes to a Reducer and all the key/value pairs in that partition get processed by that Reducer. During the process a lot of data shuffling takes place. So it's not possible to maintain the data locality in case of Reducers.
Hope this answers the question.
If you know that the data for a particular reducer is already on the right node after the map phase, and the algorithm allows for it (see this blog post about it) you should insert your reducer as a combiner. Combiners are like miniature reducers that only get to see co-located data. Often you can dramatically improve performance because the combiner output can be orders of magnitude smaller than the map output, so what's left to shuffle is trivial.
Of course, if indeed the map phase leaves your data already correctly partitioned, why use a reducer at all? Why not create a second map job that simulates a reducer?

Resources