Is the variety of keys a factor in performance? - performance

Working with hadoop and map-reduce framework, i was thinking that the reduce tasks must be fine-grained so that the different nodes that processes them can do it separately.
I think that the number of keys can influence in the granularity of the tasks. So, is the number of keys or the variety of them a significant factor in efficiency?
For example, if i had only one key or two, that would be a problem?

All the same keys should end in the same reducer, then, if you have only one key, you will really use only one reducer not matter if you have set 10 reducers. The remaining reducers won't have any output (but they will be instantiated).
This is a big issue named "skew data" and you require to redefine (and redistribute) your keys to be able to run the process in parallel.
Ideally the data should be distributed in sets with the same amount of records, this means that all reducers will have the same load of work.

Related

In Hadoop, is it possible to balance load between reducers based on the number of records rather than on the number of keys?

If I understand correctly, Hadoop tries to achieve load-balancing between reducers by hashing keys to reducers. This may work well if there is an approximately same number of values associated with each key. In the situation that some keys have much more values than others, how (if at all) is it possible to balance load between reducers based on the number of records rather than on the number of keys?

What is the benefit of the reducers in Hadoop?

I don't see a value for the reducers in Hadoop in the following scenario:
The Map Tasks generate unique keys (Because we can merge both the Map/Reduce functionality together)
The output size of the Map Tasks is too big (This will exhaust the memory if we wait for the reducers to begin the work)
If we have any functionality that doesn't need grouping and sorting of the keys
Please correct me if I am wrong.
And if someone could give me a real example of the benefits of the reducers and when it should be used, I will appreciate it.
Reducer is beneficial (or required) when you need to do operations like aggregation/grouping etc..
FYI : Reducer is meant for grouping different value for a key which comes from different mapper. So for a use case which do not require grouping/aggregation then there is no point of using reducer(you can set it to Zero , meaning Map-Only jobs).
One quick use-case i can think of is - you want to randomly split a big file to multiple part file. In this case you will supply big file (lets say 100G) to Map-Only jobs. All maps will read a chunk of file and write as a part of file.

Hadoop-2.4.1 custom partitioner to balance reducers

As we know, that during the shuffle phase of hadoop, each of the reducer read data from all the mapper's output (intermedia data).
Now, we also know that by default Hash-Partitioning is used for reducers.
My question is: How do we implement an algorithm, e.g. Locality-aware?
In short, you should not do it.
First, you have no control over where the mappers and reducers are executed on the cluster, so even when the complete output of a single mapper will go to a single reducer there is a huge probability that they would be on different hosts and the data would be transferred through the network
Second, to make the reducer process the whole output of the mapper, you first have to make mapper process the right part of the information, which means that you have to preprocess data by partitioning it and then run a single mapper and a single reducer for each partition, but this preprocessing itself would take much resources so it is mostly meaningless
And finally, why do you need it? The main concept of map-reduce is manipulation with key-value pairs, and reducer in general should aggregate list of values outputted by the mappers for the same keys. Here's why hash partitioning is used: distribute N keys between K reducers. Using different type of partitioner is a really seldom case. If you need data locality you might prefer to work with MPP database rather than Hadoop, for example.
If you really need a custom partitioner, here's an example of how it can be implemented: http://hadooptutorial.wikispaces.com/Custom+partitioner. Nothing special, just return reducer number based on the key and value passed and the number of reducers. Using hash code of the host name divided (%) by the number of reducers will make the whole output of a single mapper go to a single reducer. Also you might use process PID % number of reducers. But before doing this you have to check, whether you really need this behavior or not.

Same-machine-as-data processing on reduce side of map reduce

One of the big benefits of Hadoop MapReduce is the fact that Map processes take place on the same machine that the data they operate upon resides (to the extent possible). But can this be or is this perhaps already true of the Reduce side? For example, in the extreme case of a Map-only job, all of the output data ends up on the same machine as the corresponding input data (right?). But in an intermediate case in which the output is somewhat correlated with the output, it seems reasonable to partition the output and to the extent possible keep it on same machine at it started on.
Is this possible? Does this already happen?
Inputs to the Reducers can reside on any node(local or remote) and not necessarily on the same machine where they are running. As Mappers complete their output gets written onto the local FS of the machine where they are running. Once this is done the intermediate output is needed by the machines that are about to run the reduce task. One thing to note here is that all the values corresponding to a particular key go the same reducer. So, it's not always possible that the input to Reducers is local, since different sets of key/value pairs are processed by different Mappers running on different machines.
Now, before the Mapper output is sent to Reducers for further processing, the data is partitioned based on keys and each partition goes to a Reducer and all the key/value pairs in that partition get processed by that Reducer. During the process a lot of data shuffling takes place. So it's not possible to maintain the data locality in case of Reducers.
Hope this answers the question.
If you know that the data for a particular reducer is already on the right node after the map phase, and the algorithm allows for it (see this blog post about it) you should insert your reducer as a combiner. Combiners are like miniature reducers that only get to see co-located data. Often you can dramatically improve performance because the combiner output can be orders of magnitude smaller than the map output, so what's left to shuffle is trivial.
Of course, if indeed the map phase leaves your data already correctly partitioned, why use a reducer at all? Why not create a second map job that simulates a reducer?

How does partitioning in MapReduce exactly work?

I think I have a fair understanding of the MapReduce programming model in general, but even after reading the original paper and some other sources many details are unclear to me, especially regarding the partitioning of the intermediate results.
I will quickly summarize my understanding of MapReduce so far: We have a potentially very large input data set, which is automatically split up into M different pieces by the MR-Framework. For each piece, the framework schedules one map task which is executed by one of the available processors/machines in my cluster. Each of the M map tasks outputs a set of Key-Value-Pairs, which is stored locally on the same machine that executed this map task. Each machine divides its disk into R partitions and distributes its computed intermediate key value pairs based on the intermediate keys among the partitions. Then, the framework starts for each distinct intermediate key one reduce task which is again executed by any of the available machines.
Now my questions are:
In some tutorials it sounds like there could be map and reduce tasks executed in parallel. Is this right? How could that be, assuming that for each distinct intermediate key only one reduce task is started? Do we not have to wait until the last map task is finished before we can start the first reduce task?
As we have one reduce task per distinct intermediate key, is it right that each reduce task requires the executing machine to load the corresponding partition from every other machine? Potentially, every machine can have a key-value-pair with the desired intermediate key, so for each reduce task we potentially have to query all other machines. Is that really efficient?
The original paper says that the number of partitions (R) is specified by the user. But isn’t a partition the input for a reduce task? Or more exactly: Isn’t the union of all partitions with the same number among all machines the input of one reduce task? That would mean, that R depends on the number of distinct intermediate keys which the user usually doesn’t know.
Conceptually it is clear what the input and outputs of the map and reduce functions/tasks are. But I think I haven’t yet understood MapReduce on the technical level. Could somebody please help me understanding?
You can start the reducer tasks while the map tasks are still running (using a feature known as slowstart), but the reducers can only run the copy phase (acquiring the completed results from the completed map tasks. It will need to wait for all the mappers to complete before it can actually perform the final sort and reduce.
A reduce task actually processes zero, one or more keys (rather than a discrete tasks for each key). Each reducer will need to acquire the map output from each map task that relates to its partition before these intermediate outputs are sorted and then reduced one key set at a time.
Back to the note in 2 - a reducer task (one for each partition) runs on zero, one or more keys rather than a single task for each discrete key.
It's also important to understand the spread and variation of your intermediate key as it is hashed and modulo'd (if using the default HashPartitioner) to determine which reduce partition should process that key. Say you had an even number of reducer tasks (10), and output keys that always hashed to an even number - then in this case the modulo of these hashs numbers and 10 will always be an even number, meaning that the odd numbered reducers would never process any data.
Addendum to what Chris said,
Basically, a partitioner class in Hadoop (e.g. Default HashPartitioner)
has to implement this function,
int getPartition(K key, V value, int numReduceTasks)
This function is responsible for returning you the partition number and you get the number of reducers you fixed when starting the job from the numReduceTasks variable, as seen for in the HashPartitioner.
Based on what integer the above function return, Hadoop selects node where the reduce task for a particular key should run.
Hope this helps.

Resources