Does hadoop reducer execute keys sequential or parallel in each node? - hadoop

Just got start with hadoop, got several questions about execution of reducer.
When the key, value pairs distributed to one reducer task. Does it process sequential or parallel.
For example (A,5) (A,3) (B,10) for the reducer task. Does A,B get into reducer in parallel?

When one reducer is used, the KV pairs are not processed in parallel, but are processed in sorted order. In your example above, the pairs will be sent from one or more mapper tasks (in parallel if multiple mappers) to the single reduce task. Before these values are passed to your reducer class, they are aggregated ((A,5) and (A,3) are turn into (A,{5,3})) and then sorted before the reducer task actually runs user code to 'reduce' the input sets.

Related

Output of using 2 reducers in a word count program

Let's say the key-value pairs with the keys “the”, “sound”, “is” are processed by reducer 1 and the key-value pairs with the keys “it”, “right”, “sounds” are processed by reducer 2.
What would be the outputs of the two reducers?
Would the output file of each reducer be sorted then combined then sorted again?
When the reducer receives them is it already sorted alphabetically so that reducer 1 receives “is”, “it”, “right” and reducer 2 receives “the”, “sound”, “sounds”?
To answer your queries:
Output of the reducer would be the word and count of its occurrence.
The output of reducer working on different keys are never combined. There is no such phase in mapreduce.
The output of the mapper is sorted and fed into reducer; but different reducer emits its output randomly and the output of the all the reducers is Not sorted again. There is no such phase in mapreduce.
Even though reducers are getting keys in sorted order, think each reducer running into a separate JVM and a separate process. They output the data without "knowing" that there are more reducer running.

When Partitioner runs in Map Reduce?

As per my understanding, mapper runs first followed by partitioner(if any) followed by Reducer. But if we use Partitioner class, I am not sure when Sorting and Shuffling phase runs?
A CLOSER LOOK
Below diagram explain the complete details.
From this diagram, you can see where the mapper and reducer components of the Word Count application fit in, and how it achieves its objective. We will now examine this system in a bit closer detail.
mapreduce-flow
Shuffle and Sort phase will always execute(across the mappers and reducers nodes).
The hierarchy of the different phase in MapReduce as below:
Map --> Partition --> Combiner(optional) --> Shuffle and Sort --> Reduce.
The short answer is: Data sorting runs on the reducers, shuffling/sorting runs before the reducer (always) and after the map/combiner(if any)/partitioner(if any).
The long answer is that into a MapReduce job there are 4 main players:
Mapper, Combiner, Partitioner, Reducer. All these are classes you can actually implement by your own.
Let's take the famous word count program, and let's assume the split where we are working contains:
pippo, pluto, pippo, pippo, paperino, pluto, paperone, paperino, paperino
and each word is record.
Mapper
Each mapper runs over a subset of your file, its task is to read each record from the split and assign a key to each record which will output.
Mapper will store intermediate result on disk ( local disk ).
The intermediate output from this stage will be
pippo,1
pluto,1
pippo,1
pippo,1
peperino,1
pluto,1
paperone,1
paperino,1
paperino,1
At this will be stored on the local disk of the node which runs the mapper.
Combiner
It's a mini-reducer and can aggregate data. It can also run joins, so called map-join. This object helps to save bandwidth into the cluster because it aggregates data on the local node.
The output from the combiner, which is still part of the mapper phase, will be:
pippo,3
pluto,2
paperino,3
paperone,1
Of course here are the data from ONE node. Now we have to send the data to the reducers in order to get the global result. Which reducer will process the record depends on the partitioner.
Partitioner
It's task is to spread the data across all the available reducers. This object will read the output from the combiner and will select the reducer which will process the key.
In this example we have two reducers and we use the following rule:
all the pippo goes to reducer 1
all the pluto goes to reducer 2
all the paperino goes to reducer 2
all the paperone goes to reducer 1
so all the nodes will send records which have the key pippo to the same reducer(1), all the nodes will send the records which have the key pluto to the same reducer (2) and so on...
Here is where the data get shuffled/sorted and, since the combiner already reduced the data locally, this node has to send only 4 record instead of 9.
Reducer
This object is able to aggregate the data from each node and it's also able to sort the data.

n-Records to reducer after Shuffle and Sort

I would like to move only the first 10 records of the output after sort/shuffle to the reducer. Is this possible?
The reason is this: I am to find the least 10 items with the largest count in a file. However, I know that the results of the mapping phase will be arrive at the reducer already sorted. Hence, instead of sorting in the mappers, I'd like to just pass only the first 10 lines after 'shuffle and sort' to the reducer. this will allow the reducer sort only a subset of the original record.
Is there any way to do this?
You can achieve this by writing a custom Combiner for the job.
The different stages in the MapReduce job are:
Mapper -> Partitioner -> Sorting -> Combiner -> Reducer.
Now Combiner logic only read the first 10 (n) records and discord all the other. The Reducer will receive only 10 records from each Mapper/Combiner.
Comment provided by #K246:
From haodop definitive guide (4th ed) : Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort.
When you say least 10 in the file...Is it for each mapper or for the entire input.
If for each mapper, then you have to aggregate the again at reducer from all mappers. Then as #YoungHobbit pointed out, Combiner will do the work.
If you need least 10 from entire input file, then I think, you need to handle it with a single reducer and output accordingly.
Also, you said in the last line, that reducer will sort only subset. Do you mean you are sorting again in Reducer or that some logic is performed in reducer for only subset of the input.

Hadoop Reducer execution re-occurrence

So the Mapper gets executed only once on a given slave node containing a given data block, correct?
But the Reducer may execute multiple times because the same key may originate from many Mapper nodes, correct?
Also is it correct that the Shuffle and Sort will occur on each Mapper for a single MapReduce job?
Generally, I don't think it's proper to say how many times Mapper/Reducer are executed because they are widely distributed into different nodes and scheduled by JobTracker in MRv1 or ResourceManager in MRv2. But hopefully my answers below could help you have a better understanding.
Q: "So the Mapper gets executed only once on a given slave node containing a given data block, correct?"
A: Correct in most case. Normally, hadoop will launch a mapper for each input split (which has the same size as a data block by default), but it will start a new one if there is a mapper failed.
Q: "But the Reducer may execute multiple times because the same key may originate from many Mapper nodes, correct?"
A: Not correct. Shuffle and Sort process will merge all mappers output into a single sorted input and feed to reducer. The number of reducer is defined by user.
Q: "Also is it correct that the Shuffle and Sort will occur on each Mapper for a single MapReduce job?"
A: Inaccurate. Shuffle phase is process hadoop performed to sort and transfer the mappers' outputs to reducers as inputs. When all the mappers' outputs have been copied, Sort phase will merge all outputs and maintain their sort order. So technically first part of Shuffle and Sort happens for each mapper.
Thanks

Mapper and Reducer in Hadoop

I have a confusion about the implementation of Hadoop.
I notice that when I run my Hadoop MapReduce job with multiple mappers and reducers, I would get many part-xxxxx files. Meanwhile, it is true that a key only appears in one of them.
Thus, I am wondering how MapReduce works such that a key only goes to one output file?
Thanks in advance.
The shuffle step in the MapReduce process is responsible for ensuring that all records with the same key end up in the same reduce task. See this Yahoo tutorial for a description of the MapReduce data flow. The section called Partition & Shuffle states that
Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin.
Shuffle
Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.
Sort
The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.
I got this from here
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Have a look on it i hope this will helpful

Resources