Map phase implementation in map reduce framework - hadoop

I searched a lot and i knew that in every map task when the content of buffer arrives to a threshold, a thread partitions the data according to number of reduces.what is the role of reduce numbers here? why does partitioning happen in map?how does it help map phase?after sorting , the thread will spills the content to disk.
How does it happen? i can't undersatnd the meaning of spilling here.....
Thanks.

Map need to partition the data as the reducers poll and pull all the data from each mapper that is relevant to the reducer.
If you imagine it the other way around - the reducer pull all the output from each map then you'd be sending all the data output from each mapper to each reducer - hugely inefficient.
So by partitioning in the mapper, the reducer is able to query and pull back the data it needs to reduce from each mapper.

Related

how hadoop reduce tasks deal with map grouped data

Reduce method deals with grouped data from map. But I wonder how do reduce tasks take the groups data? If maps output many grouped data, do each reduce task just read the same numbers of groups?? What is the mechanism??
how do reduce tasks take the groups data?
It is handled on Shuffle and Sort phase
During this phasedData which is sent by mappers are grouped by key (Like group by(key)), finally it obtains key,List<> result. Result is sent to reducers. If results need to be sent to different reducers it is taken care of partition phase which is a different phase than Shuffle and Sort Phase.
This phase is done by Hadoop framework and as far as I know you have nothing to do or change about this phase.
also I suggest take a look at this question What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

Hadoop handling data skew in reducer

Am trying to determine if there are certain hooks available in the hadoop api (hadoop 2.0.0 mrv1) to handle data skew for a reducer.
Scenario : Have a custom Composite key and partitioner in place to route data to reducers. In order to deal with the odd case but very likely case of a million keys and large values ending up on the same reducer need some sort of heuristic so that this data can be further partitioned to spawn off new reducers.
Am thinking of a two step process
set mapred.max.reduce.failures.percent to say 10% and let the job
complete
rerun the job on the failed data set by passing a
configuration thru the driver which will cause my partitioner to
then randomly partition the skewed data. The partitioner will
implement the Configurable interface.
Is there a better way/another way ?
Possible counter-solution may be to write output of mappers and spin off another map job doing the work of the reducer, but do not want to pressurize the namenode.
This idea comes to my mind, I am not sure how good it is.
Lets say you are running the Job with 10 mappers currently, which is failing because of the data skewness. The idea is, you set the number of reducer to 15 and also define what the max number of (key,value) should go to one reducer from each mapper. You keep that information in a hash map in your custom partitioner class. Once a particular reducer reaches the limit, you start sending the next set of (key,value) pairs to another reducer from the extra 5 reducer which we have kept for handling the skewness.
If you process allow it, The use of a Combiner (reduce-type function) could help you. If you pre-aggregate the data in the Mapper side . Then, even all your data end in the same reducer the amount of data could be manageable.
An alternative could be reimplement the partitioner to avoid the skew case.

Split input to a reducer in hadoop

This question is kind of related to my other question Hadoop handling data skew in reducer.
However, I would like to ask if there are some configuration settings available so that if say the max reducer memory is reached then spawn off a new reducer on another datanode with the remaining data in context ?
Or maybe even on the same datanode so that say some x records off the context are read in the reduce method upto some limit and then the remaining are read off in a new reducer ?
You could try out a combiner that would reduce the work load of a single reducer handling more key,value pairs by doing a possible aggregation before it goes through to the reducer. If you are doing a join then you could try out skewed join in Pig. It involves 2 MR jobs.In first MR it does a sampling on one input and if it finds a key which is skewed so much so that it is able to fit into memory, it splits that key into more than one reducers. For the other records than the one identified in the sample it does a default join. For the skewed input it duplicates the input and sends it to both reducers.
It is not possible to spawn a new auxiliary reducer to balance the load on the job run.
Rather you could thing of picking another key element from your records which will help in shuffling the data even across the reducers.
Else as a option, you could expand the existing reducer's memory settings to accommodate more shuffled records and to get the sorting/merging done quicker. Please refer the below properties,
mapreduce.reduce.memory.mb
mapreduce.reduce.java.opts
mapreduce.reduce.merge.inmem.threshold
mapreduce.reduce.shuffle.input.buffer.percent
mapreduce.reduce.shuffle.merge.percent
mapreduce.reduce.input.buffer.percent
I could remember, there was a extended mapreduce library, skewtune, written to load balance the data skew during the course of job run. But I never experimented this, kindly check if it is helpful.
That is not possible. The number of reducers is fixed in the Driver configuration.

When do reduce tasks start in Hadoop?

In Hadoop when do reduce tasks start? Do they start after a certain percentage (threshold) of mappers complete? If so, is this threshold fixed? What kind of threshold is typically used?
The reduce phase has 3 steps: shuffle, sort, reduce. Shuffle is where the data is collected by the reducer from each mapper. This can happen while mappers are generating data since it is only a data transfer. On the other hand, sort and reduce can only start once all the mappers are done. You can tell which one MapReduce is doing by looking at the reducer completion percentage: 0-33% means its doing shuffle, 34-66% is sort, 67%-100% is reduce. This is why your reducers will sometimes seem "stuck" at 33%-- it's waiting for mappers to finish.
Reducers start shuffling based on a threshold of percentage of mappers that have finished. You can change the parameter to get reducers to start sooner or later.
Why is starting the reducers early a good thing? Because it spreads out the data transfer from the mappers to the reducers over time, which is a good thing if your network is the bottleneck.
Why is starting the reducers early a bad thing? Because they "hog up" reduce slots while only copying data and waiting for mappers to finish. Another job that starts later that will actually use the reduce slots now can't use them.
You can customize when the reducers startup by changing the default value of mapred.reduce.slowstart.completed.maps in mapred-site.xml. A value of 1.00 will wait for all the mappers to finish before starting the reducers. A value of 0.0 will start the reducers right away. A value of 0.5 will start the reducers when half of the mappers are complete. You can also change mapred.reduce.slowstart.completed.maps on a job-by-job basis. In new versions of Hadoop (at least 2.4.1) the parameter is called is mapreduce.job.reduce.slowstart.completedmaps (thanks user yegor256).
Typically, I like to keep mapred.reduce.slowstart.completed.maps above 0.9 if the system ever has multiple jobs running at once. This way the job doesn't hog up reducers when they aren't doing anything but copying data. If you only ever have one job running at a time, doing 0.1 would probably be appropriate.
The reduce phase can start long before a reducer is called. As soon as "a" mapper finishes the job, the generated data undergoes some sorting and shuffling (which includes call to combiner and partitioner). The reducer "phase" kicks in the moment post mapper data processing is started. As these processing is done, you will see progress in reducers percentage. However, none of the reducers have been called in yet. Depending on number of processors available/used, nature of data and number of expected reducers, you may want to change the parameter as described by #Donald-miner above.
As much I understand Reduce phase start with the map phase and keep consuming the record from maps. However since there is sort and shuffle phase after the map phase all the outputs have to be sorted and sent to the reducer. So logically you can imagine that reduce phase starts only after map phase but actually for performance reason reducers are also initialized with the mappers.
The percentage shown for the reduce phase is actually about the amount of the data copied from the maps output to the reducers input directories.
To know when does this copying start? It is a configuration you can set as Donald showed above. Once all the data is copied to reducers (ie. 100% reduce) that's when the reducers start working and hence might freeze in "100% reduce" if your reducers code is I/O or CPU intensive.
Reduce starts only after all the mapper have fished there task, Reducer have to communicate with all the mappers so it has to wait till the last mapper finished its task.however mapper starts transferring data to the moment it has completed its task.
Consider a WordCount example in order to understand better how the map reduce task works.Suppose we have a large file, say a novel and our task is to find the number of times each word occurs in the file. Since the file is large, it might be divided into different blocks and replicated in different worker nodes. The word count job is composed of map and reduce tasks. The map task takes as input each block and produces an intermediate key-value pair. In this example, since we are counting the number of occurences of words, the mapper while processing a block would result in intermediate results of the form (word1,count1), (word2,count2) etc. The intermediate results of all the mappers is passed through a shuffle phase which will reorder the intermediate result.
Assume that our map output from different mappers is of the following form:
Map 1:-
(is,24)
(was,32)
(and,12)
Map2 :-
(my,12)
(is,23)
(was,30)
The map outputs are sorted in such a manner that the same key values are given to the same reducer. Here it would mean that the keys corresponding to is,was etc go the same reducer.It is the reducer which produces the final output,which in this case would be:-
(and,12)(is,47)(my,12)(was,62)
Reducer tasks starts only after the completion of all the mappers.
But the data transfer happens after each Map.
Actually it is a pull operation.
That means, each time reducer will be asking every maptask if they have some data to retrive from Map.If they find any mapper completed their task , Reducer pull the intermediate data.
The intermediate data from Mapper is stored in disk.
And the data transfer from Mapper to Reduce happens through Network (Data Locality is not preserved in Reduce phase)
When Mapper finishes its task then Reducer starts its job to Reduce the Data this is Mapreduce job.

Difference and relationship between slots, map tasks, data splits, Mapper

I have gone thru few hadoop info books and papers.
A Slot is a map/reduce computation unit at a node. it may be map or reduce slot.
As far as, i know split is a group of blocks of files in HDFS which have some length and location of nodes where they ares stored.
Mapper is class but when the code is instantiated it is called map task.
Am i right ?
I am not clear of difference and relationship between map tasks, data splits and Mapper.
Regarding scheduling i understand that when a map slot of a node is free a map task is choosen from the non-running map task and launched if the data to be processed by the map task is the node.
Can anyone explain it clearly in terms of above concepts: slots, mapper and map task etc.
Thanks,
Arun
As far as, I know split is a group of blocks of files in HDFS which have the same length and location of nodes where they are stored.
InputSplit is a unit of data which a particular mapper will process. It needs not be just a group of HDFS blocks. It can be a single line, 100 rows from a DB, a 50MB file etc.
I am not clear about difference and relationship between map tasks, data splits and Mapper.
An InputSplit is processed by a map task and an instance of Mapper is a Map task.
As I understand:
first data split in HDFS to the Data nodes
then when there are a new job , the job tracker divide this job into Map and reduce tasks
and then Job tracker assign each map task to the node which already has the split of data related to this map task so the data is local in the node and there will be no cost for moving data so the execution time be less as possible
but sometimes we have to assign task to node which has not the data on it , so the node has to get the data through network and then processed it
input split is not the data it is the reference to particular amount of data that map reduce process. Usually it is same as the block size, because if size of both is not same and some data is on different node then we need to transfer that data.
MAPPER : mapper is a class.
MAPPER PHASE : mapper phase is a input,output code in to convert the values in keys and values pairs(keys,values).
MAPPER SLOT : to execute the mapper and reducer code.

Resources