MapReduce - reduce running while map is not finished - hadoop

I've implemented a simple WordCount-application in hadoop. On my cluster, I have one namenode and 4 datanodes. Replication-rate is set to 4.
In the filesystem I have put many lorem-impsum-files.
While running the wordcount application I see the reducer working even though the mappers aren't finished yet.
2021-10-29 14:53:31,044 INFO mapreduce.Job: map 70% reduce 23%
How does this work?
On many tutorial pages is written (one page for example):
"A reducer cannot start while a mapper is still in progress"
https://www.talend.com/resources/what-is-mapreduce/
How can the reducers work if the result set of mapping isn't completed?

Once data is emitted by a mapper, it undergoes two steps:
It is shuffled - this is the process of sending data to the correct reducer depending on its key and the partitioner logic.
It is sorted - this happens on the reducer itself.
So even though data is still being emitted by the mapper, reducer tasks are being created and are sorting data as it arrives. You're correct in that they won't actually start processing values until all mapping has finished.

Related

Reducer doesn't start still progress on MapReduce Job

If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?
Its is because of the mapreduce.job.reduce.slowstart.completedmaps property which's default value is 0.05.
It means that the reducer phase will be started as soon as atleast 5% of total mappers have completed the execution.
So the dispatched reducers will continue to stay in copy phase until all mappers are completed.
If you wish to start reducers only after all mappers have completed, then configure 1.0 value for the given property in the job configuration.
Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer. Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished

when will the number/nodes for the reducers be allocated in the mapreduce job execution?

When reading about MapReduce, I read the below interesting lines:
"But how do the Reducer’s know which nodes to query to get their
partitions? This happens through the Application Master. As each
Mapper instance completes, it notifies the Application Master about
the partitions it produced during its run. Each Reducer
periodically queries the Application Master for Mapper hosts until it
has received a final list of nodes hosting its partitions."
I have a doubt here. When they say Each Reducer what does it mean exactly? Will the reducers be allocated before the starting of the map phase and also how are the reducer nodes chosen?
Reducers can start before the maps are done with the processing of the data. Once they start they can pull the data from the mapper machines, but they will start the processing only after all the mappers are done processing of the data.
mapred.reduce.slowstart.completed.maps is the property to configure this behaviour. More information on the property here.

How does hadoop distribute jobs to map and reduce

Can anyone explain me how hadoop decides to pass the jobs to map and reduce. Hadoop jobs are passed onto map and reduce but I am not able to figure out the way in which its done.
Thanks in advance.
Please refer Hadoop Definitive guid, Chapter 6, Anatomy of a MapReduce Job Run topic. Happy Learning
From Apache mapreduce tutorial :
Job Configuration:
Job represents a MapReduce job configuration.
Job is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution. The framework tries to faithfully execute the job as described by Job
Task Execution & Environmen
The MRAppMaster executes the Mapper/Reducer task as a child process in a separate jvm.
Mapper
Mapper maps input key/value pairs to a set of intermediate key/value pairs.
How Many Maps?
The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.
Reducer
Reducer reduces a set of intermediate values which share a key to a smaller set of values.
The number of reduces for the job is set by the user via Job.setNumReduceTasks(int).
How Many Reduces?
The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>).
Job Submission and Monitoring
The job submission process involves:
Checking the input and output specifications of the job.
Computing the InputSplit values for the job.
Setting up the requisite accounting information for the DistributedCache of the job, if necessary.
Copying the job’s jar and configuration to the MapReduce system directory on the FileSystem.
Submitting the job to the ResourceManager and optionally monitoring it’s status.
Job Input
InputFormat describes the input-specification for a MapReduce job. InputSplit represents the data to be processed by an individual Mapper.
Job Output
OutputFormat describes the output-specification for a MapReduce job.
Go through that tutorial for further understanding of complete workflow.
From AnatomyMapReduceJob article from http://ercoppa.github.io/ :
The workflow can be pictured as below.

How can map and reduce run in parllel

I am a beginner to hadoop & when I am running a hadoop job I noticed the progress log which shows map 80% reduce 25%. My understanding of map reduce is that mappers produce bunch of intermediate values. After mappers producing output there is shuffle/sort of intermediate pairs & these values are sent to reduce job. Can someone please explain me how map/reduce can work in parallel.
The outputs from the mappers have to be copied to the appropriate reducer nodes. This is called the shuffle process. That can start even before all the mappers have finished, since the decision of which key goes to which reducer is dependent only on the output key from the mapper. So the 25% progress you see is due to the shuffle phase.
After shuffle, there is a sort phase and then the reduce phase. Sort and reduce cannot happen unless all mappers have completed. Since shuffle can happen before the mappers finish, you can see a maximum of 33.33% reduce completion before the mappers have finished. This is because the default apache implementation considers shuffle, sort and reduce each to take an equal 33.33% of the time.

When do reduce tasks start in Hadoop?

In Hadoop when do reduce tasks start? Do they start after a certain percentage (threshold) of mappers complete? If so, is this threshold fixed? What kind of threshold is typically used?
The reduce phase has 3 steps: shuffle, sort, reduce. Shuffle is where the data is collected by the reducer from each mapper. This can happen while mappers are generating data since it is only a data transfer. On the other hand, sort and reduce can only start once all the mappers are done. You can tell which one MapReduce is doing by looking at the reducer completion percentage: 0-33% means its doing shuffle, 34-66% is sort, 67%-100% is reduce. This is why your reducers will sometimes seem "stuck" at 33%-- it's waiting for mappers to finish.
Reducers start shuffling based on a threshold of percentage of mappers that have finished. You can change the parameter to get reducers to start sooner or later.
Why is starting the reducers early a good thing? Because it spreads out the data transfer from the mappers to the reducers over time, which is a good thing if your network is the bottleneck.
Why is starting the reducers early a bad thing? Because they "hog up" reduce slots while only copying data and waiting for mappers to finish. Another job that starts later that will actually use the reduce slots now can't use them.
You can customize when the reducers startup by changing the default value of mapred.reduce.slowstart.completed.maps in mapred-site.xml. A value of 1.00 will wait for all the mappers to finish before starting the reducers. A value of 0.0 will start the reducers right away. A value of 0.5 will start the reducers when half of the mappers are complete. You can also change mapred.reduce.slowstart.completed.maps on a job-by-job basis. In new versions of Hadoop (at least 2.4.1) the parameter is called is mapreduce.job.reduce.slowstart.completedmaps (thanks user yegor256).
Typically, I like to keep mapred.reduce.slowstart.completed.maps above 0.9 if the system ever has multiple jobs running at once. This way the job doesn't hog up reducers when they aren't doing anything but copying data. If you only ever have one job running at a time, doing 0.1 would probably be appropriate.
The reduce phase can start long before a reducer is called. As soon as "a" mapper finishes the job, the generated data undergoes some sorting and shuffling (which includes call to combiner and partitioner). The reducer "phase" kicks in the moment post mapper data processing is started. As these processing is done, you will see progress in reducers percentage. However, none of the reducers have been called in yet. Depending on number of processors available/used, nature of data and number of expected reducers, you may want to change the parameter as described by #Donald-miner above.
As much I understand Reduce phase start with the map phase and keep consuming the record from maps. However since there is sort and shuffle phase after the map phase all the outputs have to be sorted and sent to the reducer. So logically you can imagine that reduce phase starts only after map phase but actually for performance reason reducers are also initialized with the mappers.
The percentage shown for the reduce phase is actually about the amount of the data copied from the maps output to the reducers input directories.
To know when does this copying start? It is a configuration you can set as Donald showed above. Once all the data is copied to reducers (ie. 100% reduce) that's when the reducers start working and hence might freeze in "100% reduce" if your reducers code is I/O or CPU intensive.
Reduce starts only after all the mapper have fished there task, Reducer have to communicate with all the mappers so it has to wait till the last mapper finished its task.however mapper starts transferring data to the moment it has completed its task.
Consider a WordCount example in order to understand better how the map reduce task works.Suppose we have a large file, say a novel and our task is to find the number of times each word occurs in the file. Since the file is large, it might be divided into different blocks and replicated in different worker nodes. The word count job is composed of map and reduce tasks. The map task takes as input each block and produces an intermediate key-value pair. In this example, since we are counting the number of occurences of words, the mapper while processing a block would result in intermediate results of the form (word1,count1), (word2,count2) etc. The intermediate results of all the mappers is passed through a shuffle phase which will reorder the intermediate result.
Assume that our map output from different mappers is of the following form:
Map 1:-
(is,24)
(was,32)
(and,12)
Map2 :-
(my,12)
(is,23)
(was,30)
The map outputs are sorted in such a manner that the same key values are given to the same reducer. Here it would mean that the keys corresponding to is,was etc go the same reducer.It is the reducer which produces the final output,which in this case would be:-
(and,12)(is,47)(my,12)(was,62)
Reducer tasks starts only after the completion of all the mappers.
But the data transfer happens after each Map.
Actually it is a pull operation.
That means, each time reducer will be asking every maptask if they have some data to retrive from Map.If they find any mapper completed their task , Reducer pull the intermediate data.
The intermediate data from Mapper is stored in disk.
And the data transfer from Mapper to Reduce happens through Network (Data Locality is not preserved in Reduce phase)
When Mapper finishes its task then Reducer starts its job to Reduce the Data this is Mapreduce job.

Resources