Query regarding shuffling in map reduce - hadoop

How does a node processing running the mapper knows that it has to send some key-value output to node A (running the reducer) & some to node B (running another reducer)?
Is there somewhere a reducer node list is maintained by the the JobTracker?
If yes, how does it chooses a node to run the reducer?

A Mapper doesn't really know where to send the data, it focuses on 2 things:
Writes the data to disk. Initially the map output is buffered in memory, and once it hits a certain threshold it gets flushed to disk. But right before going to disk, the data is partitioned by taking a hash of the output key which corresponds to which Reducer it will be sent to.
Once a map task is done it will notify the parent task tracker to say it's done, which will then notify the job tracker itself. So the job tracker has the complete mapping between map outputs and task trackers.
From there, when a Reducer starts, it will keep asking the job tracker for the map outputs corresponding to his partition until it has retrieved them all. Whenever a map output is available, the reduce task will start copying it, and gradually merge as it copies.
If this is still unclear, I will advise looking at the reference book on Hadoop which has a whole chapter describing this part, here is a schema extracted from it that could help you visualize what happens in the shuffle step:

The mappers do not send the data to the reducers, rather the reducers pull the data from the task trackers where successful map tasks ran.
The Job Tracker, when allocating a reducer task to a task tracker, knows where the successful map tasks ran, and can compile a list of task tracker and map attempt task results to pull.

Related

If Map slots started across racks then how Job Tracker process the data?

1. When assigning a task to Task Tracker for processing, the Job Tracker first tries to locate a Task Tracker with a free slots on the same server that has the data node containing the data (to ensure the data locality)
2. If it does not find this Task Tracker, it looks for a Task Tracker on another node in the same rack before it goes across the racks to locate a Task Tracker.
Thumb rule: Processing logic only will reach to data for processing.
Assuming that the Task tracker started on across the racks where corresponding processing data not available, So in this scenario, how processing logic (program) reaches to data, instead of data reaches to processing logic (program)?
When the data are not available locally, they need to be transferred through the network. Data locality is not a rule (remote nodes cannot run the program), but a goal (always prefer the local nodes, containing the data, to run the process related with this data chunk), since transferring the data (many GBs) is more costly than traferring the code (a few KBs).

Difference between task counter and job counter

Can anyone please help me understand what is the difference between task counter and job counter in map reduce?
Hadoop,The Definitive guide says that task counters are those which are updated as the task progresses and and job counter are those which are updated as the job progresses.
Is this the only difference or they have any other difference too?
Task Counters
Task counters gather information about tasks over the course of their execution, and the results are aggregated over all the tasks in a job. Task counters are sent in full every time, rather than sending the counts since the last transmission, since this guards against errors due to lost messages. Furthermore, during a job run, counters may go down if a task fails for example you dont want to add up the bad_records in a split of a failed tasks. So As the task progesses and completes successfully the overall count of the task statistics is sent over to task tracker which is passed over to job tracker.
Job counters
Job counters are maintained by the jobtracker (or application master in YARN), so they don’t need to be sent across the network, unlike all other counters, They measure job-level statistics, not values that change while a task is running For example, TOTAL_LAUNCHED_MAPS counts the number of total map task launched which is just a statistics about the overall job

When do the results from a mapper task get deleted from disk?

When do the outputs for a mapper task get deleted from the local filesystem? Do they persist until the entire job completes or do they get deleted at an earlier time than that?
In addition to the map and reduce tasks, two further tasks are created: a job setup task
and a job cleanup task. These are run by tasktrackers and are used to run code to setup
the job before any map tasks run, and to cleanup after all the reduce tasks are complete.
The OutputCommitter that is configured for the job determines the code to be run, and
by default this is a FileOutputCommitter. For the job setup task it will create the final
output directory for the job and the temporary working space for the task output, and
for the job cleanup task it will delete the temporary working space for the task output.
Have a look at OutputCommitter.
If your hadoop.tmp.dir is set to a default setting (say, /tmp/), it will most likely be subject to tmpwatch and any default settings in your OS. I would suggest poking around in /etc/cron.d/, /etc/cron.daily, etc/cron.weekly/, etc., to see exactly what your OS default is like.
One thing to keep in mind about tmpwatch is that, by default, it will key on access time, not modification time (i.e., files that have not been 'touched' since X will be considered 'stale' and subject to removal). However, it's a common practice with Hadoop to mount filesystems with the noatime and nodiratime flags, meaning that access times will not get updated and thus skewing your tmpwatch behaviors.
Otherwise, Hadoop will purge task attempt logs older than 24 hours (after task completion), by default. While a few years old, this writeup has some great info on the default behaviors. Take a look in particular at the sections that refer to mapreduce.job.userlog.retain.hours.
EDIT: responding to OP's comment, which clears up my misunderstanding of the question:
As far as the intermediate output of map tasks which is spilled to disk, used by any combiners, and copied to any reducers, the Hadoop Definitive Guide has this to say:
Tasktrackers do not delete map outputs from disk as soon as the first
reducer has retrieved them, as the reducer may fail. Instead, they
wait until they are told to delete them by the jobtracker, which is
after the job has completed.
Source
I've also +1'd #mgs answer below, as they have linked the source code that controls this and described the Job cleanup task.
So, yes, the map output data is deleted immediately after the job completes, successfully or not, and no sooner.
"Tasktrackers do not delete map outputs from disk as soon as the first reducer has retrieved them, as the reducer may fail. Instead, they wait until they are told to delete them by the jobtracker, which is after the job has completed"
Hadoop: The Definitive Guide ( Section 6.4)

Mapreduce dataflow Internals

I tried to understand map reduce anatomy from various books/blogs.. But I am not getting a clear idea.
What happens when I submit a job to the cluster using this command:
..Loaded the files into hdfs already
bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
Can anyone explain the sequence of opreations that happens right from the client and inside the cluster?
The processes goes like this :
1- The client configures and sets up the job via Job and submits it to the JobTracker.
2- Once the job has been submitted the JobTracker assigns a job ID to this job.
3- Then the output specification of the job is verified. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program.
4- Once this is done, InputSplits for the job are created(based on the InputFormat you are using). If the splits cannot be computed, because the input paths don’t exist, for example, then the job is not submitted and an error is thrown to the MapReduce program.
5- Based on the number of InputSplits, map tasks are created and each InputSplits gets processed by one map task.
6- Then the resources which are required to run the job are copied across the cluster like the the job JAR file, the configuration file etc. The job JAR is copied with a high replication factor (which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job.
7- Then based on the location of the data blocks, that are going to get processed, JobTracker directs TaskTrackers to run map tasks on that very same DataNode where that particular data block is present. If there are no free CPU slots on that DataNode, the data is moved to a nearby DataNode with free slots and the processes is continued without having to wait.
8- Once the map phase starts individual records(key-value pairs) from each InputSplit start getting processed by the Mapper one by one completing the entire InputSplit.
9- Once the map phase gets over, the output undergoes shuffle, sort and combine. After this the reduce phase starts giving you the final output.
Below is the pictorial representation of the entire process :
Also, I would suggest you to go through this link.
HTH

Chain of events when running a MapReduce job

I'm looking for some specific information regarding the chain of events when running a MapReduce job on a Hadoop cluster.
Let's assume that my Reduce tasks are on the verge of completion. After my last reducer has written its output to the output file, how many replicas of the output file are there?
What exactly happens after the last reducer has finished writing to the output file. When does the NameNode request the respective Data Nodes to replicate the output file? And how is the Name Node informed that the output file is ready? Who conveys that information to the NameNode?
Thank you!
The Reduce tasks write output to HDFS. They do this by first communicating with the name node to request a block. The name node then tells the reducer which data nodes to write to, and then the reducer actually sends the data directly to the first data node, which then sends it to the second data node, which sends it to the third node. Typically the name node will keep things local, so the first data node is probably the same machine that is running the reduce task.
Once the reducer has finished writing outputs, and the data nodes have confirmed this, the reducer itself will tell the job tracker that it has finished via periodic heartbeat communication.
To understand the basics of HDFS replication, have a read over replica placement in the HDFS architecture document. In a nutshell, the NameNode will try to use the same rack to minimize latency.

Resources