How are output files handled in Hadoop at job and task level ? - hadoop

AS per The Definitive Guide, setUpJob() of OutPutCommitter will create the mapreduce output directory and also setup the temporary workspace for tasks. mapred.output.dir/_temporary
Then the book says temporary directories at task level are created when task outputs are written.
The above two statements are kind of confusing.

So basically a map reduce job consist of many task i.e map tasks and reduce tasks. Now mapreduce output directory is the directory where the final output of map-reduce job is written. Now when the map reduce job runs each of the map task and reduce tasks generate intermediate files which is local to the node where the task run. This local output per task which are intermediate are written to temporary workspace. Finally after shuffling and other phases this intermediate outputs are finally written to hdfs as final output based on the logic you apply for the map-reduce job. I hope that answers your question

Related

Chaining Map Reduce Program

I have a situation, during one POC I want to create a nested MapReduce within one Job. Like a Map M1 O/P to Reducer R1 O/P then that R1 output goes to M2 and final output will come with either M2 or we can run R2 with M2 O/P.
Single Job ID - M1->R1->M2->R2...Final output will be in a single O/P file.
Can we do it without Oozie?
You can chain multiple jobs in your Driver class. First, create a job for first MapReduce, by defining all the required configuration. Then start the job as usual by calling:
job1.waitForCompletion(true);
This is wait until the job is finished. Now check the final status of the first job, whether failed or succeeded for appropriate next action.
If the first job is completed successfully, then launch the next MapReduce in the same way. First define the required parameters and launch the job with:
job2.waitForCompletion(true);
The important thing will be output path of the first will be input for the second job. This is serial (sequential) job chaining, because both the jobs will be running one after another.
You can also make use of job control where in you can execute a number of map reduce jobs in a sequence. In your case there are two mappers and two or one reducers. You can have two map reduce jobs and for the second job you can use set the number of reducers to zero if you don't require reducers.

How to schedule post processing task after a mapreduce job

I'm looking for a simple method to chain post processing code after a map reduce job
specifically, in involves renaming\moving the out files create by org.apache.hadoop.mapred.lib.MultipleOutputs (the class has limitations on the output file names, so I ca't produce the files directly in the mapreduce job)
The options I know (or think of) are:
add it in the job creation code - this is what I do now, but I prefer the task will be scheduled by the jobtracker (to reduce the chances of the process being aborted)
using a workflow engine (luigi, oozie) - but this seems like an overkill for this issue
using job chaining - this allows chaining mapreduce jobs - it it possible to chain a "simple" task?
Your "simple" task should be a Mapper-only job. Your Map() receives as key the file name and renames the file. For this you have to write your own InputFormat and RecordReader, like in the links, but your RecordReader should not actually read the file, just return the file name in getCurrentKey():
https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileInputFormat.java?r=3
https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileRecordReader.java?r=3

When do the results from a mapper task get deleted from disk?

When do the outputs for a mapper task get deleted from the local filesystem? Do they persist until the entire job completes or do they get deleted at an earlier time than that?
In addition to the map and reduce tasks, two further tasks are created: a job setup task
and a job cleanup task. These are run by tasktrackers and are used to run code to setup
the job before any map tasks run, and to cleanup after all the reduce tasks are complete.
The OutputCommitter that is configured for the job determines the code to be run, and
by default this is a FileOutputCommitter. For the job setup task it will create the final
output directory for the job and the temporary working space for the task output, and
for the job cleanup task it will delete the temporary working space for the task output.
Have a look at OutputCommitter.
If your hadoop.tmp.dir is set to a default setting (say, /tmp/), it will most likely be subject to tmpwatch and any default settings in your OS. I would suggest poking around in /etc/cron.d/, /etc/cron.daily, etc/cron.weekly/, etc., to see exactly what your OS default is like.
One thing to keep in mind about tmpwatch is that, by default, it will key on access time, not modification time (i.e., files that have not been 'touched' since X will be considered 'stale' and subject to removal). However, it's a common practice with Hadoop to mount filesystems with the noatime and nodiratime flags, meaning that access times will not get updated and thus skewing your tmpwatch behaviors.
Otherwise, Hadoop will purge task attempt logs older than 24 hours (after task completion), by default. While a few years old, this writeup has some great info on the default behaviors. Take a look in particular at the sections that refer to mapreduce.job.userlog.retain.hours.
EDIT: responding to OP's comment, which clears up my misunderstanding of the question:
As far as the intermediate output of map tasks which is spilled to disk, used by any combiners, and copied to any reducers, the Hadoop Definitive Guide has this to say:
Tasktrackers do not delete map outputs from disk as soon as the first
reducer has retrieved them, as the reducer may fail. Instead, they
wait until they are told to delete them by the jobtracker, which is
after the job has completed.
Source
I've also +1'd #mgs answer below, as they have linked the source code that controls this and described the Job cleanup task.
So, yes, the map output data is deleted immediately after the job completes, successfully or not, and no sooner.
"Tasktrackers do not delete map outputs from disk as soon as the first reducer has retrieved them, as the reducer may fail. Instead, they wait until they are told to delete them by the jobtracker, which is after the job has completed"
Hadoop: The Definitive Guide ( Section 6.4)

Mapreduce dataflow Internals

I tried to understand map reduce anatomy from various books/blogs.. But I am not getting a clear idea.
What happens when I submit a job to the cluster using this command:
..Loaded the files into hdfs already
bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
Can anyone explain the sequence of opreations that happens right from the client and inside the cluster?
The processes goes like this :
1- The client configures and sets up the job via Job and submits it to the JobTracker.
2- Once the job has been submitted the JobTracker assigns a job ID to this job.
3- Then the output specification of the job is verified. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program.
4- Once this is done, InputSplits for the job are created(based on the InputFormat you are using). If the splits cannot be computed, because the input paths don’t exist, for example, then the job is not submitted and an error is thrown to the MapReduce program.
5- Based on the number of InputSplits, map tasks are created and each InputSplits gets processed by one map task.
6- Then the resources which are required to run the job are copied across the cluster like the the job JAR file, the configuration file etc. The job JAR is copied with a high replication factor (which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job.
7- Then based on the location of the data blocks, that are going to get processed, JobTracker directs TaskTrackers to run map tasks on that very same DataNode where that particular data block is present. If there are no free CPU slots on that DataNode, the data is moved to a nearby DataNode with free slots and the processes is continued without having to wait.
8- Once the map phase starts individual records(key-value pairs) from each InputSplit start getting processed by the Mapper one by one completing the entire InputSplit.
9- Once the map phase gets over, the output undergoes shuffle, sort and combine. After this the reduce phase starts giving you the final output.
Below is the pictorial representation of the entire process :
Also, I would suggest you to go through this link.
HTH

Query regarding shuffling in map reduce

How does a node processing running the mapper knows that it has to send some key-value output to node A (running the reducer) & some to node B (running another reducer)?
Is there somewhere a reducer node list is maintained by the the JobTracker?
If yes, how does it chooses a node to run the reducer?
A Mapper doesn't really know where to send the data, it focuses on 2 things:
Writes the data to disk. Initially the map output is buffered in memory, and once it hits a certain threshold it gets flushed to disk. But right before going to disk, the data is partitioned by taking a hash of the output key which corresponds to which Reducer it will be sent to.
Once a map task is done it will notify the parent task tracker to say it's done, which will then notify the job tracker itself. So the job tracker has the complete mapping between map outputs and task trackers.
From there, when a Reducer starts, it will keep asking the job tracker for the map outputs corresponding to his partition until it has retrieved them all. Whenever a map output is available, the reduce task will start copying it, and gradually merge as it copies.
If this is still unclear, I will advise looking at the reference book on Hadoop which has a whole chapter describing this part, here is a schema extracted from it that could help you visualize what happens in the shuffle step:
The mappers do not send the data to the reducers, rather the reducers pull the data from the task trackers where successful map tasks ran.
The Job Tracker, when allocating a reducer task to a task tracker, knows where the successful map tasks ran, and can compile a list of task tracker and map attempt task results to pull.

Resources