When do the results from a mapper task get deleted from disk? - hadoop

When do the outputs for a mapper task get deleted from the local filesystem? Do they persist until the entire job completes or do they get deleted at an earlier time than that?

In addition to the map and reduce tasks, two further tasks are created: a job setup task
and a job cleanup task. These are run by tasktrackers and are used to run code to setup
the job before any map tasks run, and to cleanup after all the reduce tasks are complete.
The OutputCommitter that is configured for the job determines the code to be run, and
by default this is a FileOutputCommitter. For the job setup task it will create the final
output directory for the job and the temporary working space for the task output, and
for the job cleanup task it will delete the temporary working space for the task output.
Have a look at OutputCommitter.

If your hadoop.tmp.dir is set to a default setting (say, /tmp/), it will most likely be subject to tmpwatch and any default settings in your OS. I would suggest poking around in /etc/cron.d/, /etc/cron.daily, etc/cron.weekly/, etc., to see exactly what your OS default is like.
One thing to keep in mind about tmpwatch is that, by default, it will key on access time, not modification time (i.e., files that have not been 'touched' since X will be considered 'stale' and subject to removal). However, it's a common practice with Hadoop to mount filesystems with the noatime and nodiratime flags, meaning that access times will not get updated and thus skewing your tmpwatch behaviors.
Otherwise, Hadoop will purge task attempt logs older than 24 hours (after task completion), by default. While a few years old, this writeup has some great info on the default behaviors. Take a look in particular at the sections that refer to mapreduce.job.userlog.retain.hours.
EDIT: responding to OP's comment, which clears up my misunderstanding of the question:
As far as the intermediate output of map tasks which is spilled to disk, used by any combiners, and copied to any reducers, the Hadoop Definitive Guide has this to say:
Tasktrackers do not delete map outputs from disk as soon as the first
reducer has retrieved them, as the reducer may fail. Instead, they
wait until they are told to delete them by the jobtracker, which is
after the job has completed.
Source
I've also +1'd #mgs answer below, as they have linked the source code that controls this and described the Job cleanup task.
So, yes, the map output data is deleted immediately after the job completes, successfully or not, and no sooner.

"Tasktrackers do not delete map outputs from disk as soon as the first reducer has retrieved them, as the reducer may fail. Instead, they wait until they are told to delete them by the jobtracker, which is after the job has completed"
Hadoop: The Definitive Guide ( Section 6.4)

Related

How to delete input files after successful mapreduce

We have a system that receives archives on a specified directory and on a regular basis it launches a mapreduce job that opens the archives and processes the files within them. To avoid re-processing the same archives the next time, we're hooked into the close() method on our RecordReader to have it deleted after the last entry is read.
The problem with this approach (we think) is that if a particular mapping fails, the next mapper that makes another attempt at it finds that the original file has been deleted by the record reader from the first one and it bombs out. We think the way to go is to hold off until all the mapping and reducing is complete and then delete the input archives.
Is this the best way to do this?
If so, how can we obtain a listing of all the input files found by the system from the main program? (we can't just scrub the whole input dir, new files may be present)
i.e.:
. . .
job.waitForCompletion(true);
(we're done, delete input files, how?)
return 0;
}
Couple comments.
I think this design is heartache-prone. What happens when you discover that someone deployed a messed up algorithm to your MR cluster and you have to backfill a month's worth of archives? They're gone now. What happens when processing takes longer than expected and a new job needs to start before the old one is completely done? Too many files are present and some get reprocessed. What about when the job starts while an archive is still in flight? Etc.
One way out of this trap is to have the archives go to a rotating location based on time, and either purge the records yourself or (in the case of something like S3) establish a retention policy that allows a certain window for operations. Also whatever the back end map reduce processing is doing could be idempotent: processing the same record twice should not be any different than processing it once. Something tells me that if you're reducing your dataset, that property will be difficult to guarantee.
At the very least you could rename the files you processed instead of deleting them right away and use a glob expression to define your input that does not include the renamed files. There are still race conditions as I mentioned above.
You could use a queue such as Amazon SQS to record the delivery of an archive, and your InputFormat could pull these entries rather than listing the archive folder when determining the input splits. But reprocessing or backfilling becomes problematic without additional infrastructure.
All that being said, the list of splits is generated by the InputFormat. Write a decorator around that and you can stash the split list wherever you want for use by the master after the job is done.
The simplest way would probably be do a multiple input job, read the directory for the files before you run the job and pass those instead of a directory to the job (then delete the files in the list after the job is done).
Based on the situation you are explaining I can suggest the following solution:-
1.The process of data monitoring I.e monitoring the directory into which the archives are landing should be done by a separate process. That separate process can use some metadata table like in mysql to put status entries based on monitoring the directories. The metadata entries can also check for duplicacy.
2. Now based on the metadata entry a separate process can handle the map reduce job triggering part. Some status could be checked in metadata for triggering the jobs.
I think you should use Apache Oozie to manage your workflow. From Oozie's website (bolding is mine):
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
...
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

Re-use files in Hadoop Distributed cache

I am wondering if someone can explain how the distributed cache works in Hadoop. I am running a job many times, and after each run I notice that the local distributed cache folder on each node is growing in size.
Is there a way for multiple jobs to re-use the same file in the distributed cache? Or is the distributed cache only valid for the lifetime of any individual job?
The reason I am confused is that the Hadoop documentation mentions that "DistributedCache tracks modification timestamps of the cache files", so this leads me to believe that if the time stamp hasn't changed, then it should not need to re-cache or re-copy the files to the nodes.
I am adding files successfully to the distributed cache using:
DistributedCache.addFileToClassPath(hdfsPath, conf);
DistributedCache uses reference counting to manage the caches. org.apache.hadoop.filecache.TrackerDistributedCacheManager.CleanupThread is in charge of cleaning up the CacheDirs whose reference count is 0. It will check every minute (default period is 1 minute, you can set it by "mapreduce.tasktracker.distributedcache.checkperiod").
When a Job finishes or fails, JobTracker will send a org.apache.hadoop.mapred.KillJobAction to the TaskTrackers. Then if a TaskTracker receives a KillJobAction, it puts the action to tasksToCleanup. In the TaskTracker, there is a background Thread called taskCleanupThread which takes the action from tasksToCleanup and do the cleanup work. For a KillJobAction, it will invoke purgeJob to clean up the Job. In this method, it will decrease the reference count used by this Job (rjob.distCacheMgr.release();).
The above analysis bases on hadoop-core-2.0.0-mr1-cdh4.2.1-sources.jar. I also checked the hadoop-core-0.20.2-cdh3u1-sources.jar and found there was a litte difference between this two versions. For example, there was not a org.apache.hadoop.filecache.TrackerDistributedCacheManager.CleanupThread in 0.20.2-cdh3u1. When initializing a Job, TrackerDistributedCacheManager will check if there is enough space to put the new caches files for this Job. If not, it will delete the caches which have 0 reference count.
If you are using cdh4.2.1, you can increase "mapreduce.tasktracker.distributedcache.checkperiod" to let the clean up work delay. Then the probability that multiple Jobs use the same distributed cache is increased.
If you are using cdh3u1, you can increase the limitation of the cache size("local.cache.size", default is 10G) and the max directories for caches("mapreduce.tasktracker.cache.local.numberdirectories", default is 10000). This can be also applied to cdh4.2.1.
If you look closely at what this book says, is that there is a limit of what can be stored in Distributed Cache. By default it's 10GB (configurable). There can be multiple different jobs running in the cluster concurrently. Furthermore, Hadoop kind of guarantees the files stay available in the cache for a single job as it is maintained by reference count done by the tasktracker for different tasks accessing the files in cache. In your case, for subsequent Jobs, the files may not be there as they are already marked for deletion.
Please correct me if you disagree anywhere. I'll be glad to discuss this further.
According to this: http://www.datasalt.com/2011/05/handling-dependencies-and-configuration-in-java-hadoop-projects-efficiently/
You should be able to do this via DistributedCache API instead of "-libjars"

Retaining logs from Hadoop job after it's executed

I'm wondering if there's an easy way to grab all the job logs / task attempt logs of a particular run, and persist them somewhere (HDFS, perhaps)?
I know that the logs are on the local filesystem at /var/log/hadoop-0.20-mapreduce/userlogs for any particular job's task attempts, and that I could write a script to SSH to each of the slave nodes and scoop them all up. However, I'm trying to avoid that if it makes sense to - perhaps there's some built-in function of Hadoop that I'm not aware of?
I did find this link, which is old, but contains some helpful information -- but did not include the answer I'm looking for.
mapreduce.job.userlog.retain.hours is set to 24 by default, so any job's logs will be automatically purged after 1 day. Is there anything I can do besides increasing the value of the retain.hours parameter to get these to persist?
I don't know of anything out of the box that exists, but I have done something similar manually.
We set up cron jobs that run every 20 minutes that look for new logs for task attempts, then pumps them all into HDFS into a specific directory. We modified the files names so that the hostname it is coming from is appended. Then, we had MapReduce jobs try to find issues, calculate stats like runtimes, etc. It was pretty neat. We did something similar with NameNode logs, too.

Mapreduce dataflow Internals

I tried to understand map reduce anatomy from various books/blogs.. But I am not getting a clear idea.
What happens when I submit a job to the cluster using this command:
..Loaded the files into hdfs already
bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
Can anyone explain the sequence of opreations that happens right from the client and inside the cluster?
The processes goes like this :
1- The client configures and sets up the job via Job and submits it to the JobTracker.
2- Once the job has been submitted the JobTracker assigns a job ID to this job.
3- Then the output specification of the job is verified. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program.
4- Once this is done, InputSplits for the job are created(based on the InputFormat you are using). If the splits cannot be computed, because the input paths don’t exist, for example, then the job is not submitted and an error is thrown to the MapReduce program.
5- Based on the number of InputSplits, map tasks are created and each InputSplits gets processed by one map task.
6- Then the resources which are required to run the job are copied across the cluster like the the job JAR file, the configuration file etc. The job JAR is copied with a high replication factor (which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job.
7- Then based on the location of the data blocks, that are going to get processed, JobTracker directs TaskTrackers to run map tasks on that very same DataNode where that particular data block is present. If there are no free CPU slots on that DataNode, the data is moved to a nearby DataNode with free slots and the processes is continued without having to wait.
8- Once the map phase starts individual records(key-value pairs) from each InputSplit start getting processed by the Mapper one by one completing the entire InputSplit.
9- Once the map phase gets over, the output undergoes shuffle, sort and combine. After this the reduce phase starts giving you the final output.
Below is the pictorial representation of the entire process :
Also, I would suggest you to go through this link.
HTH

Is it possible to restart a "killed" Hadoop job from where it left off?

I have a Hadoop job that processes log files and reports some statistics. This job died about halfway through the job because it ran out of file handles. I have fixed the issue with the file handles and am wondering if it is possible to restart a "killed" job.
As it turns out, there is not a good way to do this; once a job has been killed there is no way to re-instantiate that job and re-start processing immediately prior to the first failure. There are likely some really good reasons for this but I'm not qualified to speak to this issue.
In my own case, I was processing a large set of log files and loading these files into an index. Additionally I was creating a report on the contents of these files at the same time. In order to make the job more tolerant of failures on the indexing side (a side-effect, this isn't related to Hadoop at all) I altered my job to instead create many smaller jobs, each one of these jobs processing a chunk of these log files. When one of these jobs finishes, it renames the processed log files so that they are not processed again. Each job waits for the previous job to complete before running.
Chaining multiple MapReduce jobs in Hadoop
When one job fails, all of the subsequent jobs quickly fail afterward. Simply fixing whatever the issue was and the re-submitting my job will, roughly, pick up processing where it left off. In the worst-case scenario where a job was 99% complete at the time of it's failure, that one job will be erroneously and wastefully re-processed.

Resources