mapreduce: customize task outofmemory failure - hadoop

I have a map-only job which operates as one task per file. Sometimes a file causes task out-of-memory type exceptions.
Imagine an input directory has 10 files. Therefore the job will have 10 tasks. Now imagine 9 "good" files will succeed and 1 "bad" file cause out-of-memory exception.
Ideally I want the one "bad" file to move to a quarantine directory. The 9 "good" files write output. The job succeeds with warnings in the logs.
Partial success can come from using mapreduce.reduce.failures.maxpercent setting which is good.
But how to copy the "bad file" to quarantine when the container fails with out-of-memory?
I was thinking a custom FileOutputCommitter overriding the taskAbort method would provide the proper hook.
Anyone else done this before?

I tried to find the answer in job history server rest api, but unfortunately task attempts do not store info about their input paths.
If you won't find a better solution, you can do this:
create a special directory on hdfs for your job
in mapper setup method get input split name and store it in a special marker file inside this directory
when mapper finishes successfully, in cleanup method delete this marker file
after job finishes, check the directory and process bad files which names are there

Related

How to architecture file processing in laravel

I have task observe folder where files are coming from SFTP. File are big and processing one file is relatively time consuming. I am looking for best approach to do it. Here are some ideas how to do it, but I am not sure what is the best way.
Run scheduller each 5 min to check for new files
For each new file trigger event that there is new file.
Create listener which will listen for this event and which will using queues. In the listener for new files copy new file in the processing folder and process it. When processing of new files start insert record in the DB with status processing. When processing is done change record status and copy file to processed folder.
I this solution I have 2 copy operations for each file. This is because it is possible if second scheduler executes before all files are processed than some files could overlap in 2 processing jobs.
What is the best way to do it? Should I use another approach to avoid 2 copy operations? Something like to put database check during scheduler execution to see if the file is already in the processing state?
You should use the ->withoutOverlapping(); as stated in the manual of task Scheduler here.
Using this you will make sure that only one instance of the task run at any given time.

How to delete input files after successful mapreduce

We have a system that receives archives on a specified directory and on a regular basis it launches a mapreduce job that opens the archives and processes the files within them. To avoid re-processing the same archives the next time, we're hooked into the close() method on our RecordReader to have it deleted after the last entry is read.
The problem with this approach (we think) is that if a particular mapping fails, the next mapper that makes another attempt at it finds that the original file has been deleted by the record reader from the first one and it bombs out. We think the way to go is to hold off until all the mapping and reducing is complete and then delete the input archives.
Is this the best way to do this?
If so, how can we obtain a listing of all the input files found by the system from the main program? (we can't just scrub the whole input dir, new files may be present)
i.e.:
. . .
job.waitForCompletion(true);
(we're done, delete input files, how?)
return 0;
}
Couple comments.
I think this design is heartache-prone. What happens when you discover that someone deployed a messed up algorithm to your MR cluster and you have to backfill a month's worth of archives? They're gone now. What happens when processing takes longer than expected and a new job needs to start before the old one is completely done? Too many files are present and some get reprocessed. What about when the job starts while an archive is still in flight? Etc.
One way out of this trap is to have the archives go to a rotating location based on time, and either purge the records yourself or (in the case of something like S3) establish a retention policy that allows a certain window for operations. Also whatever the back end map reduce processing is doing could be idempotent: processing the same record twice should not be any different than processing it once. Something tells me that if you're reducing your dataset, that property will be difficult to guarantee.
At the very least you could rename the files you processed instead of deleting them right away and use a glob expression to define your input that does not include the renamed files. There are still race conditions as I mentioned above.
You could use a queue such as Amazon SQS to record the delivery of an archive, and your InputFormat could pull these entries rather than listing the archive folder when determining the input splits. But reprocessing or backfilling becomes problematic without additional infrastructure.
All that being said, the list of splits is generated by the InputFormat. Write a decorator around that and you can stash the split list wherever you want for use by the master after the job is done.
The simplest way would probably be do a multiple input job, read the directory for the files before you run the job and pass those instead of a directory to the job (then delete the files in the list after the job is done).
Based on the situation you are explaining I can suggest the following solution:-
1.The process of data monitoring I.e monitoring the directory into which the archives are landing should be done by a separate process. That separate process can use some metadata table like in mysql to put status entries based on monitoring the directories. The metadata entries can also check for duplicacy.
2. Now based on the metadata entry a separate process can handle the map reduce job triggering part. Some status could be checked in metadata for triggering the jobs.
I think you should use Apache Oozie to manage your workflow. From Oozie's website (bolding is mine):
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
...
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

Retaining logs from Hadoop job after it's executed

I'm wondering if there's an easy way to grab all the job logs / task attempt logs of a particular run, and persist them somewhere (HDFS, perhaps)?
I know that the logs are on the local filesystem at /var/log/hadoop-0.20-mapreduce/userlogs for any particular job's task attempts, and that I could write a script to SSH to each of the slave nodes and scoop them all up. However, I'm trying to avoid that if it makes sense to - perhaps there's some built-in function of Hadoop that I'm not aware of?
I did find this link, which is old, but contains some helpful information -- but did not include the answer I'm looking for.
mapreduce.job.userlog.retain.hours is set to 24 by default, so any job's logs will be automatically purged after 1 day. Is there anything I can do besides increasing the value of the retain.hours parameter to get these to persist?
I don't know of anything out of the box that exists, but I have done something similar manually.
We set up cron jobs that run every 20 minutes that look for new logs for task attempts, then pumps them all into HDFS into a specific directory. We modified the files names so that the hostname it is coming from is appended. Then, we had MapReduce jobs try to find issues, calculate stats like runtimes, etc. It was pretty neat. We did something similar with NameNode logs, too.

When do the results from a mapper task get deleted from disk?

When do the outputs for a mapper task get deleted from the local filesystem? Do they persist until the entire job completes or do they get deleted at an earlier time than that?
In addition to the map and reduce tasks, two further tasks are created: a job setup task
and a job cleanup task. These are run by tasktrackers and are used to run code to setup
the job before any map tasks run, and to cleanup after all the reduce tasks are complete.
The OutputCommitter that is configured for the job determines the code to be run, and
by default this is a FileOutputCommitter. For the job setup task it will create the final
output directory for the job and the temporary working space for the task output, and
for the job cleanup task it will delete the temporary working space for the task output.
Have a look at OutputCommitter.
If your hadoop.tmp.dir is set to a default setting (say, /tmp/), it will most likely be subject to tmpwatch and any default settings in your OS. I would suggest poking around in /etc/cron.d/, /etc/cron.daily, etc/cron.weekly/, etc., to see exactly what your OS default is like.
One thing to keep in mind about tmpwatch is that, by default, it will key on access time, not modification time (i.e., files that have not been 'touched' since X will be considered 'stale' and subject to removal). However, it's a common practice with Hadoop to mount filesystems with the noatime and nodiratime flags, meaning that access times will not get updated and thus skewing your tmpwatch behaviors.
Otherwise, Hadoop will purge task attempt logs older than 24 hours (after task completion), by default. While a few years old, this writeup has some great info on the default behaviors. Take a look in particular at the sections that refer to mapreduce.job.userlog.retain.hours.
EDIT: responding to OP's comment, which clears up my misunderstanding of the question:
As far as the intermediate output of map tasks which is spilled to disk, used by any combiners, and copied to any reducers, the Hadoop Definitive Guide has this to say:
Tasktrackers do not delete map outputs from disk as soon as the first
reducer has retrieved them, as the reducer may fail. Instead, they
wait until they are told to delete them by the jobtracker, which is
after the job has completed.
Source
I've also +1'd #mgs answer below, as they have linked the source code that controls this and described the Job cleanup task.
So, yes, the map output data is deleted immediately after the job completes, successfully or not, and no sooner.
"Tasktrackers do not delete map outputs from disk as soon as the first reducer has retrieved them, as the reducer may fail. Instead, they wait until they are told to delete them by the jobtracker, which is after the job has completed"
Hadoop: The Definitive Guide ( Section 6.4)

Oozie/Hadoop: How do I define an input dataset when it's more complex than just a static file?

I'm trying to run an existing Hadoop job using Oozie (I'm migrating from AWS).
In AWS Mapreduce I programmatically submit jobs, so before the job is submitted, my code programmatically find the input.
My input happens to be the last SUCCESSFUL run of another job. To find the last SUCCESSFUL run I need to scan an HDFS folder, sort by the timestamp embedded in the folder naming convention, and find the most recent folder with an _SUCCESS file in it.
How to do this is beyond my oozie-newbie comprehension.
Can someone simply describe for me what I need to configure in Oozie so I have some idea of what I'm attempting to reach for here?
Take a look to the following configuration for oozie: https://github.com/cloudera/cdh-twitter-example/blob/master/oozie-workflows/coord-app.xml
There is a tag called "done-flag" there you can put the _SUCCESS file in order to trigger a workflow or for your case a map reduce job. There are also parameter for scheduling the job
${coord:current(1 + (coord:tzOffset() / 60))}
....

Resources