How to delete input files after successful mapreduce - hadoop

We have a system that receives archives on a specified directory and on a regular basis it launches a mapreduce job that opens the archives and processes the files within them. To avoid re-processing the same archives the next time, we're hooked into the close() method on our RecordReader to have it deleted after the last entry is read.
The problem with this approach (we think) is that if a particular mapping fails, the next mapper that makes another attempt at it finds that the original file has been deleted by the record reader from the first one and it bombs out. We think the way to go is to hold off until all the mapping and reducing is complete and then delete the input archives.
Is this the best way to do this?
If so, how can we obtain a listing of all the input files found by the system from the main program? (we can't just scrub the whole input dir, new files may be present)
i.e.:
. . .
job.waitForCompletion(true);
(we're done, delete input files, how?)
return 0;
}

Couple comments.
I think this design is heartache-prone. What happens when you discover that someone deployed a messed up algorithm to your MR cluster and you have to backfill a month's worth of archives? They're gone now. What happens when processing takes longer than expected and a new job needs to start before the old one is completely done? Too many files are present and some get reprocessed. What about when the job starts while an archive is still in flight? Etc.
One way out of this trap is to have the archives go to a rotating location based on time, and either purge the records yourself or (in the case of something like S3) establish a retention policy that allows a certain window for operations. Also whatever the back end map reduce processing is doing could be idempotent: processing the same record twice should not be any different than processing it once. Something tells me that if you're reducing your dataset, that property will be difficult to guarantee.
At the very least you could rename the files you processed instead of deleting them right away and use a glob expression to define your input that does not include the renamed files. There are still race conditions as I mentioned above.
You could use a queue such as Amazon SQS to record the delivery of an archive, and your InputFormat could pull these entries rather than listing the archive folder when determining the input splits. But reprocessing or backfilling becomes problematic without additional infrastructure.
All that being said, the list of splits is generated by the InputFormat. Write a decorator around that and you can stash the split list wherever you want for use by the master after the job is done.

The simplest way would probably be do a multiple input job, read the directory for the files before you run the job and pass those instead of a directory to the job (then delete the files in the list after the job is done).

Based on the situation you are explaining I can suggest the following solution:-
1.The process of data monitoring I.e monitoring the directory into which the archives are landing should be done by a separate process. That separate process can use some metadata table like in mysql to put status entries based on monitoring the directories. The metadata entries can also check for duplicacy.
2. Now based on the metadata entry a separate process can handle the map reduce job triggering part. Some status could be checked in metadata for triggering the jobs.

I think you should use Apache Oozie to manage your workflow. From Oozie's website (bolding is mine):
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
...
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

Related

Synchronize NiFi process groups or flows that don't/can't connect?

Like the question states, is there some way to synchronize NiFi process groups or pipelines that don't/can't connect in the UI?
Eg. I have a process where I want to getFTP->putHDFS->moveHDFS (which ends up actually being getFTP->putHDFS->listHDFS->moveHDFS, see https://stackoverflow.com/a/50166151/8236733). However, listHDFS does not seem to take any incoming connections. Trying to do something with process groups like P1{getFTP->putHDFS->outport}->P2{inport->listHDFS->moveHDFS} also runs into the same problem (listHDFS can't seem to take any incoming connections). We don't want to moveHDFS before we ever even get anything from getFTP, but given the above, I don't see how these actions can be synchronized to occur in the right order.
New to NiFi, but I imagine this is a common use case and there must be some NiFi-ish way of doing this that I am missing. Advice in this would be appreciated. Thanks.
I'm not sure what requirement is preventing you from writing the file retrieved from FTP directly to the desired HDFS location, or if this is a "write n files to HDFS with a . starting the filename and then rename all when some certain threshold is reached" scenario.
ListHDFS does not take any incoming relationships because it should not be triggered by an incoming event, but rather on a timer/CRON schedule. Every time it runs, it will produce n flowfiles, where each references an HDFS file that has been detected to be written to the filesystem since the last execution. To do this, the processor stores local state.
Your flow segments do not need to be connected in this case. You'll have "flow segment A" which performs the FTP -> HDFS writing (GetFTP -> PutHDFS) and you'll have an independent "flow segment B" which lists the HDFS directory, reads the file descriptors (but not the content of the file unless you use FetchHDFS as well) and moves them (ListHDFS -> MoveHDFS). The ListHDFS processor will run constantly, but if it does not detect any new files during a run, it will simply yield and perform a no-op. Once the PutHDFS processor completes the task of writing a file to the HDFS file system, on the next ListHDFS execution, it will detect that file and generate a flowfile describing it.
You can tune the scheduling to your liking, but in general this is a very common pattern in NiFi flows.

Re-use files in Hadoop Distributed cache

I am wondering if someone can explain how the distributed cache works in Hadoop. I am running a job many times, and after each run I notice that the local distributed cache folder on each node is growing in size.
Is there a way for multiple jobs to re-use the same file in the distributed cache? Or is the distributed cache only valid for the lifetime of any individual job?
The reason I am confused is that the Hadoop documentation mentions that "DistributedCache tracks modification timestamps of the cache files", so this leads me to believe that if the time stamp hasn't changed, then it should not need to re-cache or re-copy the files to the nodes.
I am adding files successfully to the distributed cache using:
DistributedCache.addFileToClassPath(hdfsPath, conf);
DistributedCache uses reference counting to manage the caches. org.apache.hadoop.filecache.TrackerDistributedCacheManager.CleanupThread is in charge of cleaning up the CacheDirs whose reference count is 0. It will check every minute (default period is 1 minute, you can set it by "mapreduce.tasktracker.distributedcache.checkperiod").
When a Job finishes or fails, JobTracker will send a org.apache.hadoop.mapred.KillJobAction to the TaskTrackers. Then if a TaskTracker receives a KillJobAction, it puts the action to tasksToCleanup. In the TaskTracker, there is a background Thread called taskCleanupThread which takes the action from tasksToCleanup and do the cleanup work. For a KillJobAction, it will invoke purgeJob to clean up the Job. In this method, it will decrease the reference count used by this Job (rjob.distCacheMgr.release();).
The above analysis bases on hadoop-core-2.0.0-mr1-cdh4.2.1-sources.jar. I also checked the hadoop-core-0.20.2-cdh3u1-sources.jar and found there was a litte difference between this two versions. For example, there was not a org.apache.hadoop.filecache.TrackerDistributedCacheManager.CleanupThread in 0.20.2-cdh3u1. When initializing a Job, TrackerDistributedCacheManager will check if there is enough space to put the new caches files for this Job. If not, it will delete the caches which have 0 reference count.
If you are using cdh4.2.1, you can increase "mapreduce.tasktracker.distributedcache.checkperiod" to let the clean up work delay. Then the probability that multiple Jobs use the same distributed cache is increased.
If you are using cdh3u1, you can increase the limitation of the cache size("local.cache.size", default is 10G) and the max directories for caches("mapreduce.tasktracker.cache.local.numberdirectories", default is 10000). This can be also applied to cdh4.2.1.
If you look closely at what this book says, is that there is a limit of what can be stored in Distributed Cache. By default it's 10GB (configurable). There can be multiple different jobs running in the cluster concurrently. Furthermore, Hadoop kind of guarantees the files stay available in the cache for a single job as it is maintained by reference count done by the tasktracker for different tasks accessing the files in cache. In your case, for subsequent Jobs, the files may not be there as they are already marked for deletion.
Please correct me if you disagree anywhere. I'll be glad to discuss this further.
According to this: http://www.datasalt.com/2011/05/handling-dependencies-and-configuration-in-java-hadoop-projects-efficiently/
You should be able to do this via DistributedCache API instead of "-libjars"

When do the results from a mapper task get deleted from disk?

When do the outputs for a mapper task get deleted from the local filesystem? Do they persist until the entire job completes or do they get deleted at an earlier time than that?
In addition to the map and reduce tasks, two further tasks are created: a job setup task
and a job cleanup task. These are run by tasktrackers and are used to run code to setup
the job before any map tasks run, and to cleanup after all the reduce tasks are complete.
The OutputCommitter that is configured for the job determines the code to be run, and
by default this is a FileOutputCommitter. For the job setup task it will create the final
output directory for the job and the temporary working space for the task output, and
for the job cleanup task it will delete the temporary working space for the task output.
Have a look at OutputCommitter.
If your hadoop.tmp.dir is set to a default setting (say, /tmp/), it will most likely be subject to tmpwatch and any default settings in your OS. I would suggest poking around in /etc/cron.d/, /etc/cron.daily, etc/cron.weekly/, etc., to see exactly what your OS default is like.
One thing to keep in mind about tmpwatch is that, by default, it will key on access time, not modification time (i.e., files that have not been 'touched' since X will be considered 'stale' and subject to removal). However, it's a common practice with Hadoop to mount filesystems with the noatime and nodiratime flags, meaning that access times will not get updated and thus skewing your tmpwatch behaviors.
Otherwise, Hadoop will purge task attempt logs older than 24 hours (after task completion), by default. While a few years old, this writeup has some great info on the default behaviors. Take a look in particular at the sections that refer to mapreduce.job.userlog.retain.hours.
EDIT: responding to OP's comment, which clears up my misunderstanding of the question:
As far as the intermediate output of map tasks which is spilled to disk, used by any combiners, and copied to any reducers, the Hadoop Definitive Guide has this to say:
Tasktrackers do not delete map outputs from disk as soon as the first
reducer has retrieved them, as the reducer may fail. Instead, they
wait until they are told to delete them by the jobtracker, which is
after the job has completed.
Source
I've also +1'd #mgs answer below, as they have linked the source code that controls this and described the Job cleanup task.
So, yes, the map output data is deleted immediately after the job completes, successfully or not, and no sooner.
"Tasktrackers do not delete map outputs from disk as soon as the first reducer has retrieved them, as the reducer may fail. Instead, they wait until they are told to delete them by the jobtracker, which is after the job has completed"
Hadoop: The Definitive Guide ( Section 6.4)

Oozie/Hadoop: How do I define an input dataset when it's more complex than just a static file?

I'm trying to run an existing Hadoop job using Oozie (I'm migrating from AWS).
In AWS Mapreduce I programmatically submit jobs, so before the job is submitted, my code programmatically find the input.
My input happens to be the last SUCCESSFUL run of another job. To find the last SUCCESSFUL run I need to scan an HDFS folder, sort by the timestamp embedded in the folder naming convention, and find the most recent folder with an _SUCCESS file in it.
How to do this is beyond my oozie-newbie comprehension.
Can someone simply describe for me what I need to configure in Oozie so I have some idea of what I'm attempting to reach for here?
Take a look to the following configuration for oozie: https://github.com/cloudera/cdh-twitter-example/blob/master/oozie-workflows/coord-app.xml
There is a tag called "done-flag" there you can put the _SUCCESS file in order to trigger a workflow or for your case a map reduce job. There are also parameter for scheduling the job
${coord:current(1 + (coord:tzOffset() / 60))}
....

Hadoop Operationalization

I have all the pieces of a hadoop implementation ready - I have a running cluster, and a client writer that is pushing activity data into HDFS. I have a question about what happens next. I understand that we run jobs against the data that has been dumped into HDFS, but my questions are:
1) First off, I am writing into the stream and flushing periodically - I am writing the files via a thread in the HDFS java client, and I don't see the files appear in HDFS until I kill my server. If I write enough data to fill a block, will that automatically appear in the file system? How do I get to a point where I have files that are ready to be processed by M/R jobs?
2) When do we run M/R jobs? Like I said, I am writing the files via a thread in the HDFS java client, and that thread has a lock on the file for write. At what point should I release that file? How does this interaction work? At what point is it 'safe' to run a job against that data, and what happens to the data in HDFS when its done?
I would try to avoid "hard" synchronization between data insertion into hadoop and processing results. I mean that in many cases it is most practical to have to asynchronious processes:
a) One process putting files into HDFS. In many cases -building directory structure by dates is usefull.
b) Run jobs for all but most recent data.
You can run job on most recent data, but application should not relay on up to the minute results. In any case job usually takes more then a few minutes in any case
Another point - append is not 100% mainstream but advanced thing built for HBase. If you build your app without usage of it - you will be able to work with other DFS's like amazon s3 which do not support append. We are collecting data in local file system, and then copy them to HDFS when file is big enough.
write the data to fill a block , you will see the file in the system
M/R is submitted to the scheduler , which takes care of running it against data, we need not worry abt

Resources