How to schedule post processing task after a mapreduce job - hadoop

I'm looking for a simple method to chain post processing code after a map reduce job
specifically, in involves renaming\moving the out files create by org.apache.hadoop.mapred.lib.MultipleOutputs (the class has limitations on the output file names, so I ca't produce the files directly in the mapreduce job)
The options I know (or think of) are:
add it in the job creation code - this is what I do now, but I prefer the task will be scheduled by the jobtracker (to reduce the chances of the process being aborted)
using a workflow engine (luigi, oozie) - but this seems like an overkill for this issue
using job chaining - this allows chaining mapreduce jobs - it it possible to chain a "simple" task?

Your "simple" task should be a Mapper-only job. Your Map() receives as key the file name and renames the file. For this you have to write your own InputFormat and RecordReader, like in the links, but your RecordReader should not actually read the file, just return the file name in getCurrentKey():
https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileInputFormat.java?r=3
https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileRecordReader.java?r=3

Related

Is there any Pig map task completion hook?

I have a piece of code that I want to run at the end of each of map tasks spawned by pig to perform my job. In other words, I need to do some task just before my map task is exiting. Here is what my research yielded:
We could call PigProgressNotificationListener.jobFinishedNotification() but this method is called on completion of whole job not on completion of every (internal) mapper task.
Finish method in UDF: called at the end of UDF, doesn't meet my requirement.
I am a beginner in MR world.
In Hadoop's implementation of MapReduce, there are setup and cleanup functions that are respectively called at the start and end of each of the mappers, and which the developer can override to get the desired functionality.
So, if your Pig script is not that complicated to express as a series of MapReduce programs, you can exploit these functions.
I'm sure that Pig is advanced enough to support such functionality as well. So, just look-up for the Pig equivalent of these functions.

How are output files handled in Hadoop at job and task level ?

AS per The Definitive Guide, setUpJob() of OutPutCommitter will create the mapreduce output directory and also setup the temporary workspace for tasks. mapred.output.dir/_temporary
Then the book says temporary directories at task level are created when task outputs are written.
The above two statements are kind of confusing.
So basically a map reduce job consist of many task i.e map tasks and reduce tasks. Now mapreduce output directory is the directory where the final output of map-reduce job is written. Now when the map reduce job runs each of the map task and reduce tasks generate intermediate files which is local to the node where the task run. This local output per task which are intermediate are written to temporary workspace. Finally after shuffling and other phases this intermediate outputs are finally written to hdfs as final output based on the logic you apply for the map-reduce job. I hope that answers your question

Oozie/Hadoop: How do I define an input dataset when it's more complex than just a static file?

I'm trying to run an existing Hadoop job using Oozie (I'm migrating from AWS).
In AWS Mapreduce I programmatically submit jobs, so before the job is submitted, my code programmatically find the input.
My input happens to be the last SUCCESSFUL run of another job. To find the last SUCCESSFUL run I need to scan an HDFS folder, sort by the timestamp embedded in the folder naming convention, and find the most recent folder with an _SUCCESS file in it.
How to do this is beyond my oozie-newbie comprehension.
Can someone simply describe for me what I need to configure in Oozie so I have some idea of what I'm attempting to reach for here?
Take a look to the following configuration for oozie: https://github.com/cloudera/cdh-twitter-example/blob/master/oozie-workflows/coord-app.xml
There is a tag called "done-flag" there you can put the _SUCCESS file in order to trigger a workflow or for your case a map reduce job. There are also parameter for scheduling the job
${coord:current(1 + (coord:tzOffset() / 60))}
....

A tool showing a breakdown of completion times and source machine names for each and every mapper and reducer?

I know job tasks page (in the JobTracker UI) is already showing start time and end time of every tasks in mapper and reducer but I would like to see something more like source machine names, number of spills and so on. I guess I can try to write such a tool using JobTracker class? But before embarking on that, I would like to see if there is such a tool already.
Does the hadoop job -history all output-dir command give you enough information to parse / process?
http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html - Search for the above command

Map Reduce: ChainMapper and ChainReducer

I need to split my Map Reduce jar file in two jobs in order to get two different output file, one from each reducers of the two jobs.
I mean that the first job has to produce an output file that will be the input for the second job in chain.
I read something about ChainMapper and ChainReducer in hadoop version 0.20 (currently I am using 0.18): those could be good for my needs?
Can anybody suggest me some links where to find some examples in order to use those methods? Or maybe there are another way to achieve my issue?
Thank you,
Luca
There are many ways you can do it.
Cascading jobs
Create the JobConf object "job1" for the first job and set all the parameters with "input" as inputdirectory and "temp" as output directory. Execute this job: JobClient.run(job1).
Immediately below it, create the JobConf object "job2" for the second job and set all the parameters with "temp" as inputdirectory and "output" as output directory. Execute this job: JobClient.run(job2).
Two JobConf objects
Create two JobConf objects and set all the parameters in them just like (1) except that you don't use JobClient.run.
Then create two Job objects with jobconfs as parameters:
Job job1=new Job(jobconf1); Job job2=new Job(jobconf2);
Using the jobControl object, you specify the job dependencies and then run the jobs:
JobControl jbcntrl=new JobControl("jbcntrl");
jbcntrl.addJob(job1);
jbcntrl.addJob(job2);
job2.addDependingJob(job1);
jbcntrl.run();
ChainMapper and ChainReducer
If you need a structure somewhat like Map+ | Reduce | Map*, you can use the ChainMapper and ChainReducer classes that come with Hadoop version 0.19 and onwards. Note that in this case, you can use only one reducer but any number of mappers before or after it.
I think the above solution involves disk I/O operation, thus will slow down with large datasets.Alternative is to use Oozie or Cascading.

Resources