Is there any Pig map task completion hook? - hadoop

I have a piece of code that I want to run at the end of each of map tasks spawned by pig to perform my job. In other words, I need to do some task just before my map task is exiting. Here is what my research yielded:
We could call PigProgressNotificationListener.jobFinishedNotification() but this method is called on completion of whole job not on completion of every (internal) mapper task.
Finish method in UDF: called at the end of UDF, doesn't meet my requirement.
I am a beginner in MR world.

In Hadoop's implementation of MapReduce, there are setup and cleanup functions that are respectively called at the start and end of each of the mappers, and which the developer can override to get the desired functionality.
So, if your Pig script is not that complicated to express as a series of MapReduce programs, you can exploit these functions.
I'm sure that Pig is advanced enough to support such functionality as well. So, just look-up for the Pig equivalent of these functions.

Related

Oozie for multiple mapreduce jobs

I have a sequence of mapreduce jobs that need to be run. I was wondering if there is any advantage of using Oozie for that, instead of having "one big driver" that will run that sequence?
I know that Oozie can be used to run multiple actions of different type, e.g. pig script, shell script, mr job, but I'm concretely interested should I split my two jobs and run them using Oozie, or have a single jar to do that?
Oozie is a scheduler - crude, poorly documented, but a scheduler.
If you don't need scheduling per se, or if CRON on an edge node is sufficient
if you want to handle your workflow logic by yourself (e.g. conditional
branching, parallel executions w/ waiting for stragglers, calling
generic sub-workflows w/ ad hoc parameters, e-mail alerts on errors,
<insert your pet feature here>) or don't need any fancy logic
if you handle your executions logs and state history by yourself, or don't care about history
... well, don't use a scheduler.
PS: you also have Luigi (Spotify) and Azkaban (LinkedIn) as alternative Hadoop schedulers.
[edit] extra point to consider: if your "driver" crashes for whatever reason, you may not have a chance to send an alert; but if run from Oozie, the crash will be detected eventually (may take as much as 30 min. in a corner case e.g. AM job self-destruction due to YARN RM failover)

Run non-blocking series of jobs

A certain number of jobs needs to be executed in a sequence, such that result of one job is input to another. There's also a loop in one part of job chain. Currently, I'm running this sequency using wait for completition, but I'm going to start this sequence from web service, so I don't want to get stuck waiting for response. I wan't to start the sequence and return.
How can I do that, considering that job's depend on each other?
The typical approach I follow is to use Oozie work flow to chain the sequence of jobs with passing the dependent inputs to them accordingly.
I used a shell script to invoke the oozie job .
I am not sure about the loops within the oozie workflow. but the below link speaks about the way to implement loops within the workflow.Hope it might help you.
http://zapone.org/bernadette/2015/01/05/how-to-loop-in-oozie-using-sub-workflow/
Apart from this the JobControl class is also a good option if the jobs need to be in sequence and it requires less efforts to implement.It would be easy to do loop since it would be fully done with Java code.
http://gandhigeet.blogspot.com/2012/12/hadoop-mapreduce-chaining.html
https://cloudcelebrity.wordpress.com/2012/03/30/how-to-chain-multiple-mapreduce-jobs-in-hadoop/

How to schedule post processing task after a mapreduce job

I'm looking for a simple method to chain post processing code after a map reduce job
specifically, in involves renaming\moving the out files create by org.apache.hadoop.mapred.lib.MultipleOutputs (the class has limitations on the output file names, so I ca't produce the files directly in the mapreduce job)
The options I know (or think of) are:
add it in the job creation code - this is what I do now, but I prefer the task will be scheduled by the jobtracker (to reduce the chances of the process being aborted)
using a workflow engine (luigi, oozie) - but this seems like an overkill for this issue
using job chaining - this allows chaining mapreduce jobs - it it possible to chain a "simple" task?
Your "simple" task should be a Mapper-only job. Your Map() receives as key the file name and renames the file. For this you have to write your own InputFormat and RecordReader, like in the links, but your RecordReader should not actually read the file, just return the file name in getCurrentKey():
https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileInputFormat.java?r=3
https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileRecordReader.java?r=3

What is the difference between job.submit and job.waitForComplete in Apache Hadoop?

I have read the documentation so I know the difference.
My question however is that, is there any risk in using .submit instead of .waitForComplete if I want to run several Hadoop jobs on a cluster in parallel ?
I mostly use Elastic Map Reduce.
When I tried doing so, I noticed that only the first job being executed.
If your aim is to run jobs in parallel then there is certainly no risk in using job.submit(). The main reason job.waitForCompletion exists is that it's method call returns only when the job gets finished, and it returns with it's success or failure status which can be used to determine that further steps are to be run or not.
Now, getting back at you seeing only the first job being executed, this is because by default Hadoop schedules the jobs in FIFO order. You certainly can change this behaviour. Read more here.

Job step loop for LoadLeveler job scripts?

I'm using LoadLeveler to submit jobs on an IBM/BlueGene architecture. I read the documentation made from IBM and also gave Google a try, but I cannot find how to do the following, which I expect should be there:
One can use the
queue
keyword to tell LoadLeveler that a new job step is described, so I could do something like
first_step
queue
second_step
queue
but what I fail to find is a way that does something like
loop job_id = 1,10
do_job_with_given_job_id
end
Do I have to write a "normal" shell script that in turn calls a load level script for a bunch of times, or is there some built in loop mechanism? I know that other job managers can do this.
When this comes up, we normally just recommend that one writes a shell script which generates the job submission script or scripts; that's what I do for my own jobs. Do these steps have dependancies on each other?
Also, just out of curiosity, which schedulers/resource managers can queue multiple jobs within a loop in a submission script? Not the PBS-based ones...

Resources