How to check the overall progress of PIG job - hadoop

A pig script can be translated into multiple MR jobs and I am wondering if there is an interface or a way to see the progress of the overall PIG script like how many jobs are scheduled, executed and so on.

We had the same problem at Twitter, as some of our Pig scripts spin up dozens of Map-Reduce jobs and it's sometimes hard to tell which of them is doing what, reason about efficiency of the plan, understand how many will run in parallel, etc.
So we created Twitter Ambrose: https://github.com/twitter/ambrose
It spins up a little jetty server which gives you a nice web ui that shows the job DAG, colors the nodes as the jobs complete, gives you stats about the jobs, and tells you which relations each job is trying to calculate.

There is a command illustrate but it throws an exception on my deployment. So I use another approach.
You can get the information on how many MR jobs are scheduled by using explain command and looking at the Physical Plan section, which is at the end of the explain report. To get the number of MR jobs for the script I do the following:
./pig -e 'explain -script ./script_name.pig' > ./explain.txt
grep MapReduce ./explain.txt | wc -l
Now we have the number of MR jobs planned. To monitor script execution, before you run it, you need to access Hadoop's jobtracker page (via "http://(IP_or_node_name):50030/jobtracker.jsp") and write down the name of last job (Completed Jobs section). Submit the script. Refresh the jobtracker page and count how many running jobs there are and how many are completed after the one you have noted. Now you can get an idea of how many jobs are left to be executed.
Click on each job and see its statistics and progress.
A much simpler approach would be to run the script on a small dataset, note down the number of jobs, it is displayed on the console output after the script execution. As pig does not change its execution plan, it will be the same with the big dataset. By looking into stats of each job on Hadoop's jobtracker page (via "http://(IP_or_node_name):50030/jobtracker.jsp") you can get the idea of the proportion of time each MR job takes. Than you can use it to approximately interpolate the execution time on large dataset. If you have skewed data and some Cartesian products, execution time prediction might become tricky.

Related

Oozie for multiple mapreduce jobs

I have a sequence of mapreduce jobs that need to be run. I was wondering if there is any advantage of using Oozie for that, instead of having "one big driver" that will run that sequence?
I know that Oozie can be used to run multiple actions of different type, e.g. pig script, shell script, mr job, but I'm concretely interested should I split my two jobs and run them using Oozie, or have a single jar to do that?
Oozie is a scheduler - crude, poorly documented, but a scheduler.
If you don't need scheduling per se, or if CRON on an edge node is sufficient
if you want to handle your workflow logic by yourself (e.g. conditional
branching, parallel executions w/ waiting for stragglers, calling
generic sub-workflows w/ ad hoc parameters, e-mail alerts on errors,
<insert your pet feature here>) or don't need any fancy logic
if you handle your executions logs and state history by yourself, or don't care about history
... well, don't use a scheduler.
PS: you also have Luigi (Spotify) and Azkaban (LinkedIn) as alternative Hadoop schedulers.
[edit] extra point to consider: if your "driver" crashes for whatever reason, you may not have a chance to send an alert; but if run from Oozie, the crash will be detected eventually (may take as much as 30 min. in a corner case e.g. AM job self-destruction due to YARN RM failover)

how to check that mapreduce is running parallelly?

I submitted a mapreduce job and checked the log.
In the log l see that there are many mappers, each mapper processes one split, and the processing details of each mapper is logged in the log file sequentially in time.
However, I would like to check if my job is running parallelly and I want to see how many mappers are running concurrently.
I don't know where to find these informations.
Please help me, thx!
Use following Jobtracker Web UI and drill down to executing MapReduce job
http://<Jobtracker-HostName>:50030/

Why are map tasks killed for no apparent reason?

I am running a Pig job that loads around 8 million rows from HBase (several columns) using HBaseStorage. The job finishes successfully and seems to produce the right results but when I look at the job details in the job tracker it says 50 map tasks were created of which 28 where successful and 22 were killed. The reduce ran fine. By looking at the logs of the killed map tasks there is nothing obvious to me as to why the tasks were killed. In fact the logs of successful and failed tasks are practically identical and both tasks are taking some reasonable time. Why are all these map tasks created and then killed? Is it normal or is it a sign of a problem?
This sounds like Speculative Execution in Hadoop. It runs the same task on several nodes and kills them when at least one completes. See the explanation this this book: https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/task-execution

What is the difference between job.submit and job.waitForComplete in Apache Hadoop?

I have read the documentation so I know the difference.
My question however is that, is there any risk in using .submit instead of .waitForComplete if I want to run several Hadoop jobs on a cluster in parallel ?
I mostly use Elastic Map Reduce.
When I tried doing so, I noticed that only the first job being executed.
If your aim is to run jobs in parallel then there is certainly no risk in using job.submit(). The main reason job.waitForCompletion exists is that it's method call returns only when the job gets finished, and it returns with it's success or failure status which can be used to determine that further steps are to be run or not.
Now, getting back at you seeing only the first job being executed, this is because by default Hadoop schedules the jobs in FIFO order. You certainly can change this behaviour. Read more here.

how to automatically run a bash script when my qsub jobs are finished on a server?

I would like to run a script when all of the jobs that I have sent to a server are done.
for example, I send
ssh server "for i in config*; do qsub ./run 1 $i; done"
And I get back a list of the jobs that were started. I would like to automatically start another script on the server to process the output from these jobs once all are completed.
I would appreciate any advice that would help me avoid the following inelegant solution:
If I save each of the 1000 job id's from the above call in a separate file, I could check the contents of each file against the current list of running jobs, i.e. output from a call to:
ssh qstat
I would only need to check every half hour, but I would imagine that there is a better way.
It depends a bit on what job scheduler you are using and what version, but there's another approach that can be taken too if your results-processing can also be done on the same queue as the job.
One very handy way of managing lots of related job in more recent versions of torque (and with grid engine, and others) is to launch the any individual jobs as a job array (cf. http://docs.adaptivecomputing.com/torque/4-1-4/Content/topics/commands/qsub.htm#-t). This requires mapping the individual runs to numbers somehow, which may or may not be convenient; but if you can do it for your jobs, it does greatly simplify managing the jobs; you can qsub them all in one line, you can qdel or qhold them all at once (while still having the capability to deal with jobs individually).
If you do this, then you could submit an analysis job which had a dependency on the array of jobs which would only run once all of the jobs in the array were complete: (cf. http://docs.adaptivecomputing.com/torque/4-1-4/Content/topics/commands/qsub.htm#dependencyExamples). Submitting the job would look like:
qsub analyze.sh -W depend=afterokarray:427[]
where analyze.sh had the script to do the analysis, and 427 would be the job id of the array of jobs you launched. (The [] means only run after all are completed). The syntax differs for other schedulers (eg, SGE/OGE) but the ideas are the same.
Getting this right can take some doing, and certainly Tristan's approach has the advantage of being simple, and working with any scheduler; but learning to use job arrays in this situation if you'll be doing alot of this may be worth your time.
Something you might consider is having each job script just touch a filename in a dedicated folder like $i.jobdone, and in your master script, you could simply use ls *.jobdone | wc -l to test for the right number of jobs done.
You can use wait to stop execution until all your jobs are done. You can even collect all the exit statuses and other running statistics (time it took, count of jobs done at the time, whatever) if you cycle around waiting for specific ids.
I'd write a small C program to do the waiting and collecting (if you have permissions to upload and run executables), but you can easily use the bash wait built-in for roughly the same purpose, albeit with less flexibility.
Edit: small example.
#!/bin/bash
...
waitfor=''
for i in tasks; do
task &
waitfor="$waitfor $!"
done
wait $waitfor
...
If you run this script in background, It won't bother you and whatever comes after the wait line will run when your jobs are over.

Resources