How to get the ID of MapReduce job submitted by the `hadoop jar <example.jar> <main-class>` command? - hadoop

I want to write a shell script which submits a MapReduce job by the command hadoop jar <example.jar> <main-class>, then how can I get the ID of the job submitted by that command in the shell script, right after that command was invoked?
I know that the command hadoop job -list can display all jobs' IDs, but in that case I can't tell which job is submitted by the shell script.

Related

Why oozie submits shell action to yarn?

I am recently learning oozie. I little curious about shell action. I am executing shell action which contains shell command like
hadoop jar <jarPath> <FQCN>
While running this action there are two yarn jobs running which are
one for hadoop job
one for shell action
I dont understand why shell action needs yarn for execution. I also tried email action. It executes without yarn resources.
To answer this question, the difference is between
running a shell script independently(.sh file or from CLI)
running a shell action as a part of an oozie workflow.(shell script in an oozie shell action)
The first case is very obvious.
In the second case, oozie launches the shell script via YARN(is the resource negotiator )to run your shell script on the cluster where oozie is installed and runs MR jobs internally to launch the shell action. So the shell script runs as a YARN application internally. The logs of the oozie workflow shows the way the shell action is launched in oozie.

EMR kill PIG script

Is there a way of killing a running pig script, not only the current hadoop job ?
As you know a pig script is translated to a hadoop job DAG. Assume everything runs smoothly up to some point in this graph but, for some reason, I want to stop the execution of this script/"DAG". Is there an emr command to do that ?
I tried to kill the current hadoop job and it looks like the execution of the pig script is CANCELLED but the cluster/master node is left in a weird state which makes all the subsequent pig scripts fail instantly.

How to invoke an oozie workflow via shell script and block/wait till workflow completion

I have created a workflow using Oozie that is comprised of multiple action nodes and have been successfully able to run those via coordinator.
I want to invoke the Oozie workflow via a wrapper shell script.
The wrapper script should invoke the Oozie command, wait till the oozie job completes (success or error) and return back the Oozie success status code (0) or the error code of the failed oozie action node (if any node of the oozie workflow has failed).
From what I have seen so far, I know that as soon as I invoke the oozie command to run a workflow, the command exits with the job id getting printed on linux console, while the oozie job keeps running asynchronously in the backend.
I want my wrapper script to block till the oozie coordinator job completes and return back the success/error code.
Can you please let me know how/if I can achieve this using any of the oozie features?
I am using Oozie version 3.3.2 and bash shell in Linux.
Note: In case anyone is curious about why I need such a feature - the requirement is that my wrapper shell script should know how long an oozie job has been runnig, when an oozie job has completed, and accordingly return back the exit code so that the parent process that is calling the wrapper script knows whether the job completed successfully or not, and if errored out, raise an alert/ticket for the support team.
You can do that by using the job id then start a loop and parsing the output of oozie info. Below is the shell code for same.
Start oozie job
oozie_job_id=$(oozie job -oozie http://<oozie-server>/oozie -config job.properties -run );
echo $oozie_job_id;
sleep 30;
Parse job id from output. Here job_id format is "job: jobid"
job_id=$(echo $oozie_job_id | sed -n 's/job: \(.*\)/\1/p');
echo $job_id;
check job status at regular interval, if its Running or not
while [ true ]
do
job_status=$(oozie job --oozie http://<oozie-server>/oozie -info $job_id | sed -n 's/Status\(.*\): \(.*\)/\2/p');
if [ "$job_status" != "RUNNING" ];
then
echo "Job is completed with status $job_status";
break;
fi
#this sleep depends on you job, please change the value accordingly
echo "sleeping for 5 minutes";
sleep 5m
done
This is basic way to do it, you can modify it as per you use case.
To upload workflow definition to HDFS use the following command :
hdfs dfs -copyFromLocal -f workflow.xml /user/hdfs/workflows/workflow.xml
To fire up Oozie job you need these two commands at the below
Please Notice that to write each on a single line.
JOB_ID=$(oozie job -oozie http://<oozie-server>/oozie -config job.properties
-submit)
oozie job -oozie http://<oozie-server>/oozie -start ${JOB_ID#*:}
-config job.properties
You need to parse result coming from below command when the returning result = 0 otherwise it's a failure. Simply loop with sleep X amount of time after each trial.
oozie job -oozie http://<oozie-server>/oozie -info ${JOB_ID#*:}
echo $? //shows whether command executed successfully or not

How to interrupt PIG from DUMP -ing a huge file/variable in grunt mode?

How do we interrupt pig dump command (EDIT: when it has completed the MapReduce jobs and is now just displaying the result on grunt shell) without exiting the grunt shell?
Sometimes, if we dump a HUGE file by mistake, it goes on forever!
I know we can use CTRL+C to stop it but it also quits the grunt shell and then we have to write all the commands again.
We can execute the following command in the grunt shell
kill jobid
We can find the job’s ID by looking at Hadoop’s JobTracker GUI, which lists all jobs currently running on the cluster. Note that this command kills a particular MapReduce job. If the Pig job contains other MapReduce jobs that do not depend on the killed MapReduce job, these jobs will still continue. If you want to kill all of the MapReduce jobs associated with a particular Pig job, it is best to terminate the process running Pig using CTRL+C, and then use this command to kill any MapReduce jobs that are still running.

How to check whether a file exists or not using hdfs shell commands

am new to hadoop and a small help is required.
Suppose if i ran the job in background using shell scripting, how do i know whether the job is completed or not. The reason am asking is, once the job is completed my script has to move output file to some other location. How can i check whether job completed or outfile exists or not using hdfs.
Thanks
MRK
You need to be careful in the way you are detecting the job is done in this way, because there might be output before your job is completely finished.
To answer your direct question, to test for existence I typically do hadoop fs -ls $output | wc -l and then make sure the number is greater than 0.
My suggestion is you use && to tack on the move:
hadoop ... myjob.jar ... && hadoop fs -mv $output $new_output &
This will complete the job, and then perform the move afterwards.
You can use JobConf.setJobEndNotificationURI() to get notified when the job gets completed.
I think you can also check for the pid of the process that started the Hadoop job using the ps command.

Resources