I want to kill all my hadoop jobs automatically when my code encounters an unhandled exception. I am wondering what is the best practice to do it?
Thanks
Depending on the version, do:
version <2.3.0
Kill a hadoop job:
hadoop job -kill $jobId
You can get a list of all jobId's doing:
hadoop job -list
version >=2.3.0
Kill a hadoop job:
yarn application -kill $ApplicationId
You can get a list of all ApplicationId's doing:
yarn application -list
Use of folloing command is depreciated
hadoop job -list
hadoop job -kill $jobId
consider using
mapred job -list
mapred job -kill $jobId
Run list to show all the jobs, then use the jobID/applicationID in the appropriate command.
Kill mapred jobs:
mapred job -list
mapred job -kill <jobId>
Kill yarn jobs:
yarn application -list
yarn application -kill <ApplicationId>
An unhandled exception will (assuming it's repeatable like bad data as opposed to read errors from a particular data node) eventually fail the job anyway.
You can configure the maximum number of times a particular map or reduce task can fail before the entire job fails through the following properties:
mapred.map.max.attempts - The maximum number of attempts per map task. In other words, framework will try to execute a map task these many number of times before giving up on it.
mapred.reduce.max.attempts - Same as above, but for reduce tasks
If you want to fail the job out at the first failure, set this value from its default of 4 to 1.
Simply forcefully kill the process ID, the hadoop job will also be killed automatically . Use this command:
kill -9 <process_id>
eg: process ID no: 4040 namenode
username#hostname:~$ kill -9 4040
Use below command to kill all jobs running on yarn.
For accepted jobs use below command.
for x in $(yarn application -list -appStates ACCEPTED | awk 'NR > 2 { print $1 }'); do yarn application -kill $x; done
For running, jobs use the below command.
for x in $(yarn application -list -appStates RUNNING | awk 'NR > 2 { print $1 }'); do yarn application -kill $x; done
Related
I scheduled a coordinator which initiated many individual workflows. This was a backfill coordinator, with both startdate and enddate in the past.
A small percentage of these jobs failed due to temporary issues with the input datasets, and now I need to re-run those workflows (without re-running the successful workflows). These unsuccessful workflows have a variety of statuses: KILLED, FAILED, and SUSPENDED.
What is the best way to do this?
I don't think whether the entire thing i.e. jobs with multiple statuses, can be run in a single command but with the oozie jobs, it can be attempted although three separate commands for the three statuses. If anyone else has a better approach, please post it.
oozie jobs filter -jobtype wf -filter status=<status> -resume
Ex:
# KILLED
oozie jobs filter -jobtype wf -filter status=SUSPENDED -resume
There are a whole lot of other sub-commands offered for jobs which can be viewed by oozie help jobs. Hope that helps!
I ended up writing a bash script to do this. I won't copy the whole script here, but this was the general outline:
First, parse the output of oozie job -info to get a list of actions with a given status for a given coordinator:
actions=$(oozie job -info $oozie_coord -filter status=$status -len 1000 |
grep "\-C#" |
awk '{print $1}' |
sed -n "s/^.*#\([0-9]*\).*$/\1/p")
Then loop over these actions and issue rerun commands:
while read -r action; do
oozie job -rerun $oozie_coord -action $action
Is there any way we can kill hive query without exiting from hive shell ?. For Example, I wrongly ran the select statement from some table which has million rows of data, i just wanted to stop it, but not exiting from the shell. If I pressed CTRL+Z, its coming out of shell.
You have two options:
press Ctrl+C and wait till command terminates, it will not exit from hive CLI, press Ctrl+C second time and the session will terminate immediately exiting to the shell
from another shell run
yarn application -kill <Application ID> or
mapred job -kill <JOB_ID>
First, look for Job ID by:
hadoop job -list
And then kill it by ID:
hadoop job -kill <JOB_ID>
Go with the second option
yarn application -kill <Application ID>. Get the application ID by getting onto another session.
This is the only way I think you would be able to kill the current query. I do use via beeline on hortonwork framework.
I have created a workflow using Oozie that is comprised of multiple action nodes and have been successfully able to run those via coordinator.
I want to invoke the Oozie workflow via a wrapper shell script.
The wrapper script should invoke the Oozie command, wait till the oozie job completes (success or error) and return back the Oozie success status code (0) or the error code of the failed oozie action node (if any node of the oozie workflow has failed).
From what I have seen so far, I know that as soon as I invoke the oozie command to run a workflow, the command exits with the job id getting printed on linux console, while the oozie job keeps running asynchronously in the backend.
I want my wrapper script to block till the oozie coordinator job completes and return back the success/error code.
Can you please let me know how/if I can achieve this using any of the oozie features?
I am using Oozie version 3.3.2 and bash shell in Linux.
Note: In case anyone is curious about why I need such a feature - the requirement is that my wrapper shell script should know how long an oozie job has been runnig, when an oozie job has completed, and accordingly return back the exit code so that the parent process that is calling the wrapper script knows whether the job completed successfully or not, and if errored out, raise an alert/ticket for the support team.
You can do that by using the job id then start a loop and parsing the output of oozie info. Below is the shell code for same.
Start oozie job
oozie_job_id=$(oozie job -oozie http://<oozie-server>/oozie -config job.properties -run );
echo $oozie_job_id;
sleep 30;
Parse job id from output. Here job_id format is "job: jobid"
job_id=$(echo $oozie_job_id | sed -n 's/job: \(.*\)/\1/p');
echo $job_id;
check job status at regular interval, if its Running or not
while [ true ]
do
job_status=$(oozie job --oozie http://<oozie-server>/oozie -info $job_id | sed -n 's/Status\(.*\): \(.*\)/\2/p');
if [ "$job_status" != "RUNNING" ];
then
echo "Job is completed with status $job_status";
break;
fi
#this sleep depends on you job, please change the value accordingly
echo "sleeping for 5 minutes";
sleep 5m
done
This is basic way to do it, you can modify it as per you use case.
To upload workflow definition to HDFS use the following command :
hdfs dfs -copyFromLocal -f workflow.xml /user/hdfs/workflows/workflow.xml
To fire up Oozie job you need these two commands at the below
Please Notice that to write each on a single line.
JOB_ID=$(oozie job -oozie http://<oozie-server>/oozie -config job.properties
-submit)
oozie job -oozie http://<oozie-server>/oozie -start ${JOB_ID#*:}
-config job.properties
You need to parse result coming from below command when the returning result = 0 otherwise it's a failure. Simply loop with sleep X amount of time after each trial.
oozie job -oozie http://<oozie-server>/oozie -info ${JOB_ID#*:}
echo $? //shows whether command executed successfully or not
I have set up a pseudo distributed mode cluster setup. The FIFO scheduler got stuck somehow in between and therefore a lot of jobs got piled up which I had scheduler through cron. Now, when I restarted YARN resourcemanager it gets stuck after a while and the jobs keep piling up.
Is there a way I can clear the whole queue. Or, is it that my complete understanding of hadoop scheduling is somewhere flawed. Please help.
If you're trying to kill all the jobs in your queue, you can use this shell script:
$HADOOP_HOME/bin/hadoop job -list | awk ' { system("$HADOOP_HOME/bin/hadoop job -kill " $1) } '
Scenario:
When I run enter the query on Hive CLI, I get the errors as below:
Query:
**$ bin/hive -e "insert overwrite table pokes select a.* from invites a where a.ds='2008-08-15';"**
Error is like this:
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201111291547_0013, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201111291547_0013
Kill Command = C:\cygwin\home\Bhavesh.Shah\hadoop-0.20.2/bin/hadoop job
-Dmapred.job.tracker=localhost:9101 -kill job_201111291547_0013
2011-12-01 14:00:52,380 Stage-1 map = 0%, reduce = 0%
2011-12-01 14:01:19,518 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201111291547_0013 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
Question:
So my question is that how to stop a job? In this case the job is : job_201111291547_0013
Pls help me out so that I can remove these errors and try for next.
Thanks.
You can stop a job by running hadoop job -kill <job_id>.
hadoop job -kill is deprecated now.
Use mapred job -kill instead.
The log traces of the job launched provide the command to kill the job as well.You can use that to kill the job.
That however gives a warning that hadoop job -kill is deprecated. You can also use instead
mapred job -kill
One more option is to try WebHCat API from browser or command line, using utilities like Curl. Here's WebHCat API to delete a hive job
Also note that the link says that
The job is not immediately deleted, therefore the information returned may not reflect deletion, as in our example. Use GET jobs/:jobid to monitor the job and confirm that it is eventually deleted.