Oozie: rerun all non-SUCCEEDED workflows in coordinator - hadoop

I scheduled a coordinator which initiated many individual workflows. This was a backfill coordinator, with both startdate and enddate in the past.
A small percentage of these jobs failed due to temporary issues with the input datasets, and now I need to re-run those workflows (without re-running the successful workflows). These unsuccessful workflows have a variety of statuses: KILLED, FAILED, and SUSPENDED.
What is the best way to do this?

I don't think whether the entire thing i.e. jobs with multiple statuses, can be run in a single command but with the oozie jobs, it can be attempted although three separate commands for the three statuses. If anyone else has a better approach, please post it.
oozie jobs filter -jobtype wf -filter status=<status> -resume
Ex:
# KILLED
oozie jobs filter -jobtype wf -filter status=SUSPENDED -resume
There are a whole lot of other sub-commands offered for jobs which can be viewed by oozie help jobs. Hope that helps!

I ended up writing a bash script to do this. I won't copy the whole script here, but this was the general outline:
First, parse the output of oozie job -info to get a list of actions with a given status for a given coordinator:
actions=$(oozie job -info $oozie_coord -filter status=$status -len 1000 |
grep "\-C#" |
awk '{print $1}' |
sed -n "s/^.*#\([0-9]*\).*$/\1/p")
Then loop over these actions and issue rerun commands:
while read -r action; do
oozie job -rerun $oozie_coord -action $action

Related

Snakemake does not recognise job failure due to timeout with error code -11

Does anyone had a problem snakemake recognizing a timed-out job. I submit jobs to a cluster using qsub with a time-out set per rule:
snakemake --jobs 29 -k -p --latency-wait 60 --use-envmodules \
--cluster "qsub -l walltime={resources.walltime},nodes=1:ppn={threads},mem={resources.mem_mb}mb"
If a job fails within a script, the next one in line will be executed. When a job however hits the time-out defined in a rule, the next job in line is not executed, reducing the total number of jobs run in parallel on the cluster over time. A timed-out job raises according to the MOAB scheduler (PBS server) a -11 exit status. As far as I understood any non-zero exit status means failure - or does this only apply to positive integers?!
Thanks in advance for any hint:)
If you don't provide a --cluster-status script, snakemake internally checks job status by touching some hidden files in the submitted job script. When a job times out, snakemake (on the node) doesn't get a chance to report the failure to the main snakemake instance as qsub will kill it.
You can try a cluster profile or just grab a suitable cluster status file (be sure to chmod it as an exe and have qsub report a parsable job id).

How to invoke an oozie workflow via shell script and block/wait till workflow completion

I have created a workflow using Oozie that is comprised of multiple action nodes and have been successfully able to run those via coordinator.
I want to invoke the Oozie workflow via a wrapper shell script.
The wrapper script should invoke the Oozie command, wait till the oozie job completes (success or error) and return back the Oozie success status code (0) or the error code of the failed oozie action node (if any node of the oozie workflow has failed).
From what I have seen so far, I know that as soon as I invoke the oozie command to run a workflow, the command exits with the job id getting printed on linux console, while the oozie job keeps running asynchronously in the backend.
I want my wrapper script to block till the oozie coordinator job completes and return back the success/error code.
Can you please let me know how/if I can achieve this using any of the oozie features?
I am using Oozie version 3.3.2 and bash shell in Linux.
Note: In case anyone is curious about why I need such a feature - the requirement is that my wrapper shell script should know how long an oozie job has been runnig, when an oozie job has completed, and accordingly return back the exit code so that the parent process that is calling the wrapper script knows whether the job completed successfully or not, and if errored out, raise an alert/ticket for the support team.
You can do that by using the job id then start a loop and parsing the output of oozie info. Below is the shell code for same.
Start oozie job
oozie_job_id=$(oozie job -oozie http://<oozie-server>/oozie -config job.properties -run );
echo $oozie_job_id;
sleep 30;
Parse job id from output. Here job_id format is "job: jobid"
job_id=$(echo $oozie_job_id | sed -n 's/job: \(.*\)/\1/p');
echo $job_id;
check job status at regular interval, if its Running or not
while [ true ]
do
job_status=$(oozie job --oozie http://<oozie-server>/oozie -info $job_id | sed -n 's/Status\(.*\): \(.*\)/\2/p');
if [ "$job_status" != "RUNNING" ];
then
echo "Job is completed with status $job_status";
break;
fi
#this sleep depends on you job, please change the value accordingly
echo "sleeping for 5 minutes";
sleep 5m
done
This is basic way to do it, you can modify it as per you use case.
To upload workflow definition to HDFS use the following command :
hdfs dfs -copyFromLocal -f workflow.xml /user/hdfs/workflows/workflow.xml
To fire up Oozie job you need these two commands at the below
Please Notice that to write each on a single line.
JOB_ID=$(oozie job -oozie http://<oozie-server>/oozie -config job.properties
-submit)
oozie job -oozie http://<oozie-server>/oozie -start ${JOB_ID#*:}
-config job.properties
You need to parse result coming from below command when the returning result = 0 otherwise it's a failure. Simply loop with sleep X amount of time after each trial.
oozie job -oozie http://<oozie-server>/oozie -info ${JOB_ID#*:}
echo $? //shows whether command executed successfully or not

Running script on my local computer when jobs submitted by qsub on a server finish

I am submitting jobs via qsub to a server, and then want to analyze the results on the local machine after jobs are finished. Though I can find a way to submit the analysis job on the server, but don't know how to run that script on my local machine.
jobID=$(qsub job.sh)
qsub -W depend=afterok:$jobID analyze.sh
But instead of the above, I want something like
if(qsub -W depend=afterok:$jobID) finished successfully
sh analyze.sh
else
some script
How can I accomplish the above task?
Thank you very much.
I've faced a similar issue and I'll try to sketch the solution that worked for me:
After submitting your actual job,
jobID=$(qsub job.sh)
I would create a loop in your script that checks if the job is still running using
qstat $jobID | grep $jobID | awk '{print $5}'
Although I'm not 100% sure if the status is in the 5h column, you better double check. While the job is idling, the status will be I or Q, while running R, and afterwards C.
Once it's finished, I usually grep the output files for signs that the run was a success or not, and then run the appropriate post-processing script.
One thing that works for me is to use qsub synchronous with the option
qsub -sync y job.sh
(either on command line or as
#$ -sync y
in the script (job.sh) itself.
qsub will then exit with code 0 only if the job (or all array jobs) have finished successfully.

DATASTAGE: how to run more instance jobs in parallel using DSJOB

I have a question.
I want to run more instance of same job in parallel from within a script: I have a loop in which I invoke jobs with dsjob and without option "-wait" and "-jobstatus".
I want that jobs completed before script termination, but I don't know how to verify if job instance terminated.
I though to use wait command but it is not appropriate.
Thanks in advance
First,you should assure job compile option "Allow Multiple Instance" choose.
Second:
#!/bin/bash
. /home/dsadm/.bash_profile
INVOCATION=(1 2 3 4 5)
cd $DSHOME/bin
for id in ${INVOCATION[#]}
do
./dsjob -run -mode NORMAL -wait test demo.$id
done
project -- test
job -- demo
$id -- invocation id
the two line in shell scipt:guarantee the environment path can work.
Run the jobs like you say without the -wait, and then loop around running dsjob -jobinfo and parse the output for a job status of 1 or 2. When all jobs return this status, they are all finished.
You might find, though, that you check the status of the job before it actually starts running and you might pick up an old status. You might be able to fix this by first resetting the job instance and waiting for a status of "Not running", prior to running the job.
Invoke the jobs in loop without wait or job-status option
after your loop , check the jobs status by dsjob command
Example - dsjob -jobinfo projectname jobname.invocationid
you can code one more loop for this also and use sleep command under that
write yours further logic as per status of the jobs
but its good to create Job Sequence to invoke this multi-instance job simultaneously with the help of different invoaction-ids
create a sequence job if these are in same process
create different sequences or directly create different scripts to trigger these jobs simultaneously with invocation- ids and schedule in same time.
Best option create a standard generalized script where each thing will be getting created or getting value as per input command line parameters
Example - log files on the basis of jobname + invocation-id
then schedule the same script for different parameters or invocations .

How to check whether a file exists or not using hdfs shell commands

am new to hadoop and a small help is required.
Suppose if i ran the job in background using shell scripting, how do i know whether the job is completed or not. The reason am asking is, once the job is completed my script has to move output file to some other location. How can i check whether job completed or outfile exists or not using hdfs.
Thanks
MRK
You need to be careful in the way you are detecting the job is done in this way, because there might be output before your job is completely finished.
To answer your direct question, to test for existence I typically do hadoop fs -ls $output | wc -l and then make sure the number is greater than 0.
My suggestion is you use && to tack on the move:
hadoop ... myjob.jar ... && hadoop fs -mv $output $new_output &
This will complete the job, and then perform the move afterwards.
You can use JobConf.setJobEndNotificationURI() to get notified when the job gets completed.
I think you can also check for the pid of the process that started the Hadoop job using the ps command.

Resources