How to know when PBS batch jobs are complete - bash

I have a BASH script that submits multiple serial jobs to the PBS queueing system. Once the jobs are submitted the script ends. The jobs then run on a cluster and when they are all finished I can move on to the next step. A typical workflow might involve several of these steps.
My question:
Is there a way for my script not to exit upon completion of the submission, but rather to sleep until ALL jobs submitted by that script have completed on the cluster, only then exiting?

You are trying to establish a workflow, correct? The best way to do what you're attempting to accomplish would be to use job dependencies. Essentially, what you are trying to do is submit X number of jobs, and then submit more jobs that depend on the first set of jobs, and you can do this with job dependencies. There are different ways to do dependencies that you can read about in the previous link, but here's an example of submitting 3 jobs and then submitting 3 more that won't execute until after the first 3 have exited.
#first batch
jobid1=`qsub ...`
jobid2=`qsub ...`
jobid3=`qsub ...`
#next batch
depend_str="-W after:${jobid1} -W after:${jobid2} -W after:${jobid3}"
qsub ... $depend_str
qsub ... $depend_str
qsub ... $depend_str

One way to do this would be using GNU Parallel command 'sem'
I learnt about this doing queue stuff as well. It acts as a timer allowing commands to be executed after exiting etc.
Edit: I know the example here is really basic but there is a lot that can be achieved running tasks using parallel --sem or even just parallel in general. Have a look at the tutorial, I'm certain you will be able to find a relevant example that will help.
There is a great tutorial here
An example from a tutorial:
sem 'sleep 1; echo The first finished' &&
echo The first is now running in the background &&
sem 'sleep 1; echo The second finished' &&
echo The second is now running in the background
sem --wait
Output:
The first is now running in the background
The first finished
The second is now running in the background
The second finished
See Man Page

To actually check if a job is done, we need to use qstat and the job ID to get the job status and then grep the status for a status code. As long as your username or job name are not "C", the following should work:
#!/bin/bash
# SECTION 1: Launch all jobs and store their job IDs in a variable
myJobs="job1.qsub job2.qsub job3.qsub" # Your job names here
numJobs=$(echo "$myJobs" | wc -w) # Count the jobs
myJobIDs="" # Initialize an empty list of job IDs
for job in $myJobs; do
jobID_full=$(qsub $job)
# jobID_full will look like "12345.machinename", so use sed
# to get just the numbers
jobID=$(echo "$jobID_full" | sed -e 's|\([0-9]*\).*|\1|')
myJobIDs="$myJobIDs $jobID" # Add this job ID to our list
done
# SECTION 2: Check the status of each job, and exit while loop only
# if they are all complete
numDone=0 # Initialize so that loop starts
while [ $numDone -lt $numJobs ]; do # Less-than operator
numDone=0 # Zero since we will re-count each time
for jobID in $myJobIDs; do # Loop through each job ID
# The following if-statement ONLY works if qstat won't return
# the string ' C ' (a C surrounded by two spaces) in any
# situation besides a completed job. I.e. if your username
# or jobname is 'C' then this won't work!
# Could add a check for error (grep -q ' E ') too if desired
if qstat $jobID | grep -q ' C '
then
(( numDone++ ))
else
echo $numDone jobs completed out of $numJobs
sleep 1
fi
done
done
echo all jobs complete

Related

LSF - automatic job rerun using sasbatch script

I am trying to create an auto-rerun mechanism by implementing some code into sasbatch script after sascommand will finish. General idea is to:
locate a log of sas process and an id of the flow containing current job,
check if the log contains particular ORA-xxxxx errors which we know that solution for them is just rerun of the process,
if so, then trigger jrerun class from LSF Platform Command Line Interface,
exit sasbatch passing $rc to LSF
The idea was implemented as:
#define used paths
log_dir=/path/to/sas_logs_directory
out_log=/path/to/auto-rerun_log.txt
out_log2=/path/to/lsf_rerun_log.txt
if [ -n "${LSB_JOBNAME}"]; then
if [ ! -f "$out_log"]; then
touch $out_log
fi
#get flow runtime attributes
IFS-: read -r flow_id username flow_name job_name <<< "${LSB_JOBNAME}"
#find log of the current process
log_path=$(ls -t $log_dir/*.log | xargs grep -li "job:\s*$job_name" | grep -i "/$flow_name_" | head -1)
#set path to txt file containing lines which represents ORA errors we look for
conf_path-/path/to/error_list
#analyse process' log line by line
while read -r line;
do
#if error is found in log then try to rerun flow
if grep -q "$line" $log_path; then
(nohup /path/to/rerun_script.sh $flow_id >$out_log2 2>&1) &
disown
break
fi
done < $conf_path
fi
While rerun_script is the script which calls jrerun class after sleep command - in order to let parent script exit $rc in the meanwhile. It looks like:
sleep 10
/some/lsf/path/jrerun
Problem is that job is running for the all time. In LSF history I can see that jrerun was called before job exited.
Furthermore in $out_log2 I can see message: <flow_id> has no starting or exit points.
Do anyone have an idea how I can pass return code to LSF before jrerun calling? Or maybe some simplier way to perform autorerun of SAS jobs in Platform LSF?
I am using SAS 9.4 and Platform Process Manager 9.1
Or maybe some simplier way to perform autorerun of SAS jobs in Platform LSF?
I'm not knowledgeable about the SAS part. But on the LSF side there's at least a couple of ways to requeue the job.
If you have control of the job script, you can use special process exit value to automatically requeue the job.
https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_admin/job_requeue_about.html
If you have control outside of the job script, you can use brequeue -r to requeue a running job.
https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_command_ref/brequeue.1.html
Good Luck
I managed to get this working by using two additional configuration files. When my grep returnes 1 I add found flow_id to flow_list.txt configuration file and modify especially made trigger_file.txt.
I scheduled additional flow execute_rerun in LSF which is triggered after file trigger_file.txt is modified. The execute_rerun flow reads flow_list.txt configuration file line by line and calls jrerun method on each flow.
I managed to achieve an automatic rerun of the flows which fails due to particular errors.

Stdout based on a PID (OSX)

I am processing many files using many separate scripts. To speed up the processing I placed them in the background using &, however with this I lost the ability to keep track of what they are doing (I can't see the output).
Is there a simple way of getting output based on PID? I found some answers which are based on fg [job number], but I can't figure out job number from PID.
You might consider running your scripts from screen then return to them whenever you want:
$ screen ./script.sh
To "detach" and keep the script running press ControlA followed by ControlD
$ screen -ls
Will list your screen sessions
$ screen -r <screen pid number>
Returns to a screen session
The few commands above barely touches on the abilities that screen has, so check out the man pages about it and you might be surprised by all it can do.
A script that is backgrounded will normally just continue to write to standard output; if you run several, they will all be dumping their output intermingled with each other. Dump them to a file instead. For example, generate an output file name using $$ (current process ID) and write to that file.
outfile=process.$$.out
# ...
echo Output >$outfile
will write to, say, process.27422.out.
The answers by other users are right - exec &>$outfile or exec &>$outfifo or exec &>$another_tty is what you need to do & is the correct way.
However, if you have already started the scripts, then there is a workaround that you can use. I had written this script to redirect the stdout/stderr of any running process to another file/terminal.
$ cat redirect_terminal
#!/bin/bash
PID=$1
stdout=$2
stderr=${3:-$2}
if [ -e "/proc/$PID" ]; then
gdb -q -n -p $PID <<EOF >/dev/null
p dup2(open("$stdout",1),1)
p dup2(open("$stderr",1),2)
detach
quit
EOF
else
echo No such PID : $PID
fi
Sample usage:
./redirect_terminal 1234 /dev/pts/16
Where,
1234 is the PID of the script process.
/dev/pts/16 is another terminal opened separately.
Note that this updated stdout/stderr will not be inherited to the already running children of that process.
Consider using GNU Parallel - it is easily installed on OSX with homebrew. Not only will it tag your output lines, but it will also keep your CPUs busy, scheduling another job immediately the previous one finishes. You can make up your own tags with substitution parameters.
Let's say you have 20 files called file{10..20}.txt to process:
parallel --tagstring "MyTag-{}" 'echo Start; echo Processing file {}; echo Done' ::: file*txt
MyTag-file15.txt Start
MyTag-file15.txt Processing file file15.txt
MyTag-file15.txt Done
MyTag-file16.txt Start
MyTag-file16.txt Processing file file16.txt
MyTag-file16.txt Done
MyTag-file17.txt Start
MyTag-file17.txt Processing file file17.txt
MyTag-file17.txt Done
MyTag-file18.txt Start
MyTag-file18.txt Processing file file18.txt
MyTag-file18.txt Done
MyTag-file14.txt Start
MyTag-file14.txt Processing file file14.txt
MyTag-file14.txt Done
MyTag-file13.txt Start
MyTag-file13.txt Processing file file13.txt
MyTag-file13.txt Done
MyTag-file12.txt Start
MyTag-file12.txt Processing file file12.txt
MyTag-file12.txt Done
MyTag-file19.txt Start
MyTag-file19.txt Processing file file19.txt
MyTag-file19.txt Done
MyTag-file20.txt Start
MyTag-file20.txt Processing file file20.txt
MyTag-file20.txt Done
MyTag-file11.txt Start
MyTag-file11.txt Processing file file11.txt
MyTag-file11.txt Done
MyTag-file10.txt Start
MyTag-file10.txt Processing file file10.txt
MyTag-file10.txt Done
If you want the output in order, use parallel -k to keep the output order
If you want a progress report, use parallel --progress
If you want a log of when jobs started/ended, use parallel --joblog log.txt
If you want to run 32 jobs in parallel, instead of the default 1 job per CPU core, use parallel -j 32
Example joblog:
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
6 : 1461141901.514 0.005 0 38 0 0 echo Start; echo Processing file file15.txt; echo Done
7 : 1461141901.517 0.006 0 38 0 0 echo Start; echo Processing file file16.txt; echo Done
8 : 1461141901.519 0.006 0 38 0 0 echo Start; echo Processing file file17.txt; echo Done

How to get a stdout message once a background process finishes?

I realize that there are several other questions on SE about notifications upon completion of background tasks, and how to queue up jobs to start after others end, and questions like these, but I am looking for a simpler answer to a simpler question.
I want to start a very simple background job, and get a simple stdout text notification of its completion.
For example:
cp My_Huge_File.txt New_directory &
...and when it done, my bash shell would display a message. This message could just be the completed job's PID, but if I could program unique messages per background process, that would be cool too, so I could have numerous background jobs running without confusion.
Thanks for any suggestions!
EDIT: user000001's answer separates commands with ;. I separated commands with && in my original example. The only difference I notice is that you don't have to surround your base command with braces if you use&&. Semicolons are a bit more flexible, so I've updated my examples.
The first thing that comes to mind is
{ sleep 2; echo "Sleep done"; } &
You can also suppress the accompanying stderr output from the above line:
{ { sleep 2; echo "Sleep done"; } & } 2>/dev/null
If you want to save your program output (stdout) to a log file for later viewing, you can use:
{ { sleep 2; echo "Sleep done"; } & } 2>/dev/null 1>myfile.log
Here's even a generic form you might use (You can even make an alias so that you can run it at any time without having to type so much!):
# dont hesitate to add semicolons for multiple commands
CMD="cp My_Huge_File.txt New_directory"
{ eval $CMD & } 2>/dev/null 1>myfile.log
You might also pipe stdout into another process using | in case you wish to process output in real time with other scripts or software. tee is also a helpful tool in case you wish to use multiple pipes. For reference, there are more examples of I/O redirection here.
You could use command grouping:
{ slow_program; echo ok; } &
or the wait command
slow_program &
wait
echo ok
The most reliable way is to simply have the output from the background process go to a temporary file and then consume the temporary file.
When you have a background process running it can be difficult to capture the output into something useful because multiple jobs will overwrite eachother
For example, if you have two processes which each print out a string with a number "this is my string1" "this is my string2" then it is possible for you to end up with output that looks like this:
"this is mthis is my string2y string1"
instead of:
this is my string1
this is my string2
By using temporary files you guarantee that the output will be correct.
As I mentioned in my comment above, bash already does this kind of notification by default, as far as I know. Here's an example I just made:
$ sleep 5 &
[1] 25301
$ sleep 10 &
[2] 25305
$ sleep 3 &
[3] 25309
$ jobs
[1] Done sleep 5
[2]- Running sleep 10 &
[3]+ Running sleep 3 &
$ :
[3]+ Done sleep 3
$ :
[2]+ Done sleep 10
$

Use shell output for error handling for condor

I need to submit multiple simulations to condor (multi-client execution grid) using shell and since this may take a while, I decided to write a shell script to do it for me. I am very new to shell scripting and this is the result of what I did on one day:
for H in {0..50}
do
for S in {0..10}
do
./p32 -data ../data.txt -out ../result -position $S -group $H
echo "> Ready to submit"
condor_submit profile.sub
echo "> Waiting 15 minutes for group $H Pos $S"
for W in {1..15}
do
echo "Staring minute $W"
sleep 60
done
done
echo "Deleting data_3 to free up space"
mkdir /tmp/data_3
if [$H < 10]
then
tar cfvz /tmp/data_3/group_000$H.tar.gz ../result/data_3/group_000$H
rm -r ../result/data_3/group_000$H
else
tar cfvz /tmp/data_3/group_00$H.tar.gz ../result/data_3/group_00$H
rm -r ../result/data_3/group_00$H
fi
done
This script runs through 0..50 simulations and submits 0..10 different parameters to a program that generates a condor submission profile. Then I submit this profile and let it execute for 15 minutes (with a call being made every minute to ensure the SSH pipe doesn't break). Once the 15 minutes are up I compress the output to a volume with more space and erase the original files.
The reason for me implementing this because is due to our condor system can only being able to handle up to 10,000 submissions at once and one submission (condor_submit profile.sub) executes 7000+ simulations.
Now my problem is with this line. When I checked this morning I (luckily) spotted that the when calling condor_submit profile.sub may cause an error if the network is too busy. The error code is:
ERROR: Failed to connect to local queue manager
CEDAR:6001:Failed to connect to <IP_NUMBER:PORT_NUMBER>
This means that from time to time a whole iteration gets lost! How can I work around this? The only way I see is to use shell to read in the last line/s of terminal output and evaluate whether they follow the expected response i.e.:
7392 job(s) submitted to cluster CLUSTER_NUMBER.
But how would I read in the last line and go about checking for errors?
Any help is very needed and very much appreciated
Does condor_submit give a non-zero exit code when it fails? If so, you can try calling it like this:
while ! condor_submit profile.sub; do
sleep 5
done
which will cause the current profile to be submitted every 5 seconds until it succeeds.

I want to make a conditional cronjob

I have a cron job that runs every hour. It accesses an xml feed. If the xml feed is unvailable (which seems to happen once a day or so) it creates a "failure" file. This "failure" file has some metadata in it and is erased at the next hour when the script runs again and the XML feed works again.
What I want is to make a 2nd cron job that runs a few minutes after the first one, looks into the directory for a "failure" file and, if it's there, retries the 1st cron job.
I know how to set up cron jobs, I just don't know how to make scripting conditional like that. Am I going about this in entirely the wrong way?
Possibly. Maybe what you'd be better off doing is having the original script sleep and retry a (limited) number of times.
Sleep is a shell command and shells support looping so it could look something like:
for ((retry=0;retry<12;retry++)); do
try the thing
if [[ -e my_xlm_file ]]; then break; fi
sleep 300
# five minutes later...
done
As the command to run, try:
/bin/bash -c 'test -e failurefile && retrycommand -someflag -etc'
It runs retrycommand if failurefile exists
Why not have your set your script touch a status file when it has successfully completed. Have it run every 5 minutes, and have the first check of the script be to see if the status file is less then 60 minutes old, and if it is young, then quit, if it is old, then fetch.
I agree with MarkusQ that you should retry in the original job instead of creating another job to watch the first job.
Take a look at this tool to make retrying easier: https://github.com/kadwanev/retry
You can just wrap the original cron in a retry very easily and the final existence of the failure file would indicate if it failed even after the retries.
If somebody will need a bash script to ping an endpoint (for example, run scheduled API tasks via cron), retry it, if the response status was bad, then:
#!/bin/bash
echo "Start pinch.sh script.";
# run 5 times
for ((i=1;i<=5;i++))
do
# run curl to do a POST request to https://www.google.com
# silently flush all its output
# get the response status code as a bash variable
http_response=$(curl -o /dev/null -s -w "%{response_code}" https://www.google.com);
# check for the expected code
if [ $http_response != "200" ]
then
# process fail
echo "The pinch is Failed. Sleeping for 5 minutes."
# wait for 300 seconds, then start another iteration
sleep 300
else
# exit from the cycle
echo "The pinch is OK. Finishing."
break;
fi
done
exit 0

Resources