Checking for status of qsub jobs running within shell script - bash

I have been given a c shell script that launches 800 individual qsubs for a sample. I need to run this script on more than 500 samples (listed in samples.txt). To automate the process, I thought about running the script (named SrchDriver) using the following bash shell script:
#!/bin/sh
for item in $(cat samples.txt)
do
(cd dir_"$item"/MAPGAPS && SrchDriver "$item"_Out 3)
done
This script would launch the SrchDriver script for all samples one right after another which would result in too many jobs on the server at one time. I would like to run only one sample at a time by waiting for all qsubs to finish for a particular sample.
What is the best way to put in a check for running/waiting jobs for a sample and holding the launch of the Srchdriver script for additional samples until all jobs are finished for the current sample?
I was thinking to first wait for 30 seconds and then check status of the qsubs (name of jobs is mapgaps). Next, I wanted to use a while loop to check the status every 30 seconds. Once the status is no longer 0, then proceed to the next sample. Would this be correct?
sleep 30
qstat | grep mapgaps &> /dev/null
while [ $? -eq 0 ];
do
sleep 30
qstat | grep mapgaps &> /dev/null
done;
If correct, how would I combine it with my for-loop? Would the following code below be correct?
#!/bin/sh
for item in $(cat samples.txt)
do
(cd dir_"$item"/MAPGAPS && SrchDriver "$item"_Out 3)
sleep 30
qstat | grep mapgaps &> /dev/null
status=$?
while [ $status = 0 ]
do
sleep 30
qstat | grep mapgaps &> /dev/null
status=$?
done
done
Thanks in advance for help. Please let me know if more information is needed.

Your script should work as is, indeed. The logic is sound and the syntax is correct.
A small improvement: the while statement can take the return status of a command directly, without using $?, so you could write your script like this:
#!/bin/sh
for item in $(cat samples.txt)
do
(cd dir_"$item"/MAPGAPS && SrchDriver "$item"_Out 3)
sleep 30
while qstat | grep mapgaps &> /dev/null
do
sleep 30
done
done

Related

How to wait in bash till a shell script is finished?

right now I'm using this script for a program:
export FREESURFER_HOME=$HOME/freesurfer
source $FREESURFER_HOME/SetUpFreeSurfer.sh
cd /home/ubuntu/fastsurfer
datadir=/home/ubuntu/moya/data
fastsurferdir=/home/ubuntu/moya/output
mkdir -p $fastsurferdir/logs # create log dir for storing nohup output log (optional)
while read p ; do
echo $p
nohup ./run_fastsurfer.sh --t1 $datadir/$p/orig.nii \
--parallel --threads 16 --sid $p --sd $fastsurferdir > $fastsurferdir/logs/out-${p}.log &
sleep 3600s
done < /home/ubuntu/moya/data/subjects-list.txt
Instead of using sleep 3600s, as the program needs around an hour, I'd like to use wait until all processes (several PIDS) are finished.
If this is the right way, can you tell me how to do that?
BR Alex
wait will wait for all background processes to finish (see help wait). So all you need is to run wait after creating all of the background processes.
This may be more than what you are asking for but I figured I would provide some methods for controlling the number of threads you want to have running at once. I find that I always want to limit the number for various reasons.
Explaination
The following will limit concurrent threads to max_threads running at one time. I am also using the main design pattern so we have a main that runs the script with a function run_jobs that handles the calling and waiting. I read all of $p into an array, then traverse that array as we launch threads. It will either launch a thread up to 4 or wait 5 seconds, once there are at least one less than four it will start another thread. When finished it waits for any remaining to be done. If you want something more simplistic I can do that as well.
#!/usr/bin/env bash
export FREESURFER_HOME=$HOME/freesurfer
source $FREESURFER_HOME/SetUpFreeSurfer.sh
typeset max_threads=4
typeset subjects_list="/home/ubuntu/moya/data/subjects-list.txt"
typeset subjectsArray
run_jobs() {
local child="$$"
local num_children=0
local i=0
while [[ 1 ]] ; do
num_children=$(ps --no-headers -o pid --ppid=$child | wc -w) ; ((num_children-=1))
echo "Children: $num_children"
if [[ ${num_children} -lt ${max_threads} ]] ;then
if [ $i -lt ${#subjectsArray[#]} ] ;then
((i+=1))
# RUN COMMAND HERE &
./run_fastsurfer.sh --t1 $datadir/${subjectsArray[$i]}/orig.nii \
--parallel --threads 16 --sid ${subjectsArray[$i]} --sd $fastsurferdir
fi
fi
sleep 10
done
wait
}
main() {
cd /home/ubuntu/fastsurfer
datadir=/home/ubuntu/moya/data
fastsurferdir=/home/ubuntu/moya/output
mkdir -p $fastsurferdir/logs # create log dir for storing nohup output log (optional)
mapfile -t subjectsArray < ${subjects_list}
run_jobs
}
main
Note: I did not run this code since you have not provided enough information to actually do so.

question on using bwait to wait for multiple bsub jobs to finish

I am new to using LSF (been using PBS/Torque all along).
I need to write code/logic to make sure all bsub jobs finish before other commands/jobs can be fired.
Here is what I have done: I have a master shell script which calls multiple other shell scripts via bsub commands. I capture the job ids from bsub in a log file and I need to ensure that all jobs get finished before the downstream shell script should execute its other commands.
Master shell script
#!/bin/bash
...Code not shown for brevity..
"Command 1 invoked with multiple bsubs" > log_cmd_1.txt
Need Code logic to use bwait before downstream Commands can be used
"Command 2 will be invoked with multiple bsubs" > log_cmd_2.txt
and so on
stdout captured from Command 1 within the Master Shell script is stored in log_cmd_1.txt which looks like this
Submitting Sample 101
Job <545> is submitted to .
Submitting Sample 102
Job <546> is submitted to .
Submitting Sample 103
Job <547> is submitted to .
Submitting Sample 104
Job <548> is submitted to .
I have used the codeblock shown below after Command 1 in the master shell script.
However, it does not seem to work for my situation. Looks like I would have gotten the whole thing wrong below.
while sleep 30m;
do
#the below gets the JobId from the log_cmd_1.txt and tries bwait
grep '^Job' <path_to>/log_cmd_1.txt | perl -pe 's!.*?<(\d+)>.*!$1!' | while read -r line; do res=$(bwait -w "done($line)");echo $res; done 1>
<path_to>/running.txt;
# the below sed command deletes lines that start with Space
sed '/^\s*$/d' running.txt > running2.txt;
# -s file check operator means "file is not zero size"
if [ -s $WORK_DIR/logs/running2.txt ]
then
echo "Jobs still running";
else
echo "Jobs complete";
break;
fi
done
The question: What's the correct way to do this using bwait within the master shell script.
Thanks in advance.
bwait will block until the condition is satisfied, so the loops are probably not neecessary. Note that since you're using done, if the job fails then bwait will exit and inform you that the condition can never be satisfied. Make sure to check that case.
What you have should work. At least the following test worked for me.
#!/bin/bash
# "Command 1 invoked with multiple bsubs" > log_cmd_1.txt
( bsub sleep 0; bsub sleep 0 ) > log_cmd_1.txt
# Need Code logic to use bwait before downstream Commands can be used
while sleep 1
do
#the below gets the JobId from the log_cmd_1.txt and tries bwait
grep '^Job' log_cmd_1.txt | perl -pe 's!.*?<(\d+)>.*!$1!' | while read -r line; do res=$(bwait -w "done($line)");echo "$res"; done 1> running.txt;
# the below sed command deletes lines that start with Space
sed '/^\s*$/d' running.txt > running2.txt;
# -s file check operator means "file is not zero size"
if [ -s running2.txt ]
then
echo "Jobs still running";
else
echo "Jobs complete";
break;
fi
done
Another way to do it. Which may is a little cleaner, is to use job arrays and job dependencies. Job arrays will combine several pieces of work that can be managed as a single job. So your
"Command 1 invoked with multiple bsubs" > log_cmd_1.txt
could be submitted as a single job array. You'll need a driver script that can launch the individual jobs. Here's an example driver script.
$ cat runbatch1.sh
#!/bin/bash
# $LSB_JOBINDEX goes from 1 to 10
if [ "$LSB_JOBINDEX" -eq 1 ]; then
# do the work for job batch 1, job 1
...
elif [ "$LSB_JOBINDEX" -eq 2 ]; then
# etc
...
fi
Then you can submit the job array like this.
bsub -J 'batch1[1-10]' sh runbatch1.sh
This command will run 10 job array elements. The driver script's environment will use the variable LSB_JOB_INDEX to let you know which element the driver is running. Since the array has a name, batch, it's easier to manage. You can submit a second job array that won't start until all elements of the first have completed successfully. The second array is submitted with this command.
bsub -w 'done(batch1)' -J 'batch2[1-10]' sh runbatch2.sh
I hope that this helps.

Is there a way to stop scripts that are running simultaneously if one of them send an echo?

I need to find if a value (actually it's more complex than that) is in one of 20 servers I have. And I need to do it as fast as possible. Right now I am sending the scripts simultaneously to all the servers. My main script is something like this (but with all the servers):
#!/bin/sh
#mainScript.sh
value=$1
c1=`cat serverList | sed -n '1p'`
c2=`cat serverList | sed -n '2p'`
sh find.sh $value $c1 & sh find.sh $value $c2
#!/bin/sh
#find.sh
#some code here .....
if [ $? -eq 0 ]; then
rm $tempfile
else
myValue=`sed -n '/VALUE/p' $tempfile | awk 'BEGIN{FS="="} {print substr($2, 8, length($2)-2)}'`
echo "$myValue"
fi
So the script only returns a response if it finds the value in the server. I would like to know if there is a way to stop executing the other scripts if one of them already return a value.
I tried adding an "exit" on the find.sh script but it won't stop all the scripts. Can somebody please tell me if what I want to do is possible?
I would suggest that you use something that can handle this for you: GNU Parallel. From the linked tutorial:
If you are looking for success instead of failures, you can use success. This will finish as soon as the first job succeeds:
parallel -j2 --halt now,success=1 echo {}\; exit {} ::: 1 2 3 0 4 5 6
Output:
1
2
3
0
parallel: This job succeeded:
echo 0; exit 0
I suggest you start by modifying your find.sh so that its return code depends on its success, that will let us identify a successful call more easily; for instance:
myValue=`sed -n '/VALUE/p' $tempfile | awk 'BEGIN{FS="="} {print substr($2, 8, length($2)-2)}'`
success=$?
echo "$myValue"
exit $success
To terminate all the find.sh processes spawned by your script you can use pkill with a Parent Process ID criteria and a command name criteria :
pkill -P $$ find.sh # $$ refers to the current process' PID
Note that this requires that you start the find.sh script directly rather than passing it as a parameter to sh. Normally that shouldn't be a problem, but if you have a good reason to call sh rather than your script, you can replace find.sh in the pkill command by sh (assuming you're not spawning other scripts you wouldn't want to kill).
Now that find.sh exits with success only when it finds the expected string, you can plug the two actions with && and run the whole thing in background :
{ find.sh $value $c1 && pkill -P $$ find.sh; } &
The first occurrence of find.sh that terminates with success will invoke the pkill command that will terminate all others (those killed processes will have non-zero exit codes and therefore won't run their associated pkill).

How to get the correct number of background jobs running, f.ex. $(jobs | wc -l | xargs) returns 1 instead of 0

I have a jenkins build job that starts processes in the background. I need to write a function that checks wether there are still background processes running. To test it I came up with this:
#!/bin/bash -e
function waitForUploadFinish() {
runningJobs=$(jobs | wc -l | xargs)
echo "Waiting for ${runningJobs} background upload jobs to finish"
while [ "$(jobs | wc -l | xargs)" -ne 0 ];do
echo "$(jobs | wc -l | xargs)"
echo -n "." # no trailing newline
sleep 1
done
echo ""
}
for i in {1..3}
do
sleep $i &
done
waitForUploadFinish
The problem is it never comes down to 0. Even when the last sleep is done, there is still one job running?
mles-MBP:ionic mles$ ./jobs.sh
Waiting for 3 background upload jobs to finish
3
.2
.1
.1
.1
.1
.1
.1
Why I don't want to use wait here
In the Jenkins build job script where this snippet is for, i'm starting background upload processes for huge files. They don't run for 3 seconds like in the example here with sleep. They can take up to 30 minutes to proceed. If I use wait here, the user would see something like this in the log:
upload huge_file1.ipa &
upload huge_file2.ipa &
upload huge_file3.ipa &
wait
They would wonder why is nothing going on?
Instead I want to implement something like this:
upload huge_file1.ipa &
upload huge_file2.ipa &
upload huge_file3.ipa &
Waiting for 3 background upload jobs to finish
............
Waiting for 2 background upload jobs to finish
.................
Waiting for 1 background upload jobs to finish
.........
Upload done
That's why I need the loop with the current running background jobs.
This fixes it:
function waitForUploadFinish() {
runningJobs=$(jobs | wc -l | xargs)
echo "Waiting for ${runningJobs} background upload jobs to finish"
while [ `jobs -r | wc -l | tr -d " "` != 0 ]; do
jobs -r | wc -l | tr -d " "
echo -n "." # no trailing newline
sleep 1
done
echo ""
}
Note: you will only count the background processes that are started by this bash script, you will not see the background processes from the starting shell.
As the gniourf_gniourf commented: if you only need to wait and don't need to output then a simple wait after the sleeps is much simpler.
for i in {1..3}; do
sleep $i &
done
wait
Please consider comments made by gniourf_gniourf, as your design is not good to start with.
However, despite a much simpler and more efficient solution being possible, there is the question of why you are seeing what you are seeing.
I modified the first line of your loop, like so :
while [ "$(jobs | tee >(cat >&2) | wc -l | xargs)" -ne 0 ];do
The tee command takes its input and sends it to both standard out and to the file passed as argument. >(cat >&2) is syntax that, to explain it simply, provides a file to the tee command, but that file really is a named FIFO and anything written to that file will be sent to standard error, bypassing the pipeline and allowing you to see what jobs is spitting out, all while allowing the rest of the pipeline to operate normally.
If you do that, you will notice that the "job" that jobs keeps on returning is not a job, but a message stating some other job has finished. I do not know why it keeps on repeating that, but this is the cause of the problem.
You could replace :
while [ "$(jobs | wc -l | xargs)" -ne 0 ];do
With :
while [ "$(jobs -p | grep "^[0-9]" | wc -l | xargs)" -ne 0 ];do
This will cause jobs to echo PIDs, and filter out any line that does not begin with a number, so messages will not be counted.

Output of background process output to Shell variable

I want to get output of a command/script to a variable but the process is triggered to run in background. I tried as below and few servers ran it correctly and I got the response. But in few I am getting i_res as empty.
I am trying to run it in background as the command has chance to get in hang state and I don't want to hung the parent script.
Hope I will get a response soon.
#!/bin/ksh
x_cmd="ls -l"
i_res=$(eval $x_cmd 2>&1 &)
k_pid=$(pgrep -P $$ | head -1)
sleep 5
c_errm="$(kill -0 $k_pid 2>&1 )"; c_prs=$?
if [ $c_prs -eq 0 ]; then
c_errm=$(kill -9 $k_pid)
fi
wait $k_pid
echo "Result : $i_res"
Try something like this:
#!/bin/ksh
pid=$$ # parent process
(sleep 5 && kill $pid) & # this will sleep and wake up after 5 seconds
# and kill off the parent.
termpid=$! # remember the timebomb pid
# put the command that can hang here
result=$( ls -l )
# if we got here in less than 5 five seconds:
kill $termpid # kill off the timebomb
echo "$result" # disply result
exit 0
Add whatever messages you need to the code. On average this will complete much faster than always having a sleep statement. You can see what it does by making the command sleep 6 instead of ls -l

Resources