slurm: write exit code or job status into logfile when finished (completed or failed) - exit-code

I need to make sure that all commands in my script finished successfully (returned 0 status). That's why my slurm script includes following lines:
set -e
set -x
Now I would like the exit status of the whole script to be written in the logfile that's automatically created by slurm. I have tried echo $SLURM_JOB_EXIT_CODE (with no success) or echo $? (which I am not sure is what I need) as a last line of my script.
What's the proper way to do this? I need to differentiate between "failed" and "completed" jobs, preferably by checking logfiles only.

Catching the exit code of the script within the script is impossible so you should either
wrap your script in another script that would take proper action based on its return code, or
get the return code from Slurm's accounting with the sacct command.

I know this is an old question but here is my way of appending the final job status to the Slurm output.
res=$(sbatch job.sh)
echo $res
sleep 10s
ST="PENDING"
while [[ "$ST" != "COMPLETED" && "$ST" != "FAILED" ]] ; do
ST=$(sacct -j ${res##* } -o State | awk 'FNR == 3 {print $1}')
sleep 10s
done
echo "$ST" >> job.out # assuming stdout writes to job.out

Related

question on using bwait to wait for multiple bsub jobs to finish

I am new to using LSF (been using PBS/Torque all along).
I need to write code/logic to make sure all bsub jobs finish before other commands/jobs can be fired.
Here is what I have done: I have a master shell script which calls multiple other shell scripts via bsub commands. I capture the job ids from bsub in a log file and I need to ensure that all jobs get finished before the downstream shell script should execute its other commands.
Master shell script
#!/bin/bash
...Code not shown for brevity..
"Command 1 invoked with multiple bsubs" > log_cmd_1.txt
Need Code logic to use bwait before downstream Commands can be used
"Command 2 will be invoked with multiple bsubs" > log_cmd_2.txt
and so on
stdout captured from Command 1 within the Master Shell script is stored in log_cmd_1.txt which looks like this
Submitting Sample 101
Job <545> is submitted to .
Submitting Sample 102
Job <546> is submitted to .
Submitting Sample 103
Job <547> is submitted to .
Submitting Sample 104
Job <548> is submitted to .
I have used the codeblock shown below after Command 1 in the master shell script.
However, it does not seem to work for my situation. Looks like I would have gotten the whole thing wrong below.
while sleep 30m;
do
#the below gets the JobId from the log_cmd_1.txt and tries bwait
grep '^Job' <path_to>/log_cmd_1.txt | perl -pe 's!.*?<(\d+)>.*!$1!' | while read -r line; do res=$(bwait -w "done($line)");echo $res; done 1>
<path_to>/running.txt;
# the below sed command deletes lines that start with Space
sed '/^\s*$/d' running.txt > running2.txt;
# -s file check operator means "file is not zero size"
if [ -s $WORK_DIR/logs/running2.txt ]
then
echo "Jobs still running";
else
echo "Jobs complete";
break;
fi
done
The question: What's the correct way to do this using bwait within the master shell script.
Thanks in advance.
bwait will block until the condition is satisfied, so the loops are probably not neecessary. Note that since you're using done, if the job fails then bwait will exit and inform you that the condition can never be satisfied. Make sure to check that case.
What you have should work. At least the following test worked for me.
#!/bin/bash
# "Command 1 invoked with multiple bsubs" > log_cmd_1.txt
( bsub sleep 0; bsub sleep 0 ) > log_cmd_1.txt
# Need Code logic to use bwait before downstream Commands can be used
while sleep 1
do
#the below gets the JobId from the log_cmd_1.txt and tries bwait
grep '^Job' log_cmd_1.txt | perl -pe 's!.*?<(\d+)>.*!$1!' | while read -r line; do res=$(bwait -w "done($line)");echo "$res"; done 1> running.txt;
# the below sed command deletes lines that start with Space
sed '/^\s*$/d' running.txt > running2.txt;
# -s file check operator means "file is not zero size"
if [ -s running2.txt ]
then
echo "Jobs still running";
else
echo "Jobs complete";
break;
fi
done
Another way to do it. Which may is a little cleaner, is to use job arrays and job dependencies. Job arrays will combine several pieces of work that can be managed as a single job. So your
"Command 1 invoked with multiple bsubs" > log_cmd_1.txt
could be submitted as a single job array. You'll need a driver script that can launch the individual jobs. Here's an example driver script.
$ cat runbatch1.sh
#!/bin/bash
# $LSB_JOBINDEX goes from 1 to 10
if [ "$LSB_JOBINDEX" -eq 1 ]; then
# do the work for job batch 1, job 1
...
elif [ "$LSB_JOBINDEX" -eq 2 ]; then
# etc
...
fi
Then you can submit the job array like this.
bsub -J 'batch1[1-10]' sh runbatch1.sh
This command will run 10 job array elements. The driver script's environment will use the variable LSB_JOB_INDEX to let you know which element the driver is running. Since the array has a name, batch, it's easier to manage. You can submit a second job array that won't start until all elements of the first have completed successfully. The second array is submitted with this command.
bsub -w 'done(batch1)' -J 'batch2[1-10]' sh runbatch2.sh
I hope that this helps.

Is there a way to stop scripts that are running simultaneously if one of them send an echo?

I need to find if a value (actually it's more complex than that) is in one of 20 servers I have. And I need to do it as fast as possible. Right now I am sending the scripts simultaneously to all the servers. My main script is something like this (but with all the servers):
#!/bin/sh
#mainScript.sh
value=$1
c1=`cat serverList | sed -n '1p'`
c2=`cat serverList | sed -n '2p'`
sh find.sh $value $c1 & sh find.sh $value $c2
#!/bin/sh
#find.sh
#some code here .....
if [ $? -eq 0 ]; then
rm $tempfile
else
myValue=`sed -n '/VALUE/p' $tempfile | awk 'BEGIN{FS="="} {print substr($2, 8, length($2)-2)}'`
echo "$myValue"
fi
So the script only returns a response if it finds the value in the server. I would like to know if there is a way to stop executing the other scripts if one of them already return a value.
I tried adding an "exit" on the find.sh script but it won't stop all the scripts. Can somebody please tell me if what I want to do is possible?
I would suggest that you use something that can handle this for you: GNU Parallel. From the linked tutorial:
If you are looking for success instead of failures, you can use success. This will finish as soon as the first job succeeds:
parallel -j2 --halt now,success=1 echo {}\; exit {} ::: 1 2 3 0 4 5 6
Output:
1
2
3
0
parallel: This job succeeded:
echo 0; exit 0
I suggest you start by modifying your find.sh so that its return code depends on its success, that will let us identify a successful call more easily; for instance:
myValue=`sed -n '/VALUE/p' $tempfile | awk 'BEGIN{FS="="} {print substr($2, 8, length($2)-2)}'`
success=$?
echo "$myValue"
exit $success
To terminate all the find.sh processes spawned by your script you can use pkill with a Parent Process ID criteria and a command name criteria :
pkill -P $$ find.sh # $$ refers to the current process' PID
Note that this requires that you start the find.sh script directly rather than passing it as a parameter to sh. Normally that shouldn't be a problem, but if you have a good reason to call sh rather than your script, you can replace find.sh in the pkill command by sh (assuming you're not spawning other scripts you wouldn't want to kill).
Now that find.sh exits with success only when it finds the expected string, you can plug the two actions with && and run the whole thing in background :
{ find.sh $value $c1 && pkill -P $$ find.sh; } &
The first occurrence of find.sh that terminates with success will invoke the pkill command that will terminate all others (those killed processes will have non-zero exit codes and therefore won't run their associated pkill).

bash run multiple files exit condition

I have a function like so
function generic_build_a_module(){
move_to_the_right_directory
echo 'copying the common packages'; ./build/build_sdist.sh;
echo 'installing the api common package'; ./build/cache_deps.sh;
}
I want to exit the function if ./build/build_sdist.sh doesn't finishes successfully.
here is the content ./build/build_sdist.sh
... multiple operations....
echo "installing all pip dependencies from $REQUIREMENTS_FILE_PATH and placing their tar.gz into $PACKAGES_DIR"
pip install --no-use-wheel -d $PACKAGES_DIR -f $PACKAGES_DIR -r $REQUIREMENTS_FILE_PATH $PACKAGES_DIR/*
In other words, how does the main function generic_build_a_module "knows" if the ./build/build_sdist.sh finished successfully?
You can check the exit status of a command by surrounding it with an if. ! inverts the exit status. Use return 1 to exit your function with exit status 1.
generic_build_a_module() {
move_to_the_right_directory
echo 'copying the common packages'
if ! ./build/build_sdist.sh; then
echo "Aborted due to error while executing build."
return 1
fi
echo 'installing the api common package'
./build/cache_deps.sh;
}
If you don't want to print an error message, the same program can be written shorter using ||.
generic_build_a_module() {
move_to_the_right_directory
echo 'copying the common packages'
./build/build_sdist.sh || return 1
echo 'installing the api common package'
./build/cache_deps.sh;
}
Alternatively, you could use set -e. This will exit your script immediately when some command exits with a non-zero status.
You have to do the following:-
Run both the script in background and store their respective process id in two variables
Keep checking whether the scripts completed or not after an interval say for every 1 to 2 seconds.
Kill the process which is not completed after a specific time say 30 seconds
Example:
sdist=$(ps -fu $USER|grep -v "grep"|grep "build_sdist.sh"| awk '{print $2}')
OR
sdist=$(ps -fu $USER|grep [b]uild_sdist.sh| awk '{print $2}')
deps=$(ps -fu $USER|grep -v "grep"|grep "cache_deps.sh"| awk '{print $2}')
Now use a while loop to check the status every after a certain interval or just check the status directly after 30 seconds like below
sleep 30
if grep "$sdist"; then
kill -8 $sdist
fi
if grep "$deps"; then
kill -8 $deps
fi
You can check the exit code status of the last executed command by checking the $? variable. Exit code 0 is a typical indication that the command completed successfully.
Exit codes can be set by using exit followed by the code number within a script.
Here's a previous question regarding the use of $? with more detail, but to simply check this value try:
echo "test";echo $?
# Example
echo 'copying the common packages'; ./build/build_sdist.sh;
if [ $? -ne 0 ]; then
echo "The last command exited with a non-zero code"
fi
[ $? -ne 0 ] Checks if the last executed commands error code is not equal to 0. This is also useful to ensure that any negative error codes generated such as -1 are captured.
The caveat of the above approach is that we have only checked against the last command executed and not the ... multiple operations.... that you mentioned, so we may have missed an error generated by a command executed before pip install.
Depending on the situation you could set -e within a subsequent script, which instructs the shell to exit the script at the first instance a command exits with a non-zero status.
Another option would be to perform a similar operation as the example within ./build/build_sdist.sh to check the exit code of each command. This would give you the most control as to when and how the script finishes and allows the script to set it's own exit code.

Is it logical to use the killall command to exit a script?

I am working around with a pin generator and I have come across a small issue.
I know of a few different methods to exiting a script but I have been playing around with calling the same script that is running as a child process, however when the child process is not called, the script exits perfectly. When called, the parent script does not exit properly after the child has completed and exited and the parent script loops back to the user input. I cannot think of anything other than possibly using the "wait" command though I don't know if this command would be proper with this code. Any thoughts on using the "killall" command to exit the script? I have tested it out, as you may see it in the code below, but I am left with the message, "Terminated" and if I can use killall how would I prevent that message from printing to standard out? Here is my code:
#!/bin/bash
clear
echo ""
echo "Now generating a random pin."
sleep 3
echo ""
echo "----------------------------------------------"
echo ""
# Generates a random 8-digit number
gen_num=$(tr -dc '0-9' </dev/urandom | head -c 8)
echo " Pin = $gen_num "
echo ""
echo "Pin has been generated!"
sleep 3
echo ""
clear
PS3="Would you like to generate another pin?: "
select CHOICE in "YES" "NO"
do
if [ "$CHOICE" == "YES" ]
then
bash "/home/yokai/Modules/Wps-options.sh"
elif [ "$CHOICE" == "NO" ]
then
clear
echo ""
echo "Okay bye!"
sleep 3
clear
killall "Wps-options.sh"
break
exit 0
fi
done
exit 0
You don't need to call the same script recursively (and then kill all its instances). The following script performs the task without forking:
#!/bin/bash
gen_pin () {
echo 'Now generating a random pin.'
# Generates a random 8-digit number
gen_num="$(tr -dc '0-9' </dev/urandom | head -c 8)"
echo "Pin = ${gen_num}"
PS3='Would you like to generate another pin?:'
select CHOICE in 'NO' 'YES'
do
case ${CHOICE} in
'NO')
echo 'OK'
exit 0;;
*)
break;;
esac
done
}
while true
do
gen_pin
done
You can find a lot of information about how to program in bash here.
First of all, when you execut
bash "/home/yokai/Modules/Wps-options.sh"
The script forks and crates a child process, then, it waits for the child termination, and it does not continue with execution, unless, your script Wps-options.sh executes something else in background (forking again) without reaping its child. But i can not tell you more because i dont know what is in your script Wps-options.sh
To prevent messages to be printed to stdout when you execute killall:
killall "Wps-options.sh" 1> /dev/null 2> /dev/null
1> stands for stdout redirection to file /dev/null and 2> stands for stderr redirection to file /dev/null

How can I wait for certain output from a process then continue in Bash?

I'm trying to write a bash script to do some stuff, start a process, wait for that process to say it's ready, and then do more stuff while that process continues to run. The issue I'm running into is finding a way to wait for that process to be ready before continuing, and allowing it to continue to run.
In my specific case I'm trying to setup a PPP connection. I need to wait until it has connected before I run the next command. I would also like to stop the script if PPP fails to connect. pppd prints to stdout.
In psuedo code what I want to do is:
[some stuff]
echo START
[set up the ppp connection]
pppd <options> /dev/ttyUSB0
while 1
if output of pppd contains "Script /etc/ppp/ipv6-up finished (pid ####), status = 0x0"
break
if output of pppd contains "Sending requests timed out"
exit 1
[more stuff, and pppd continues to run]
echo CONTINUING
Any ideas on how to do this?
I had to do something similar waiting for a line in /var/log/syslog to appear. This is what worked for me:
FILE_TO_WATCH=/var/log/syslog
SEARCH_PATTERN='file system mounted'
tail -f -n0 ${FILE_TO_WATCH} | grep -qe ${SEARCH_PATTERN}
if [ $? == 1 ]; then
echo "Search terminated without finding the pattern"
fi
It pipes all new lines appended to the watched file to grep and instructs grep to exit quietly as soon as the pattern is discovered. The following if statement detects if the 'wait' terminated without finding the pattern.
The quickest solution I came up with was to run pppd with nohup in the background and check the nobup.out file for stdout. It ended up something like this:
sudo nohup pppd [options] 2> /dev/null &
# check to see if it started correctly
PPP_RESULT="unknown"
while true; do
if [[ $PPP_RESULT != "unknown" ]]; then
break
fi
sleep 1
# read in the file containing the std out of the pppd command
# and look for the lines that tell us what happened
while read line; do
if [[ $line == Script\ /etc/ppp/ipv6-up\ finished* ]]; then
echo "pppd has been successfully started"
PPP_RESULT="success"
break
elif [[ $line == LCP:\ timeout\ sending\ Config-Requests ]]; then
echo "pppd was unable to connect"
PPP_RESULT="failed"
break
elif [[ $line == *is\ locked\ by\ pid* ]]; then
echo "pppd is already running and has locked the serial port."
PPP_RESULT="running"
break;
fi
done < <( sudo cat ./nohup.out )
done
There's a tool called "Expect" that does almost exactly what you want. More info: http://en.wikipedia.org/wiki/Expect
You might also take a look at the man pages for "chat", which is a pppd feature that does some of the stuff that expect can do.
If you go with expect, as #sblom advised, please check autoexpect.
You run what you need via autoexpect command and it will create expect script.
Check man page for examples.
Sorry for the late response but a simpler way would to use wait.
wait is a BASH built-in command which waits for a process to finish
Following is the excerpt from the MAN page.
wait [n ...]
Wait for each specified process and return its termination sta-
tus. Each n may be a process ID or a job specification; if a
job spec is given, all processes in that job's pipeline are
waited for. If n is not given, all currently active child pro-
cesses are waited for, and the return status is zero. If n
specifies a non-existent process or job, the return status is
127. Otherwise, the return status is the exit status of the
last process or job waited for.
For further reference on usage:
Refer to wiki page

Resources