Exit a bash script if an error occurs in it or any of the background jobs it creates [duplicate] - bash

This question already has answers here:
How to wait in bash for several subprocesses to finish, and return exit code !=0 when any subprocess ends with code !=0?
(35 answers)
Closed 6 years ago.
Background
I'm working on a bash script to automate the process of building half a dozen projects that live in the same directory. Each project has two scripts to run in order to build it:
npm install
npm run build
The first line will fetch all of the dependencies from npm. Since this step takes the longest, and since the projects can fetch their dependencies simultaneously, I'm using a background job to fetch everything in parallel. (ie: npm install &)
The second line will use those dependencies to build the project. Since this must happen after all the Step 1s finish, I'm running the wait command in between. See code snippet below.
The Question
I would like to have my script exit as soon as an error occurs in any of the background jobs, or the npm run build step that happens afterward.
I'm using set -e, however this does not apply to the background jobs, and thus if one project fails to install it's dependencies, everything else keeps going.
Here is an simplified example of how my script looks right now.
build.sh
set -e
DIR=$PWD
for dir in ./projects/**/
do
echo -e "\033[4;32mInstalling $dir\033[0m"
cd $dir
npm install & # takes a while, so do this in parallel
cd $DIR
done
wait # continue once the background jobs are completed
for dir in ./projects/**/
do
cd $dir
echo -e "\033[4;32mBuilding $dir\033[0m"
npm run build # Some projects use other projects, so build them in series
cd $DIR
echo -e "\n"
done
Again, I don't want to continue doing anything in the script if an error occurs at any point, this applies to both the parent and background jobs. Is this possible?

Collect the PIDs for the background jobs; then, use wait to collect the exit status of each, exiting the first time any PID polled over in that loop is nonzero.
install_pids=( )
for dir in ./projects/**/; do
(cd "$dir" && exec npm install) & install_pids+=( $! )
done
for pid in "${install_pids[#]}"; do
wait "$pid" || exit
done
The above, while simple, has a caveat: If an item late in the list exits nonzero prior to items earlier in the list, this won't be observed until that point in the list is polled. To work around this caveat, you can repeatedly iterate through the entire list:
install_pids=( )
for dir in ./projects/**/; do
(cd "$dir" && exec npm install) & install_pids+=( $! )
done
while (( ${#install_pids[#]} )); do
for pid_idx in "${!install_pids[#]}"; do
pid=${install_pids[$pid_idx]}
if ! kill -0 "$pid" 2>/dev/null; then # kill -0 checks for process existance
# we know this pid has exited; retrieve its exit status
wait "$pid" || exit
unset "install_pids[$pid_idx]"
fi
done
sleep 1 # in bash, consider a shorter non-integer interval, ie. 0.2
done
However, because this polls, it incurs extra overhead. This can be avoided by trapping SIGCHLD and referring to jobs -n (to get a list of jobs whose status changed since prior poll) when the trap is triggered.

Bash isn't made for parallel processing such as this. To accomplish what you want, I had to write a function library. I'd suggest seeking a language more readily suited to this if possible.
The problem with looping through the pids, such as this...
#!/bin/bash
pids=()
f() {
sleep $1
echo "no good"
false
}
t() {
sleep $1
echo "good"
true
}
t 3 &
pids+=$!
f 1 &
pids+=$!
t 2 &
pids+=$!
for p in ${pids[#]}; do
wait $p || echo failed
done
The problem is that "wait" will wait on the first pid, and if the other pids finish before the first one does, you'll not catch the exit code. The code above shows this problem on bash v4.2.46. The false command should produce output that never gets caught.

Related

wait command not working on parent process [duplicate]

Context:
Users provide me their custom scripts to run. These scripts can be of any sort like scripts to start multiple GUI programs, backend services. I have no control over how the scripts are written. These scripts can be of blocking type i.e. execution waits till all the child processes (programs that are run sequentially) exit
#exaple of blocking script
echo "START"
first_program
second_program
echo "DONE"
or non blocking type i.e. ones that fork child process in the background and exit something like
#example of non-blocking script
echo "START"
first_program &
second_program &
echo "DONE"
What am I trying to achieve?
User provided scripts can be of any of the above two types or mix of both. My job is to run the script and wait till all the processes started by it exit and then shutdown the node. If its of blocking type, case is plain simple i.e. get the PID of script execution process and wait till ps -ef|grep -ef PID has no more entries. Non-blocking scripts are the ones giving me trouble
Is there a way I can get list of PIDs of all the child process spawned by execution of a script? Any pointers or hints will be highly appreciated
You can use wait to wait for all the background processes started by userscript to complete. Since wait only works on children of the current shell, you'll need to source their script instead of running it as a separate process.
( source userscript; wait )
Sourcing the script in an explicit subshell should simulate starting a new process closely enough. If not, you can also background the subshell, which forces a new process to be started, then wait for it to complete.
( source userscript; wait ) & wait
ps --ppid $PID will list all child processes of the process with $PID.
You can open a file descriptor that gets inherited by other processes, and then wait until it's no longer in use. This is a low overhead method that usually works fine, though it's possible for processes to work around it if they want:
foo=$(mktemp)
( flock -x 5000; theirscript; ) 5000> "$foo"
flock -x 0 < "$foo"
rm "$foo"
echo "The script and its subprocesses are done"
You can follow all invoked processes using ptrace, such as with strace. This is easier, but has some associated overhead and may not work when scripts invoke suid binaries:
strace -f -e none theirscript
You can use pgrep -P <parent_pid> to get a list of child processes. Example:
IFS=$'\n' read -ra CHILD_PROCS -d '' < <(exec pgrep -P "$1")
And to get the grand-children, simply do the same procedure on each child process.
Check out my blog Bash functions to list and kill or send signals to process trees.
You can use one of those function to properly list all processes spawned under one process. Each has their own method or order of sending signals to process.
The only limitation by those is that process still have to be connected and not orphaned. If you could somehow find a way to group your processes, then that might be your solution.
To simply answer the question that was asked. You could store the process ID of each script you're calling into the same variable:
echo "START"
first_program &
child_process_ids+="$! "
second_program &
child_process_ids+="$! "
echo $child_process_ids
echo "DONE"
$child_process_ids would just be a space delimited string of process Ids. Now, this answers the question asked, however, what I would do would be a bit different. I would call each script from a for loop, store its process ID, then wait on each one in another for loop to finish and inspect each exit code individually. Using the same example, here's what it would look like.
echo "START"
scripts="first_program second_program"
for script in $scripts; do
#Call script and send to background
./$script &
#Store the script's processID that was just sent to the background
child_process_ids+="$! "
done
for child_process_id in $child_process_ids; do
#Pass each processId into the wait command to retrieve its exit
#code and store it in $rc
wait $child_process_id
rc=$?
#Inspect each processes exit code
if [ $rc -ne 0 ]; then
echo "$child_process_id failed with an exit code of $rc"
else
echo "$child_process_id was successful"
fi
done

Prevent SIGINT from interrupting current task while still passing information about SIGINT (and preserve the exit code)

I have a quite long shell script and I'm trying to add signal handling to it.
The main task of the script is to run various programs and then clean up their temporary files.
I want to trap SIGINT.
When the signal is caught, the script should wait for the current program to finish execution, then do the cleanup and exit.
Here is an MCVE:
#!/bin/sh
stop_this=0
trap 'stop_this=1' 2
while true ; do
result="$(sleep 2 ; echo success)" # run some program
echo "result: '$result'"
echo "Cleaning up..." # clean up temporary files
if [ $stop_this -ne 0 ] ; then
echo 'OK, time to stop this.'
break
fi
done
exit 0
The expected result:
Cleaning up...
result: 'success'
Cleaning up...
^Cresult: 'success'
Cleaning up...
OK, time to stop this.
The actual result:
Cleaning up...
result: 'success'
Cleaning up...
^Cresult: ''
Cleaning up...
OK, time to stop this.
The problem is that the currently running instruction (result="$(sleep 2 ; echo success)" in this case) is interrupted.
What can I do so it would behave more like I was set trap '' 2?
I'm looking for either a POSIX solution or one that is supported by most of shell interpreters (BusyBox, dash, Cygwin...)
I already saw answers for Prevent SIGINT from closing child process in bash script but this isn't really working for me. All of these solutions require to modify each line which shouldn't be interrupted. My real script is quite long and much more complicated than the example. I would have to modify hundreds of lines.
You need to prevent the SIGINT from going to the echo in the first place (or rewrite the cmd that you are running in the variable assignment to ignore SIGINT). Also, you need to allow the variable assignment to happen, and it appears that the shell is aborting the assignment when it receives the SIGINT. If you're only worried about user generated SIGINT from the tty, you need to disassociate that command from the tty (eg, get it out of the foreground process group) and prevent the SIGINT from aborting the assignment. You can (almost) accomplish both of those with:
#!/bin/sh
stop_this=0
while true ; do
trap 'stop_this=1' INT
{ sleep 1; echo success > tmpfile; } & # run some program
while ! wait; do : ; done
trap : INT
result=$(cat tmpfile& wait)
echo "result: '$result'"
echo "Cleaning up..." # clean up temporary files
if [ $stop_this -ne 0 ] ; then
echo 'OK, time to stop this.'
break
fi
done
exit 0
If you're worried about SIGINT from another source, you'll have to re-implement sleep (or whatever command I presume sleep is a proxy for) to handle SIGINT the way you want. The key here is to run the command in the background and wait for it to prevent the SIGINT from going to it and terminating it early. Note that we've opened at least 2 new cans of worms here. By waiting in a loop, we're effectively ignoring the any errors that the subcommand might raise (we're doing this to try and implement a SIGRESTART), so may potentially hang. Also, if the SIGINT arrives during the cat, we have attempted to prevent the cat from aborting by running it in the background, but now the variable assignment will be terminated and you'll get your original behavior. Signal handling is not clean in the shell! But this gets you closer to your desired goal.
Sighandling in shell scripts can get clumsy. It's pretty much impossible to
do it "right" without the support of C.
The problem with:
result="$(sleep 2 ; echo success)" # run some program
is that $() creates a subshell and in subshells, non-ignored (trap '' SIGNAL is how you ignore SIGNAL)
signals are reset to their default dispositions which for SIGINT is to terminate the process
($( ) gets its own process, thought it will receive the signal too because the terminal-generated SIGINT
is process-group targeted)
To prevent this, you could do something like:
result="$(
trap '' INT #ignore; could get killed right before the trap command
sleep 2; echo success)"
or
result="$( trap : INT; #no-op handler; same problem
sleep 2; while ! echo success; do :; done)"
but as noted, there will be a small race-condition window between the start of the
subshell and the registration of the signal handler during which
the subshell could get killed by the reset-to-default SIGINT signal.
Both answers from #PSkocik and #WilliamPursell have helped me to get on the right track.
I have a fully working solution. It ain't pretty because it needs to use an external file to indicate that the signal didn't occurred but beside that it should work reliably.
#!/bin/sh
touch ./continue
trap 'rm -f ./continue' 2
( # the whole main body of the script is in a separate background process
trap '' 2 # ignore SIGINT
while true ; do
result="$(sleep 2 ; echo success)" # run some program
echo "result: '$result'"
echo "Cleaning up..." # clean up temporary files
if [ ! -e ./continue ] ; then # exit the loop if file "./continue" is deleted
echo 'OK, time to stop this.'
break
fi
done
) & # end of the main body of the script
while ! wait ; do : ; done # wait for the background process to end (ignore signals)
wait $! # wait again to get the exit code
result=$? # exit code of the background process
rm -f ./continue # clean up if the background process ended without a signal
exit $result
EDIT: There are some problems with this code in Cygwin.
The main functionality regarding signals work.
However, it seems like the finished background process doesn't stay in the system as a zombie. This makes the wait $! to not work. The exit code of the script has incorrect value of 127.
Solution to that would be removing lines wait $!, result=$? and result=$? so the script always returns 0.
It should be also possible to keep the proper error code by using another layer of subshell and temporarily store the exit code in a file.
For disallowing interrupting the program:
trap "" ERR HUP INT QUIT TERM TSTP TTIN TTOU
But if a sub-command handles traps by itself, and that command must really complete, you need to prevent passing signals to it.
For people on Linux that don't mind installing extra commands, you can just use:
waitFor [command]
Alternatively you can adapt the latest source code of waitFor into your program as needed, or use the code from Gilles' answer. Although that has the disadvantage of not benefiting from updates upstream.
Just mind that other terminals and the service manager can still terminate "command". If you want the service manager to be unable to close "command", it shall be run as a service with the appropriate kill mode and kill signal set.
You may want to adapt the following:
#!/bin/sh
tmpfile=".tmpfile"
rm -f $tmpfile
trap : INT
# put the action that should not be interrupted in the innermost brackets
# | |
( set -m; (sleep 10; echo success > $tmpfile) & wait ) &
wait # wait will be interrupted by Ctrl+c
while [ ! -r $tmpfile ]; do
echo "waiting for $tmpfile"
sleep 1
done
result=`cat $tmpfile`
echo "result: '$result'"
This seems also to work with programs that install their own SIGINT handler like mpirun and mpiexec and so on.

How to capture a process Id and also add a trigger when that process finishes in a bash script?

I am trying to make a bash script to start a jar file and do it in the background. For that reason I'm using nohup. Right now I can capture the pid of the java process but I also need to be able to execute a command when the process finishes.
This is how I started
nohup java -jar jarfile.jar & echo $! > conf/pid
I also know from this answer that using ; will make a command execute after the first one finishes.
nohup java -jar jarfile.jar; echo "done"
echo "done" is just an example. My problem now is that I don't know how to combine them both. If I run echo $! first then echo "done" executes immediately. While if echo "done" goes first then echo $! will capture the PID of echo "done" instead of the one of the jarfile.
I know that I could achieve the desire functionality by polling until I don't see the PID running anymore. But I would like to avoid that as much as possible.
You can use the bash util wait once you start the process using nohup
nohup java -jar jarfile.jar &
pid=$! # Getting the process id of the last command executed
wait $pid # Waits until the process mentioned by the pid is complete
echo "Done, execute the new command"
I don't think you're going to get around "polling until you don't see the pid running anymore." wait is a bash builtin; it's what you want and I'm certain that's exactly what it does behind the scenes. But since Inian beat me to it, here's a friendly function for you anyway (in case you want to get a few things running in parallel).
alert_when_finished () {
declare cmd="${#}";
${cmd} &
declare pid="${!}";
while [[ -d "/proc/${pid}/" ]]; do :; done; #equivalent to wait
echo "[${pid}] Finished running: ${cmd}";
}
Running a command like this will give the desired effect and suppress unneeded job output:
( alert_when_finished 'sleep 5' & )

Introduce timeout in a bash for-loop

I have a task that is very well inside of a bash for loop. The situation is though, that a few of the iterations seem to not terminate. What I'm looking for is a way to introduce a timeout that if that iteration of command hasn't terminated after e.g. two hours it will terminate, and move on to the next iteration.
Rough outline:
for somecondition; do
while time-run(command) < 2h do
continue command
done
done
One (tedious) way is to start the process in the background, then start another background process that attempts to kill the first one after a fixed timeout.
timeout=7200 # two hours, in seconds
for somecondition; do
command & command_pid=$!
( sleep $timeout & wait; kill $command_pid 2>/dev/null) & sleep_pid=$!
wait $command_pid
kill $sleep_pid 2>/dev/null # If command completes prior to the timeout
done
The wait command blocks until the original command completes, whether naturally or because it was killed after the sleep completes. The wait immediately after sleep is used in case the user tries to interrupt the process, since sleep ignores most signals, but wait is interruptible.
If I'm understanding your requirement properly, you have a process that needs to run, but you want to make sure that if it gets stuck it moves on, right? I don't know if this will fully help you out, but here is something I wrote a while back to do something similar (I've since improved this a bit, but I only have access to a gist at present, I'll update with the better version later).
#!/bin/bash
######################################################
# Program: logGen.sh
# Date Created: 22 Aug 2012
# Description: parses logs in real time into daily error files
# Date Updated: N/A
# Developer: #DarrellFX
######################################################
#Prefix for pid file
pidPrefix="logGen"
#output direcory
outDir="/opt/Redacted/logs/allerrors"
#Simple function to see if running on primary
checkPrime ()
{
if /sbin/ifconfig eth0:0|/bin/grep -wq inet;then isPrime=1;else isPrime=0;fi
}
#function to kill previous instances of this script
killScript ()
{
/usr/bin/find /var/run -name "${pidPrefix}.*.pid" |while read pidFile;do
if [[ "${pidFile}" != "/var/run/${pidPrefix}.${$}.pid" ]];then
/bin/kill -- -$(/bin/cat ${pidFile})
/bin/rm ${pidFile}
fi
done
}
#Check to see if primary
#If so, kill any previous instance and start log parsing
#If not, just kill leftover running processes
checkPrime
if [[ "${isPrime}" -eq 1 ]];then
echo "$$" > /var/run/${pidPrefix}.$$.pid
killScript
commands && commands && commands #Where the actual command to run goes.
else
killScript
exit 0
fi
I then set this script to run on cron every hour. Every time the script is run, it
creates a lock file named after a variable that describes the script that contains the pid of that instance of the script
calls the function killScript which:
uses the find command to find all lock files for that version of the script (this lets more than one of these scripts be set to run in cron at once, for different tasks). For each file it finds, it kills the processes of that lock file and removes the lock file (it automatically checks that it's not killing itself)
Starts doing whatever it is I need to run and not get stuck (I've omitted that as it's hideous bash string manipulation that I've since redone in python).
If this doesn't get you squared let me know.
A few notes:
the checkPrime function is poorly done, and should either return a status, or just exit the script itself
there are better ways to create lock files and be safe about it, but this has worked for me thus far (famous last words)

How to switch a sequence of tasks to background?

I'm running two tests on a remote server, here is the command I used several hours ago:
% ./test1.sh; ./test2.sh
The two tests are supposed to run one by one.If the second runs before the first completes, everything will be in ruin, and I'll have to restart the whole procedure.
The dilemma is, these two tasks cost too many hours to complete, and when I prepare to logout the server and wait for the result. I don't know how to switch both of them to background... If I use Ctrl+Z, only the first task will be suspended, while the second starts doing nothing useful while wiping out current data.
Is it possible to switch both of them to background, preserving their orders? Actually I should make these two tasks in the same process group like (./test1.sh; ./test2.sh) &, but sadly, the first test have run several hours, and it's quite a pity to restart the tests.
An option is to kill the second test before it starts, but is there any mechanism to cope with this?
First rename the ./test2.sh to ./test3.sh. Then do [CTRL+Z], followed by bg and disown -h. Then save this script (test4.sh):
while :; do
sleep 5;
pgrep -f test1.sh &> /dev/null
if [ $? -ne 0 ]; then
nohup ./test3.sh &
break
fi
done
then do: nohup ./test4.sh &.
and you can logout.
First, screen or tmux are your friends here, if you don't already work with them (they make remote machine work an order of magnitude easier).
To use conditional consecutive execution you can write:
./test1.sh && ./test2.sh
which will only execute test2.sh if test1.sh returns with 0 (conventionally meaning: no error). Example:
$ true && echo "first command was successful"
first command was successful
$ ! true && echo "ain't gonna happen"
More on control operators: http://www.humbug.in/docs/the-linux-training-book/ch08s01.html

Resources