Bash - kill ALL parallel processes if one fails - bash

I have 2 processes that I run via:
init_thing & start_thing
init_thing polls the logs of start_thing for a particular line that it considers to show that start_thing has successfully begun, then executes a few commands against it (e.g. adding users).
The init_thing function could fail with a non-zero exit code if it considers start_thing to have timed out.
The start_thing function could fail, but if successful it runs forever.
What I want to do is kill start_thing if init_thing fails.
I've seen use of GNU parallel in a lot of answers, but it seems to rely on both processes completing (i.e. exiting with a zero exit-code), which in my case doesn't apply.
Is there a way to do this with bash? Perhaps using parallel in a way that I haven't seen/understood?

trap ERR may be useful, where pid variable contains the pid of process to kill
trap 'kill $pid' ERR
after reflexion it is clearer to write explicitly
init_thing || {
echo "something goes wrong, killing $pid"
kill "$pid"
}

Related

Issues with `trap cleanup INT EXIT` in BASH

I wanted my BASH script to end in a defined way doing some cleanup before exiting.
It's easy to do if the script runs until end, but it's getting tricky if the user is impatient and sends a SIGINT (^C).
So I added a trap cleanup INT EXIT (cleanup is my function to clean things up), and I thought things were OK (as cleanup would be called when the script exits, cleanup itself does not use exit).
But then I started a test adding kill -INT $$; sleep 4 in the middle of the script, and I realized that cleanup is being called on SIGINT, but still the sleep 4 was executed and at the end of my script cleanup was called a second time, something I did not intend.
So I wanted to "reset" the handlers at the end of my cleanup using trap INT EXIT as the manual page said the syntax is "trap [-lp] [[arg] sigspec ...]" (also saying: "If arg is absent (and there is a single sigspec) or -, each specified signal is reset to its original disposition (the value it had upon entrance to the shell).").
Interestingly that did not work as intended, so I used trap '' INT EXIT instead (The manual says: "If arg is the null string the signal specified by each sigspec is ignored by the shell and by the commands it invokes.").
It would be a nice sub-question how to do it correctly, but let's ignore that right now.
If I modify my trap to trap cleanup INT, then the cleanup is executed immediately when receiving the SIGINT, and not when the script exits after the sleep eventually (SIGINT does not cause the script to exit early).
If I modify my trap to trap cleanup EXIT, then the cleanup is executed immediately when receiving the SIGINT, and the script ends after cleanup returned.
So the question is: Does trap cleanup INT EXIT make any sense (for cleanup purposes)?
It seems to me that EXIT includes the exits caused by any signal, too (I'm unsure whether that has been the case always).
Contrary trapping SIGINT would perform cleanup actions without actually causing the script to exit.
Is there a general agreed-on "cleanup trap pattern"?
(There is a similar question in bash robustness: what is a correct and portable way to trap for the purpose of an "on exit" cleanup routine?, but it has no good answer)
The shell does not exit when a signal for which a trap has been set is received. So, the answer is no; trap cleanup INT EXIT does not make any sense for cleanup purposes, as it prevents SIGINT from interrupting the execution of the program, and hooks the cleanup routine to an event that doesn't warrant a cleanup anymore.
Not sure how agreed-upon, but this is how I do an automatic cleanup on normal or signal-driven termination:
cleanup() {
# do the cleanup
}
trap cleanup EXIT

Trying to close all child processes when I interrupt my bash script

I have written a bash script to carry out some tests on my system. The tests run in the background and in parallel. The tests can take a long time and sometimes I may wish to abort the tests part way through.
If I Control+C then it aborts the parent script, but leaves the various children running. I wish to make it so that I can hit Control+C or otherwise to quit and then kill all child processes running in the background. I have a bit of code that does the job if I'm running running the background jobs directly from the terminal, but it doesn't work in my script.
I have a minimal working example.
I have tried using trap in combination with pgrep -P $$.
#!/bin/bash
trap 'kill -n 2 $(pgrep -P $$)' 2
sleep 10 &
wait
I was hoping that on hitting control+c (SIGINT) would kill everything that the script started but it actually says:
./breakTest.sh: line 1: kill: (3220) - No such process
This number changes, but doesn't seem to apply to any running processes, so I don't know where it is coming from.
I guess if the contents of the trap command get evaluated where the trap command occurs then it might explain the outcome. The 3220 pid might be for pgrep itself.
I'd appreciate some insight here
Thanks
I have found a solution using pkill. This example also deals with many child processes.
#!/bin/bash
trap 'pkill -P $$' SIGINT SIGTERM
for i in {1..10}; do
sleep 10 &
done
wait
This appears to kill all the child processes elegantly. Though I don't properly understand what the issue was with my original code, apart from sending the correct signal.
in bash whenever you you use & after a command it places that command as a background job ( this background jobs are called job_spec ) which is incremented by one until you exit that terminal session. You can use the jobs command to get the list of the background jobs running. To work with this jobs you have to use the % with the job id. The jobs command also accept other options such as jobs -p to see the proces sids of all jobs , jobs -p %JOB_SPEC to see the process of id of that particular job.
#!/usr/bin/env bash
trap 'kill -9 %1' 2
sleep 10 &
wait
or
#!/usr/bin/env bash
trap 'kill -9 $(jobs -p %1)' 2
sleep 10 &
wait
I implemented something like this few years back, you can take a look at it async bash
You can try something like the following:
pkill -TERM -P <your_parent_id_here>

Bash files: run process in parallel and stop when one is over

I would like to start two C codes from a bash file in parallel and the second one stops when the first one has finished.
The instruction wait expects both processes to stop which is not what I would like to do.
Thanks for any suggestion.
GNU parallel can do this kind of job. Check termination section, it can shutdown down remaining processes based on the exit code (either success or failure:
parallel -j2 --halt now,success=1 ::: 'cmd1 args' 'cmd2 args'
When one of the job finishes successfully, it will send TERM signal to the other jobs (if jobs are not terminated it forces using KILL signal).
With $! you get the pid of the last command executed in parallel. See some nice examples here: Bash `wait` command, waiting for more than 1 PID to finish execution
For your peculiar problem I imagine something like:
#!/bin/bash
command_master() {
echo -e "Command_master"
sleep 1
}
command_tokill() {
echo -e "Command_tokill"
sleep 10
}
command_master & pid_master=($!)
command_tokill & pid_tokill=($!)
wait "$pid_master"
kill "$pid_tokill"
wait -n is what you are looking for. It waits for the next job to finish. You can then have a list of the PIDs of the remaining jobs with jobs -p if you want to kill them.
prog1 & pids=( $! )
prog2 & pids+=( $! )
wait -n
kill "${pids[#]}"
This requires bash.
The two programs are started as background jobs, and the shell waits for one of them to exit.
When this happens, kill is used to terminate both processes (this will cause an error since one of them is already dead).

Background process getting killed when its parent is terminated?

I have code that looks something like this
function doTheThing{
# a potentially infinite while loop...
}
# other stuff...
doTheThing &
trap "kill $!" SIGINT SIGTERM
Strangely, when I ctrl-C out of the parent process before the loop is done, I get a message that the process doesn't exist. Furthermore, if I get rid of the trap, I can't find the process with a ps -aF. It looks like the background process is getting killed when its parent is terminated, but my understanding was that wasn't supposed to happen. I just want to make sure that I can safely leave out the trap and not leave zombie processes everywhere.
The POSIX specification says that when you type the interrupt character (normally Control-C) the SIGINT is sent to the foreground process group. So as long as the background process is running in the same process group as the script that invoked it, it will receive the signal at the same time as the script process.
Shells generally use process groups to implement job control, and by default this is only enabled in interactive shells, not shells running scripts. There's no standard way to run a function in its own process group, but you could use setsid to run it in a new session, which is an even higher level of grouping than process groups. Then it wouldn't receive the interrupt.
You might still want to write a trap command that kills the function on EXIT, though.
doTheThing&
trap "kill $!" EXIT
since exiting the script doesn't automatically kill the rest of the process group.

How to kill all children of the current shell on interrupt?

My scripts cdist-deploy-to and cdist-mass-deploy (from cdist configuration management) run interactively (i.e. are called by a user).
These scripts call a lot of scripts, which again call some scripts:
cdist-mass-deploy ...
cdist-deploy-to ...
cdist-explorer-run-global ...
cdist-dir ....
What I want is to exit / kill all scripts, as soon as cdist-mass-deploy is either stopped by control C (SIGINT) or killed with SIGTERM.
cdist-deploy-to can also be called interactively and should exhibit the same behaviour.
Using ps -ef... and co variants to find out all processes with the ppid looks like it could be quite unportable. Using $! does not work as in the deeper levels the children are no background processes.
I tried using the following code:
__cdist_kill_on_interrupt()
{
__cdist_tmp_removal
kill 0
exit 1
}
trap __cdist_kill_on_interrupt INT TERM
But this leads to ugly Terminated messages as well as to a segfault in the shells (dash, bash, zsh) and seems not to stop everything instantly anyway:
# cdist-mass-deploy -p ikq04.ethz.ch ikq05.ethz.ch
core: Waiting for cdist-deploy-to jobs to finish
^CTerminated
Terminated
Terminated
Terminated
Segmentation fault
So the question is, how to cleanly exit including all (sub-)children in a portable manner (bourne shell, no csh support needed)?
You don't need to handle ^C, that will result in a signal being sent to the whole process group, which will kill all the processes that are not in the background. So you don't need to catch INT.
The only reason you get a Terminated when you kill them is that kill sends TERM by default, but that's reasonable if you are handling a TERM in the first place. You could use kill -INT 0 if you want to avoid the messages.
(responding with extra info)
If the child processes are run in the background, you can get their process ids just after you start them, using the $! special shell variable. Gather these together in a variable and just kill them all when you need to terminate.

Resources