Is it possible in Bash to spawn multiple processes and after the last process finishes, report how many of the processes terminated correctly/didn't core dump?
Or would it be better to do this in Python?
(I'd ideally like to report which command failed, if any)
You can hopefully leverage GNU Parallel and its failure handling. General example:
parallel ::: ./processA ./processB ./processC
Specific example... here I run 3 simple jobs, each surrounded by single quotes and set it up to stop once all jobs are complete or failed:
parallel --halt soon,fail=100% ::: 'echo 0 && exit 0' 'echo 1 && exit 1' 'echo 2 && exit 2'
Output
0
1
parallel: This job failed:
echo 1 && exit 1
2
parallel: This job failed:
echo 2 && exit 2
By default, it will run N jobs in parallel, where N is the number of cores your CPU has, if you just want the jobs to be run sequentially, use:
parallel -j 1 ...
Obviously you could pipe the output through grep -c "This job failed" to count the failures.
Assuming you have a file with the commands:
cmd1
cmd2
cmd3
Then this will give you the number of failed jobs as long as you have at most 100 failures:
cat file | parallel
a=$?; echo $((`wc -l <file`-$a))
To get exactly which jobs failed use --joblog.
cat file | parallel --joblog my.log
# Find non-zero column 7
grep -v -P '(.*\t){6}0\t.*\t' my.log
It's easy.
First run your jobs in the background. Remember the pids.
Then for each child execute wait $pid and see the wait exit status, which is equal to the exit status of the childs pid you pass to it.
If the exit status is zero, it means the child terminated successfully.
#!/bin/bash
exit 0 &
childs+=($!)
exit 1 &
childs+=($!)
exit 2 &
childs+=($!)
echo 1 &
childs+=($!)
successes=0
for i in "${childs[#]}"; do
wait $i
if (($? == 0)); then
((successes++))
fi
done
# will print that 2 processes (exit 0 and echo 1) terminated successfully
printf "$successes processes terminated correctly and didn't core dump\n"
Related
I'm trying to make GNU parallel abort all processing when one of the subprocesses fails. The option --halt now,fail=1 does this correctly: if a subprocess exits with a non-zero exit code or if it is sigkilled (by the OOM killer), parallel will stop all jobs.
If parallel detects a non-zero exit code, it will itself exit with the same non-zero exit code. This lets the parent script detect that something went wrong.
The problem is that in the case where a subprocess is sigkilled, the return code from parallel itself is zero and the parent script has no way to tell that the process failed.
Here's a demonstration of the problem (single job to keep it simple but the same issue exists for multiple jobs taking different amounts of time etc).
# Create script that will SIGKILL itself.
cat << 'EOF' > selfkill.sh
echo Process $BASHPID will now terminate itself
kill -9 $BASHPID
EOF
# And another that will exit with code 1.
cat << 'EOF' > exit_one.sh
echo Process $BASHPID will now exit with code 1
exit 1
EOF
echo Running exit_one.sh via parallel with default error handling
parallel --joblog job.log bash exit_one.sh ::: 1
echo Exit code is $?
cat job.log
echo
echo Running selfkill.sh via parallel with default error handling
parallel --joblog job.log bash selfkill.sh ::: 1
echo Exit code is $?
cat job.log
echo
echo Running exit_one.sh via parallel with abort on error
parallel --joblog job.log --halt now,fail=1 bash exit_one.sh ::: 1
echo Exit code is $?
cat job.log
echo
echo Running selfkill.sh via parallel with abort on error
parallel --joblog job.log --halt now,fail=1 bash selfkill.sh ::: 1
echo Exit code is $?
cat job.log
Output:
Running exit_one.sh via parallel with default error handling
Process 1521789 will now exit with code 1
Exit code is 1
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1665764801.403 0.002 0 42 1 0 bash exit_one.sh 1
Running selfkill.sh via parallel with default error handling
Process 1521804 will now terminate itself
Exit code is 1
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1665764801.573 0.003 0 42 0 9 bash selfkill.sh 1
Running exit_one.sh via parallel with abort on error
Process 1521819 will now exit with code 1
parallel: This job failed:
bash exit_one.sh 1
Exit code is 1
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1665764801.730 0.003 0 42 1 0 bash exit_one.sh 1
Running selfkill.sh via parallel with abort on error
Process 1521834 will now terminate itself
parallel: This job failed:
bash selfkill.sh 1
Exit code is 0
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1665764801.880 0.001 0 42 0 9 bash selfkill.sh 1
All good except the last case where I need a non-zero exit code since there was a problem.
The test job appears to exit with code 137 when it is sigkilled, so I would expect it to be detected. However parallel is somehow able to see that it was killed (good) but then exits with zero (not good). Maybe the --halt option causes parallel to consider itself successful because it's successfully halted all the other processes? Is there a workaround for this behaviour?
$ bash selfkill.sh
Process 1485899 will now terminate itself
Killed
$ echo $?
137
This is on an Ubuntu 20.04 with GNU parallel 20161222 and bash 5.0.17.
Further context in case this is an XY problem and there's a better approach than using GNU parallel. Initially we weren't using GNU parallel. We had a number of background jobs and pipelines running in parallel via extensive use of &. But we couldn't find a way to detect when jobs had been sigkilled and the parent script would carry on much the same as it does in the example above.
<sigh> I should have tried a more recent version before asking here.
Ubuntu 20.04 is stuck with an old version but I was able to experiment inside a Docker container with docker run --rm -it ubuntu:22.04 where I could install GNU parallel 20210822.
Running exit_one.sh via parallel with default error handling
Process 1179 will now exit with code 1
Exit code is 1
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1665765327.768 0.100 0 39 1 0 bash exit_one.sh 1
Running selfkill.sh via parallel with default error handling
Process 1186 will now terminate itself
Exit code is 1
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1665765328.292 0.092 0 39 0 9 bash selfkill.sh 1
Running exit_one.sh via parallel with abort on error
Process 1193 will now exit with code 1
parallel: This job failed:
bash exit_one.sh 1
Exit code is 1
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1665765328.818 0.111 0 39 1 0 bash exit_one.sh 1
Running selfkill.sh via parallel with abort on error
Process 1200 will now terminate itself
parallel: This job failed:
bash selfkill.sh 1
Exit code is 137
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1665765329.321 0.085 0 39 0 9 bash selfkill.sh 1
The last case which was problematic before now exits with 137 which is great and consistent with what bash sees.
I have a bash script where I would like to run two processes in parallel, and have the script fail if either of the processes return non-zero. A minimal example of my initial attempt is:
#!/bin/bash
set -e
(sleep 3 ; true ) &
(sleep 4 ; false ) &
wait %1 && wait %2
echo "Still here, exit code: $?"
As expected this doesn't print the message because wait %1 && wait %2 fails and the script exits due to the set -e. However, if the waits are reversed such that the first one has the non-zero status (wait %2 && wait %1), the message is printed:
$ bash wait_test.sh
Still here, exit code: 1
Putting each wait on its own line works as I want and exits the script if either of the processes fail, but the fact that it doesn't work with && makes me suspect that I'm misunderstanding something here.
Can anyone explain what's going on?
You can achieve what you want quite elegantly with GNU Parallel and its "fail handling".
In general, it will run as many jobs in parallel as you have CPU cores.
In your case, try this, which says "exit with failed status if one or more jobs failed":
#!/bin/bash
cat <<EOF | parallel --halt soon,fail=1
echo Job 1; exit 0
echo Job 2; exit 1
EOF
echo GNU Parallel exit status: $?
Sample Output
Job 1
Job 2
parallel: This job failed:
echo Job 2; exit 1
GNU Parallel exit status: 1
Now run it such that no job fails:
#!/bin/bash
cat <<EOF | parallel --halt soon,fail=1
echo Job 1; exit 0
echo Job 2; exit 0
EOF
echo GNU Parallel exit status: $?
Sample Output
Job 1
Job 2
GNU Parallel exit status: 0
If you dislike the heredoc syntax, you can put the list of jobs in a file called jobs.txt like this:
echo Job 1; exit 0
echo Job 2; exit 0
Then run with:
parallel --halt soon,fail=1 < jobs.txt
From bash manual section about usage of set
-e Exit immediately if a pipeline (which may consist of a single simple command), a list, or a compound command (see SHELL GRAMMAR above), exits with a non-zero status. The shell does not exit if the command that fails is part of the command list immediately following a while or until keyword, part of the test following the if or elif reserved words, part of any command executed in a && or || list except the command following the final && or ||, any command in a pipeline but the last, or if the command's return value is being inverted with !. If a compound command other than a subshell returns a non- zero status because a command failed while -e was being ignored, the shell does not exit. A trap on ERR, if set, is executed before the shell exits. This option applies to the shell environment and each subshell environment separately (see COMMAND EXECUTION ENVIRONMENT above), and may cause subshells to exit before executing all the commands in the subshell.
tl;dr
In a bash script, for a command list like this
command1 && command2
command1 is run in a separate environment, so it cannot affect the script's execution environment. but command2 is run in the current environment, so it can affect
In a CI setting, I'd like to run multiple jobs in the background, and use set -e to exit on the first error.
This requires using wait -n instead of wait, but for increasing throughput I'd then want to move the for i in {1..20}; do wait -n; done to the end of the script.
Unfortunately, this means that it is hard to track the errors.
Rather, what I would want is to do the equivalent to a non-blocking wait -n often, and exit as soon as possible.
Is this possible or do I have to write my bash scripts as a Makefile?
Alternative Approach: Emulate set -e for background jobs
Instead of checking the jobs all the time it could be easier and more efficient to exit the script directly when a job fails. To this end, append ... || kill $$ to every job you start:
# before
myCommand &
myProgram arg1 arg2 &
# after
myCommand || kill $$ &
myProgram arg1 arg2 || kill $$ &
Non-Blocking wait -n
If you really have to, you can write your own non-blocking wait -n with a little trick:
nextJobExitCode() {
sleep 0.1 &
wait -n
exitCode="$?"
kill %%
return "$exitCode"
}
The function nextJobExitCode waits at most 0.1 seconds for your jobs. If none of your jobs were already finished or did finish in that 0.1 seconds, nextJobExitCode will terminate with exit code 0.
Example usage
set -e
sleep 1 & # job 1
(sleep 3; false) & # job 2
nextJobExitCode # won't exit. No jobs finished yet
sleep 2
nextJobExitCode # won't exit. Job 1 finished with 0
sleep 2
nextJobExitCode # will exit! Job 2 finished with 1
I have a shell script that parses a flatfile and for each line in it, executes a hive script in parallel.
xargs -P 5 -d $'\n' -n 1 bash -c '
IFS='\t' read -r arg1 arg2 arg 3<<<"$1"
eval "hive -hiveconf tableName=$arg1 -f ../hive/LoadTables.hql" 2> ../path/LogFile-$arg1
' _ < ../path/TableNames.txt
Question is how can I capture the exit codes from each parallel process, so even if one child process fails, exit the script at the end with the error code.
Unfortunately I can't use gnu parallel.
I suppose that you look for something fancier, but a simple solution is to store possible errors in a tmp file and look it up afterwards:
FilewithErrors=/tmp/errors.txt
FinalError=0
xargs -P 5 -d $'\n' -n 1 bash -c '
IFS='\t' read -r arg1 arg2 arg 3<<<"$1"
eval "hive -hiveconf tableName=$arg1 -f ../hive/LoadTables.hql || echo $args1 > $FilewithErrors" 2> ../path/LogFile-$arg1
' _ < ../path/TableNames.txt
if [ -e $FilewithErrors ]; then FinalError=1; fi
rm $FilewithErrors
return $FinalError
As per the comments: Use GNU Parallel installed as a personal or minimal installation as described in http://git.savannah.gnu.org/cgit/parallel.git/tree/README
From man parallel
EXIT STATUS
Exit status depends on --halt-on-error if one of these are used: success=X,
success=Y%, fail=Y%.
0 All jobs ran without error. If success=X is used: X jobs ran without
error. If success=Y% is used: Y% of the jobs ran without error.
1-100 Some of the jobs failed. The exit status gives the number of failed jobs.
If Y% is used the exit status is the percentage of jobs that failed.
101 More than 100 jobs failed.
255 Other error.
If you need the exact error code (and not just whether the job failed or not) use: --joblog mylog.
You can probably do something like:
cat ../path/TableNames.txt |
parallel --colsep '\t' --halt now,fail=1 hive -hiveconf tableName={1} -f ../hive/LoadTables.hql '2>' ../path/LogFile-{1}
fail=1 will stop spawning new jobs if one job fails, and exit with the exit code from the job.
now will kill the remaining jobs. If you want the remaining jobs to exit of "natural causes", use soon instead.
I'm trying to run 3 commands in parallel in bash shell:
$ (first command) & (second command) & (third command) & wait
The problem with this is that if first command fails, for example, the exit code is 0 (I guess because wait succeeds).
The desired behavior is that if one of the commands fails, the exit code will be non-zero (and ideally, the other running commands will be stopped).
How could I achieve this?
Please note that I want to run the commands in parallel!
the best I can think of is:
first & p1=$!
second & p2=$!
...
wait $p1 && wait $p2 && ..
or
wait $p1 || ( kill $p2 $p3 && exit 1 )
...
however this still enforces an order for the check of processes, so if the third fails immediately you won't notice it until the first and second finishes.
This might work for you:
parallel -j3 --halt 2 <list_of_commands.txt
This will run 3 commands in parallel.
If any running job fails it will kill the remaining running jobs and then stop, returning the exit code of the failing job.
You should use && instead of &. eg:
first command && second command && third command && wait
However this will NOT run your command in parallel as every subsequent command's execution will depend on exit code 0 of the previous command.
The shell function below will wait for all PIDs passed as arguments to finish, returning 0 if all PIDs executed without error.
The first PID that exists with an error will cause the PIDs that come after it to be killed, and the exit code that caused the error will be returned by the function.
wait_and_fail_on_first() {
local piderr=0 i
while test $# -gt 0; do {
dpid="$1"; shift
wait $dpid || { piderr=$?; kill $#; return $piderr ;}
} done
}
Here's how to use it:
(first command) & pid1=$!
(second command) & pid2=$!
(third command) & pid3=$!
wait_and_fail_on_first $pid1 $pid2 $pid3 || {
echo "PID $dpid failed with code $?"
echo "Other PIDs were killed"
}