Using wait process-id on a bash if-condition returning error code 1 for successful process termination - bash

I know a little of bash return codes on successful/failure conditions, but I was experimenting a little with wait on background processes on couple of scripts on if condition and I was surprised to see the behavior on the return error codes 0 for success and non-zero for failure cases.
My scripts:-
$cat foo.sh
#!/bin/bash
sleep 5
$cat bar.sh
#!/bin/bash
sleep 10
$cat experiment.sh
./foo.sh &
pid1=$!
./bar.sh &
pid2=$!
if wait $pid1 && wait $pid2
then
echo "Am getting screwed here!"
else
echo "Am supposed to be screwed here!"
fi
Run the script as it is and getting the output as Am getting screwed here! instead of Am supposed to be screwed here!
$./experiment.sh
Am getting screwed here!
Now modifying the scripts to forcefully return exit codes using exit in both foo.sh and bar.sh
$cat foo.sh
#!/bin/bash
sleep 5
exit 2
$cat bar.sh
#!/bin/bash
sleep 10
exit 17
And am surprised to see the output as
$./experiment.sh
Am supposed to be screwed here!
Apologize for the detailed post, but any help appreciated.
The man page for reference:- http://ss64.com/bash/wait.html

That's correct behavior. The exit status of wait (when called with a single process ID) is the exit status of the process being waited on. Since at least one of them has a non-zero exit status, the && list fails and the else branch is taken.
The rationale is that there is one way (0) for a command to succeed but many ways (any non-zero integer) for it to fail. Don't confuse bash's use of exit statuses with the standard Boolean interpretation of 0 as false and nonzero as true. The shell if statement checks if its command succeeds.

Related

Why can't I exit from an exit trap when I'm inside of a function in ZSH, unless I'm in a loop?

I'm really trying to understand the difference in how ZSH and Bash are handling signal traps, but I'm having a very hard time grasping why ZSH is doing what it's doing.
In short, I'm not able to exit a script with exit in ZSH from within a trap if the execution point is within a function, unless it's also within a loop.
Here is an example of how exit in a trap action behaves in the global / file level scope.
#!/bin/zsh
trap 'echo "Trap SIGINT" ; exit 130' SIGINT
sleep 1
echo "1"
sleep 1
echo "2"
sleep 1
echo "3"
If I call the script, I can send an INT signal by pressing Cntrl+C at any time to echo "Trap SIGINT" and exit the script immediately.
If I hit Cntrl+C after I see the first 1, the output looks like this:
$ ./foobar
1
^CTrap SIGINT
But if I wrap the code in a function, then the trap doesn't want to stop script execution until the function finishes. Using exit 130 from within the trap action just continues the code from the execution point within the function.
Here is an example of how using trap behaves in the function level scope.
#!/bin/zsh
trap 'echo "Trap SIGINT" ; exit 130' SIGINT
foobar() {
sleep 1
echo "1"
sleep 1
echo "2"
sleep 1
echo "3"
}
foobar
echo "Finished"
If I call the script, the only thing that an INT signal does is end the sleep command early. The script will just keep on going from the same execution point after that.
If I hit Cntrl+C repeatedly the output looks like this.
$ ./foobar
^CTrap SIGINT
1
^CTrap SIGINT
2
^CTrap SIGINT
3
It doesn't echo the "Finished" at the end, so it is exiting when the function is finished, but I can't seem to exit before it's finished.
It doesn't make a difference if I set the trap in the global / file scope or from within the function.
If I change exit 130 to return 130, then it will jump out of that function early but continue script execution. This is expected behavior from what I could read in the ZSH documentation.
If I wrap the code inside of a for or while loop as shown in the code below, the code then has no problem breaking out of the loop.
#!/bin/zsh
trap 'echo "Trap SIGINT" ; exit 130' SIGINT
foobar() {
for i in 1; do
sleep 1
echo "1"
sleep 1
echo "2"
sleep 1
echo "3"
done
sleep 1
echo "Outside of loop"
}
foobar
echo "Finished"
Even if I have the loop in the global / file scope and calling foobar from within the loop, it still has no problem exiting within the trap action. I assume it's because using
The one thing that does work correctly is defining a TRAPINT function instead of using the trap built-in, and returning a non-exit code from that function. However exiting from the TRAPINT function works the same way it does with the trap built-in.
I've tried to find anything on why it acts like this but I couldn't find anything.
So what's actually happening here? Why is ZSH not letting me exit from the trap action when the execution point is inside a function?
One way to make this work as expected is setting the ERR_EXIT option.
From the documentation:
If a command has a non-zero exit status, execute the ZERR trap, if set, and exit. This is disabled while running initialization scripts.
There's also ERR_RETURN:
If a command has a non-zero exit status, return immediately from the enclosing function. The logic is similar to that for ERR_EXIT, except that an implicit return statement is executed instead of an exit. This will trigger an exit at the outermost level of a non-interactive script.
Both options have some caveats and notes; refer to the documentation.
Adding a setopt localoptions err_exit as the first line of the foobar function (You probably don't want to do this globally) in your script causes:
$ ./foobar
1
^CTrap SIGINT
$
Now, the interesting bit. In your demonstration script, if you change your exit value from 130 to some other number, and the echo lines to echo "1 - $?" etc., you get:
$ ./foobar
1 - 0
2 - 0
^CTrap SIGINT
3 - 130
The sleep is still exiting with 130, the normal value for a process killed by a SIGINT. What happened to your exit in the trap and its value? Not a clue (I'll update the answer if I figure it out) .
I'd just stick with the TRAPnal functions when writing zsh scripts that care about signals.

Bash: why wait returns prematurely with code 145

This problem is very strange and I cannot find any documentation about this online. In the following code snippet I am merely trying to run a bunch of sub-processes in parallel, printing something when they exit and collect/print their exit code at the end. I find that without catching SIGCHLD things work as I would expect however, things break when I catch the signal. Here is the code:
#!/bin/bash
#enabling job control
set -m
cmd_array=( "$#" ) #array of commands to run in parallel
cmd_count=$# #number of commands to run
cmd_idx=0; #current index of command
cmd_pids=() #array of child proc pids
trap 'echo "Child job existed"' SIGCHLD #setting up signal handler on SIGCHLD
#running jobs in parallel
while [ $cmd_idx -lt $cmd_count ]; do
cmd=${cmd_array[$cmd_idx]} #retreiving the job command as a string
eval "$cmd" &
cmd_pids[$cmd_idx]=$! #keeping track of the job pid
echo "Job #$cmd_idx launched '$cmd']"
(( cmd_idx++ ))
done
#all jobs have been launched, collecting exit codes
idx=0
for pid in "${cmd_pids[#]}"; do
wait $pid
child_exit_code=$?
if [ $child_exit_code -ne 0 ]; then
echo "ERROR: Job #$idx failed with return code $child_exit_code. [job_command: '${cmd_array[$idx]}']"
fi
(( idx++ ))
done
You can tell something is wrong when you try to run this the following command:
./parallel_script.sh "sleep 20; echo done_20" "sleep 3; echo done_3"
The interesting thing here is that you can tell as soon as the signal handler is called (when sleep 3 is done), the wait (which is waiting on sleep 20) is interrupted right away with a return code 145. I can tell the sleep 20 is still running even after the script is done.
I can't find any documentation about such a return code from wait. Can anyone shed some light as to what is going on here?
(By the way if I add a while loop when I wait and keep on waiting while the return code is 145, I actually get the result I expect)
Thanks to #muru, I was able to reproduce the "problem" using much less code, which you can see below:
#!/bin/bash
set -m
trap "echo child_exit" SIGCHLD
function test() {
sleep $1
echo "'sleep $1' just returned now"
}
echo sleeping for 6 seconds in the background
test 6 &
pid=$!
echo sleeping for 2 second in the background
test 2 &
echo waiting on the 6 second sleep
wait $pid
echo "wait return code: $?"
If you run this you will get the following output:
linux:~$ sh test2.sh
sleeping for 6 seconds in the background
sleeping for 2 second in the background
waiting on the 6 second sleep
'sleep 2' just returned now
child_exit
wait return code: 145
lunux:~$ 'sleep 6' just returned now
Explanation:
As #muru pointed out "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (c.f. Bash manual on Exit Status).
Now what mislead me here is the "fatal" signal. I was looking for a command to fail somewhere when nothing did.
Digging a little deeper in Bash manual on Signals: "When Bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed."
So there you have it, what happens in the script above is the following:
sleep 6 starts in the background
sleep 3 starts in the background
wait starts waiting on sleep 6
sleep 3terminates and the SIGCHLD trap if fired interrupting wait, which returns 128 + SIGCHLD = 145
my script exits since it does not wait anymore
the background sleep 6 terminates hence the "'sleep 6' just returned now" after the script already exited

sh -e: collecting a command's exit status from the "else" branch of an if

We are writing shell scripts with set -e as policy which means that it will exit if any unhandled non-zero exit status appears.
#!/bin/sh -e
if some_command; then
experience_happyness
else
print error status of some_command to log or standard error
experience_sadness
exit 1
fi
The $? expression evaluates to 0 at the beginning of the else branch. If I don't run the some_command inside if then an eventual error will terminate the shell script immediately.
How can I know the exit status of a program when set -e is effective without terminating the script?
I'm interested in bash specific solutions too if pure sh solutions are not available.
EDIT: My bad. as #"that other guy" answered I was mistaken when I told that "$?" evaluates to 0 at the beginning of the else branch. I tried it, but I made some mistake when tried it. Sorry.
I think we may keep this question because of the pro quality answer. Should we?
The $? expression evaluates to 0 at the beginning of the else branch.
No it doesn't.
#!/bin/sh -e
some_command() { return 42; }
if some_command; then
echo "Worked"
else
echo "Command failed with $?"
exit 1
fi
will print Command failed with 42.

How can I get both the process id and the exit code from a bash script?

I need a bash script that does the following:
Starts a background process with all output directed to a file
Writes the process's exit code to a file
Returns the process's pid (right away, not when process exits).
The script must exit
I can get the pid but not the exit code:
$ executable >>$log 2>&1 &
pid=`jobs -p`
Or, I can capture the exit code but not the pid:
$ executable >>$log;
# blocked on previous line until process exits
echo $0 >>$log;
How can I do all of these at the same time?
The pid is in $!, no need to run jobs. And the return status is returned by wait:
$executable >> $log 2>&1 &
pid=$!
wait $!
echo $? # return status of $executable
EDIT 1
If I understand the additional requirement as stated in a comment, and you want the script to return immediately (without waiting for the command to finish), then it will not be possible to have the initial script write the exit status of the command. But it is easy enough to have an intermediary write the exit status as soon as the child finishes. Something like:
sh -c "$executable"' & echo pid=$! > pidfile; wait $!; echo $? > exit-status' &
should work.
EDIT 2
As pointed out in the comments, that solution has a race condition: the main script terminates before the pidfile is written. The OP solves this by doing a polling sleep loop, which is an abomination and I fear I will have trouble sleeping at night knowing that I may have motivated such a travesty. IMO, the correct thing to do is to wait until the child is done. Since that is unacceptable, here is a solution that blocks on a read until the pid file exists instead of doing the looping sleep:
{ sh -c "$executable > $log 2>&1 &"'
echo $! > pidfile
echo # Alert parent that the pidfile has been written
wait $!
echo $? > exit-status
' & } | read

How do I check the exit code of a command executed by flock?

Greetings all. I'm setting up a cron job to execute a bash script, and I'm worried that the next one may start before the previous one ends. A little googling reveals that a popular way to address this is the flock command, used in the following manner:
flock -n lockfile myscript.sh
if [ $? -eq 1 ]; then
echo "Previous script is still running! Can't execute!"
fi
This works great. However, what do I do if I want to check the exit code of myscript.sh? Whatever exit code it returns will be overwritten by flock's, so I have no way of knowing if it executed successfully or not.
It looks like you can use the alternate form of flock, flock <fd>, where <fd> is a file descriptor. If you put this into a subshell, and redirect that file descriptor to your lock file, then flock will wait until it can write to that file (or error out if it can't open it immediately and you've passed -n). You can then do everything in your subshell, including testing the return value of scripts you run:
(
if flock -n 200
then
myscript.sh
echo $?
fi
) 200>lockfile
According to the flock man page, flock has a -E or --exit-conflict-code flag you can use to set what the exit code of flock should be in the case a conflict occurs:
-E, --conflict-exit-code number
The exit status used when the -n option is in use, and the conflicting lock exists, or the -w option is in use, and the timeout is reached. The default value is 1. The number has to be in the range of 0 to 255.
The man page also states:
EXIT STATUS
The command uses sysexits.h exit status values for everything, except when using either of the options -n or -w which report a failure to acquire the lock with a exit status given by the -E option, or 1 by default. The exit status given by -E has to be in the range of 0 to 255.
When using the command variant, and executing the child worked, then the exit status is that of the child command.
So, in the case of the -n or -w flags while using the "command" variant, you can see both exit statuses.
Example:
$ flock --exclusive /tmp/flock.lock bash -c 'exit 42'; echo $?
42
$ flock --exclusive /tmp/flock.lock flock --exclusive --nonblock --conflict-exit-code 100 /tmp/flock.lock bash -c 'exit 42'; echo $?
100
In the first example, we see that we get back the exit status of the process we're running with flock. In the second example, we are creating contention for the lock. In that case, flock itself returns the status code we tell it (100). If you do not specify a value with the --conflict-exit-code flag, it will return 1 instead. However, I prefer setting less common values to prevent confusion from other processess/scripts which also might return a value of 1.
#!/bin/bash
if ! pgrep myscript.sh; then
flock -n lockfile myscript.sh
fi
If I understand you right, you want to make sure 'myscript.sh' is not running before cron attempts to run your command again. Assuming that's right, we check to see if pgrep failed to find myscript.sh in the processes list and if so we run the flock command again.
Perhaps something like this would work for you.
#!/bin/bash
RETVAL=0
lockfailed()
{
echo "cannot flock"
exit 1
}
(
flock -w 2 42 || lockfailed
false
RETVAL=$?
echo "original retval $RETVAL"
exit $RETVAL
) 42>|/tmp/flocker
RETVAL=$?
echo "returned $RETVAL"
exit $RETVAL

Resources