I'm creating a startup/shutdown script for WebSEAL. It's written to allow several instances to be stopped/started in parallel. The only problem is verifying that it completed without issue. With other infrastructures, I could simply grep for a particular keyword in the output (which I redirect to a log file), but WebSEAL does not give any success/error message.
Instead, I thought to use the $? to throw the exit status into a dynamic variable that will be checked after the startups have occured (during log consolidation).
Here is the code that starts/stops and then creates the variable
${PDCOMMAND} >> ${LOGDIR}/${APP}.txt 2>&1 &
let return_${APP}=$?
PDCOMMAND is a valid startup/stop command: aka pdweb start my_instance
APP is the name of the instance: aka my_instance
The goal is that return_${APP} (return_my_instance) will have a value of 0 (success) or 1 (failure) when I check it at a later point in the script.
Are there problems using the $? for a command that may have not technically completed at the time that it was set, or does it set it upon completion of that? So let's say I have 3 instances
instance_1, instance_2, instance_3
if I ran the following:
pdweb start instance1 &
let return_instance_1 = $?
pdweb start instance2 &
let return_instance_2 = $?
pdweb start instance_3 &
let_return_instance_3 = $?
would return_instance_[1|2|3] have the correct values if they started in unequal amounts of time? If instance_3 starts before instance_1, for example, will it still output the result of instance_3 to return_instance_3?
Basically, I'm trying to figure out how the command line treats an asynchronous request in regards to the exit status.
No; the exit status code is only available when the command finishes. (That's why it's called "exit status".) If you successfully spawned a service and it is up and running, it does not yet have an exit status.
If I am able to correctly guess what you are trying to accomplish, you could reap the values of $! after starting each instance, wait for a "reasonable" time (a few seconds?) and check that the processes you started are still running. If they have terminated, there was a problem.


I have this code:
for i in $(seq 1 999); do
sleep 1 &
pids+=( "$!" )
for pid in "${pids[#]}"; do
wait "$pid"
I expect the following behavior:
spin through the first loop
wait about a second on the first pid
spin through the second loop
Instead, I get this error:
./ line 8: wait: pid 24752 is not a child of this shell
(repeated 171 times with different pids)
If I run the script with shorter loop (50 instead of 999), then I get no errors.
What's going on?
Edit: I am using GNU bash 4.4.23 on Windows.
POSIX says:
The implementation need not retain more than the {CHILD_MAX} most recent entries in its list of known process IDs in the current shell execution environment.
{CHILD_MAX} here refers to the maximum number of simultaneous processes allowed per user. You can get the value of this limit using the getconf utility:
$ getconf CHILD_MAX
Bash stores the statuses of at most twice as that many exited background processes in a circular buffer, and says not a child of this shell when you call wait on the PID of an old one that's been overwritten. You can see how it's implemented here.
The way you might reasonably expect this to work, as it would if you wrote a similar program in most other languages, is:
sleep is executed in the background via a fork+exec.
At some point, sleep exits leaving behind a zombie.
That zombie remains in place, holding its PID, until its parent calls wait to retrieve its exit code.
However, shells such as bash actually do this a little differently. They proactively reap their zombie children and store their exit codes in memory so that they can deallocate the system resources those processes were using. Then when you wait the shell just hands you whatever value is stored in memory, but the zombie could be long gone by then.
Now, because all of these exit statuses are being stored in memory, there is a practical limit to how many background processes can exit without you calling wait before you've filled up all the memory you have available for this in the shell. I expect that you're hitting this limit somewhere in the several hundreds of processes in your environment, while other users manage to make it into the several thousands in theirs. Regardless, the outcome is the same - eventually there's nowhere to store information about your children and so that information is lost.
I can reproduce on ArchLinux with docker run -ti --rm bash:5.0.18 bash -c 'pids=; for ((i=1;i<550;++i)); do true & pids+=" $!"; done; wait $pids' and any earlier. I can't reproduce with bash:5.1.0 .
What's going on?
It looks like a bug in your version of Bash. There were a couple of improvements in jobs.c and wait.def in Bash:5.1 and Make sure SIGCHLD is blocked in all cases where waitchld() is not called from a signal handler is mentioned in the changelog. From the look of it, it looks like an issue with handling a SIGCHLD signal while already handling another SIGCHLD signal.

Ive got a script that takes a quite a long time to run, as it has to handle many thousands of files. I want to make this script as fool proof as possible. To this end, I want to check if the user ran the script using nohup and '&'. E.x.
me#myHost:/home/me/bin $ nohup &. I want to make 100% sure the script was run with nohup and '&', because its a very painful recovery process if the script dies in the middle for whatever reason.
How can I check those two key paramaters inside the script itself? and if they are missing, how can I stop the script before it gets any farther, and complain to the user that they ran the script wrong? Better yet, is there way I can force the script to run in nohup &?
Edit: the server enviornment is AIX 7.1
The ps utility can get the process state. The process state code will contain the character + when running in foreground. Absence of + means code is running in background.
However, it will be hard to tell whether the background script was invoked using nohup. It's also almost impossible to rely on the presence of nohup.out as output can be redirected by user elsewhere at will.
There are 2 ways to accomplish what you want to do. Either bail out and warn the user or automatically restart the script in background.
local mypid=$$
if [[ $(ps -o stat= -p $mypid) =~ "+" ]]; then
echo Running in foreground.
exec nohup $0 "$#" &
# the rest of the script
In this code, if the process has a state code +, it will print a warning then restart the process in background. If the process was started in the background, it will just proceed to the rest of the code.
If you prefer to bailout and just warn the user, you can remove the exec line. Note that the exit is not needed after exec. I left it there just in case you choose to remove the exec line.
One good way to find if a script is logging to nohup, is to first check that the nohup.out exists, and then to echo to it and ensure that you can read it there. For example:
echo "complextag"
if ( $(cat nohup.out | grep "complextag" ) != "complextag" );then
# various commands complaining to the user, then exiting
This works because if the script's stdout is going to nohup.out, where they should be going (or whatever out file you specified), then when you echo that phrase, it should be appended to the file nohup.out. If it doesn't appear there, then the script was nut run using nohup and you can scold them, perhaps by using a wall command on a temporary broadcast file. (if you want me to elaborate on that I can).
As for being run in the background, if it's not running you should know by checking nohup.

I want to send multiple jobs to a remote computer. Therefore I wrote a for loop which iterates over i jobs which consist of several subcommands. I need to pause the subsequent iteration until a certain subcommand is executed and the job actually runs on the remote computer.
So the idea is to check whether the string "PEND" appears in the output of a command on the remote computer. I want the for loop to continue when "PEND" changes to "RUN". I don't know whether the if statement is the right thing to use here. A fixed waiting time by using sleep wouldn't do the trick as the status change from PEND to RUN is highly irregular.
Additional information: The subcommands comprise compilation of an executable.
Erroneous pseudocode:
for i in {1..10}
if [[ jobs | grep "PEND" == TRUE ]]; then sleep 1

I have a task that is very well inside of a bash for loop. The situation is though, that a few of the iterations seem to not terminate. What I'm looking for is a way to introduce a timeout that if that iteration of command hasn't terminated after e.g. two hours it will terminate, and move on to the next iteration.
Rough outline:
for somecondition; do
while time-run(command) < 2h do
continue command
One (tedious) way is to start the process in the background, then start another background process that attempts to kill the first one after a fixed timeout.
timeout=7200 # two hours, in seconds
for somecondition; do
command & command_pid=$!
( sleep $timeout & wait; kill $command_pid 2>/dev/null) & sleep_pid=$!
wait $command_pid
kill $sleep_pid 2>/dev/null # If command completes prior to the timeout
The wait command blocks until the original command completes, whether naturally or because it was killed after the sleep completes. The wait immediately after sleep is used in case the user tries to interrupt the process, since sleep ignores most signals, but wait is interruptible.
If I'm understanding your requirement properly, you have a process that needs to run, but you want to make sure that if it gets stuck it moves on, right? I don't know if this will fully help you out, but here is something I wrote a while back to do something similar (I've since improved this a bit, but I only have access to a gist at present, I'll update with the better version later).
# Program:
# Date Created: 22 Aug 2012
# Description: parses logs in real time into daily error files
# Date Updated: N/A
# Developer: #DarrellFX
#Prefix for pid file
#output direcory
#Simple function to see if running on primary
checkPrime ()
if /sbin/ifconfig eth0:0|/bin/grep -wq inet;then isPrime=1;else isPrime=0;fi
#function to kill previous instances of this script
killScript ()
/usr/bin/find /var/run -name "${pidPrefix}.*.pid" |while read pidFile;do
if [[ "${pidFile}" != "/var/run/${pidPrefix}.${$}.pid" ]];then
/bin/kill -- -$(/bin/cat ${pidFile})
/bin/rm ${pidFile}
#Check to see if primary
#If so, kill any previous instance and start log parsing
#If not, just kill leftover running processes
if [[ "${isPrime}" -eq 1 ]];then
echo "$$" > /var/run/${pidPrefix}.$$.pid
commands && commands && commands #Where the actual command to run goes.
exit 0
I then set this script to run on cron every hour. Every time the script is run, it
creates a lock file named after a variable that describes the script that contains the pid of that instance of the script
calls the function killScript which:
uses the find command to find all lock files for that version of the script (this lets more than one of these scripts be set to run in cron at once, for different tasks). For each file it finds, it kills the processes of that lock file and removes the lock file (it automatically checks that it's not killing itself)
Starts doing whatever it is I need to run and not get stuck (I've omitted that as it's hideous bash string manipulation that I've since redone in python).
If this doesn't get you squared let me know.
A few notes:
the checkPrime function is poorly done, and should either return a status, or just exit the script itself
there are better ways to create lock files and be safe about it, but this has worked for me thus far (famous last words)

I have several binary in the same folder that I want to run in a sequence.
Each binary does not terminate by itself and is waiting for data from a socket interface. Also, I need to decide whether to run the next binary based on the output of the previous binary. I am thinking of running them in the background and redirect the output of the previous binary to a file and "grep" for the keyword. However, unless I use wait, I couldn't capture all the output I want from running the previous binary. But if I use wait, I can't get control back because the binary is listening on socket and wouldn't return.
What can I do here?
a sample code here:
/home/test_1 & > test_1_log
===> I also want to grep "Success" in test_1_log here.
===> can't get here because of wait.
/home/test_2 & >test_2_log
Can you use sleep instead of wait?
The problem is that you can't wait for it to return, because it won't. At the same time, you have to wait for some output. If you know that "Success" or something will be output, then you can loop until that line appears with a sleep.
while [ $RC != 0 ]
sleep 1
grep -q 'Success' test_1_log
that also allows you to stop waiting after, say, 10 iterations or something, making sure your script exits
