Why can't I use job control in a bash script? - bash

In this answer to another question, I was told that
in scripts you don't have job control
(and trying to turn it on is stupid)
This is the first time I've heard this, and I've pored over the bash.info section on Job Control (chapter 7), finding no mention of either of these assertions. [Update: The man page is a little better, mentioning 'typical' use, default settings, and terminal I/O, but no real reason why job control is particularly ill-advised for scripts.]
So why doesn't script-based job-control work, and what makes it a bad practice (aka 'stupid')?
Edit: The script in question starts a background process, starts a second background process, then attempts to put the first process back into the foreground so that it has normal terminal I/O (as if run directly), which can then be redirected from outside the script. Can't do that to a background process.
As noted by the accepted answer to the other question, there exist other scripts that solve that particular problem without attempting job control. Fine. And the lambasted script uses a hard-coded job number — Obviously bad. But I'm trying to understand whether job control is a fundamentally doomed approach. It still seems like maybe it could work...

What he meant is that job control is by default turned off in non-interactive mode (i.e. in a script.)
From the bash man page:
JOB CONTROL
Job control refers to the ability to selectively stop (suspend)
the execution of processes and continue (resume) their execution at a
later point.
A user typically employs this facility via an interactive interface
supplied jointly by the system’s terminal driver and bash.
and
set [--abefhkmnptuvxBCHP] [-o option] [arg ...]
...
-m Monitor mode. Job control is enabled. This option is on by
default for interactive shells on systems that support it (see
JOB CONTROL above). Background processes run in a separate
process group and a line containing their exit status is
printed upon their completion.
When he said "is stupid" he meant that not only:
is job control meant mostly for facilitating interactive control (whereas a script can work directly with the pid's), but also
I quote his original answer, ... relies on the fact that you didn't start any other jobs previously in the script which is a bad assumption to make. Which is quite correct.
UPDATE
In answer to your comment: yes, nobody will stop you from using job control in your bash script -- there is no hard case for forcefully disabling set -m (i.e. yes, job control from the script will work if you want it to.) Remember that in the end, especially in scripting, there always are more than one way to skin a cat, but some ways are more portable, more reliable, make it simpler to handle error cases, parse the output, etc.
You particular circumstances may or may not warrant a way different from what lhunath (and other users) deem "best practices".

Job control with bg and fg is useful only in interactive shells. But & in conjunction with wait is useful in scripts too.
On multiprocessor systems spawning background jobs can greatly improve the script's performance, e.g. in build scripts where you want to start at least one compiler per CPU, or process images using ImageMagick tools parallely etc.
The following example runs up to 8 parallel gcc's to compile all source files in an array:
#!bash
...
for ((i = 0, end=${#sourcefiles[#]}; i < end;)); do
for ((cpu_num = 0; cpu_num < 8; cpu_num++, i++)); do
if ((i < end)); then gcc ${sourcefiles[$i]} & fi
done
wait
done
There is nothing "stupid" about this. But you'll require the wait command, which waits for all background jobs before the script continues. The PID of the last background job is stored in the $! variable, so you may also wait ${!}. Note also the nice command.
Sometimes such code is useful in makefiles:
buildall:
for cpp_file in *.cpp; do gcc -c $$cpp_file & done; wait
This gives much finer control than make -j.
Note that & is a line terminator like ; (write command& not command&;).
Hope this helps.

Job control is useful only when you are running an interactive shell, i.e., you know that stdin and stdout are connected to a terminal device (/dev/pts/* on Linux). Then, it makes sense to have something on foreground, something else on background, etc.
Scripts, on the other hand, doesn't have such guarantee. Scripts can be made executable, and run without any terminal attached. It doesn't make sense to have foreground or background processes in this case.
You can, however, run other commands non-interactively on the background (appending "&" to the command line) and capture their PIDs with $!. Then you use kill to kill or suspend them (simulating Ctrl-C or Ctrl-Z on the terminal, it the shell was interactive). You can also use wait (instead of fg) to wait for the background process to finish.

It could be useful to turn on job control in a script to set traps on
SIGCHLD. The JOB CONTROL section in the manual says:
The shell learns immediately whenever a job changes state. Normally,
bash waits until it is about to print a prompt before reporting
changes in a job's status so as to not interrupt any other output. If
the -b option to the set builtin command is enabled, bash reports
such changes immediately. Any trap on SIGCHLD is executed for each
child that exits.
(emphasis is mine)
Take the following script, as an example:
dualbus#debian:~$ cat children.bash
#!/bin/bash
set -m
count=0 limit=3
trap 'counter && { job & }' CHLD
job() {
local amount=$((RANDOM % 8))
echo "sleeping $amount seconds"
sleep "$amount"
}
counter() {
((count++ < limit))
}
counter && { job & }
wait
dualbus#debian:~$ chmod +x children.bash
dualbus#debian:~$ ./children.bash
sleeping 6 seconds
sleeping 0 seconds
sleeping 7 seconds
Note: CHLD trapping seems to be broken as of bash 4.3
In bash 4.3, you could use 'wait -n' to achieve the same thing,
though:
dualbus#debian:~$ cat waitn.bash
#!/home/dualbus/local/bin/bash
count=0 limit=3
trap 'kill "$pid"; exit' INT
job() {
local amount=$((RANDOM % 8))
echo "sleeping $amount seconds"
sleep "$amount"
}
for ((i=0; i<limit; i++)); do
((i>0)) && wait -n; job & pid=$!
done
dualbus#debian:~$ chmod +x waitn.bash
dualbus#debian:~$ ./waitn.bash
sleeping 3 seconds
sleeping 0 seconds
sleeping 5 seconds
You could argue that there are other ways to do this in a more
portable way, that is, without CHLD or wait -n:
dualbus#debian:~$ cat portable.sh
#!/bin/sh
count=0 limit=3
trap 'counter && { brand; job & }; wait' USR1
unset RANDOM; rseed=123459876$$
brand() {
[ "$rseed" -eq 0 ] && rseed=123459876
h=$((rseed / 127773))
l=$((rseed % 127773))
rseed=$((16807 * l - 2836 * h))
RANDOM=$((rseed & 32767))
}
job() {
amount=$((RANDOM % 8))
echo "sleeping $amount seconds"
sleep "$amount"
kill -USR1 "$$"
}
counter() {
[ "$count" -lt "$limit" ]; ret=$?
count=$((count+1))
return "$ret"
}
counter && { brand; job & }
wait
dualbus#debian:~$ chmod +x portable.sh
dualbus#debian:~$ ./portable.sh
sleeping 2 seconds
sleeping 5 seconds
sleeping 6 seconds
So, in conclusion, set -m is not that useful in scripts, since
the only interesting feature it brings to scripts is being able to
work with SIGCHLD. And there are other ways to achieve the same thing
either shorter (wait -n) or more portable (sending signals yourself).

Bash does support job control, as you say. In shell script writing, there is often an assumption that you can't rely on the fact that you have bash, but that you have the vanilla Bourne shell (sh), which historically did not have job control.
I'm hard-pressed these days to imagine a system in which you are honestly restricted to the real Bourne shell. Most systems' /bin/sh will be linked to bash. Still, it's possible. One thing you can do is instead of specifying
#!/bin/sh
You can do:
#!/bin/bash
That, and your documentation, would make it clear your script needs bash.

Possibly o/t but I quite often use nohup when ssh into a server on a long-running job so that if I get logged out the job still completes.
I wonder if people are confusing stopping and starting from a master interactive shell and spawning background processes? The wait command allows you to spawn a lot of things and then wait for them all to complete, and like I said I use nohup all the time. It's more complex than this and very underused - sh supports this mode too. Have a look at the manual.
You've also got
kill -STOP pid
I quite often do that if I want to suspend the currently running sudo, as in:
kill -STOP $$
But woe betide you if you've jumped out to the shell from an editor - it will all just sit there.
I tend to use mnemonic -KILL etc. because there's a danger of typing
kill - 9 pid # note the space
and in the old days you could sometimes bring the machine down because it would kill init!

jobs DO work in bash scripts
BUT, you ... NEED to watch for the spawned staff
like:
ls -1 /usr/share/doc/ | while read -r doc ; do ... done
jobs will have different context on each side of the |
bypassing this may be using for instead of while:
for `ls -1 /usr/share/doc` ; do ... done
this should demonstrate how to use jobs in a script ...
with the mention that my commented note is ... REAL (dunno why that behaviour)
#!/bin/bash
for i in `seq 7` ; do ( sleep 100 ) & done
jobs
while [ `jobs | wc -l` -ne 0 ] ; do
for jobnr in `jobs | awk '{print $1}' | cut -d\[ -f2- |cut -d\] -f1` ; do
kill %$jobnr
done
#this is REALLY ODD ... but while won't exit without this ... dunno why
jobs >/dev/null 2>/dev/null
done
sleep 1
jobs

Related

shell: clean up leaked background processes which hang due to shared stdout/stderr

I need to run essentially arbitrary commands on a (remote) shell in ephemeral containers/VMs for a test execution engine. Sometimes these leak background processes which then cause the entire command to hang. This can be boiled down to this simple command:
$ sh -c 'sleep 30 & echo payload'
payload
$
Here the backgrounded sleep 30 plays the role of a leaked process (which in reality will be something like dbus-daemon) and the echo is the actual thing I want to run. The sleep 30 & echo payload should be considered as an atomic opaque example command here.
The above command is fine and returns immediately as the shell's and also sleep's stdout/stderr are a PTY. However, when capturing the output of the command to a pipe/file (a test runner wants to save everything into a log, after all), the whole command hangs:
$ sh -c 'sleep 30 & echo payload' | cat
payload
# ... does not return to the shell (until the sleep finishes)
Now, this could be fixed with some rather ridiculously complicated shell magic which determines the FDs of stdout/err from /proc/$$/fd/{1,2}, iterating over ls /proc/[0-9]*/fd/* and killing every process which also has the same stdout/stderr. But this involves a lot of brittle shell code and expensive shell string comparisons.
Is there a way to clean up these leaked background processes in a more elegant and simpler way? setsid does not help:
$ sh -c 'setsid -w sh -c "sleep 30 & echo payload"' | cat
payload
# hangs...
Note that process groups/sessions and killing them wholesale isn't sufficient as leaked processes (like dbus-daemon) often setsid themselves.
P.S. I can only assume POSIX shell or bash in these environments; no Python, Perl, etc.
Thank you in advance!
We had this problem with parallel tests in Launchpad. The simplest solution we had then - which worked well - was just to make sure that no processes share stdout/stdin/stderr (except ones where you actually want to hang if they haven't finished - e.g. the test workers themselves).
Hmm, having re-read this I cannot give you the solution you are after (use systemd to kill them). What we came up with is to simply ignore the processes but reliably not hang when the single process we were waiting for is done. Note that this is distinctly different from the pipes getting closed.
Another option, not perfect but useful, is to become a local reaper with prctl(2) and PR_SET_CHILD_SUBREAPER. This will allow you to be the parent of all the processes that would otherwise reparent to init. With this arrangement you could try to kill all the processes that have you as ppid. This is terrible but it's the closest best thing to using cgroups.
But note, that unless you are running this helper as root you will find that practical testing might spawn some setuid thing that will lurk and won't be killable. It's an annoying problem really.
Use script -qfc instead of sh -c.

How to make bash interpreter stop until a command is finished?

I have a bash script with a loop that calls a hard calculation routine every iteration. I use the results from every calculation as input to the next. I need make bash stop the script reading until every calculation is finished.
for i in $(cat calculation-list.txt)
do
./calculation
(other commands)
done
I know the sleep program, and i used to use it, but now the time of the calculations varies greatly.
Thanks for any help you can give.
P.s>
The "./calculation" is another program, and a subprocess is opened. Then the script passes instantly to next step, but I get an error in the calculation because the last is not finished yet.
If your calculation daemon will work with a precreated empty logfile, then the inotify-tools package might serve:
touch $logfile
inotifywait -qqe close $logfile & ipid=$!
./calculation
wait $ipid
(edit: stripped a stray semicolon)
if it closes the file just once.
If it's doing an open/write/close loop, perhaps you can mod the daemon process to wrap some other filesystem event around the execution? `
#!/bin/sh
# Uglier, but handles logfile being closed multiple times before exit:
# Have the ./calculation start this shell script, perhaps by substituting
# this for the program it's starting
trap 'echo >closed-on-calculation-exit' 0 1 2 3 15
./real-calculation-daemon-program
Well, guys, I've solved my problem with a different approach. When the calculation is finished a logfile is created. I wrote then a simple until loop with a sleep command. Although this is very ugly, it works for me and it's enough.
for i in $(cat calculation-list.txt)
do
(calculations routine)
until [[ -f $logfile ]]; do
sleep 60
done
(other commands)
done
Easy. Get the process ID (PID) via some awk magic and then use wait too wait for that PID to end. Here are the details on wait from the advanced Bash scripting guide:
Suspend script execution until all jobs running in background have
terminated, or until the job number or process ID specified as an
option terminates. Returns the exit status of waited-for command.
You may use the wait command to prevent a script from exiting before a
background job finishes executing (this would create a dreaded orphan
process).
And using it within your code should work like this:
for i in $(cat calculation-list.txt)
do
./calculation >/dev/null 2>&1 & CALCULATION_PID=(`jobs -l | awk '{print $2}'`);
wait ${CALCULATION_PID}
(other commands)
done

Introduce timeout in a bash for-loop

I have a task that is very well inside of a bash for loop. The situation is though, that a few of the iterations seem to not terminate. What I'm looking for is a way to introduce a timeout that if that iteration of command hasn't terminated after e.g. two hours it will terminate, and move on to the next iteration.
Rough outline:
for somecondition; do
while time-run(command) < 2h do
continue command
done
done
One (tedious) way is to start the process in the background, then start another background process that attempts to kill the first one after a fixed timeout.
timeout=7200 # two hours, in seconds
for somecondition; do
command & command_pid=$!
( sleep $timeout & wait; kill $command_pid 2>/dev/null) & sleep_pid=$!
wait $command_pid
kill $sleep_pid 2>/dev/null # If command completes prior to the timeout
done
The wait command blocks until the original command completes, whether naturally or because it was killed after the sleep completes. The wait immediately after sleep is used in case the user tries to interrupt the process, since sleep ignores most signals, but wait is interruptible.
If I'm understanding your requirement properly, you have a process that needs to run, but you want to make sure that if it gets stuck it moves on, right? I don't know if this will fully help you out, but here is something I wrote a while back to do something similar (I've since improved this a bit, but I only have access to a gist at present, I'll update with the better version later).
#!/bin/bash
######################################################
# Program: logGen.sh
# Date Created: 22 Aug 2012
# Description: parses logs in real time into daily error files
# Date Updated: N/A
# Developer: #DarrellFX
######################################################
#Prefix for pid file
pidPrefix="logGen"
#output direcory
outDir="/opt/Redacted/logs/allerrors"
#Simple function to see if running on primary
checkPrime ()
{
if /sbin/ifconfig eth0:0|/bin/grep -wq inet;then isPrime=1;else isPrime=0;fi
}
#function to kill previous instances of this script
killScript ()
{
/usr/bin/find /var/run -name "${pidPrefix}.*.pid" |while read pidFile;do
if [[ "${pidFile}" != "/var/run/${pidPrefix}.${$}.pid" ]];then
/bin/kill -- -$(/bin/cat ${pidFile})
/bin/rm ${pidFile}
fi
done
}
#Check to see if primary
#If so, kill any previous instance and start log parsing
#If not, just kill leftover running processes
checkPrime
if [[ "${isPrime}" -eq 1 ]];then
echo "$$" > /var/run/${pidPrefix}.$$.pid
killScript
commands && commands && commands #Where the actual command to run goes.
else
killScript
exit 0
fi
I then set this script to run on cron every hour. Every time the script is run, it
creates a lock file named after a variable that describes the script that contains the pid of that instance of the script
calls the function killScript which:
uses the find command to find all lock files for that version of the script (this lets more than one of these scripts be set to run in cron at once, for different tasks). For each file it finds, it kills the processes of that lock file and removes the lock file (it automatically checks that it's not killing itself)
Starts doing whatever it is I need to run and not get stuck (I've omitted that as it's hideous bash string manipulation that I've since redone in python).
If this doesn't get you squared let me know.
A few notes:
the checkPrime function is poorly done, and should either return a status, or just exit the script itself
there are better ways to create lock files and be safe about it, but this has worked for me thus far (famous last words)

How to switch a sequence of tasks to background?

I'm running two tests on a remote server, here is the command I used several hours ago:
% ./test1.sh; ./test2.sh
The two tests are supposed to run one by one.If the second runs before the first completes, everything will be in ruin, and I'll have to restart the whole procedure.
The dilemma is, these two tasks cost too many hours to complete, and when I prepare to logout the server and wait for the result. I don't know how to switch both of them to background... If I use Ctrl+Z, only the first task will be suspended, while the second starts doing nothing useful while wiping out current data.
Is it possible to switch both of them to background, preserving their orders? Actually I should make these two tasks in the same process group like (./test1.sh; ./test2.sh) &, but sadly, the first test have run several hours, and it's quite a pity to restart the tests.
An option is to kill the second test before it starts, but is there any mechanism to cope with this?
First rename the ./test2.sh to ./test3.sh. Then do [CTRL+Z], followed by bg and disown -h. Then save this script (test4.sh):
while :; do
sleep 5;
pgrep -f test1.sh &> /dev/null
if [ $? -ne 0 ]; then
nohup ./test3.sh &
break
fi
done
then do: nohup ./test4.sh &.
and you can logout.
First, screen or tmux are your friends here, if you don't already work with them (they make remote machine work an order of magnitude easier).
To use conditional consecutive execution you can write:
./test1.sh && ./test2.sh
which will only execute test2.sh if test1.sh returns with 0 (conventionally meaning: no error). Example:
$ true && echo "first command was successful"
first command was successful
$ ! true && echo "ain't gonna happen"
More on control operators: http://www.humbug.in/docs/the-linux-training-book/ch08s01.html

Is there a way to make bash job control quiet?

Bash is quite verbose when running jobs in the background:
$ echo toto&
toto
[1] 15922
[1]+ Done echo toto
Since I'm trying to run jobs in parallel and use the output, I'd like to find a way to silence bash. Is there a way to remove this superfluous output?
You can use parentheses to run a background command in a subshell, and that will silence the job control messages. For example:
(sleep 10 & )
Note: The following applies to interactive Bash sessions. In scripts, job-control messages are never printed.
There are 2 basic scenarios for silencing Bash's job-control messages:
Launch-and-forget:
CodeGnome's helpful answer answer suggests enclosing the background command in a simple subshell - e.g, (sleep 10 &) - which effectively silences job-control messages - both on job creation and on job termination.
This has an important side effect:
By using control operator & inside the subshell, you lose control of the background job - jobs won't list it, and neither %% (the spec. (ID) of the most recently launched job) nor $! (the PID of the (last) process launched (as part of) the most recent job) will reflect it.[1]
For launch-and-forget scenarios, this is not a problem:
You just fire off the background job,
and you let it finish on its own (and you trust that it runs correctly).
[1] Conceivably, you could go looking for the process yourself, by searching running processes for ones matching its command line, but that is cumbersome and not easy to make robust.
Launch-and-control-later:
If you want to remain in control of the job, so that you can later:
kill it, if need be.
synchronously wait (at some later point) for its completion,
a different approach is needed:
Silencing the creation job-control messages is handled below, but in order to silence the termination job-control messages categorically, you must turn the job-control shell option OFF:
set +m (set -m turns it back on)
Caveat: This is a global setting that has a number of important side effects, notably:
Stdin for background commands is then /dev/null rather than the current shell's.
The keyboard shortcuts for suspending (Ctrl-Z) and delay-suspending (Ctrl-Y) a foreground command are disabled.
For the full story, see man bash and (case-insensitively) search for occurrences of "job control".
To silence the creation job-control messages, enclose the background command in a group command and redirect the latter's stderr output to /dev/null
{ sleep 5 & } 2>/dev/null
The following example shows how to quietly launch a background job while retaining control of the job in principle.
$ set +m; { sleep 5 & } 2>/dev/null # turn job-control option off and launch quietly
$ jobs # shows the job just launched; it will complete quietly due to set +m
If you do not want to turn off the job-control option (set +m), the only way to silence the termination job-control message is to either kill the job or wait for it:
Caveat: There are two edge cases where this technique still produces output:
If the background command tries to read from stdin right away.
If the background command terminates right away.
To launch the job quietly (as above, but without set +m):
$ { sleep 5 & } 2>/dev/null
To wait for it quietly:
$ wait %% 2>/dev/null # use of %% is optional here
To kill it quietly:
{ kill %% && wait; } 2>/dev/null
The additional wait is necessary to make the termination job-control message that is normally displayed asynchronously by Bash (at the time of actual process termination, shortly after the kill) a synchronous output from wait, which then allows silencing.
But, as stated, if the job completes by itself, a job-control message will still be displayed.
Wrap it in a dummy script:
quiet.sh:
#!/bin/bash
$# &
then call it, passing your command to it as an argument:
./quiet.sh echo toto
You may need to play with quotes depending on your input.
Interactively, no. It will always display job status. You can influence when the status is shown using set -b.
There's nothing preventing you from using the output of your commands (via pipes, or storing it variables, etc). The job status is sent to the controlling terminal by the shell and doesn't mix with other I/O. If you're doing something complex with jobs, the solution is to write a separate script.
The job messages are only really a problem if you have, say, functions in your bashrc which make use of job control which you want to have direct access to your interactive environment. Unfortunately there's nothing you can do about it.
One solution (in bash anyway) is to route all the output to /dev/null
echo 'hello world' > /dev/null &
The above will not give you any output other than the id for the bg process.

Resources