How to create an anonymous pipe between 2 child processes and know their pids (while not using files/named pipes)? - bash

Please note that this questions was edited after a couple of comments I received. Initially I wanted to split my goal into smaller pieces to make it simpler (and perhaps expand my knowledge on various fronts), but it seems I went too far with the simplicity :). So, here I am asking the big question.
Using bash, is there a way one can actually create an anonymous pipe between two child processes and know their pids?
The reason I'm asking is when you use the classic pipeline, e.g.
cmd1 | cmd2 &
you lose the ability to send signals to cmd1. In my case the actual commands I am running are these
./my_web_server | ./my_log_parser &
my_web_server is a basic web server that dump a lot of logging information to it's stdout
my_log_parser is a log parser that I wrote that reads through all the logging information it receives from my_web_server and it basically selects only certain values from the log (in reality it actually stores the whole log as it received it, but additionally it creates an extra csv file with the values it finds).
The issue I am having is that my_web_server actually never stops by itself (it is a web server, you don't want that from a web server :)). So after I am done, I need to stop it myself. I would like for the bash script to do this when I stop it (the bash script), either via SIGINT or SIGTERM.
For something like this, traps are the way to go. In essence I would create a trap for INT and TERM and the function it would call would kill my_web_server, but... I don't have the pid and even though I know I could look for it via ps, I am looking for a pretty solution :).
Some of you might say: "Well, why don't you just kill my_log_parser and let my_web_server die on its own with SIGPIPE?". The reason why I don't want to kill it is when you kill a process that's at the end of the pipeline, the output buffer of the process before it, is not flushed. Ergo, you lose stuff.
I've seen several solutions here and in other places that suggested to store the pid of my_web_server in a file. This is a solution that works. It is possible to write the pipeline by fiddling with the filedescriptors a bit. I, however don't like this solution, because I have to generate files. I don't like the idea of creating arbitrary files just to store a 5-character PID :).
What I ended up doing for now is this:
#!/bin/bash
trap " " HUP
fifo="$( mktemp -u "$( basename "${0}" ).XXXXXX" )"
mkfifo "${fifo}"
<"${fifo}" ./my_log_parser &
parser_pid="$!"
>"${fifo}" ./my_web_server &
server_pid="$!"
rm "${fifo}"
trap '2>/dev/null kill -TERM '"${server_pid}"'' INT TERM
while true; do
wait "${parser_pid}" && break
done
This solves the issue with me not being able to terminate my_web_server when the script receives SIGINT or SIGTERM. It seems more readable than any hackery fiddling with file descriptors in order to eventually use a file to store my_web_server's pid, which I think is good, because it improves the readability.
But it still uses a file (named pipe). Even though I know it uses the file (named pipe) for my_web_server and my_log_parser to talk (which is a pretty good reason) and the file gets wiped from the disk very shortly after it's created, it's still a file :).
Would any of you guys know of a way to do this task without using any files (named pipes)?

From the Bash man pages:
! Expands to the process ID of the most recently executed back-
ground (asynchronous) command.
You are not running a background command, you are running process substitution to read to file descriptor 3.
The following works, but I'm not sure if it is what you are trying to achieve:
sleep 120 &
child_pid="$!"
wait "${child_pid}"
sleep 120
Edit:
Comment was: I know I can pretty much do this the silly 'while read i; do blah blah; done < <( ./my_proxy_server )'-way, but I don't particularly like the fact that when a script using this approach receives INT or TERM, it simply dies without telling ./my_proxy_server to bugger off too :)
So, it seems like your problem stems from the fact that it is not so easy to get the PID of the proxy server. So, how about using your own named pipe, with the trap command:
pipe='/tmp/mypipe'
mkfifo "$pipe"
./my_proxy_server > "$pipe" &
child_pid="$!"
echo "child pid is $child_pid"
# Tell the proxy server to bugger-off
trap 'kill $child_pid' INT TERM
while read
do
echo $REPLY
# blah blah blah
done < "$pipe"
rm "$pipe"
You could probably also use kill %1 instead of using $child_pid.
YAE (Yet Another Edit):
You ask how to get the PIDS from:
./my_web_server | ./my_log_parser &
Simples, sort of. To test I used sleep, just like your original.
sleep 400 | sleep 500 &
jobs -l
Gives:
[1]+ 8419 Running sleep 400
8420 Running | sleep 500 &
So its just a question of extracting those PIDS:
pid1=$(jobs -l|awk 'NR==1{print $2}')
pid2=$(jobs -l|awk 'NR==2{print $1}')
I hate calling awk twice here, but anything else is just jumping through hoops.

Related

Why does bash "forget" about my background processes?

I have this code:
#!/bin/bash
pids=()
for i in $(seq 1 999); do
sleep 1 &
pids+=( "$!" )
done
for pid in "${pids[#]}"; do
wait "$pid"
done
I expect the following behavior:
spin through the first loop
wait about a second on the first pid
spin through the second loop
Instead, I get this error:
./foo.sh: line 8: wait: pid 24752 is not a child of this shell
(repeated 171 times with different pids)
If I run the script with shorter loop (50 instead of 999), then I get no errors.
What's going on?
Edit: I am using GNU bash 4.4.23 on Windows.
POSIX says:
The implementation need not retain more than the {CHILD_MAX} most recent entries in its list of known process IDs in the current shell execution environment.
{CHILD_MAX} here refers to the maximum number of simultaneous processes allowed per user. You can get the value of this limit using the getconf utility:
$ getconf CHILD_MAX
13195
Bash stores the statuses of at most twice as that many exited background processes in a circular buffer, and says not a child of this shell when you call wait on the PID of an old one that's been overwritten. You can see how it's implemented here.
The way you might reasonably expect this to work, as it would if you wrote a similar program in most other languages, is:
sleep is executed in the background via a fork+exec.
At some point, sleep exits leaving behind a zombie.
That zombie remains in place, holding its PID, until its parent calls wait to retrieve its exit code.
However, shells such as bash actually do this a little differently. They proactively reap their zombie children and store their exit codes in memory so that they can deallocate the system resources those processes were using. Then when you wait the shell just hands you whatever value is stored in memory, but the zombie could be long gone by then.
Now, because all of these exit statuses are being stored in memory, there is a practical limit to how many background processes can exit without you calling wait before you've filled up all the memory you have available for this in the shell. I expect that you're hitting this limit somewhere in the several hundreds of processes in your environment, while other users manage to make it into the several thousands in theirs. Regardless, the outcome is the same - eventually there's nowhere to store information about your children and so that information is lost.
I can reproduce on ArchLinux with docker run -ti --rm bash:5.0.18 bash -c 'pids=; for ((i=1;i<550;++i)); do true & pids+=" $!"; done; wait $pids' and any earlier. I can't reproduce with bash:5.1.0 .
What's going on?
It looks like a bug in your version of Bash. There were a couple of improvements in jobs.c and wait.def in Bash:5.1 and Make sure SIGCHLD is blocked in all cases where waitchld() is not called from a signal handler is mentioned in the changelog. From the look of it, it looks like an issue with handling a SIGCHLD signal while already handling another SIGCHLD signal.

Bash script lingers after exiting (issues with named pipe I/O)

Summary
I have worked out a solution to the issue of this question.
Basically, the callee (wallpaper) was not itself exiting because it was waiting on another process to finish.
Over the course of 52 days, this problematic side effect had snowballed until 10,000+ lingering processes were consuming 10+ gigabytes of RAM, almost crashing my system.
The offending process turned out to be a call to printf from a function called log that I had sent into the background and forgotten about, because it was writing to a pipe and hanging.
As it turns out, a process writing to a named pipe will block until another process comes along and reads from it.
This, in turn, changed the requirements of the question from "I need a way to stop these processes from building up" to "I need a better way of getting around FIFO I/O than throwing it to the background".
Note that while the question has been solved, I'm more than happy to accept an answer that goes into detail on the technical level. For example, the unsolved mystery of why the caller script's (wallpaper-run) process was being duplicated as well, even though it was only called once, or how to read a pipe's state information proper, rather than relying on open's failure when called with O_NONBLOCK.
The original question follows.
The Question
I have two bash scripts meant to run in a loop. The first, wallpaper-run, runs in an infinite loop and calls the second, wallpaper.
They are part of my "desktop", which is a bunch of hacked together shell scripts augmenting the dwm window manager.
wallpaper-run:
log "starting wallpaper runner"
while true; do
log "..."
$scr/wallpaper
sleep 900 # 15 minutes
done &
wallpaper:
log "changing wallpaper"
# several utility functions ...
if [[ $1 ]]; then
parse_arg $1
else
load_random
fi
Some notes:
log is an exported function from init, which, as its name suggests, logs a message.
init calls wallpaper-run (among other things) in its foreground (hence the while loop being in the background)
$scr is also defined by init; it is the directory where so-called "init-scripts" go
parse_arg and load_random are local to wallpaper
in particular, images are loaded into the background via the program feh
The manner in which wallpaper-run is loaded is as such: $mod/wallpaper-run
init is called directly by startx, and starts dwm before it runs wallpaper-run (and the other "modules")
Now on to the problem, which is that for some reason, both wallpaper-run and wallpaper "linger" in memory. That is to say that after each iteration of the loop, two new instances of wallpaper and wallpaper-run are created, while the "old" ones don't get cleaned up and get stuck in sleep status. It's like a memory leak, but with lingering processes instead of bad memory management.
I found out about this "process leak" after having my system up for 52 days when everything broke ( something like bash: cannot fork: resource temporarily unavailable spammed the terminal whenever I tried to run a command ) because the system ran out of memory. I had to kill over 10,000 instances of wallpaper/run to bring my system back to working order.
I have absolutely no idea why this is the case. I see no reason for these scripts to linger in memory because a script exiting should mean that its process gets cleaned up.
Why are they lingering and eating up resources?
Update 1
With some help from the comments (much thanks to I'L'I), I've traced the problem to the function log, which makes background calls to printf (though why I chose to do that, I don't recall). Here is the function as it appears in init:
log(){
local pipe=$pipe_front
if ! [[ -p $pipe ]]; then
mkfifo $pipe
fi
printf ... >> $initlog
printf ... > $pipe &
printf ... &
[[ $2 == "-g" ]] && notify-send "[DWM Init] $1"
sleep 0.001
}
As you can see, the function is very poorly written. I hacked it together to make it work, not to make it robust.
The second and third printf are sent to the background. I don't recall why i did this, but it's presumably because the first printf must have been making log hang.
The printf lines have been abridged to "...", because they are fairly complex and not relevant to the issue at hand (And also I have better things to do with 40 minutes of my time than fighting with Android's garbage text input interface). In particular, things like the current time, name of the calling process, and the passed message are printed, depending on which printf we're talking about. The first has the most detail because it's saved to a file where immediate context is lost, while the notify-send line has the least amount of detail because it's going to be displayed on the desktop.
The whole pipe debacle is for interfacing directly with init via a rudimentary shell that I wrote for it.
The third printf is intentional; it prints to the tty that I log into at the beginning of a session. This is so that if init suddenly crashes on me, I can see a log of what went wrong. Or at least what was happening before it crashed
I'm including this in the question because this is the root cause of the "leak". If I can fix this function, the issue will be resolved.
The function needs to log the messages to their respective sources and halt until each call to printf finishes, but it also must finish within a timely manner; hanging for an indefinite period of time and/or failing to log the messages is unacceptable behavior.
Update 2
After isolating the log function (see update 1) into a test script and setting up a mock environment, I've boiled it down to printf.
The printf call which is redirected into a pipe,
printf "..." > $pipe
hangs if nothing is listening to it, because it's waiting for a second process to pick up the read end of the pipe and consume the data. This is probably why I had initially forced them into the background, so that a process could, at some point, read the data from the pipe while, in the immediate case, the system could move on and do other things.
The call to sleep, then, was a not-well-thought-out hack to work around data race problems resulting from one reader trying to read from multiple writers simultaneously. The theory was that if each writer had to wait for 0.001 seconds (despite the fact that the printf in the background has nothing to do with the sleep following it), somehow, that would make the data appear in order and fix the bug. Of course, looking back, that really does nothing useful.
The end result is several background processes hanging on to the pipe, waiting for something to read from it.
The answer to "Prevent hanging of "echo STRING > fifo" when nothing..." presents the same "solution" that caused the bug that spawned this question. Obviously incorrect. However, an interesting comment by user R.. mentioned something about fifos containing state which includes information such as what processes are reading the pipe.
Storing state? You mean the absence/presence of a reader? That's part of the state of the fifo; any attempt to store it outside would be bogus and would be subject to race conditions.
Obtaining this information and refusing to write if there is no reader is the key to solving this.
However, no matter what I search for on Google, I can't seem to find anything about reading the state of a pipe, even in C. I am perfectly willing to use C if need be, but a bash solution (or an existing core util) would be preferred.
So now the question becomes: how in the heck do I read the state information of a FIFO, particularly the process(es) who has (have) the pipe open for reading and/or writing?
https://stackoverflow.com/a/20694422
The above linked answer shows a C program attempting to open a file with O_NONBLOCK. So I tried writing a program whose job is to return 0 (success) if open returns a valid file descriptor, and 1 (fail) if open returns -1.
#include <fcntl.h>
#include <unistd.h>
int
main(int argc, char **argv)
{
int fd = open(argv[1], O_WRONLY | O_NONBLOCK);
if(fd == -1)
return 1;
close(fd);
return 0;
}
I didn't bother checking if argv[1] is null or if open failed because the file doesn't exist because I only plan to utilize this program from a shell script where it is guaranteed to be given the correct arguments.
That said, the program does its job
$ gcc pipe-open.c
$ ./a.out ./pipe && echo "pipe has a reader" || echo "pipe has no reader"
$ ./a.out ./pipe && echo "pipe has a reader" || echo "pipe has no reader"
Assuming the existence of pipe and that between the first and second invocations, another process opens the pipe (cat pipe), the output looks like this:
pipe has no reader
pipe has a reader
The program also works if the pipe has a second writer (I.e. it will fail because there is no reader)
The only problem is that after closing the file, the reader closes its end of the pipe as well. And removing the call to close won't do any good because all open file descriptors are automatically closed after main returns (control goes to exit, which walks the list of open file descriptors and closes them one by one). Not good!
This means that the only window to actually write to the pipe is before its closing, I.e. from within the C program itself.
#include <fcntl.h>
#include <unistd.h>
int
write_to_pipe(int fd)
{
char buf[1024];
ssize_t nread;
int nsuccess = 0;
while((nread = read(0, buf, 1024)) > 0 && ++nsuccess)
write(fd, buf, nread);
close(fd);
return nsuccess > 0 ? 0 : 2;
}
int
main(int argc, char **argv)
{
int fd = open(argv[1], O_WRONLY | O_NONBLOCK);
if(fd == -1)
return 1;
return write_to_pipe(fd);
}
Invocation:
$ echo hello world | ./a.out pipe
$ ret=$?
$ if [[ $ret == 1 ]]; then echo no reader
> elif [[ $ret == 2 ]]; then echo an error occurred trying to write to the pipe
> else echo success
> fi
Output with same conditions as before (1st call has no reader; 2nd call does):
no reader
success
Additionally, the text "Hello World" can be seen in the terminal reading the pipe
And finally, the problem is solved. I have a program which acts as a middle man between a writer and a pipe, which exits immediately with a failure code if no reader is attached to the pipe at the time of invocation, or if there is, attempts to write to the pipe and communicates failure if nothing is written.
That last part is new. I thought it might be useful in the future to know if nothing got written.
I'll probably add more error detection in the future, but since log checks for the existence of the pipe before trying to write to it, this is fine for now
The issue is that you are starting the wallpaper process without checking if the previous run finished or not. So, in 52 days, potentially 4 * 24 * 52 = ~5000 instances could be running (not sure how you found 10000, though)! Is it possible to use flock to make sure there is only one instance of wallpaper running at a time?
See this post: Quick-and-dirty way to ensure only one instance of a shell script is running at a time

How can I tell if a script was run in the background and with nohup?

Ive got a script that takes a quite a long time to run, as it has to handle many thousands of files. I want to make this script as fool proof as possible. To this end, I want to check if the user ran the script using nohup and '&'. E.x.
me#myHost:/home/me/bin $ nohup doAlotOfStuff.sh &. I want to make 100% sure the script was run with nohup and '&', because its a very painful recovery process if the script dies in the middle for whatever reason.
How can I check those two key paramaters inside the script itself? and if they are missing, how can I stop the script before it gets any farther, and complain to the user that they ran the script wrong? Better yet, is there way I can force the script to run in nohup &?
Edit: the server enviornment is AIX 7.1
The ps utility can get the process state. The process state code will contain the character + when running in foreground. Absence of + means code is running in background.
However, it will be hard to tell whether the background script was invoked using nohup. It's also almost impossible to rely on the presence of nohup.out as output can be redirected by user elsewhere at will.
There are 2 ways to accomplish what you want to do. Either bail out and warn the user or automatically restart the script in background.
#!/bin/bash
local mypid=$$
if [[ $(ps -o stat= -p $mypid) =~ "+" ]]; then
echo Running in foreground.
exec nohup $0 "$#" &
exit
fi
# the rest of the script
...
In this code, if the process has a state code +, it will print a warning then restart the process in background. If the process was started in the background, it will just proceed to the rest of the code.
If you prefer to bailout and just warn the user, you can remove the exec line. Note that the exit is not needed after exec. I left it there just in case you choose to remove the exec line.
One good way to find if a script is logging to nohup, is to first check that the nohup.out exists, and then to echo to it and ensure that you can read it there. For example:
echo "complextag"
if ( $(cat nohup.out | grep "complextag" ) != "complextag" );then
# various commands complaining to the user, then exiting
fi
This works because if the script's stdout is going to nohup.out, where they should be going (or whatever out file you specified), then when you echo that phrase, it should be appended to the file nohup.out. If it doesn't appear there, then the script was nut run using nohup and you can scold them, perhaps by using a wall command on a temporary broadcast file. (if you want me to elaborate on that I can).
As for being run in the background, if it's not running you should know by checking nohup.

shell: clean up leaked background processes which hang due to shared stdout/stderr

I need to run essentially arbitrary commands on a (remote) shell in ephemeral containers/VMs for a test execution engine. Sometimes these leak background processes which then cause the entire command to hang. This can be boiled down to this simple command:
$ sh -c 'sleep 30 & echo payload'
payload
$
Here the backgrounded sleep 30 plays the role of a leaked process (which in reality will be something like dbus-daemon) and the echo is the actual thing I want to run. The sleep 30 & echo payload should be considered as an atomic opaque example command here.
The above command is fine and returns immediately as the shell's and also sleep's stdout/stderr are a PTY. However, when capturing the output of the command to a pipe/file (a test runner wants to save everything into a log, after all), the whole command hangs:
$ sh -c 'sleep 30 & echo payload' | cat
payload
# ... does not return to the shell (until the sleep finishes)
Now, this could be fixed with some rather ridiculously complicated shell magic which determines the FDs of stdout/err from /proc/$$/fd/{1,2}, iterating over ls /proc/[0-9]*/fd/* and killing every process which also has the same stdout/stderr. But this involves a lot of brittle shell code and expensive shell string comparisons.
Is there a way to clean up these leaked background processes in a more elegant and simpler way? setsid does not help:
$ sh -c 'setsid -w sh -c "sleep 30 & echo payload"' | cat
payload
# hangs...
Note that process groups/sessions and killing them wholesale isn't sufficient as leaked processes (like dbus-daemon) often setsid themselves.
P.S. I can only assume POSIX shell or bash in these environments; no Python, Perl, etc.
Thank you in advance!
We had this problem with parallel tests in Launchpad. The simplest solution we had then - which worked well - was just to make sure that no processes share stdout/stdin/stderr (except ones where you actually want to hang if they haven't finished - e.g. the test workers themselves).
Hmm, having re-read this I cannot give you the solution you are after (use systemd to kill them). What we came up with is to simply ignore the processes but reliably not hang when the single process we were waiting for is done. Note that this is distinctly different from the pipes getting closed.
Another option, not perfect but useful, is to become a local reaper with prctl(2) and PR_SET_CHILD_SUBREAPER. This will allow you to be the parent of all the processes that would otherwise reparent to init. With this arrangement you could try to kill all the processes that have you as ppid. This is terrible but it's the closest best thing to using cgroups.
But note, that unless you are running this helper as root you will find that practical testing might spawn some setuid thing that will lurk and won't be killable. It's an annoying problem really.
Use script -qfc instead of sh -c.

How to make bash interpreter stop until a command is finished?

I have a bash script with a loop that calls a hard calculation routine every iteration. I use the results from every calculation as input to the next. I need make bash stop the script reading until every calculation is finished.
for i in $(cat calculation-list.txt)
do
./calculation
(other commands)
done
I know the sleep program, and i used to use it, but now the time of the calculations varies greatly.
Thanks for any help you can give.
P.s>
The "./calculation" is another program, and a subprocess is opened. Then the script passes instantly to next step, but I get an error in the calculation because the last is not finished yet.
If your calculation daemon will work with a precreated empty logfile, then the inotify-tools package might serve:
touch $logfile
inotifywait -qqe close $logfile & ipid=$!
./calculation
wait $ipid
(edit: stripped a stray semicolon)
if it closes the file just once.
If it's doing an open/write/close loop, perhaps you can mod the daemon process to wrap some other filesystem event around the execution? `
#!/bin/sh
# Uglier, but handles logfile being closed multiple times before exit:
# Have the ./calculation start this shell script, perhaps by substituting
# this for the program it's starting
trap 'echo >closed-on-calculation-exit' 0 1 2 3 15
./real-calculation-daemon-program
Well, guys, I've solved my problem with a different approach. When the calculation is finished a logfile is created. I wrote then a simple until loop with a sleep command. Although this is very ugly, it works for me and it's enough.
for i in $(cat calculation-list.txt)
do
(calculations routine)
until [[ -f $logfile ]]; do
sleep 60
done
(other commands)
done
Easy. Get the process ID (PID) via some awk magic and then use wait too wait for that PID to end. Here are the details on wait from the advanced Bash scripting guide:
Suspend script execution until all jobs running in background have
terminated, or until the job number or process ID specified as an
option terminates. Returns the exit status of waited-for command.
You may use the wait command to prevent a script from exiting before a
background job finishes executing (this would create a dreaded orphan
process).
And using it within your code should work like this:
for i in $(cat calculation-list.txt)
do
./calculation >/dev/null 2>&1 & CALCULATION_PID=(`jobs -l | awk '{print $2}'`);
wait ${CALCULATION_PID}
(other commands)
done

Resources