lazy (non-buffered) processing of shell pipeline

lazy (non-buffered) processing of shell pipeline - bash

I'm trying to figure out how to perform the laziest possible processing of a standard UNIX shell pipeline. For example, let's say I have a command which does some calculations and outputting along the way, but the calculations get more and more expensive so that the first few lines of output arrive quickly but then subsequent lines get slower. If I'm only interested in the first few lines then I want to obtain those via lazy evaluation, terminating the calculations ASAP before they get too expensive.
This can be achieved with a straight-forward shell pipeline, e.g.:
./expensive | head -n 2
However this does not work optimally. Let's simulate the calculations with a script which gets exponentially slower:
#!/bin/sh
i=1
while true; do
echo line $i
sleep $(( i ** 4 ))
i=$(( i+1 ))
done
Now when I pipe this script through head -n 2, I observe the following:
line 1 is output.
After sleeping one second, line 2 is output.
Despite head -n 2 having already received two (\n-terminated) lines and exiting, expensive carries on running and now waits a further 16 seconds (2 ** 4) before completing, at which point the pipeline also completes.
Obviously this is not as lazy as desired, because ideally expensive would terminate as soon as the head process receives two lines. However, this does not happen; IIUC it actually terminates after trying to write its third line, because at this point it tries to write to its STDOUT which is connected through a pipe to STDIN the head process which has already exited and is therefore no longer reading input from the pipe. This causes expensive to receive a SIGPIPE, which causes the bash interpreter running the script to invoke its SIGPIPE handler which by default terminates running the script (although this can be changed via the trap command).
So the question is, how can I make it so that expensive quits immediately when head quits, not just when expensive tries to write its third line to a pipe which no longer has a listener at the other end? Since the pipeline is constructed and managed by the interactive shell process I typed the ./expensive | head -n 2 command into, presumably that interactive shell is the place where any solution for this problem would lie, rather than in any modification of expensive or head? Is there any native trick or extra utility which can construct pipelines with the behaviour I want? Or maybe it's simply impossible to achieve what I want in bash or zsh, and the only way would be to write my own pipeline manager (e.g. in Ruby or Python) which spots when the reader terminates and immediately terminates the writer?

If all you care about is foreground control, you can run expensive in a process substitution; it still blocks until it next tries to write, but head exits immediately (and your script's flow control can continue) after it's received its input
head -n 2 < <(exec ./expensive)
# expensive still runs 16 seconds in the background, but doesn't block your program
In bash 4.4, these store their PIDs in $! and allow process management in the same manner as other background processes.
# REQUIRES BASH 4.4 OR NEWER
exec {expensive_fd}< <(exec ./expensive); expensive_pid=$!
head -n 2 <&"$expensive_fd" # read the content we want
exec {expensive_fd}<&- # close the descriptor
kill "$expensive_pid" # and kill the process
Another approach is a coprocess, which has the advantage of only requiring bash 4.0:
# magic: store stdin and stdout FDs in an array named "expensive", and PID in expensive_PID
coproc expensive { exec ./expensive }
# read two lines from input FD...
head -n 2 <&"${expensive[0]}"
# ...and kill the process.
kill "$expensive_PID"

I'll answer with a POSIX shell in mind.
What you can do is use a fifo instead of a pipe and kill the first link the moment the second finishes.
If the expensive process is a leaf process or if it takes care of killing its children, you can use a simple kill. If it's a process-spawning shell script, you should run it in a process group (doable with set -m) and kill it with a process-group kill.
Example code:
#!/bin/sh -e
expensive()
{
i=1
while true; do
echo line $i
sleep 0.$i #sped it up a little
echo >&2 slept
i=$(( i+1 ))
done
}
echo >&2 NORMAL
expensive | head -n2
#line 1
#slept
#line 2
#slept
echo >&2 SPED-UP
mkfifo pipe
exec 3<>pipe
rm pipe
set -m; expensive >&3 & set +m
<&3 head -n 2
kill -- -$!
#line 1
#slept
#line 2
If you run this, the second run should not have the second slept line, meaning the first link was killed the moment head finished, not when the first link tried to output after head was finished.

Related

pipe operator between two blocks

I've found an interesting bash script that with some modifications would likely solve my use case. But I'm unsure if I understand how it works, in particular the pipe between the blocks.
How do these two blocks work together, and what is the behaviour of the pipe that separates them?
function isTomcatUp {
# Use FIFO pipeline to check catalina.out for server startup notification rather than
# ping with an HTTP request. This was recommended by ForgeRock (Zoltan).
FIFO=/tmp/notifytomcatfifo
mkfifo "${FIFO}" || exit 1
{
# run tail in the background so that the shell can
# kill tail when notified that grep has exited
tail -f $CATALINA_HOME/logs/catalina.out &
# remember tail's PID
TAILPID=$!
# wait for notification that grep has exited
read foo <${FIFO}
# grep has exited, time to go
kill "${TAILPID}"
} | {
grep -m 1 "INFO: Server startup"
# notify the first pipeline stage that grep is done
echo >${FIFO}
}
# clean up
rm "${FIFO}"
}
Code Source: https://www.manthanhd.com/2016/01/15/waiting-for-tomcat-to-start-up-in-a-script/

bash has a whole set of compound commands, which work much like simple commands. Most relevant here is that each compound command has its own standard input and standard output.
{ ... } is one such compound command. Each command inside the group inherits its standard input and output from the group, so the effect is that the standard output of a group is the concatenation of its children's standard output. Likewise, each command inside reads in turn from the group's standard input. In your example, nothing interesting happens, because grep consumes all of the standard input and no other command tries to read from it. But consider this example:
$ cat tmp.txt
foo
bar
$ { read a; read b; echo "$b then $a"; } < tmp.txt
bar then foo
The first read gets a single line from standard input, and the second read gets the second. Importantly, the first read consumes a line of input before the second read could see it. Contrast this with
$ read a < tmp.txt
$ read b < tmp.txt
where a and b will both contain foo, because each read command opens tmp.txt anew and both will read the first line.

The { …; } operations groups the commands such that the I/O redirections apply to all the commands within it. The { must be separate as if it were a command name; the } must be preceded by either a semicolon or a newline and be separate too. The commands are not executed in a sub-shell, unlike ( … ) which also has some syntactic differences.
In your script, you have two such groupings connected by a pipe. Because of the pipe, each group is in a sub-shell, but it is not in a sub-shell because of the braces.
The first group runs tail -f on a file in background, and then waits for a FIFO to be closed so it can kill the tail -f. The second part looks for the first occurrence of some specific information and when it finds it, stops reading and writes to the FIFO to free everything up.
As with any pipeline, the exit status is the status of the last group — which is likely to be 0 because the echo succeeds.

Cancel "tail -f" if the file hasn't changed for N seconds?

I am debugging an application, that has a long running time and produces only a logfile as reliably complete output, so usually I use tail -f to monitor the output. Since it also requires some specific setup of the environment variables, I have wrapped the whole invocation, including tail -f LOGFILE &, into a bash script.
However, this creates a tail process that won't be terminated automatically and will remain running. Cleanup with trap leads to complicated code, once there is more than a single cleanup task, and there is no obvious way to account for all ways the script may be terminated.
Using the timeout command, I could limit tail -f to terminate after a fixed total time, but that will break cases, where it is SUPPOSED to run longer.
So I was wondering, if there is a way to limit tail -f such that it terminates, if the followed file doesn't change for a specified amount of time.

Update: The subsequent script worked for me when executed on its own, but in some instances, the tail process would not terminate regardless. It isn't entirely clear, whether tail -f detects that process it is piping to has terminated.
Lacking a builtin solution, a stdout based timeout can be produced in bash, exploiting that tail will terminate, if it's stdout is closed.
# Usage: withTimeout TIMEOUT COMMAND [ARGS ...]
# Execute COMMAND. Terminate, if it hasn't produced new output in TIMEOUT seconds.
# Depending on the platform, TIMEOUT may be fractional. See `help read`.
withTimeout () {
local timeout="$1"; shift 1
"$#" | while IFS= read -r -t "${timeout}" line || return 0; do
printf "%s\n" "$line"
done
}
withTimeout 2 tail -f LOGFILE &
Note that tail may resort to polling the file once per second, if it cannot use inotify. If faster output is needed the -s option can be supplied.

Bash redirect stdout and stderr to seperate files with timestamps

Want to log all stdout and stderr to seperate files and add timestamp to each line.
Tried the following, which works but is missing timestamps.
#!/bin/bash
debug_file=./stdout.log
error_file=./stderr.log
exec > >(tee -a "$debug_file") 2> >(tee -a "$error_file")
echo "hello"
echo "hello world"
this-will-fail
and-so-will-this
Adding timestamps. (Only want timestamps prefixed to log output.)
#!/bin/bash
debug_file=./stdout.log
error_file=./stderr.log
log () {
file=$1; shift
while read -r line; do
printf '%(%s)T %s\n' -1 "$line"
done >> "$file"
}
exec > >(tee >(log "$debug_file")) 2> >(tee >(log "$error_file"))
echo "hello"
echo "hello world"
this-will-fail
and-so-will-this
The latter adds timestamps to the logs but it also has the chance of messing up my terminal window. Reproducing this behavior was not straight forward, it only happend every now and then. I suspect it has to with the subroutine/buffer still having output flowing through it.
Examples of the script messing up my terminal.
# expected/desired behavior
user#system:~ ./log_test
hello
hello world
./log_test: line x: this-will-fail: command not found
./log_test: line x: and-so-will-this: command not found
user#system:~ # <-- cursor blinks here
# erroneous behavior
user#system:~ ./log_test
hello
hello world
user#system:~ ./log_test: line x: this-will-fail: command not found
./log_test: line x: and-so-will-this: command not found
# <-- cursor blinks here
# erroneous behavior
user#system:~ ./log_test
hello
hello world
./log_test: line x: this-will-fail: command not found
user#system:~
./log_test: line x: and-so-will-this: command not found
# <-- cursor blinks here
# erroneous behavior
user#system:~ ./log_test
hello
hello world
user#system:~
./log_test: line x: this-will-fail: command not found
./log_test: line x: and-so-will-this: command not found
# <-- cursor blinks here
For funs I put a sleep 2 at the end of the script to see what would happen and the problem never occurred again.
Hopefully someone knows the answer or can point me in the right derection.
Thanks
Edit
Judging from another question answered by Charles Duffy, what I'm trying to achieve is not really possible in bash.
Separately redirecting and recombining stderr/stdout without losing ordering

The trick is to make sure that tee, and the process substitution running your log function, exits before the script as a whole does -- so that when the shell that started the script prints its prompt, there isn't any backgrounded process that might write more output after it's done.
As a working example (albeit one focused more on being explicit than terseness):
#!/usr/bin/env bash
stdout_log=stdout.log; stderr_log=stderr.log
log () {
file=$1; shift
while read -r line; do
printf '%(%s)T %s\n' -1 "$line"
done >> "$file"
}
# first, make backups of your original stdout and stderr
exec {stdout_orig_fd}>&1 {stderr_orig_fd}>&2
# for stdout: start your process substitution, record its PID, start tee, record *its* PID
exec {stdout_log_fd}> >(log "$stdout_log"); stdout_log_pid=$!
exec {stdout_tee_fd}> >(tee "/dev/fd/$stdout_log_fd"); stdout_tee_pid=$!
exec {stdout_log_fd}>&- # close stdout_log_fd so the log process can exit when tee does
# for stderr: likewise
exec {stderr_log_fd}> >(log "$stderr_log"); stderr_log_pid=$!
exec {stderr_tee_fd}> >(tee "/dev/fd/$stderr_log_fd" >&2); stderr_tee_pid=$!
exec {stderr_log_fd}>&- # close stderr_log_fd so the log process can exit when tee does
# now actually swap out stdout and stderr for the processes we started
exec 1>&$stdout_tee_fd 2>&$stderr_tee_fd {stdout_tee_fd}>&- {stderr_tee_fd}>&-
# ...do the things you want to log here...
echo "this goes to stdout"; echo "this goes to stderr" >&2
# now, replace the FDs going to tee with the backups...
exec >&"$stdout_orig_fd" 2>&"$stderr_orig_fd"
# ...and wait for the associated processes to exit.
while :; do
ready_to_exit=1
for pid_var in stderr_tee_pid stderr_log_pid stdout_tee_pid stdout_log_pid; do
# kill -0 just checks whether a PID exists; it doesn't actually send a signal
kill -0 "${!pid_var}" &>/dev/null && ready_to_exit=0
done
(( ready_to_exit )) && break
sleep 0.1 # avoid a busy-loop eating unnecessary CPU by sleeping before next poll
done
So What's With The File Descriptor Manipulation?
A few key concepts to make sure we have clear:
All subshells have their own copies of the file descriptor table as created when they were fork()ed off from their parent. From that point forward, each file descriptor table is effectively independent.
A process reading from (the read end of) a FIFO (or pipe) won't see an EOF until all programs writing to (the write end of) that FIFO have closed their copies of the descriptor.
...so, if you create a FIFO pair, fork() off a child process, and let the child process write to the write end of the FIFO, whatever's reading from the read end will never see an EOF until not just the child, but also the parent, closes their copies.
Thus, the gymnastics you see here:
When we run exec {stdout_log_fd}>&-, we're closing the parent shell's copy of the FIFO writing to the log function for stdout, so the only remaining copy is the one used by the tee child process -- so that when tee exits, the subshell running log exits too.
When we run exec 1>&$stdout_tee_fd {stdout_tee_fd}>&-, we're doing two things: First, we make FD 1 a copy of the file descriptor whose number is stored in the variable stdout_tee_fd; second, we delete the stdout_tee_fd entry from the file descriptor table, so only the copy on FD 1 remains. This ensures that later, when we run exec >&"$stdout_orig_fd", we're deleting the last remaining write handle to the stdout tee function, causing tee to get an EOF on stdin (so it exits, thus closing the handle it holds on the log function's subshell and letting that subshell exit as well).
Some Final Notes On Process Management
Unfortunately, how bash handles subshells created for process substitutions has changed substantially between still-actively-deployed releases; so while in theory it's possible to use wait "$pid" to let a process substitution exit and collect its exit status, this isn't always reliable -- hence the use of kill -0.
However, if wait "$pid" worked, it would be strongly preferable, because the wait() syscall is what removes a previously-exited process's entry from the process table: It is guaranteed that a PID will not be reused (and a zombie process-table entry is left as a placeholder) if no wait() or waitpid() invocation has taken place.
Modern operating systems try fairly hard to avoid short-term PID reuse, so wraparound is not an active concern in most scenarios. However, if you're worried about this, consider using the flock-based mechanism discussed in https://stackoverflow.com/a/31552333/14122 for waiting for your process substitutions to exit, instead of kill -0.

In bash, is it generally better to use process substitution or pipelines

In the use-case of having the output of a singular command being consumed by only one other, is it better to use | (pipelines) or <() (process substitution)?
Better is, of course, subjective. For my specific use case I am after performance as the primary driver, but also interested in robustness.
The while read do done < <(cmd) benefits I already know about and have switched over to.
I have several var=$(cmd1|cmd2) instances that I suspect might be better replaced as var=$(cmd2 < <(cmd1)).
I would like to know what specific benefits the latter case brings over the former.

tl;dr: Use pipes, unless you have a convincing reason not to.
Piping and redirecting stdin from a process substitution is essentially the same thing: both will result in two processes connected by an anonymous pipe.
There are three practical differences:
1. Bash defaults to creating a fork for every stage in a pipeline.
Which is why you started looking into this in the first place:
#!/bin/bash
cat "$1" | while IFS= read -r last; do true; done
echo "Last line of $1 is $last"
This script won't work by default with a pipelines, because unlike ksh and zsh, bash will fork a subshell for each stage.
If you set shopt -s lastpipe in bash 4.2+, bash mimics the ksh and zsh behavior and works just fine.
2. Bash does not wait for process substitutions to finish.
POSIX only requires a shell to wait for the last process in a pipeline, but most shells including bash will wait for all of them.
This makes a notable difference when you have a slow producer, like in a /dev/random password generator:
tr -cd 'a-zA-Z0-9' < /dev/random | head -c 10 # Slow?
head -c 10 < <(tr -cd 'a-zA-Z0-9' < /dev/random) # Fast?
The first example will not benchmark favorably. Once head is satisfied and exits, tr will wait around for its next write() call to discover that the pipe is broken.
Since bash waits for both head and tr to finish, it will appear seem slower.
In the procsub version, bash only waits for head, and lets tr finish in the background.
3. Bash does not currently optimize away forks for single simple commands in process substitutions.
If you invoke an external command like sleep 1, then the Unix process model requires that bash forks and executes the command.
Since forks are expensive, bash optimizes the cases that it can. For example, the command:
bash -c 'sleep 1'
Would naively incur two forks: one to run bash, and one to run sleep. However, bash can optimize it because there's no need for bash to stay around after sleep finishes, so it can instead just replace itself with sleep (execve with no fork). This is very similar to tail call optimization.
( sleep 1 ) is similarly optimized, but <( sleep 1 ) is not. The source code does not offer a particular reason why, so it may just not have come up.
$ strace -f bash -c '/bin/true | /bin/true' 2>&1 | grep -c clone
2
$ strace -f bash -c '/bin/true < <(/bin/true)' 2>&1 | grep -c clone
3
Given the above you can create a benchmark favoring whichever position you want, but since the number of forks is generally much more relevant, pipes would be the best default.
And obviously, it doesn't hurt that pipes are the POSIX standard, canonical way of connecting stdin/stdout of two processes, and works equally well on all platforms.

Bash function hangs once conditions are met

All,
I am trying to run a bash script that kicks off several sub processes. The processes redirect to their own log files and I must kick them off in parallel. To do this i have written a check_procs procedure, that monitors for the number of processes using the same parent PID. Once the number reaches 1 again, the script should continue. However, it seems to just hang. I am not sure why, but the code is below:
check_procs() {
while true; do
mypid=$$
backup_procs=`ps -eo ppid | grep -w $mypid | wc -w`
until [ $backup_procs == 1 ]; do
echo $backup_procs
sleep 5
backup_procs=`ps -eo ppid | grep -w $mypid | wc -w`
done
done
}
This function is called after the processes are kicked off, and I can see it echoing out the number of processes, but then the echoing stops (suggesting the function has completed since the process count is now 1, but then nothing happens, and I can see the script is still in the process list of the server. I have to kill it off manually. The part where the function is called is below:
for ((i=1; i <= $threads; i++)); do
<Some trickery here to generate $cmdfile and $logfile>
nohup rman target / cmdfile=$cmdfile log=$logfile &
x=$(($x+1))
done
check_procs
$threads is a command line parameter passed to the script, and is a small number like 4 or 6. These are kicked off using nohup, as shown. When the IF in check_procs is satisfied, everything hangs instead of executing the remainder of the script. What's wrong with my function?

Maybe I'm mistaken, but it is not expected? Your outer loop runs forever, there is no exit point. Unless the process count increases again the outer loop runs infinitely (without any delay which is not recommended).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio