Asynchronously consuming pipe with bash - shell

I have a bash script like this
data_generator_that_never_guits | while read data
do
an_expensive_process_with data
done
The first process continuously generates events (at irregular intervals) which needs to be processed as they become available. A problem with this script is that read on consumes a single line of the output; and as the processing is very expensive, I'd want it to consume all the data that is currently available. On the other side, the processing must start immediately if a new data becomes available. In the nutshell, I want to do something like this
data_generator_that_never_guits | while read_all_available data
do
an_expensive_process_with data
done
where the command read_all_available will wait if no data is available for consumption or copy all the currently available data to the variable. It is perfectly fine if the data does not consist of full lines. Basically, I am looking for an analog of read which would read the entire pipe buffer instead of reading just a single line from the pipe.
For the curious among you, the background of the question that I have a build script which needs to trigger a rebuild on a source file change. I want to avoid triggering rebuilds too often. Please do not suggest me to use grunt, gulp or other available build systems, they do not work well for my purpose.
Thanks!

I think I have found the solution after I got better insight how subshells work. This script appears to do what I need:
data_generator_that_never_guits | while true
do
# wait until next element becomes available
read LINE
# consume any remaining elements — a small timeout ensures that
# rapidly fired events are batched together
while read -t 1 LINE; do true; done
# the data buffer is empty, launch the process
an_expensive_process
done
It would be possible to collect all the read lines to a single batch, but I don't really care about their contents at this point, so I didn't bother figuring that part out :)
Added on 25.09.2014
Here is a final subroutine, in case it could be useful for someone one day:
flushpipe() {
# wait until the next line becomes available
read -d "" buffer
# consume any remaining elements — a small timeout ensures that
# rapidly fired events are batched together
while read -d "" -t 1 line; do buffer="$buffer\n$line"; done
echo $buffer
}
To be used like this:
data_generator_that_never_guits | while true
do
# wait until data becomes available
data=$(flushpipe)
# the data buffer is empty, launch the process
an_expensive_process_with data
done

Something like read -N 4096 -t 1 might do the trick, or perhaps read -t 0 with additional logic. See the Bash reference manual for details. Otherwise, you might have to move from Bash to e.g. Perl.

Related

Counting lines or enumerating line numbers so I can loop over them - why is this an anti-pattern?

I posted the following code and got scolded. Why is this not acceptable?
numberOfLines=$(wc -l <"$1")
for ((i=1; $i<=$numberOfLines; ++$i)); do
lineN=$(sed -n "$i!d;p;q" "$1")
# ... do things with "$lineN"
done
We collect the number of lines in the input file into numberOfLines, then loop from 1 to that number, pulling out the next line from the file with sed in each iteration.
The feedback I received complained that reading the same file repeatedly with sed inside the loop to get the next line is inefficient. I guess I could use head -n "$i" "$1" | tail -n 1 but that's hardly more efficient, is it?
Is there a better way to do this? Why would I want to avoid this particular approach?
The shell (and basically every programming language which is above assembly language) already knows how to loop over the lines in a file; it does not need to know how many lines there will be to fetch the next one — strikingly, in your example, sed already does this, so if the shell couldn't do it, you could loop over the output from sed instead.
The proper way to loop over the lines in a file in the shell is with while read. There are a couple of complications — commonly, you reset IFS to avoid having the shell needlessly split the input into tokens, and you use read -r to avoid some pesky legacy behavior with backslashes in the original Bourne shell's implementation of read, which have been retained for backward compatibility.
while IFS='' read -r lineN; do
# do things with "$lineN"
done <"$1"
Besides being much simpler than your sed script, this avoids the problem that you read the entire file once to obtain the line count, then read the same file again and again in each loop iteration. With a typical modern OS, some repeated reading will be avoided thanks to caching (the disk driver keeps a buffer of recently accessed data in memory, so that reading it again will not actually require fetching it from the disk again), but the basic fact is still that reading information from disk is on the order of 1000x slower than not doing it when you can avoid it. Especially with a large file, the cache will fill up eventually, and so you end up reading in and discarding the same bytes over and over, adding a significant amount of CPU overhead and an even more significant amount of the CPU simply doing something else while waiting for the disk to deliver the bytes you read, again and again.
In a shell script, you also want to avoid the overhead of an external process if you can. Invoking sed (or the functionally equivalent but even more expensive two-process head -n "$i"| tail -n 1) thousands of times in a tight loop will add significant overhead for any non-trivial input file. On the other hand, if the body of your loop could be done in e.g. sed or Awk instead, that's going to be a lot more efficient than a native shell while read loop, because of the way read is implemented. This is why while read is also frequently regarded as an antipattern.
And make sure you are reasonably familiar with the standard palette of Unix text processing tools - cut, paste, nl, pr, etc etc.
In many, many cases you should avoid looping over the lines in a shell script and use an external tool instead. There is basically only one exception to this; when the body of the loop is also significantly using built-in shell commands.
The q in the sed script is a very partial remedy for repeatedly reading the input file; and frequently, you see variations where the sed script will read the entire input file through to the end each time, even if it only wants to fetch one of the very first lines out of the file.
With a small input file, the effects are negligible, but perpetuating this bad practice just because it's not immediately harmful when the input file is small is simply irresponsible. Just don't teach this technique to beginners. At all.
If you really need to display the number of lines in the input file, for a progress indicator or similar, at least make sure you don't spend a lot of time seeking through to the end just to obtain that number. Maybe stat the file and keep track of how many bytes there are on each line, so you can project the number of lines you have left (and instead of line 1/10345234 display something like line 1/approximately 10000000?) ... or use an external tool like pv.
Tangentially, there is a vaguely related antipattern you want to avoid, too; you don't want to read an entire file into memory when you are only going to process one line at a time. Doing that in a for loop also has some additional gotchas, so don't do that, either; see https://mywiki.wooledge.org/DontReadLinesWithFor
Another common variation is to find the line you want to modify with grep, only so you can find it with sed ... which already knows full well how to perform a regex search by itself. (See also useless use of grep.)
# XXX FIXME: wrong
line=$(grep "foo" file)
sed -i "s/$line/thing/" file
The correct way to do this would be to simply change the sed script to contain a search condition:
sed -i '/foo/s/.*/thing/' file
This also avoids the complications when the value of $line in the original, faulty script contains something which needs to be escaped in order to actually match itself. (For example, foo\bar* in a regular expression does not match the literal text itself.)

Named and Unnamed Pipes

Ok, here's something that I cannot wrap my head around. I bumped into this while working on a rather complex script. Managed to simplify this to the bare minimum, but it still doesn't make sense.
Let's say, I have a fifo:
mkfifo foo.fifo
Running the command below on one terminal, and then writing things into the pipe (echo "abc" > foo.fifo) on another seems to work fine:
while true; do read LINE <foo.fifo; echo "LINE=$LINE"; done
LINE=abc
However, changing the command ever so slightly, and the read command fails to wait for the next line after it's read the first one:
cat a.fifo | while true; do read LINE; echo "LINE=$LINE"; done
LINE=abc
LINE=
LINE=
LINE=
[...] # At this keeps repeating endlessly
The really disturbing part is, that it'll wait for the first line, but then it just reads an empty string into $LINE, and fails to block. (Funny enough, this is one of the few times, I want an I/O-operation to block :))
I thought, I really understand how I/O-redirection and such things work, but now I am rather confused.
So, what's the solution, what am I missing? Can anyone explain this phenomenon?
UPDATE: For a short answer, and a quick solution see William's answer. For a more in-depth, and complete insight, you'd want to go with rici's explanation!
Really, the two command lines in the question are very similar, if we eliminate the UUOC:
while true; do read LINE <foo.fifo; echo "LINE=$LINE"; done
and
while true; do read LINE; echo "LINE=$LINE"; done <foo.fifo
They act in slightly different ways, but the important point is that neither of them correct.
The first one opens and reads from the fifo and then closes the fifo every time through the loop. The second one opens the fifo, and then attempts to read from it every time through the loop.
A fifo is a slightly complicated state machine, and it's important to understand the various transitions.
Opening a fifo for reading or writing will block until some process has it open in the other direction. That makes it possible to start a reader and a writer independently; the open calls will return at the same time.
A read from a fifo succeeds if there is data in the fifo buffer. It blocks if there is no data in the fifo buffer but there is at least one writer which holds the fifo open. If returns EOF if there is no data in the fifo buffer and no writer.
A write to a fifo succeeds if there is space in the fifo buffer and there is at least one reader which has the fifo open. It blocks if there is no space in the fifo buffer, but at least one reader has the fifo open. And it triggers SIGPIPE (and then fails with EPIPE if that signal is being ignored) if there is no reader.
Once both ends of the fifo are closed, any data left in the fifo buffer is discarded.
Now, based on that, let's consider the first scenario, where the fifo is redirected to the read. We have two processes:
reader writer
-------------- --------------
1. OPEN blocks
2. OPEN succeeds OPEN succeeds immediately
3. READ blocks
4. WRITE
5. READ succeeds
6. CLOSE ///////// CLOSE
(The writer could equally well have started first, in which case it would block at line 1 instead of the reader. But the result is the same. The CLOSE operations at line 6 are not synchronized. See below.)
At line 6, the fifo no longer has readers nor writers, so its buffer is flushed. Consequently, if the writer had written two lines instead of one, the second line would be tossed into the bit bucket, before the loop continues.
Let's contrast that with the second scenario, in which the reader is the while loop and not just the read:
reader writer
--------- ---------
1. OPEN blocks
2. OPEN succeeds OPEN succeeds immediately
3. READ blocks
4. WRITE
5. READ succeeds
6. CLOSE
--loop--
7. READ returns EOF
8. READ returns EOF
... and again
42. and again OPEN succeeds immediately
43. and again WRITE
44. READ succeeds
Here, the reader will continue to read lines until it runs out. If no writer has appeared by then, the reader will start getting EOFs. If it ignores them (eg. while true; do read...), then it will get a lot of them, as indicated.
Finally, let's return for a moment to the first scenario, and consider the possibilities when both processes loop. In the description above, I assumed that both CLOSE operations would succeed before either OPEN operation was attempted. That would be the common case, but nothing guarantees it. Suppose instead that the writer succeeds in doing both a CLOSE and an OPEN before the reader manages to do its CLOSE. Now we have the sequence:
reader writer
-------------- --------------
1. OPEN blocks
2. OPEN succeeds OPEN succeeds immediately
3. READ blocks
4. WRITE
5. CLOSE
5. READ succeeds OPEN
6. CLOSE
7. WRITE !! SIGPIPE !!
In short, the first invocation will skip lines, and has a race condition in which the writer will occasionally receive a spurious error. The second invocation will read everything written, and the writer will be safe, but the reader will continuously receive EOF indications instead of blocking until data is available.
So what is the correct solution?
Aside from the race condition, the optimal strategy for the reader is to read until EOF, and then close and reopen the fifo. The second open will block if there is no writer. That can be achieved with a nested loop:
while :; do
while read line; do
echo "LINE=$line"
done < fifo
done
Unfortunately, the race condition which generates SIGPIPE is still possible, although it is going to be extremely rare [See note 1]. All the same, a writer would have to be prepared for its write to fail.
A simpler and more robust solution is available on Linux, because Linux allows fifos to be opened for reading and writing. Such an open always succeeds immediately. And since there is always a process which holds the fifo open for writing, the reads will block, as expected:
while read line; do
echo "LINE=$line"
done <> fifo
(Note that in bash, the "redirect both ways" operator <> still only redirects stdin -- or fd n form n<> -- so the above does not mean "redirect stdin and stdout to fifo".)
Notes
The fact that a race condition is extremely rare is not a reason to ignore it. Murphy's law states that it will happen at the most critical moment; for example, when the correct functioning was necessary in order to create a backup just before a critical file was corrupted. But in order to trigger the race condition, the writer process needs to arrange for its actions to happen in some extremely tight time bands:
reader writer
-------------- --------------
fifo is open fifo is open
1. READ blocks
2. CLOSE
3. READ returns EOF
4. OPEN
5. CLOSE
6. WRITE !! SIGPIPE !!
7. OPEN
In other words, the writer needs to perform its OPEN in the brief interval between the moment the reader receives an EOF and responds by closing the fifo. (That's the only way the writer's OPEN won't block.) And then it needs to do the write in the (different) brief interval between the moment that the reader closes the fifo, and the subsequent reopen. (The reopen wouldn't block because now the writer has the fifo open.)
That's one of those once in a hundred million race conditions that, as I said, only pops up at the most inopportune moment, possibly years after the code was written. But that doesn't mean you can ignore it. Make sure that the writer is prepared to handle SIGPIPE and retry a write which fails with EPIPE.
When you do
cat a.fifo | while true; do read LINE; echo "LINE=$LINE"; done
which, incidentally, ought to be written:
while true; do read LINE; echo "LINE=$LINE"; done < a.fifo
that script will block until someone opens the fifo for writing. As soon as that happens, the while loop begins. If the writer (the 'echo foo > a.fifo' you ran in another shell) terminates and there is no one else with the pipe open for writing, then the read returns because the pipe is empty and there are no processes that have the other end open. Try this:
in one shell:
while true; do date; read LINE; echo "LINE=$LINE"; done < a.fifo
in a second shell:
cat > a.fifo
in a third shell
echo hello > a.fifo
echo world > a.fifo
By keeping the cat running in the second shell, the read in the while loop blocks instead of returning.
I guess the key insight is that when you do the redirection inside the loop, the shell does not start the read until someone opens the pipe for writing. When you do the redirection to the while loop, the shell only blocks before it starts the loop.

bash: wait for specific command output before continuing

I know there are several posts asking similar things, but none address the problem I'm having.
I'm working on a script that handles connections to different Bluetooth low energy devices, reads from some of their handles using gatttool and dynamically creates a .json file with those values.
The problem I'm having is that gatttool commands take a while to execute (and are not always successful in connecting to the devices due to device is busy or similar messages). These "errors" translate not only in wrong data to fill the .json file but they also allow lines of the script to continue writing to the file (e.g. adding extra } or similar). An example of the commands I'm using would be the following:
sudo gatttool -l high -b <MAC_ADDRESS> --char-read -a <#handle>
How can I approach this in a way that I can wait for a certain output? In this case, the ideal output when you --char-read using gatttool would be:
Characteristic value/description: some_hexadecimal_data`
This way I can make sure I am following the script line by line instead of having these "jumps".
grep allows you to filter the output of gatttool for the data you are looking for.
If you are actually looking for a way to wait until a specific output is encountered before continuing, expect might be what you are looking for.
From the manpage:
expect [[-opts] pat1 body1] ... [-opts] patn [bodyn]
waits until one of the patterns matches the output of a spawned
process, a specified time period has passed, or an end-of-file is
seen. If the final body is empty, it may be omitted.

Emulating 'named' process substitutions

Let's say I have a big gzipped file data.txt.gz, but often the ungzipped version needs to be given to a program. Of course, instead of creating a standalone unpacked data.txt, one could use the process substitution syntax:
./program <(zcat data.txt.gz)
However, depending on the situation, this can be tiresome and error-prone.
Is there a way to emulate a named process substitution? That is, to create a pseudo-file data.txt that would 'unfold' into a process substitution zcat data.txt.gz whenever it is accessed. Not unlike a symbolic link forwards a read operation to another file, but, in this case, it needs to be a temporary named pipe.
Thanks.
PS. Somewhat similar question
Edit (from comments) The actual use-case is having a large gzipped corpus that, besides its usage in its raw form, also sometimes needs to be processed with a series of lightweight operations (tokenized, lowercased, etc.) and then fed to some "heavier" code. Storing a preprocessed copy wastes disk space and repeated retyping the full preprocessing pipeline can introduce errors. In the same time, running the pipeline on-the-fly incurs a tiny computational overhead, hence the idea of a long-lived pseudo-file that hides the details under the hood.
As far as I know, what you are describing does not exist, although it's an intriguing idea. It would require kernel support so that opening the file would actually run an arbitrary command or script instead.
Your best bet is to just save the long command to a shell function or script to reduce the difficulty of invoking the process substitution.
There's a spectrum of options, depending on what you need and how much effort you're willing to put in.
If you need a single-use file, you can just use mkfifo to create the file, start up a redirection of your archive into the fifo, and and pass the fifo's filename to whoever needs to read from it.
If you need to repeatedly access the file (perhaps simultaneously), you can set up a socket using netcat that serves the decompressed file over and over.
With "traditional netcat" this is as simple as while true; do nc -l -p 1234 -c "zcat myfile.tar.gz"; done. With BSD netcat it's a little more annoying:
# Make a dummy FIFO
mkfifo foo
# Use the FIFO to track new connections
while true; do cat foo | zcat myfile.tar.gz | nc -l 127.0.0.1 1234 > foo; done
Anyway once the server (or file based domain socket) is up, you just do nc localhost 1234 to read the decompressed file. You can of course use nc localhost 1234 as part of a process substitution somewhere else.
It looks like this in action (image probably best viewed in separate tab):
Depending on your needs, you may want to make the bash script more sophisticated for caching etc, or just dump this thing and go for a regular web server in some scripting language you're comfortable with.
Finally, and this is probably the most "exotic" solution, you can write a FUSE filesystem that presents virtual files backed by whatever logic your heart desires. At this point you should probably have a good hard think about whether the maintainability and complexity costs of where you're going really offset someone having to call zcat a few extra times.

create a rolling buffer in bash

I want to use curl to get a stream from a remote server, and write it to a buffer. So far so good I just do curl http://the.stream>/path/to/thebuffer. Thing is I don't want this file to get too large, so I want to be able to delete the first bytes of the file as I simultaneously add to the last bytes. Is there a way of doing this?
Alternatively if I could write n bytes to buffer1, then switch to buffer2, buffer3.. and when buffer x was reached delete buffer1 and start again - without losing the data coming in from curl (it's a live stream, so I can't stop curl). I've been reading up the man pages for curl and cat and read, but can't see anything promising.
There isn't any particularly easy way to do what you are seeking to do.
Probably the nearest approach creates a FIFO, and redirects the output of curl to the FIFO. You then have a program such as split or csplit reading the FIFO and writing to different files. If you decide that the split programs are not the tool, you may need to write your own variation on them. You can then decide how to process the files that are created, and when to remove them.
Note that curl will hang until there is a process reading from the FIFO. When the process reading the FIFO exits, curl will get either a SIGPIPE signal or a write error, either of which should stop it.

Resources