Counting lines or enumerating line numbers so I can loop over them - why is this an anti-pattern? - bash

I posted the following code and got scolded. Why is this not acceptable?
numberOfLines=$(wc -l <"$1")
for ((i=1; $i<=$numberOfLines; ++$i)); do
lineN=$(sed -n "$i!d;p;q" "$1")
# ... do things with "$lineN"
done
We collect the number of lines in the input file into numberOfLines, then loop from 1 to that number, pulling out the next line from the file with sed in each iteration.
The feedback I received complained that reading the same file repeatedly with sed inside the loop to get the next line is inefficient. I guess I could use head -n "$i" "$1" | tail -n 1 but that's hardly more efficient, is it?
Is there a better way to do this? Why would I want to avoid this particular approach?

The shell (and basically every programming language which is above assembly language) already knows how to loop over the lines in a file; it does not need to know how many lines there will be to fetch the next one — strikingly, in your example, sed already does this, so if the shell couldn't do it, you could loop over the output from sed instead.
The proper way to loop over the lines in a file in the shell is with while read. There are a couple of complications — commonly, you reset IFS to avoid having the shell needlessly split the input into tokens, and you use read -r to avoid some pesky legacy behavior with backslashes in the original Bourne shell's implementation of read, which have been retained for backward compatibility.
while IFS='' read -r lineN; do
# do things with "$lineN"
done <"$1"
Besides being much simpler than your sed script, this avoids the problem that you read the entire file once to obtain the line count, then read the same file again and again in each loop iteration. With a typical modern OS, some repeated reading will be avoided thanks to caching (the disk driver keeps a buffer of recently accessed data in memory, so that reading it again will not actually require fetching it from the disk again), but the basic fact is still that reading information from disk is on the order of 1000x slower than not doing it when you can avoid it. Especially with a large file, the cache will fill up eventually, and so you end up reading in and discarding the same bytes over and over, adding a significant amount of CPU overhead and an even more significant amount of the CPU simply doing something else while waiting for the disk to deliver the bytes you read, again and again.
In a shell script, you also want to avoid the overhead of an external process if you can. Invoking sed (or the functionally equivalent but even more expensive two-process head -n "$i"| tail -n 1) thousands of times in a tight loop will add significant overhead for any non-trivial input file. On the other hand, if the body of your loop could be done in e.g. sed or Awk instead, that's going to be a lot more efficient than a native shell while read loop, because of the way read is implemented. This is why while read is also frequently regarded as an antipattern.
And make sure you are reasonably familiar with the standard palette of Unix text processing tools - cut, paste, nl, pr, etc etc.
In many, many cases you should avoid looping over the lines in a shell script and use an external tool instead. There is basically only one exception to this; when the body of the loop is also significantly using built-in shell commands.
The q in the sed script is a very partial remedy for repeatedly reading the input file; and frequently, you see variations where the sed script will read the entire input file through to the end each time, even if it only wants to fetch one of the very first lines out of the file.
With a small input file, the effects are negligible, but perpetuating this bad practice just because it's not immediately harmful when the input file is small is simply irresponsible. Just don't teach this technique to beginners. At all.
If you really need to display the number of lines in the input file, for a progress indicator or similar, at least make sure you don't spend a lot of time seeking through to the end just to obtain that number. Maybe stat the file and keep track of how many bytes there are on each line, so you can project the number of lines you have left (and instead of line 1/10345234 display something like line 1/approximately 10000000?) ... or use an external tool like pv.
Tangentially, there is a vaguely related antipattern you want to avoid, too; you don't want to read an entire file into memory when you are only going to process one line at a time. Doing that in a for loop also has some additional gotchas, so don't do that, either; see https://mywiki.wooledge.org/DontReadLinesWithFor
Another common variation is to find the line you want to modify with grep, only so you can find it with sed ... which already knows full well how to perform a regex search by itself. (See also useless use of grep.)
# XXX FIXME: wrong
line=$(grep "foo" file)
sed -i "s/$line/thing/" file
The correct way to do this would be to simply change the sed script to contain a search condition:
sed -i '/foo/s/.*/thing/' file
This also avoids the complications when the value of $line in the original, faulty script contains something which needs to be escaped in order to actually match itself. (For example, foo\bar* in a regular expression does not match the literal text itself.)

Related

Asynchronously consuming pipe with bash

I have a bash script like this
data_generator_that_never_guits | while read data
do
an_expensive_process_with data
done
The first process continuously generates events (at irregular intervals) which needs to be processed as they become available. A problem with this script is that read on consumes a single line of the output; and as the processing is very expensive, I'd want it to consume all the data that is currently available. On the other side, the processing must start immediately if a new data becomes available. In the nutshell, I want to do something like this
data_generator_that_never_guits | while read_all_available data
do
an_expensive_process_with data
done
where the command read_all_available will wait if no data is available for consumption or copy all the currently available data to the variable. It is perfectly fine if the data does not consist of full lines. Basically, I am looking for an analog of read which would read the entire pipe buffer instead of reading just a single line from the pipe.
For the curious among you, the background of the question that I have a build script which needs to trigger a rebuild on a source file change. I want to avoid triggering rebuilds too often. Please do not suggest me to use grunt, gulp or other available build systems, they do not work well for my purpose.
Thanks!
I think I have found the solution after I got better insight how subshells work. This script appears to do what I need:
data_generator_that_never_guits | while true
do
# wait until next element becomes available
read LINE
# consume any remaining elements — a small timeout ensures that
# rapidly fired events are batched together
while read -t 1 LINE; do true; done
# the data buffer is empty, launch the process
an_expensive_process
done
It would be possible to collect all the read lines to a single batch, but I don't really care about their contents at this point, so I didn't bother figuring that part out :)
Added on 25.09.2014
Here is a final subroutine, in case it could be useful for someone one day:
flushpipe() {
# wait until the next line becomes available
read -d "" buffer
# consume any remaining elements — a small timeout ensures that
# rapidly fired events are batched together
while read -d "" -t 1 line; do buffer="$buffer\n$line"; done
echo $buffer
}
To be used like this:
data_generator_that_never_guits | while true
do
# wait until data becomes available
data=$(flushpipe)
# the data buffer is empty, launch the process
an_expensive_process_with data
done
Something like read -N 4096 -t 1 might do the trick, or perhaps read -t 0 with additional logic. See the Bash reference manual for details. Otherwise, you might have to move from Bash to e.g. Perl.

Emulating 'named' process substitutions

Let's say I have a big gzipped file data.txt.gz, but often the ungzipped version needs to be given to a program. Of course, instead of creating a standalone unpacked data.txt, one could use the process substitution syntax:
./program <(zcat data.txt.gz)
However, depending on the situation, this can be tiresome and error-prone.
Is there a way to emulate a named process substitution? That is, to create a pseudo-file data.txt that would 'unfold' into a process substitution zcat data.txt.gz whenever it is accessed. Not unlike a symbolic link forwards a read operation to another file, but, in this case, it needs to be a temporary named pipe.
Thanks.
PS. Somewhat similar question
Edit (from comments) The actual use-case is having a large gzipped corpus that, besides its usage in its raw form, also sometimes needs to be processed with a series of lightweight operations (tokenized, lowercased, etc.) and then fed to some "heavier" code. Storing a preprocessed copy wastes disk space and repeated retyping the full preprocessing pipeline can introduce errors. In the same time, running the pipeline on-the-fly incurs a tiny computational overhead, hence the idea of a long-lived pseudo-file that hides the details under the hood.
As far as I know, what you are describing does not exist, although it's an intriguing idea. It would require kernel support so that opening the file would actually run an arbitrary command or script instead.
Your best bet is to just save the long command to a shell function or script to reduce the difficulty of invoking the process substitution.
There's a spectrum of options, depending on what you need and how much effort you're willing to put in.
If you need a single-use file, you can just use mkfifo to create the file, start up a redirection of your archive into the fifo, and and pass the fifo's filename to whoever needs to read from it.
If you need to repeatedly access the file (perhaps simultaneously), you can set up a socket using netcat that serves the decompressed file over and over.
With "traditional netcat" this is as simple as while true; do nc -l -p 1234 -c "zcat myfile.tar.gz"; done. With BSD netcat it's a little more annoying:
# Make a dummy FIFO
mkfifo foo
# Use the FIFO to track new connections
while true; do cat foo | zcat myfile.tar.gz | nc -l 127.0.0.1 1234 > foo; done
Anyway once the server (or file based domain socket) is up, you just do nc localhost 1234 to read the decompressed file. You can of course use nc localhost 1234 as part of a process substitution somewhere else.
It looks like this in action (image probably best viewed in separate tab):
Depending on your needs, you may want to make the bash script more sophisticated for caching etc, or just dump this thing and go for a regular web server in some scripting language you're comfortable with.
Finally, and this is probably the most "exotic" solution, you can write a FUSE filesystem that presents virtual files backed by whatever logic your heart desires. At this point you should probably have a good hard think about whether the maintainability and complexity costs of where you're going really offset someone having to call zcat a few extra times.

How to display command output in a whiptail textbox

The whiptail command has an option --textbox that has the following description:
--textbox <file> <height> <width>
The first option requires a file as input; I would like to use the output of a command in its place. It seems like this should be possible in sh or bash. For the sake of the question, let's say I'd like to view the output of ls -l in a whiptail textbox.
Note that process substitution does not appear to work in whiptail (e.g. whiptail --textbox <(ls -l) 40 80 does not work.
This question is a re-asking of this other stackoverflow question, which technically was answered.
For the record, the question says that
whiptail --textbox <(ls -l) 40 80
"doesn't work". It's most certainly worth stating clearly that the nature of the failure is that whiptail displays an empty textbox. (That's mentioned in a comment to an answer to the original question, linked in this question, but that's a pretty obscure place to look for a problem report.)
In 2014, this workaround was available (and was the original contents of this answer):
whiptail --textbox /dev/stdin 40 80 <<<"$(ls -l)"
That will still work in 2022, if ls -l produces enough output (at least 64k on a standard Linux/Bash install).
Another possible workaround is to use a msgbox instead of a textbox:
whiptail --msgbox "$(ls -l)" 40 80
However, that will fail if the output from the command is too large to use as a command-line argument, which might be the case at 128k.
So if you can guess reasonably accurately how big the output will be, one of those solutions will work. Up to around 100k, you can use the msgbox solution; beyond that, you can use the textbox with a here-string.
But that's far from ideal, since it's really hard to reliably guess the size of the output of a command, even within a factor of two.
What will always work is to save the output of the command to a temporary file, then provide the file to whiptail, and then delete the file. (In fact, you can delete the file immediately, since Posix systems don't delete files until there are no open file handles.) But no matter how hard you try, you will occasionally end up with a file which should have been deleted. On Linux, your best bet is to create temporary files in the /tmp directory, which is an in-memory filesystem which is emptied automatically on reboot.
Why does this happen?
I came up with the solution above, eight years prior to this edit, on the assumption that OP was correct in their guess that the problem had to do with not being able to seek() a process substituted fd. Indeed, it's true that you can't seek() /dev/fd/63. It was also true at the time that bash implemented here-strings and here-docs by creating a temporary file to hold the expanded text, and then redirecting standard input (or whatever fd was specified) to that temporary file. So the suggested workaround did work; it ensured that /dev/stdin was just a regular file.
But in 2022, the same question was asked, this time because the suggested workaround failed. As it turns out, the reason is that Bash v5.1, which was released in late 2020, attempts to optimise small here-strings and here-docs:
c. Here documents and here strings now use pipes for the expanded document if it's smaller than the pipe buffer size, reverting to temporary files if it's larger.
(from the Bash CHANGES file; changes between bash 5.1alpha and bash 5.0, in section 3, New features in Bash.)
So with the new Bash version, unless the here-string is bigger than a pipe buffer (on Linux, 16 pages), it will no longer be a regular file.
One slightly confusing aspect of this issue is that whiptail does not, in fact, try to call lseek() on the textbox file. So the initial guess about the nature of the problem was not exact. That's not all that surprising, since using lseek() on a FIFO to position the stream at SEEK_END produces an explicit error, and it's reasonable to expect software to actually report error returns. But whiptail does not attempt to get the filesize by seeking to the end. Instead, it fstat()s the file and gets the file size from the st_size field in the returned struct stat. It then allocates exactly enough memory to hold the contents of the file, and reads the indicated number of bytes.
Of course, fstat cannot report the size of a FIFO, since that's not known until the FIFO is completely drained. But unlike lseek, that's not considered an error. fstat is documented as not filling in the st_size field on FIFOs, sockets, character devices, and other stream types. As it happens, on Linux the st_size field is filled in as 0 for such file descriptors, but Posix actually allows it to be unset. In any case, there is no error indication, and it's essentially impossible to distinguish between a stream which doesn't have a known size and a stream which is known to have size 0, like an empty file. Thus, whiptail treats a FIFO as though it were an empty file, reading 0 bytes and presenting an empty textbox.
What about dialog?
One alternative to whiptail is Dialog, currently maintained by Thomas Dickey. You can often directly substitute dialog for whiptail, and it has some additional widgets which can be useful. However, it does not provide a simple solution in this case.
Unlike whiptail, dialog's textbox attempts to avoid reading the entire file into memory before drawing the widget. As a result, it does depend on lseek() in order to read out of order, and thus cannot work on special files at all. Attempting to use a FIFO with dialog produces an error message, rather than drawing an empty textbox; that makes diagnosis easier but doesn't really solve the underlying problem. Dialog does have a variety of widgets which can read from stdin, but as far as I know none of them allow scrolling, so they're only useful if the command output fits in a single window. But it's possible that I've missed something.
Drawing out a moral: just read until you reach the end
(Skip this section if you're only interested in using command-line utilities, not writing them.)
The tragic part of this complicated tale is that it was all completely unnecessary. Whiptail is going to read the entire file in any case; trying to get the size of the file before reading was either laziness or a misguided attempt to optimise. Had the code just read until an end of file indication, possibly allocating more memory as needed, all these problems would have been avoided. And not just these problems. As it happens, Posix does not guarantee that the st_size field is accurate even for apparently normal files. Stat is only required to report accurate sizes for symlinks (the length of the link itself) and shared memory objects. The Linux documentation indicates that st_size will be returned as 0 for certain automatically-generated files:
For example, the value 0 is returned for many files under the /proc directory, while various files under /sys report a size of 4096 bytes, even though the file content is smaller. For such files, one should simply try to read as many bytes as possible (and append '\0' to the returned buffer if it is to be interpreted as a string). (from man 7 inode).
lseek will also fail on many autogenerated files, as well as sockets, FIFOs and character devices. So the more common idiom for this particular optimization is also unreliable, as well as leading to TOCTOU-like race conditions when the file is truncated or appended to while it is being read.

Embarrassingly parallel workflow creates too many output files

On a Linux cluster I run many (N > 10^6) independent computations. Each computation takes only a few minutes and the output is a handful of lines. When N was small I was able to store each result in a separate file to be parsed later. With large N however, I find that I am wasting storage space (for the file creation) and simple commands like ls require extra care due to internal limits of bash: -bash: /bin/ls: Argument list too long.
Each computation is required to run through a qsub scheduling algorithm so I am unable to create a master program which simply aggregates the output data to a single file. The simple solution of appending to a single fails when two programs finish at the same time and interleave their output. I have no admin access to the cluster, so installing a system-wide database is not an option.
How can I collate the output data from embarrassingly parallel computation before it gets unmanageable?
1) As you say, it's not ls which is failing; it's the shell which does glob expansion before starting up ls. You can fix that problem easily enough by using something like
find . -type f -name 'GLOB' | xargs UTILITY
eg.:
find . -type f -name '*.dat' | xargs ls -l
You might want to sort the output, since find (for efficiency) doesn't sort the filenames (usually). There are many other options to find (like setting directory recursion depth, filtering in more complicated ways, etc.) and to xargs (maximum number of arguments for each invocation, parallel execution, etc.). Read the man pages for details.
2) I don't know how you are creating the individual files, so it's a bit hard to provide specific solutions, but here are a couple of ideas:
If you get to create the files yourself, and you can delay the file creation until the end of the job (say, by buffering output), and the files are stored on a filesystem which supports advisory locking or some other locking mechanism like atomic linking, then you can multiplex various jobs into a single file by locking it before spewing the output, and then unlocking. But that's a lot of requirements. In a cluster you might well be able to do that with a single file for all the jobs running on a single host, but then again you might not.
Again, if you get to create the files yourself, you can atomically write each line to a shared file. (Even NFS supports atomic writes but it doesn't support atomic append, see below.) You'd need to prepend a unique job identifier to each line so that you can demultiplex it. However, this won't work if you're using some automatic mechanism such as "my job writes to stdout and then the scheduling framework copies it to a file", which is sadly common. (In essence, this suggestion is pretty similar to the MapReduce strategy. Maybe that's available to you?)
Failing everything else, maybe you can just use sub-directories. A few thousand directories of a thousand files each is a lot more manageable than a single directory with a few million files.
Good luck.
Edit As requested, some more details on 2.2:
You need to use Posix I/O functions for this, because, afaik, the C library does not provide atomic write. In Posix, the write function always writes atomically, provided that you specify O_APPEND when you open the file. (Actually, it writes atomically in any case, but if you don't specify O_APPEND then each process retains it's own position into the file, so they will end up overwriting each other.)
So what you need to do is:
At the beginning of the program, open a file with options O_WRONLY|O_CREATE|O_APPEND. (Contrary to what I said earlier, this is not guaranteed to work on NFS, because NFS may not handle O_APPEND properly. Newer versions of NFS could theoretically handle append-only files, but they probably don't. Some thoughts about this a bit later.) You probably don't want to always use the same file, so put a random number somewhere into its name so that your various jobs have a variety of alternatives. O_CREAT is always atomic, afaik, even with crappy NFS implementations.
For each output line, sprintf the line to an internal buffer, putting a unique id at the beginning. (Your job must have some sort of unique id; just use that.) [If you're paranoid, start the line with some kind of record separator, followed by the number of bytes in the remaining line -- you'll have to put this value in after formatting -- so the line will look something like ^0274:xx3A7B29992A04:<274 bytes>\n, where ^ is hex 01 or some such.]
write the entire line to the file. Check the return code and the number of bytes written. If the write fails, try again. If the write was short, hopefully you followed the "if you're paranoid" instructions above, also just try again.
Really, you shouldn't get short writes, but you never know. Writing the length is pretty simple; demultiplexing is a bit more complicated, but you could cross that bridge when you need to :)
The problem with using NFS is a bit more annoying. As with 2.1, the simplest solution is to try to write the file locally, or use some cluster filesystem which properly supports append. (NFSv4 allows you to ask for only "append" permissions and not "write" permissions, which would cause the server to reject the write if some other process already managed to write to the offset you were about to use. In that case, you'd need to seek to the end of the file and try the write again, until eventually it succeeds. However, I have the impression that this feature is not actually implemented. I could be wrong.)
If the filesystem doesn't support append, you'll have another option: decide on a line length, and always write that number of bytes. (Obviously, it's easier if the selected fixed line length is longer than the longest possible line, but it's possible to write multiple fixed-length lines as long as they have a sequence number.) You'll need to guarantee that each job writes at different offsets, which you can do by dividing the job's job number into a file number and an interleave number, and write all the lines for a particular job at its interleave modulo the number of interleaves, into a file whose name includes the file number. (This is easiest if the jobs are numbered sequentially.) It's OK to write beyond the end of the file, since unix filesystems will -- or at least, should -- either insert NULs or create discontiguous files (which waste less space, but depend on the blocksize of the file).
Another way to handle filesystems which don't support append but do support advisory byte-range locking (NFSv4 supports this) is to use the fixed-line-length idea, as above, but obtaining a lock on the range about to be written before writing it. Use a non-blocking lock, and if the lock cannot be obtained, try again at the next line-offset multiple. If the lock can be obtained, read the file at that offset to verify that it doesn't have data before writing it; then release the lock.
Hope that helps.
If you are only concerned by space:
parallel --header : --tag computation {foo} {bar} {baz} ::: foo 1 2 ::: bar I II ::: baz . .. | pbzip2 > out.bz2
or shorter:
parallel --tag computation ::: 1 2 ::: I II ::: . .. | pbzip2 > out.bz2
GNU Parallel ensures output is not mixed.
If you are concerned with finding a subset of the results, then look at --results.
Watch the intro videos to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Another possibility would be to use N files, with N greater or equal to the number of nodes in the cluster, and assign the files to your computations in a round-robin fashion. This should avoid concurrent writes to any of the files, provided you have a reasonnable guarantee on the order of execution of your computations.

Fastest Way to Parse a Large File in Ruby

I have a simple text file that is ~150mb. My code will read each line, and if it matches certain regexes, it gets written to an output file.
But right now, it just takes a long time to iterate through all of the lines of the file (several minutes) doing it like
File.open(filename).each do |line|
# do some stuff
end
I know that it is the looping through the lines of the file that is taking a while because even if I do nothing with the data in "#do some stuff", it still takes a long time.
I know that some unix programs can parse large files like this almost instantly (like grep), so I am wondering why ruby (MRI 1.9) takes so long to read the file, and is there some way to make it faster?
It's not really fair to compare to grep because that is a highly tuned utility that only scans the data, it doesn't store any of it. When you're reading that file using Ruby you end up allocating memory for each line, then releasing it during the garbage collection cycle. grep is a pretty lean and mean regexp processing machine.
You may find that you can achieve the speed you want by using an external program like grep called using system or through the pipe facility:
`grep ABC bigfile`.split(/\n/).each do |line|
# ... (called on each matching line) ...
end
File.readlines.each do |line|
#do stuff with each line
end
Will read the whole file into one array of lines. It should be a lot faster, but it takes more memory.
You should read it into the memory and then parse. Of course it depends on what you are looking for. Don't expect miracle performance from ruby, especially comparing to c/c++ programs which are being optimized for past 30 years ;-)

Resources