Embarrassingly parallel workflow creates too many output files - bash

On a Linux cluster I run many (N > 10^6) independent computations. Each computation takes only a few minutes and the output is a handful of lines. When N was small I was able to store each result in a separate file to be parsed later. With large N however, I find that I am wasting storage space (for the file creation) and simple commands like ls require extra care due to internal limits of bash: -bash: /bin/ls: Argument list too long.
Each computation is required to run through a qsub scheduling algorithm so I am unable to create a master program which simply aggregates the output data to a single file. The simple solution of appending to a single fails when two programs finish at the same time and interleave their output. I have no admin access to the cluster, so installing a system-wide database is not an option.
How can I collate the output data from embarrassingly parallel computation before it gets unmanageable?

1) As you say, it's not ls which is failing; it's the shell which does glob expansion before starting up ls. You can fix that problem easily enough by using something like
find . -type f -name 'GLOB' | xargs UTILITY
eg.:
find . -type f -name '*.dat' | xargs ls -l
You might want to sort the output, since find (for efficiency) doesn't sort the filenames (usually). There are many other options to find (like setting directory recursion depth, filtering in more complicated ways, etc.) and to xargs (maximum number of arguments for each invocation, parallel execution, etc.). Read the man pages for details.
2) I don't know how you are creating the individual files, so it's a bit hard to provide specific solutions, but here are a couple of ideas:
If you get to create the files yourself, and you can delay the file creation until the end of the job (say, by buffering output), and the files are stored on a filesystem which supports advisory locking or some other locking mechanism like atomic linking, then you can multiplex various jobs into a single file by locking it before spewing the output, and then unlocking. But that's a lot of requirements. In a cluster you might well be able to do that with a single file for all the jobs running on a single host, but then again you might not.
Again, if you get to create the files yourself, you can atomically write each line to a shared file. (Even NFS supports atomic writes but it doesn't support atomic append, see below.) You'd need to prepend a unique job identifier to each line so that you can demultiplex it. However, this won't work if you're using some automatic mechanism such as "my job writes to stdout and then the scheduling framework copies it to a file", which is sadly common. (In essence, this suggestion is pretty similar to the MapReduce strategy. Maybe that's available to you?)
Failing everything else, maybe you can just use sub-directories. A few thousand directories of a thousand files each is a lot more manageable than a single directory with a few million files.
Good luck.
Edit As requested, some more details on 2.2:
You need to use Posix I/O functions for this, because, afaik, the C library does not provide atomic write. In Posix, the write function always writes atomically, provided that you specify O_APPEND when you open the file. (Actually, it writes atomically in any case, but if you don't specify O_APPEND then each process retains it's own position into the file, so they will end up overwriting each other.)
So what you need to do is:
At the beginning of the program, open a file with options O_WRONLY|O_CREATE|O_APPEND. (Contrary to what I said earlier, this is not guaranteed to work on NFS, because NFS may not handle O_APPEND properly. Newer versions of NFS could theoretically handle append-only files, but they probably don't. Some thoughts about this a bit later.) You probably don't want to always use the same file, so put a random number somewhere into its name so that your various jobs have a variety of alternatives. O_CREAT is always atomic, afaik, even with crappy NFS implementations.
For each output line, sprintf the line to an internal buffer, putting a unique id at the beginning. (Your job must have some sort of unique id; just use that.) [If you're paranoid, start the line with some kind of record separator, followed by the number of bytes in the remaining line -- you'll have to put this value in after formatting -- so the line will look something like ^0274:xx3A7B29992A04:<274 bytes>\n, where ^ is hex 01 or some such.]
write the entire line to the file. Check the return code and the number of bytes written. If the write fails, try again. If the write was short, hopefully you followed the "if you're paranoid" instructions above, also just try again.
Really, you shouldn't get short writes, but you never know. Writing the length is pretty simple; demultiplexing is a bit more complicated, but you could cross that bridge when you need to :)
The problem with using NFS is a bit more annoying. As with 2.1, the simplest solution is to try to write the file locally, or use some cluster filesystem which properly supports append. (NFSv4 allows you to ask for only "append" permissions and not "write" permissions, which would cause the server to reject the write if some other process already managed to write to the offset you were about to use. In that case, you'd need to seek to the end of the file and try the write again, until eventually it succeeds. However, I have the impression that this feature is not actually implemented. I could be wrong.)
If the filesystem doesn't support append, you'll have another option: decide on a line length, and always write that number of bytes. (Obviously, it's easier if the selected fixed line length is longer than the longest possible line, but it's possible to write multiple fixed-length lines as long as they have a sequence number.) You'll need to guarantee that each job writes at different offsets, which you can do by dividing the job's job number into a file number and an interleave number, and write all the lines for a particular job at its interleave modulo the number of interleaves, into a file whose name includes the file number. (This is easiest if the jobs are numbered sequentially.) It's OK to write beyond the end of the file, since unix filesystems will -- or at least, should -- either insert NULs or create discontiguous files (which waste less space, but depend on the blocksize of the file).
Another way to handle filesystems which don't support append but do support advisory byte-range locking (NFSv4 supports this) is to use the fixed-line-length idea, as above, but obtaining a lock on the range about to be written before writing it. Use a non-blocking lock, and if the lock cannot be obtained, try again at the next line-offset multiple. If the lock can be obtained, read the file at that offset to verify that it doesn't have data before writing it; then release the lock.
Hope that helps.

If you are only concerned by space:
parallel --header : --tag computation {foo} {bar} {baz} ::: foo 1 2 ::: bar I II ::: baz . .. | pbzip2 > out.bz2
or shorter:
parallel --tag computation ::: 1 2 ::: I II ::: . .. | pbzip2 > out.bz2
GNU Parallel ensures output is not mixed.
If you are concerned with finding a subset of the results, then look at --results.
Watch the intro videos to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Another possibility would be to use N files, with N greater or equal to the number of nodes in the cluster, and assign the files to your computations in a round-robin fashion. This should avoid concurrent writes to any of the files, provided you have a reasonnable guarantee on the order of execution of your computations.

Related

Counting lines or enumerating line numbers so I can loop over them - why is this an anti-pattern?

I posted the following code and got scolded. Why is this not acceptable?
numberOfLines=$(wc -l <"$1")
for ((i=1; $i<=$numberOfLines; ++$i)); do
lineN=$(sed -n "$i!d;p;q" "$1")
# ... do things with "$lineN"
done
We collect the number of lines in the input file into numberOfLines, then loop from 1 to that number, pulling out the next line from the file with sed in each iteration.
The feedback I received complained that reading the same file repeatedly with sed inside the loop to get the next line is inefficient. I guess I could use head -n "$i" "$1" | tail -n 1 but that's hardly more efficient, is it?
Is there a better way to do this? Why would I want to avoid this particular approach?
The shell (and basically every programming language which is above assembly language) already knows how to loop over the lines in a file; it does not need to know how many lines there will be to fetch the next one — strikingly, in your example, sed already does this, so if the shell couldn't do it, you could loop over the output from sed instead.
The proper way to loop over the lines in a file in the shell is with while read. There are a couple of complications — commonly, you reset IFS to avoid having the shell needlessly split the input into tokens, and you use read -r to avoid some pesky legacy behavior with backslashes in the original Bourne shell's implementation of read, which have been retained for backward compatibility.
while IFS='' read -r lineN; do
# do things with "$lineN"
done <"$1"
Besides being much simpler than your sed script, this avoids the problem that you read the entire file once to obtain the line count, then read the same file again and again in each loop iteration. With a typical modern OS, some repeated reading will be avoided thanks to caching (the disk driver keeps a buffer of recently accessed data in memory, so that reading it again will not actually require fetching it from the disk again), but the basic fact is still that reading information from disk is on the order of 1000x slower than not doing it when you can avoid it. Especially with a large file, the cache will fill up eventually, and so you end up reading in and discarding the same bytes over and over, adding a significant amount of CPU overhead and an even more significant amount of the CPU simply doing something else while waiting for the disk to deliver the bytes you read, again and again.
In a shell script, you also want to avoid the overhead of an external process if you can. Invoking sed (or the functionally equivalent but even more expensive two-process head -n "$i"| tail -n 1) thousands of times in a tight loop will add significant overhead for any non-trivial input file. On the other hand, if the body of your loop could be done in e.g. sed or Awk instead, that's going to be a lot more efficient than a native shell while read loop, because of the way read is implemented. This is why while read is also frequently regarded as an antipattern.
And make sure you are reasonably familiar with the standard palette of Unix text processing tools - cut, paste, nl, pr, etc etc.
In many, many cases you should avoid looping over the lines in a shell script and use an external tool instead. There is basically only one exception to this; when the body of the loop is also significantly using built-in shell commands.
The q in the sed script is a very partial remedy for repeatedly reading the input file; and frequently, you see variations where the sed script will read the entire input file through to the end each time, even if it only wants to fetch one of the very first lines out of the file.
With a small input file, the effects are negligible, but perpetuating this bad practice just because it's not immediately harmful when the input file is small is simply irresponsible. Just don't teach this technique to beginners. At all.
If you really need to display the number of lines in the input file, for a progress indicator or similar, at least make sure you don't spend a lot of time seeking through to the end just to obtain that number. Maybe stat the file and keep track of how many bytes there are on each line, so you can project the number of lines you have left (and instead of line 1/10345234 display something like line 1/approximately 10000000?) ... or use an external tool like pv.
Tangentially, there is a vaguely related antipattern you want to avoid, too; you don't want to read an entire file into memory when you are only going to process one line at a time. Doing that in a for loop also has some additional gotchas, so don't do that, either; see https://mywiki.wooledge.org/DontReadLinesWithFor
Another common variation is to find the line you want to modify with grep, only so you can find it with sed ... which already knows full well how to perform a regex search by itself. (See also useless use of grep.)
# XXX FIXME: wrong
line=$(grep "foo" file)
sed -i "s/$line/thing/" file
The correct way to do this would be to simply change the sed script to contain a search condition:
sed -i '/foo/s/.*/thing/' file
This also avoids the complications when the value of $line in the original, faulty script contains something which needs to be escaped in order to actually match itself. (For example, foo\bar* in a regular expression does not match the literal text itself.)

Makefile: store warning count into variable without using temp file

I would like to improve an existing Makefile, so it prints out the number of warnings and/or errors that were encountered during the build process.
My basic idea is that there must be a way to pipe the output to grep and have the number of occurences of a certain string in either stderr or stdout stream (i.e. "Warning:") stored into a variable that can then simply be echo'ed out at the end make command.
Requirements / Challenges:
Current console output and exit code must remain exactly the same
That means also without even changing control characters. Dev's using the MakeFile must not recognize any difference to what the output was prior to my change (except for a nice, additional Warning count output at the end of the make process). Any approaches with tee i tried so far were not successful, as the color coding of stderr messages in the console is lost, changing them to all black & white.
Must be system-independent
The project is currently being built by Win/OSX/Linux devs and thus needs to work with standard tools available out-of-the-box in most *nix / CygWin shells. Introducing another dependency such as unbuffer is not an option.
It must be stable and free of side-effects (see also 5.)
If the make process is interrupted (i.e. by the user pressing CTRL+C or for any other reason), there should be no side-effects (such as having an orphaned log file of the output left behind on disk)
(Bonus) It should be efficient
The amount of output may get >1MB and more, so just piping to a file and greping it will be a small performance hit and also there will be additional the disk I/O (thus unnecessarily slowing down the build). I'm simply wondering if this cannot be done w/o a temp file as i understand pipes as sort of "streams" that just need to be analysed as the flow through.
(Bonus) Make it locale-independent w/o changing the current output
Depending on the current locale, the string to grep and count is localized differently, i.e. "Warning:" (en_US.utf8) or "Warnung:" (de_DE.utf8). Surely i could have locale switch to en_US in the Makefile, but that would change console output for users (Hence breaking requirement 1.), so i'd like to know if there's any (efficient) approach you could think of for this.
At the end of the day, i'd be able to do with a solid solution that just fullfills requirement 1. + 2.
If 3. to 5. are not possible to be done then i'd have to convince the project maintainers to have some changes to .gitignore, have the build process slightly take up more time and resources, and/or fix the make output to english only but i assume they will agree that would be worth it.
Current solution
The best i have so far is:
script -eqc "make" make.log && WARNING_COUNT=$(grep -i --count "Warning:" make.log)" && rm make.log || rm make.log
That fulfills my requirements 1, 2 and almost no. 3: still, if the machine has a power-outage while running the command, make.log will remain as an unwanted artifact. Also the repetition of rm make.log looks ugly.
So i'm open on alternative approaches and improvements by anybody. Thanks in advance.

Emulating 'named' process substitutions

Let's say I have a big gzipped file data.txt.gz, but often the ungzipped version needs to be given to a program. Of course, instead of creating a standalone unpacked data.txt, one could use the process substitution syntax:
./program <(zcat data.txt.gz)
However, depending on the situation, this can be tiresome and error-prone.
Is there a way to emulate a named process substitution? That is, to create a pseudo-file data.txt that would 'unfold' into a process substitution zcat data.txt.gz whenever it is accessed. Not unlike a symbolic link forwards a read operation to another file, but, in this case, it needs to be a temporary named pipe.
Thanks.
PS. Somewhat similar question
Edit (from comments) The actual use-case is having a large gzipped corpus that, besides its usage in its raw form, also sometimes needs to be processed with a series of lightweight operations (tokenized, lowercased, etc.) and then fed to some "heavier" code. Storing a preprocessed copy wastes disk space and repeated retyping the full preprocessing pipeline can introduce errors. In the same time, running the pipeline on-the-fly incurs a tiny computational overhead, hence the idea of a long-lived pseudo-file that hides the details under the hood.
As far as I know, what you are describing does not exist, although it's an intriguing idea. It would require kernel support so that opening the file would actually run an arbitrary command or script instead.
Your best bet is to just save the long command to a shell function or script to reduce the difficulty of invoking the process substitution.
There's a spectrum of options, depending on what you need and how much effort you're willing to put in.
If you need a single-use file, you can just use mkfifo to create the file, start up a redirection of your archive into the fifo, and and pass the fifo's filename to whoever needs to read from it.
If you need to repeatedly access the file (perhaps simultaneously), you can set up a socket using netcat that serves the decompressed file over and over.
With "traditional netcat" this is as simple as while true; do nc -l -p 1234 -c "zcat myfile.tar.gz"; done. With BSD netcat it's a little more annoying:
# Make a dummy FIFO
mkfifo foo
# Use the FIFO to track new connections
while true; do cat foo | zcat myfile.tar.gz | nc -l 127.0.0.1 1234 > foo; done
Anyway once the server (or file based domain socket) is up, you just do nc localhost 1234 to read the decompressed file. You can of course use nc localhost 1234 as part of a process substitution somewhere else.
It looks like this in action (image probably best viewed in separate tab):
Depending on your needs, you may want to make the bash script more sophisticated for caching etc, or just dump this thing and go for a regular web server in some scripting language you're comfortable with.
Finally, and this is probably the most "exotic" solution, you can write a FUSE filesystem that presents virtual files backed by whatever logic your heart desires. At this point you should probably have a good hard think about whether the maintainability and complexity costs of where you're going really offset someone having to call zcat a few extra times.

Locking output file for shell script invoked multiple times in parallel

I have close to a million files over which I want to run a shell script and append the result to a single file.
For example suppose I just want to run wc on the files.
So that it runs fast I can parallelize it with xargs. But I do not want the scripts to step over each other when writing the output. It is probably better to write to a few separate files rather than one and then cat them later. But I still want the number of such temporary output files to be significantly smaller than the number of input files. Is there a way to get the kind of locking I want, or is it the case that is always ensured by default?
Is there any utility that will recursively cat two files in parallel?
I can write a script to do that, but have to deal with the temporaries and clean up. So was wondering if there is an utility which does that.
GNU parallel claims that it:
makes sure output from the commands is
the same output as you would get had
you run the commands sequentially
If that's the case, then I presume it should be safe to simple pipe the output to your file and let parallel handle the intermediate data.
Use the -k option to maintain the order of the output.
Update: (non-Perl solution)
Another alternative would be prll, which is implemented with shell functions with some C extensions. It is less feature-rich compared to GNU parallel but should the the job for basic use cases.
The feature listing claims:
Does internal buffering and locking to
prevent mangling/interleaving of
output from separate jobs.
so it should meet your needs as long as order of output is not important
However, note on the following statement on this page:
prll generates a lot of status
information on STDERR which makes it
harder to use the STDERR output of the
job directly as input for another
program.
Disclaimer: I've tried neither of the tools and am merely quoting from their respective docs.

How can I speed up Perl's readdir for a directory with 250,000 files?

I am using Perl readdir to get file listing, however, the directory contains more than 250,000 files and this results long time (longer than 4 minutes) to perform readdir and uses over 80MB of RAM. As this was intended to be a recurring job every 5 minutes, this lag time will not be acceptable.
More info:
Another job will fill the directory (once per day) being scanned.
This Perl script is responsible for processing the files. A file count is specified for each script iteration, currently 1000 per run.
The Perl script is to run every 5 min and process (if applicable) up to 1000 files.
File count limit intended to allow down stream processing to keep up as Perl pushes data into database which triggers complex workflow.
Is there another way to obtain filenames from directory, ideally limited to 1000 (set by variable) which would greatly increase speed of this script?
What exactly do you mean when you say readdir is taking minutes and 80 MB? Can you show that specific line of code? Are you using readdir in scalar or list context?
Are you doing something like this:
foreach my $file ( readdir($dir) ) {
#do stuff here
}
If that's the case, you are reading the entire directory listing into memory. No wonder it takes a long time and a lot of memory.
The rest of this post assumes that this is the problem, if you are not using readdir in list context, ignore the rest of the post.
The fix for this is to use a while loop and use readdir in a scalar context.
while (
defined( my $file = readdir $dir )
) {
# do stuff.
}
Now you only read one item at a time. You can add a counter to keep track of how many files you process, too.
The solution would maybe lie in the other end : at the script that fills the directory...
Why not create an arborescence to store all those files and that way have lots of directories each with a manageable number of files ?
Instead of creating "mynicefile.txt" why not "m/my/mynicefile", or something like that ?
Your file system would thank you for that (especially if you remove the empty directories when you have finished with them).
This is not exactly an answer to your query, but I think having that many files in the same directory is not a very good thing for overall speed (including, the speed at which your filesystem handles add and delete operations, not just listing as you have seen).
A solution to that design problem is to have sub-directories for each possible first letter of the file names, and have all files beginning with that letter inside that directory. Recurse to the second, third, etc. letter if need be.
You will probably see a definite speed improvement on may operations.
You're saying that the content gets there by unpacking zip file(s). Why don't you just work on the zip files instead of creating/using 250k of files in one directory?
Basically - to speed it up, you don't need specific thing in perl, but rather on filesystem level. If you are 100% sure that you have to work with 250k files in directory (which I can't imagine a situation when something like this would be required) - you're much better off with finding better filesystem to handle it than to finding some "magical" module in perl that would scan it faster.
Probably not. I would guess most of the time is in reading the directory entry.
However you could preprocess the entire directory listing, creating one file per 1000-entries. Then your process could do one of those listing files each time and not incur the expense of reading the entire directory.
Have you tried just readdir() through the directory without any other processing at all to get a baseline?
You aren't going to be able to speed up readdir, but you can speed up the task of monitoring a directory. You can ask the OS for updates -- Linux has inotify, for example. Here's an article about using it:
http://www.ibm.com/developerworks/linux/library/l-ubuntu-inotify/index.html?ca=drs-
You can use Inotify from Perl:
http://metacpan.org/pod/Linux::Inotify2
The difference is that you will have one long-running app instead of a script that is started by cron. In the app, you'll keep a queue of files that are new (as provided by inotify). Then, you set a timer to go off every 5 minutes, and process 1000 items. After that, control returns to the event loop, and you either wake up in 5 minutes and process 1000 more items, or inotify sends you some more files to add to the queue.
(BTW, You will need an event loop to handle the timers; I recommend EV.)

Resources