I asked the other day if data integrity (of flushed data) is kept
even when there are more than one PIPEs streaming into
localhost's STDIN. The answer is NO if the data flushed is large.
Data integrity question when collecting STDOUTs from multiple remote hosts over SSH
But I would like to guarantee every line flushed on each end is
passed to the single STDIN in full and won't be mixed up with
data from other pipes. Is there any way to do so? How can that be done?
(Note that it can be done if I create multiple STDINs locally.
But it is more convenient if I can process line streams through a
single STDIN. So my question focuses on the case when there is
only one STDIN at localhost with multiple (STDOUT) PIPEs into it.)
This can be done via a congestion-backoff system like that used in Ethernet.
First, assign each pipe a unique delimiter. This delimiter cannot appear unescaped in the contents of any pipe. Now, use the following pseudocode:
Check for other process' delimiter; while an odd number of a single other process' delimiters is present, wait.
Write delimiter character.
Check if another process has also written an unmatched delimiter. If so, back off a random (increasing) amount and return to first step.
Write data.
Write delimiter character.
This will ensure that, although you will have some junk, every whole message will eventually get through.
Related
I posted the following code and got scolded. Why is this not acceptable?
numberOfLines=$(wc -l <"$1")
for ((i=1; $i<=$numberOfLines; ++$i)); do
lineN=$(sed -n "$i!d;p;q" "$1")
# ... do things with "$lineN"
done
We collect the number of lines in the input file into numberOfLines, then loop from 1 to that number, pulling out the next line from the file with sed in each iteration.
The feedback I received complained that reading the same file repeatedly with sed inside the loop to get the next line is inefficient. I guess I could use head -n "$i" "$1" | tail -n 1 but that's hardly more efficient, is it?
Is there a better way to do this? Why would I want to avoid this particular approach?
The shell (and basically every programming language which is above assembly language) already knows how to loop over the lines in a file; it does not need to know how many lines there will be to fetch the next one — strikingly, in your example, sed already does this, so if the shell couldn't do it, you could loop over the output from sed instead.
The proper way to loop over the lines in a file in the shell is with while read. There are a couple of complications — commonly, you reset IFS to avoid having the shell needlessly split the input into tokens, and you use read -r to avoid some pesky legacy behavior with backslashes in the original Bourne shell's implementation of read, which have been retained for backward compatibility.
while IFS='' read -r lineN; do
# do things with "$lineN"
done <"$1"
Besides being much simpler than your sed script, this avoids the problem that you read the entire file once to obtain the line count, then read the same file again and again in each loop iteration. With a typical modern OS, some repeated reading will be avoided thanks to caching (the disk driver keeps a buffer of recently accessed data in memory, so that reading it again will not actually require fetching it from the disk again), but the basic fact is still that reading information from disk is on the order of 1000x slower than not doing it when you can avoid it. Especially with a large file, the cache will fill up eventually, and so you end up reading in and discarding the same bytes over and over, adding a significant amount of CPU overhead and an even more significant amount of the CPU simply doing something else while waiting for the disk to deliver the bytes you read, again and again.
In a shell script, you also want to avoid the overhead of an external process if you can. Invoking sed (or the functionally equivalent but even more expensive two-process head -n "$i"| tail -n 1) thousands of times in a tight loop will add significant overhead for any non-trivial input file. On the other hand, if the body of your loop could be done in e.g. sed or Awk instead, that's going to be a lot more efficient than a native shell while read loop, because of the way read is implemented. This is why while read is also frequently regarded as an antipattern.
And make sure you are reasonably familiar with the standard palette of Unix text processing tools - cut, paste, nl, pr, etc etc.
In many, many cases you should avoid looping over the lines in a shell script and use an external tool instead. There is basically only one exception to this; when the body of the loop is also significantly using built-in shell commands.
The q in the sed script is a very partial remedy for repeatedly reading the input file; and frequently, you see variations where the sed script will read the entire input file through to the end each time, even if it only wants to fetch one of the very first lines out of the file.
With a small input file, the effects are negligible, but perpetuating this bad practice just because it's not immediately harmful when the input file is small is simply irresponsible. Just don't teach this technique to beginners. At all.
If you really need to display the number of lines in the input file, for a progress indicator or similar, at least make sure you don't spend a lot of time seeking through to the end just to obtain that number. Maybe stat the file and keep track of how many bytes there are on each line, so you can project the number of lines you have left (and instead of line 1/10345234 display something like line 1/approximately 10000000?) ... or use an external tool like pv.
Tangentially, there is a vaguely related antipattern you want to avoid, too; you don't want to read an entire file into memory when you are only going to process one line at a time. Doing that in a for loop also has some additional gotchas, so don't do that, either; see https://mywiki.wooledge.org/DontReadLinesWithFor
Another common variation is to find the line you want to modify with grep, only so you can find it with sed ... which already knows full well how to perform a regex search by itself. (See also useless use of grep.)
# XXX FIXME: wrong
line=$(grep "foo" file)
sed -i "s/$line/thing/" file
The correct way to do this would be to simply change the sed script to contain a search condition:
sed -i '/foo/s/.*/thing/' file
This also avoids the complications when the value of $line in the original, faulty script contains something which needs to be escaped in order to actually match itself. (For example, foo\bar* in a regular expression does not match the literal text itself.)
I have a Ruby script that performs some substitutions on the output of mysqldump.
The input can have very long lines (hundreds of MB), because a single line can represent a multi-row INSERT statement for all the data in a table. The mysqldump utility can be coerced to produce one INSERT statement per row, but I don't have control of every client.
My script naively expects IO#each_line to control memory usage:
$stdin.each_line do |line|
next if options[:entity_excludes].any? { |entity| line =~ /^(DROP TABLE IF EXISTS|INSERT INTO) `custom_#{entity}(s|_meta)`/ }
line.gsub!(/^CREATE TABLE `/, "CREATE TABLE IF NOT EXISTS `")
line.gsub!('{{__OPF_SITEURL__}}', siteurl) if siteurl
$stdout.write(line)
end
I've already seen input with maximum line length over 400MB, and this translates directly into process resident memory.
Are there libraries for Ruby that allow text transforms on an input stream using buffers instead of relying on line-delimited input?
This was marked as a duplicate of a simpler question. But there's quite a bit more to this. You need to keep track of multiple buffers and test for application of transforms even when they apply across a buffer boundary. It's easy to get wrong, which is why I'm hoping a library already exists.
I know there are several posts asking similar things, but none address the problem I'm having.
I'm working on a script that handles connections to different Bluetooth low energy devices, reads from some of their handles using gatttool and dynamically creates a .json file with those values.
The problem I'm having is that gatttool commands take a while to execute (and are not always successful in connecting to the devices due to device is busy or similar messages). These "errors" translate not only in wrong data to fill the .json file but they also allow lines of the script to continue writing to the file (e.g. adding extra } or similar). An example of the commands I'm using would be the following:
sudo gatttool -l high -b <MAC_ADDRESS> --char-read -a <#handle>
How can I approach this in a way that I can wait for a certain output? In this case, the ideal output when you --char-read using gatttool would be:
Characteristic value/description: some_hexadecimal_data`
This way I can make sure I am following the script line by line instead of having these "jumps".
grep allows you to filter the output of gatttool for the data you are looking for.
If you are actually looking for a way to wait until a specific output is encountered before continuing, expect might be what you are looking for.
From the manpage:
expect [[-opts] pat1 body1] ... [-opts] patn [bodyn]
waits until one of the patterns matches the output of a spawned
process, a specified time period has passed, or an end-of-file is
seen. If the final body is empty, it may be omitted.
On a Linux cluster I run many (N > 10^6) independent computations. Each computation takes only a few minutes and the output is a handful of lines. When N was small I was able to store each result in a separate file to be parsed later. With large N however, I find that I am wasting storage space (for the file creation) and simple commands like ls require extra care due to internal limits of bash: -bash: /bin/ls: Argument list too long.
Each computation is required to run through a qsub scheduling algorithm so I am unable to create a master program which simply aggregates the output data to a single file. The simple solution of appending to a single fails when two programs finish at the same time and interleave their output. I have no admin access to the cluster, so installing a system-wide database is not an option.
How can I collate the output data from embarrassingly parallel computation before it gets unmanageable?
1) As you say, it's not ls which is failing; it's the shell which does glob expansion before starting up ls. You can fix that problem easily enough by using something like
find . -type f -name 'GLOB' | xargs UTILITY
eg.:
find . -type f -name '*.dat' | xargs ls -l
You might want to sort the output, since find (for efficiency) doesn't sort the filenames (usually). There are many other options to find (like setting directory recursion depth, filtering in more complicated ways, etc.) and to xargs (maximum number of arguments for each invocation, parallel execution, etc.). Read the man pages for details.
2) I don't know how you are creating the individual files, so it's a bit hard to provide specific solutions, but here are a couple of ideas:
If you get to create the files yourself, and you can delay the file creation until the end of the job (say, by buffering output), and the files are stored on a filesystem which supports advisory locking or some other locking mechanism like atomic linking, then you can multiplex various jobs into a single file by locking it before spewing the output, and then unlocking. But that's a lot of requirements. In a cluster you might well be able to do that with a single file for all the jobs running on a single host, but then again you might not.
Again, if you get to create the files yourself, you can atomically write each line to a shared file. (Even NFS supports atomic writes but it doesn't support atomic append, see below.) You'd need to prepend a unique job identifier to each line so that you can demultiplex it. However, this won't work if you're using some automatic mechanism such as "my job writes to stdout and then the scheduling framework copies it to a file", which is sadly common. (In essence, this suggestion is pretty similar to the MapReduce strategy. Maybe that's available to you?)
Failing everything else, maybe you can just use sub-directories. A few thousand directories of a thousand files each is a lot more manageable than a single directory with a few million files.
Good luck.
Edit As requested, some more details on 2.2:
You need to use Posix I/O functions for this, because, afaik, the C library does not provide atomic write. In Posix, the write function always writes atomically, provided that you specify O_APPEND when you open the file. (Actually, it writes atomically in any case, but if you don't specify O_APPEND then each process retains it's own position into the file, so they will end up overwriting each other.)
So what you need to do is:
At the beginning of the program, open a file with options O_WRONLY|O_CREATE|O_APPEND. (Contrary to what I said earlier, this is not guaranteed to work on NFS, because NFS may not handle O_APPEND properly. Newer versions of NFS could theoretically handle append-only files, but they probably don't. Some thoughts about this a bit later.) You probably don't want to always use the same file, so put a random number somewhere into its name so that your various jobs have a variety of alternatives. O_CREAT is always atomic, afaik, even with crappy NFS implementations.
For each output line, sprintf the line to an internal buffer, putting a unique id at the beginning. (Your job must have some sort of unique id; just use that.) [If you're paranoid, start the line with some kind of record separator, followed by the number of bytes in the remaining line -- you'll have to put this value in after formatting -- so the line will look something like ^0274:xx3A7B29992A04:<274 bytes>\n, where ^ is hex 01 or some such.]
write the entire line to the file. Check the return code and the number of bytes written. If the write fails, try again. If the write was short, hopefully you followed the "if you're paranoid" instructions above, also just try again.
Really, you shouldn't get short writes, but you never know. Writing the length is pretty simple; demultiplexing is a bit more complicated, but you could cross that bridge when you need to :)
The problem with using NFS is a bit more annoying. As with 2.1, the simplest solution is to try to write the file locally, or use some cluster filesystem which properly supports append. (NFSv4 allows you to ask for only "append" permissions and not "write" permissions, which would cause the server to reject the write if some other process already managed to write to the offset you were about to use. In that case, you'd need to seek to the end of the file and try the write again, until eventually it succeeds. However, I have the impression that this feature is not actually implemented. I could be wrong.)
If the filesystem doesn't support append, you'll have another option: decide on a line length, and always write that number of bytes. (Obviously, it's easier if the selected fixed line length is longer than the longest possible line, but it's possible to write multiple fixed-length lines as long as they have a sequence number.) You'll need to guarantee that each job writes at different offsets, which you can do by dividing the job's job number into a file number and an interleave number, and write all the lines for a particular job at its interleave modulo the number of interleaves, into a file whose name includes the file number. (This is easiest if the jobs are numbered sequentially.) It's OK to write beyond the end of the file, since unix filesystems will -- or at least, should -- either insert NULs or create discontiguous files (which waste less space, but depend on the blocksize of the file).
Another way to handle filesystems which don't support append but do support advisory byte-range locking (NFSv4 supports this) is to use the fixed-line-length idea, as above, but obtaining a lock on the range about to be written before writing it. Use a non-blocking lock, and if the lock cannot be obtained, try again at the next line-offset multiple. If the lock can be obtained, read the file at that offset to verify that it doesn't have data before writing it; then release the lock.
Hope that helps.
If you are only concerned by space:
parallel --header : --tag computation {foo} {bar} {baz} ::: foo 1 2 ::: bar I II ::: baz . .. | pbzip2 > out.bz2
or shorter:
parallel --tag computation ::: 1 2 ::: I II ::: . .. | pbzip2 > out.bz2
GNU Parallel ensures output is not mixed.
If you are concerned with finding a subset of the results, then look at --results.
Watch the intro videos to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Another possibility would be to use N files, with N greater or equal to the number of nodes in the cluster, and assign the files to your computations in a round-robin fashion. This should avoid concurrent writes to any of the files, provided you have a reasonnable guarantee on the order of execution of your computations.
I want to use curl to get a stream from a remote server, and write it to a buffer. So far so good I just do curl http://the.stream>/path/to/thebuffer. Thing is I don't want this file to get too large, so I want to be able to delete the first bytes of the file as I simultaneously add to the last bytes. Is there a way of doing this?
Alternatively if I could write n bytes to buffer1, then switch to buffer2, buffer3.. and when buffer x was reached delete buffer1 and start again - without losing the data coming in from curl (it's a live stream, so I can't stop curl). I've been reading up the man pages for curl and cat and read, but can't see anything promising.
There isn't any particularly easy way to do what you are seeking to do.
Probably the nearest approach creates a FIFO, and redirects the output of curl to the FIFO. You then have a program such as split or csplit reading the FIFO and writing to different files. If you decide that the split programs are not the tool, you may need to write your own variation on them. You can then decide how to process the files that are created, and when to remove them.
Note that curl will hang until there is a process reading from the FIFO. When the process reading the FIFO exits, curl will get either a SIGPIPE signal or a write error, either of which should stop it.