Behaviour of shuf and split in a multi processes awk - bash

I need to lookup from a long list ("list_of_unique_ID") of shuffled ID ("shuffled_ID") in a huge file ("huge_file") those ID and extract them.
list_of_unique_IDhas been created from "huge_file" but only with its unique ID
The mono process scenario with AWK works fine.
I tried to multiprocess it with the following code : (bash, pseudocode because I can't extract the code from the machine it's running)
for sample in $list_of_samples; do
shuf -n lines_to_be_extracted list_of_unique_ID > shuffled_ID.txt
if (not_splitted = TRUE)
then
split --number=l/number_of_processes --numeric-suffixes huge_file huge_file_splitted_
not_splitted = FALSE
fi
done
for n in number_of_processes ; do
awk -v =shuffled_ID.txt [AWK function to lookup for the shuffled id] huge_file_splitted_${n} > output_${n} &
done
I'm running 5 processes for 3 parameters in parallel.
Which is fairly "simple". I'm just surprised as this script takes much time to run, that given the fact it's shuffled, I have first the output_00 and then output_02 and then output_05 each and every time which are filling first, in this order, before the other ones. I ran it multiple times and tried to swap the splitted huge_file but it seems that the order is given by what's outputed from the split function.
I don't think it's logical unless shuffle actually "jumps" over lines but keeps having the same order as the original file ? (But from what I have read, it seems not)
I might be wrong but I don't really the point of multiprocessing it if the processes are not parallel but subsequent ?
Is there something wrong in my code? Is there something I don't understand clearly?
Thanks !
Edit : thanks for the correction. I corrected my vocabulary :)

Related

Counting lines or enumerating line numbers so I can loop over them - why is this an anti-pattern?

I posted the following code and got scolded. Why is this not acceptable?
numberOfLines=$(wc -l <"$1")
for ((i=1; $i<=$numberOfLines; ++$i)); do
lineN=$(sed -n "$i!d;p;q" "$1")
# ... do things with "$lineN"
done
We collect the number of lines in the input file into numberOfLines, then loop from 1 to that number, pulling out the next line from the file with sed in each iteration.
The feedback I received complained that reading the same file repeatedly with sed inside the loop to get the next line is inefficient. I guess I could use head -n "$i" "$1" | tail -n 1 but that's hardly more efficient, is it?
Is there a better way to do this? Why would I want to avoid this particular approach?
The shell (and basically every programming language which is above assembly language) already knows how to loop over the lines in a file; it does not need to know how many lines there will be to fetch the next one — strikingly, in your example, sed already does this, so if the shell couldn't do it, you could loop over the output from sed instead.
The proper way to loop over the lines in a file in the shell is with while read. There are a couple of complications — commonly, you reset IFS to avoid having the shell needlessly split the input into tokens, and you use read -r to avoid some pesky legacy behavior with backslashes in the original Bourne shell's implementation of read, which have been retained for backward compatibility.
while IFS='' read -r lineN; do
# do things with "$lineN"
done <"$1"
Besides being much simpler than your sed script, this avoids the problem that you read the entire file once to obtain the line count, then read the same file again and again in each loop iteration. With a typical modern OS, some repeated reading will be avoided thanks to caching (the disk driver keeps a buffer of recently accessed data in memory, so that reading it again will not actually require fetching it from the disk again), but the basic fact is still that reading information from disk is on the order of 1000x slower than not doing it when you can avoid it. Especially with a large file, the cache will fill up eventually, and so you end up reading in and discarding the same bytes over and over, adding a significant amount of CPU overhead and an even more significant amount of the CPU simply doing something else while waiting for the disk to deliver the bytes you read, again and again.
In a shell script, you also want to avoid the overhead of an external process if you can. Invoking sed (or the functionally equivalent but even more expensive two-process head -n "$i"| tail -n 1) thousands of times in a tight loop will add significant overhead for any non-trivial input file. On the other hand, if the body of your loop could be done in e.g. sed or Awk instead, that's going to be a lot more efficient than a native shell while read loop, because of the way read is implemented. This is why while read is also frequently regarded as an antipattern.
And make sure you are reasonably familiar with the standard palette of Unix text processing tools - cut, paste, nl, pr, etc etc.
In many, many cases you should avoid looping over the lines in a shell script and use an external tool instead. There is basically only one exception to this; when the body of the loop is also significantly using built-in shell commands.
The q in the sed script is a very partial remedy for repeatedly reading the input file; and frequently, you see variations where the sed script will read the entire input file through to the end each time, even if it only wants to fetch one of the very first lines out of the file.
With a small input file, the effects are negligible, but perpetuating this bad practice just because it's not immediately harmful when the input file is small is simply irresponsible. Just don't teach this technique to beginners. At all.
If you really need to display the number of lines in the input file, for a progress indicator or similar, at least make sure you don't spend a lot of time seeking through to the end just to obtain that number. Maybe stat the file and keep track of how many bytes there are on each line, so you can project the number of lines you have left (and instead of line 1/10345234 display something like line 1/approximately 10000000?) ... or use an external tool like pv.
Tangentially, there is a vaguely related antipattern you want to avoid, too; you don't want to read an entire file into memory when you are only going to process one line at a time. Doing that in a for loop also has some additional gotchas, so don't do that, either; see https://mywiki.wooledge.org/DontReadLinesWithFor
Another common variation is to find the line you want to modify with grep, only so you can find it with sed ... which already knows full well how to perform a regex search by itself. (See also useless use of grep.)
# XXX FIXME: wrong
line=$(grep "foo" file)
sed -i "s/$line/thing/" file
The correct way to do this would be to simply change the sed script to contain a search condition:
sed -i '/foo/s/.*/thing/' file
This also avoids the complications when the value of $line in the original, faulty script contains something which needs to be escaped in order to actually match itself. (For example, foo\bar* in a regular expression does not match the literal text itself.)

How can I use real time monitoring (tail -f), cut, sort, and uniq together in Unix?

I am trying to delete duplicate text that's being written into a log, while continuously monitoring it.
The only issue is that this particular log is timestamped, so before it's possible to determine if the same text is written twice or three times in a row, the timestamp must be cut.
I'm not Unix expert, but this is my attempt:
tail -f log.txt | cut -c 28- | sort | uniq
The terminal behaves unexpectedly, and just hangs. Whereas either of the two following commands work on their own:
tail -f log.txt | cut -c 28-
or
tail -f log.txt | uniq
Ideally I'd like to filter out non-adjacent text entries, i.e. I would like to be able to use sort as well, but currently I can't get it to work with the -f flag on tail.
You can't get sorted output of a stream of text before it has ended, as the next item to come in might belong ahead of the first one you've seen so far. This makes the sort | unique part of your pipeline not useful for your situation.
While it's probably possible to to filter out your duplicates with some more complicated shell scripting, you might find it easier to write a script in some other language. Many scripting languages have efficient set data structures that can quickly check if an item has been seen before. Here's a fairly trivial script that should do the job using Python 3:
#!/usr/bin/env python3
import sys
seen = set()
for line in sys.stdin:
if line not in seen:
sys.stdout.write(line)
seen.add(line)
The downside to this approach is that the filtering script will use much more memory than uniq does, since it must remember every unique line it has seen before. So, this might not be an appropriate solution if your pipeline may see a great many different lines in a single run.

Asynchronously consuming pipe with bash

I have a bash script like this
data_generator_that_never_guits | while read data
do
an_expensive_process_with data
done
The first process continuously generates events (at irregular intervals) which needs to be processed as they become available. A problem with this script is that read on consumes a single line of the output; and as the processing is very expensive, I'd want it to consume all the data that is currently available. On the other side, the processing must start immediately if a new data becomes available. In the nutshell, I want to do something like this
data_generator_that_never_guits | while read_all_available data
do
an_expensive_process_with data
done
where the command read_all_available will wait if no data is available for consumption or copy all the currently available data to the variable. It is perfectly fine if the data does not consist of full lines. Basically, I am looking for an analog of read which would read the entire pipe buffer instead of reading just a single line from the pipe.
For the curious among you, the background of the question that I have a build script which needs to trigger a rebuild on a source file change. I want to avoid triggering rebuilds too often. Please do not suggest me to use grunt, gulp or other available build systems, they do not work well for my purpose.
Thanks!
I think I have found the solution after I got better insight how subshells work. This script appears to do what I need:
data_generator_that_never_guits | while true
do
# wait until next element becomes available
read LINE
# consume any remaining elements — a small timeout ensures that
# rapidly fired events are batched together
while read -t 1 LINE; do true; done
# the data buffer is empty, launch the process
an_expensive_process
done
It would be possible to collect all the read lines to a single batch, but I don't really care about their contents at this point, so I didn't bother figuring that part out :)
Added on 25.09.2014
Here is a final subroutine, in case it could be useful for someone one day:
flushpipe() {
# wait until the next line becomes available
read -d "" buffer
# consume any remaining elements — a small timeout ensures that
# rapidly fired events are batched together
while read -d "" -t 1 line; do buffer="$buffer\n$line"; done
echo $buffer
}
To be used like this:
data_generator_that_never_guits | while true
do
# wait until data becomes available
data=$(flushpipe)
# the data buffer is empty, launch the process
an_expensive_process_with data
done
Something like read -N 4096 -t 1 might do the trick, or perhaps read -t 0 with additional logic. See the Bash reference manual for details. Otherwise, you might have to move from Bash to e.g. Perl.

Multiple Wait Commands

I haven't been able to thoroughly concise my question, which means search engines can't give me a straight answer, and I've used this site in the past many times, so it's time I ask my own question..
So, I have a script (in bash) that runs through my list of pictures and determines if they need to be resized, which works well. I want to have a maximum of 3 running at a time. This is what I have.
if [ "$COUNT" = "$(echo "$(cat /proc/cpuinfo|grep processor|wc -l)+1"|bc)" ] ;then
wait
COUNT="0"
fi
Above, I set a COUNT variable to 0, and append 1 in a for loop (the same for loop that runs the convert statement). Once it hits (in this case) 3, (number of processors+1), it will 'wait' until the childs are done, then continue.
This is all fine and dandy, but let's say all my images are 48k, except for 1 which is 250mb (not true, but for explanation purposes). Somewhere down the line, the script would wait for that ONE picture when the rest could be running.
So finally, my question. Given the context and background info above, is there a way (using the 'wait' command or not) to have a script stop at 3 (processors+1) childs, and only execute 1 as one finishes? In context, is there a way to have 2 pictures running while the 250mb picture is doing its thing, then have 3 once it finishes? As opposed to what I do now, just wait for all 3, then execute 3 more.
Thank you for any and all suggestions!

Embarrassingly parallel workflow creates too many output files

On a Linux cluster I run many (N > 10^6) independent computations. Each computation takes only a few minutes and the output is a handful of lines. When N was small I was able to store each result in a separate file to be parsed later. With large N however, I find that I am wasting storage space (for the file creation) and simple commands like ls require extra care due to internal limits of bash: -bash: /bin/ls: Argument list too long.
Each computation is required to run through a qsub scheduling algorithm so I am unable to create a master program which simply aggregates the output data to a single file. The simple solution of appending to a single fails when two programs finish at the same time and interleave their output. I have no admin access to the cluster, so installing a system-wide database is not an option.
How can I collate the output data from embarrassingly parallel computation before it gets unmanageable?
1) As you say, it's not ls which is failing; it's the shell which does glob expansion before starting up ls. You can fix that problem easily enough by using something like
find . -type f -name 'GLOB' | xargs UTILITY
eg.:
find . -type f -name '*.dat' | xargs ls -l
You might want to sort the output, since find (for efficiency) doesn't sort the filenames (usually). There are many other options to find (like setting directory recursion depth, filtering in more complicated ways, etc.) and to xargs (maximum number of arguments for each invocation, parallel execution, etc.). Read the man pages for details.
2) I don't know how you are creating the individual files, so it's a bit hard to provide specific solutions, but here are a couple of ideas:
If you get to create the files yourself, and you can delay the file creation until the end of the job (say, by buffering output), and the files are stored on a filesystem which supports advisory locking or some other locking mechanism like atomic linking, then you can multiplex various jobs into a single file by locking it before spewing the output, and then unlocking. But that's a lot of requirements. In a cluster you might well be able to do that with a single file for all the jobs running on a single host, but then again you might not.
Again, if you get to create the files yourself, you can atomically write each line to a shared file. (Even NFS supports atomic writes but it doesn't support atomic append, see below.) You'd need to prepend a unique job identifier to each line so that you can demultiplex it. However, this won't work if you're using some automatic mechanism such as "my job writes to stdout and then the scheduling framework copies it to a file", which is sadly common. (In essence, this suggestion is pretty similar to the MapReduce strategy. Maybe that's available to you?)
Failing everything else, maybe you can just use sub-directories. A few thousand directories of a thousand files each is a lot more manageable than a single directory with a few million files.
Good luck.
Edit As requested, some more details on 2.2:
You need to use Posix I/O functions for this, because, afaik, the C library does not provide atomic write. In Posix, the write function always writes atomically, provided that you specify O_APPEND when you open the file. (Actually, it writes atomically in any case, but if you don't specify O_APPEND then each process retains it's own position into the file, so they will end up overwriting each other.)
So what you need to do is:
At the beginning of the program, open a file with options O_WRONLY|O_CREATE|O_APPEND. (Contrary to what I said earlier, this is not guaranteed to work on NFS, because NFS may not handle O_APPEND properly. Newer versions of NFS could theoretically handle append-only files, but they probably don't. Some thoughts about this a bit later.) You probably don't want to always use the same file, so put a random number somewhere into its name so that your various jobs have a variety of alternatives. O_CREAT is always atomic, afaik, even with crappy NFS implementations.
For each output line, sprintf the line to an internal buffer, putting a unique id at the beginning. (Your job must have some sort of unique id; just use that.) [If you're paranoid, start the line with some kind of record separator, followed by the number of bytes in the remaining line -- you'll have to put this value in after formatting -- so the line will look something like ^0274:xx3A7B29992A04:<274 bytes>\n, where ^ is hex 01 or some such.]
write the entire line to the file. Check the return code and the number of bytes written. If the write fails, try again. If the write was short, hopefully you followed the "if you're paranoid" instructions above, also just try again.
Really, you shouldn't get short writes, but you never know. Writing the length is pretty simple; demultiplexing is a bit more complicated, but you could cross that bridge when you need to :)
The problem with using NFS is a bit more annoying. As with 2.1, the simplest solution is to try to write the file locally, or use some cluster filesystem which properly supports append. (NFSv4 allows you to ask for only "append" permissions and not "write" permissions, which would cause the server to reject the write if some other process already managed to write to the offset you were about to use. In that case, you'd need to seek to the end of the file and try the write again, until eventually it succeeds. However, I have the impression that this feature is not actually implemented. I could be wrong.)
If the filesystem doesn't support append, you'll have another option: decide on a line length, and always write that number of bytes. (Obviously, it's easier if the selected fixed line length is longer than the longest possible line, but it's possible to write multiple fixed-length lines as long as they have a sequence number.) You'll need to guarantee that each job writes at different offsets, which you can do by dividing the job's job number into a file number and an interleave number, and write all the lines for a particular job at its interleave modulo the number of interleaves, into a file whose name includes the file number. (This is easiest if the jobs are numbered sequentially.) It's OK to write beyond the end of the file, since unix filesystems will -- or at least, should -- either insert NULs or create discontiguous files (which waste less space, but depend on the blocksize of the file).
Another way to handle filesystems which don't support append but do support advisory byte-range locking (NFSv4 supports this) is to use the fixed-line-length idea, as above, but obtaining a lock on the range about to be written before writing it. Use a non-blocking lock, and if the lock cannot be obtained, try again at the next line-offset multiple. If the lock can be obtained, read the file at that offset to verify that it doesn't have data before writing it; then release the lock.
Hope that helps.
If you are only concerned by space:
parallel --header : --tag computation {foo} {bar} {baz} ::: foo 1 2 ::: bar I II ::: baz . .. | pbzip2 > out.bz2
or shorter:
parallel --tag computation ::: 1 2 ::: I II ::: . .. | pbzip2 > out.bz2
GNU Parallel ensures output is not mixed.
If you are concerned with finding a subset of the results, then look at --results.
Watch the intro videos to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Another possibility would be to use N files, with N greater or equal to the number of nodes in the cluster, and assign the files to your computations in a round-robin fashion. This should avoid concurrent writes to any of the files, provided you have a reasonnable guarantee on the order of execution of your computations.

Resources