I need some help in optimizing gnu parallel when the input data is contained in many files, which must be concatenated together and piped into several different commands, each to be run in parallel.
I am parsing data from an archive whose contents are contained in many files. The goal is to parse different data types into files by type for the whole archive. To accomplish this I am concatenating the files together and piping them to each parsing command. The parser accepts data on stdin and takes as an argument the data type to parse (e.g 'parser type1' to parse data of type1, etc.)
At the moment I have something like this:
parallel --xapply ::: \
'cat datadir/*.dat | parser type1 > type1.txt' \
'cat datadir/*.dat | parser type2 > type2.txt' \
'cat datadir/*.dat | parser type3 > type3.txt'
But this requires concatenation of the data several times, which is slow and seems unnecessary costly. Plus my understanding is that there is a throughput limit on a pipe. Is there an better way to achieve this?
This is a bit complex but it avoids reading datadir/*.dat more than once, it uses fifos instead of temporary files.
# Example parser
parser() { cat | grep "$#"; }
# Make function visible to GNU Parallel
export -f parser
types="type1 type2 type3"
# Create fifos: myfifo-type1 myfifo-type2 myfifo-type3
parallel mkfifo myfifo-{} ::: $types
# Send datadir/*.dat to 'parser type1', 'parser type2', 'parser type3'
# send output to myfifo-type1 myfifo-type2 myfifo-type3
cat datadir/*.dat |
parallel -j0 --pipe --tee parser {} '>' myfifo-{} ::: $types &
# Read from the fifos
parallel --xapply echo :::: myfifo*
rm myfifo*
It is a bit unclear what command you want in the final --xapply. Maybe what you are looking for is simply sending the same input to parser with different parameters:
cat datadir/*.dat |
parallel -j0 --pipe --tee parser {} '>' output-{} ::: $types
which is much simpler than the mkfifo setup above.
Pipes/fifos are very fast (>2 GB/s) and there is no limit to the amount of data you can put through those. They avoid that data ever hit the disk.
Related
On several occasions I want to use xargs generated arguments and use them in an elaborated command pipeline (piping the results through several steps and using).
I typically end up using a bash -c with xargs-generated-argument as only argument. E.g:
find *.obj | xargs -I{} bash -c 'sha256sum $0 | tee $0.sha256' {}
Is there any better or more concise way to do this?
I think it is more concise with GNU Parallel as follows:
find "*.obj" -print0 | parallel -0 sha256sum {} \| tee {}.sha256
In addition, it is:
potentially more performant since it works in parallel across all CPU cores
debuggable by using parallel --dry-run ...
more informative by using --bar or --eta to get a progress bar or ETA for completion
more flexible, since it gives you many predefined variables, such as {.} meaning the current file minus its extension, {/} meaning "base name" of current file, {//} meaning directory of current file and so on
I have a text file infile.txt as such:
abc what's the foo bar.
foobar hello world, hhaha cluster spatio something something.
xyz trying to do this in parallel
kmeans you're mean, who's mean?
Each line in the file will be processed by this perl command into the out.txt
`cat infile.txt | perl dosomething > out.txt`
Imagine if the textfile is 100,000,000 lines. I want to parallelize the bash command so i tried something like this:
$ mkdir splitfiles
$ mkdir splitfiles_processed
$ cd splitfiles
$ split -n3 ../infile.txt
$ for i in $(ls); do "cat $i | perl dosomething > ../splitfiles_processed/$i &"; done
$ wait
$ cd ../splitfiles_processed
$ cat * > ../infile_processed.txt
But is there a less verbose way to do the same?
The answer from #Ulfalizer gives you a good hint about the solution, but it lacks some details.
You can use GNU parallel (apt-get install parallel on Debian)
So your problem can be solved using the following command:
cat infile.txt | parallel -l 1000 -j 10 -k --spreadstdin perl dosomething > result.txt
Here is the meaning of the arguments:
-l 1000: send 1000 lines blocks to command
-j 10: launch 10 jobs in parallel
-k: keep sequence of output
--spreadstdin: sends the above 1000 line block to the stdin of the command
I've never tried it myself, but GNU parallel might be worth checking out.
Here's an excerpt from the man page (parallel(1)) that's similar to what you're currently doing. It can split the input in other ways too.
EXAMPLE: Processing a big file using more cores
To process a big file or some output you can use --pipe to split up
the data into blocks and pipe the blocks into the processing program.
If the program is gzip -9 you can do:
cat bigfile | parallel --pipe --recend '' -k gzip -9 >bigfile.gz
This will split bigfile into blocks of 1 MB and pass that to gzip -9
in parallel. One gzip will be run per CPU core. The output of gzip -9
will be kept in order and saved to bigfile.gz
Whether this is worthwhile depends on how CPU-intensive your processing is. For simple scripts you'll spend most of the time shuffling data to and from the disk, and parallelizing won't get you much.
You can find some introductory videos by the GNU Parallel author here.
Assuming your limiting factor is NOT your disk, you can do this in perl with fork() and specifically Parallel::ForkManager:
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
my $max_forks = 8; #2x procs is usually optimal
sub process_line {
#do something with this line
}
my $fork_manager = Parallel::ForkManager -> new ( $max_forks );
open ( my $input, '<', 'infile.txt' ) or die $!;
while ( my $line = <$input> ) {
$fork_manager -> start and next;
process_line ( $line );
$fork_manager -> finish;
}
close ( $input );
$fork_manager -> wait_all_children();
The downside of doing something like this though is that of coalescing your output. Each parallel task doesn't necessarily finish in the sequence it started, so you have all sorts of potential problems regarding serialising the results.
You can work around these with something like flock but you need to be careful, as too many locking operations can take away your parallel advantage in the first place. (Hence my first statement - if your limiting factor is disk IO, then parallelism doesn't help very much at all anyway).
There's various possible solutions though - so much the wrote a whole chapter on it in the perl docs: perlipc - but keep in mind you can retrieve data with Parallel::ForkManager too.
Given a set of files, I need to pass 2 arguments and direct the output to a newly named file, based on either input filename. The input list follows a defined format: S1_R1.txt, S1_R2.txt; S2_R1.txt, S2_R2.txt; S3_R1.txt, S3_R2.txt, etc. The first numeric is incremented by 1 and each has an R1 and corresponding R2.
The output file is a combination of each S#-pair and should be named respective of this, e.g. S1_interleave.txt, S3_interleave.txt, S3_interleave.txt, etc.
The following works to print to screen
find S*R*.txt -maxdepth 0 | xargs -n 2 python interleave.py
How can I utilize the input filenames for use as output?
Just to make it at bit more fun: Let us assume the files are gzipped (as paired end reads often are) and you want the result gzipped, too:
parallel --xapply 'python interleave.py <(zcat {1}) <(zcat {2}) |gzip > {=1 s/_R1.txt.gz/_interleave.txt.gz/=}' ::: *R1.txt.gz ::: *R2.txt.gz
You need the pre-release of GNU Parallel to do this http://git.savannah.gnu.org/cgit/parallel.git/snapshot/parallel-1a1c0ebe0f79c0ada18527366b1eabeccd18bdf5.tar.gz (or wait for the release 20140722).
As asked it is even simpler (but you still need the pre-release, though):
parallel --xapply 'python interleave.py {1} {2} > {=1 s/_R1.txt/_interleave.txt/=}' ::: *R1.txt ::: *R2.txt
I'm trying to split a very large file to one new file per line.
Why? It's going to be input for Mahout. but there are too many lines and not enough suffixes for split.
Is there a way to do this in bash?
Increase Your Suffix Length with Split
If you insist on using split, then you have to increase your suffix length. For example, assuming you have 10,000 lines in your file:
split --suffix-length=5 --lines=1 foo.txt
If you really want to go nuts with this approach, you can even set the suffix length dynamically with the wc command and some shell arithmetic. For example:
file='foo.txt'
split \
--suffix-length=$(( $(wc --chars < <(wc --lines < "$file")) - 1 )) \
--lines=1 \
"$file"
Use Xargs Instead
However, the above is really just a kludge anyway. A more correct solution would be to use xargs from the GNU findutils package to invoke some command once per line. For example:
xargs --max-lines=1 --arg-file=foo.txt your_command
This will pass one line at a time to your command. This is a much more flexible approach and will dramatically reduce your disk I/O.
split --lines=1 --suffix-length=5 input.txt output.
This will use 5 characters per suffix, which is enough for 265 = 11881376 files. If you really have more than that, increase suffix-length.
Here's another way to do something for each line:
while IFS= read -r line; do
do_something_with "$line"
done < big.file
GNU Parallel can do this:
cat big.file | parallel --pipe -N1 'cat > {#}'
But if Mahout can read from stdin then you can avoid the temporary files:
cat big.file | parallel --pipe -N1 mahout --input-file -
Learn more about GNU Parallel https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1 and walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
I'm loading a pretty gigantic file to a postgresql database. To do this I first use split in the file to get smaller files (30Gb each) and then I load each smaller file to the database using GNU Parallel and psql copy.
The problem is that it takes about 7 hours to split the file, and then it starts to load a file per core. What I need is a way to tell split to print the file name to std output each time it finishes writing a file so I can pipe it to Parallel and it starts loading the files at the time split finish writing it. Something like this:
split -l 50000000 2011.psv carga/2011_ | parallel ./carga_postgres.sh {}
I have read the split man pages and I can't find anything. Is there a way to do this with split or any other tool?
You could let parallel do the splitting:
<2011.psv parallel --pipe -N 50000000 ./carga_postgres.sh
Note, that the manpage recommends using --block over -N, this will still split the input at record separators, \n by default, e.g.:
<2011.psv parallel --pipe --block 250M ./carga_postgres.sh
Testing --pipe and -N
Here's a test that splits a sequence of 100 numbers into 5 files:
seq 100 | parallel --pipe -N23 'cat > /tmp/parallel_test_{#}'
Check result:
wc -l /tmp/parallel_test_[1-5]
Output:
23 /tmp/parallel_test_1
23 /tmp/parallel_test_2
23 /tmp/parallel_test_3
23 /tmp/parallel_test_4
8 /tmp/parallel_test_5
100 total
If you use GNU split, you can do this with the --filter option
‘--filter=command’
With this option, rather than simply writing to each output file, write through a pipe to the specified shell command for each output file. command should use the $FILE environment variable, which is set to a different output file name for each invocation of the command.
You can create a shell script, which creates a file and start carga_postgres.sh at the end in the background
#! /bin/sh
cat >$FILE
./carga_postgres.sh $FILE &
and use that script as the filter
split -l 50000000 --filter=./filter.sh 2011.psv