GNU Parallel: split file into children - bash

Goal
Use GNU Parallel to split a large .gz file into children. Since the server has 16 CPUs, create 16 children. Each child should contain, at most, N lines. Here, N = 104,214,420 lines. Children should be in .gz format.
Input File
name: file1.fastq.gz
size: 39 GB
line count: 1,667,430,708 (uncompressed)
Hardware
36 GB Memory
16 CPUs
HPCC environment (I'm not admin)
Code
Version 1
zcat "${input_file}" | parallel --pipe -N 104214420 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
Three days later, the job was not finished. split_log.txt was empty. No children were visible in the output directory. Log files indicated that Parallel had increased the --block-size from 1 MB (the default) to over 2 GB. This inspired me to change my code to Version 2.
Version 2
# --block-size 3000000000 means a single record could be 3 GB long. Parallel will increase this value if needed.
zcat "${input_file}" | "${parallel}" --pipe -N 104214420 --block-size 3000000000 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
The job has been running for ~2 hours. split_log.txt is empty. No children are visible in the output directory yet. So far, log files show the following warning:
parallel: Warning: --blocksize >= 2G causes problems. Using 2G-1.
Questions
How can my code be improved ?
Is there a faster way to accomplish this goal ?

Let us assume that the file is a fastq file, and that the record size therefore is 4 lines.
You tell that to GNU Parallel with -L 4.
In a fastq file the order does not matter, so you want to pass blocks of n*4 lines to the children.
To do that efficiently you use --pipe-part, except --pipe-part does not work with compressed files and does not work with -L, so you have to settle for --pipe.
zcat file1.fastq.gz |
parallel -j16 --pipe -L 4 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
This will pass a block to 16 children, and a block defaults to 1 MB, which is chopped at a record boundary (i.e. 4 lines). It will run a job for each block. But what you really want is to have the input passed to only 16 jobs in total, and you can do that round robin. Unfortunately there is an element of randomness in --round-robin, so --resume-failed will not work:
zcat file1.fastq.gz |
parallel -j16 --pipe -L 4 --joblog split_log.txt --round-robin "gzip > ${input_file}_child_{#}.gz"
parallel will be struggling to keep up with the 16 gzips, but you should be able to compress 100-200 MB/s.
Now if you had the fastq-file uncompressed we can do it even faster, but we will have to cheat a little: Often in fastq files you will have a seqname that starts the same string:
#EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
#EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
#EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333
Here it is #EAS54_6_R. Unfortunately this is also a valid string in the quality line (which is a really dumb design), but in practice we would be extremely surprised to see a quality line starting with #EAS54_6_R. It just does not happen.
We can use that to our advantage, because now you can use \n followed by #EAS54_6_R as a record separator, and then we can use --pipe-part. The added benefit is that the order will remain the same. Here you would have to give the block size to 1/16 of the size of file1-fastq:
parallel -a file1.fastq --block <<1/16th of the size of file1.fastq>> -j16 --pipe-part --recend '\n' --recstart '#EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"
If you use GNU Parallel 20161222 then GNU Parallel can do that computation for you. --block -1 means: Choose a block-size so that you can give one block to each of the 16 jobslots.
parallel -a file1.fastq --block -1 -j16 --pipe-part --recend '\n' --recstart '#EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"
Here GNU Parallel will not be the limiting factor: It can easily transfer 20 GB/s.
It is annoying having to open the file to see what the recstart value should be, so this will work in most cases:
parallel -a file1.fastq --pipe-part --block -1 -j16
--regexp --recend '\n' --recstart '#.*\n[A-Za-z\n\.~]'
my_command
Here we assume that the lines will start like this:
#<anything>
[A-Za-z\n\.~]<anything>
<anything>
<anything>
Even if you have a few quality lines starting with '#', then they will never be followed by a line starting with [A-Za-z\n.~], because a quality line is always followed by the seqname line, which starts with #.
You could also have a block size so big that it corresponded to 1/16 of the uncompressed file, but that would be a bad idea:
You would have to be able to keep the full uncompressed file in RAM.
The last gzip will only be started after the last byte had been read (and the first gzip will probably be done by then).
By setting the number of records to 104214420 (using -N) this is basically what you are doing, and your server is probably struggling with keeping the 150 GB of uncompressed data in its 36 GB of RAM.

Paired end poses a restriction: The order does not matter, but the order must be predictable for different files. E.g. record n in file1.r1.fastq.gz must match record n in file1.r2.fastq.gz.
split -n r/16 is very efficient for doing simple round-robin. It does, however, not support multiline records. So we insert \0 as a record separator after every 4th line, which we remove after the splitting. --filter runs a command on the input, so we do not need to save the uncompressed data:
doit() { perl -pe 's/\0//' | gzip > $FILE.gz; }
export -f doit
zcat big.gz | perl -pe '($.-1)%4 or print "\0"' | split -t '\0' -n r/16 --filter doit - big.
Filenames will be named big.aa.gz .. big.ap.gz.

Related

Skipping chunk of a big file with bash

Using bash, how to awk/grep from the middle of a given file and skip 1Gig for instance? In other words, I don't want awk/grep to search through the first 1Gig of the file but want to start my search in the middle of the file.
You can use dd like this:
# make a 10GB file of zeroes
dd if=/dev/zero bs=1G count=10 > file
# read it, skipping first 9GB and count what you get
dd if=file bs=1G skip=9 | wc -c
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.84402 s, 582 MB/s
1073741824
Note that I am just demonstrating a concept of how easily you can skip 9GB. In practice, you may prefer to use a 100MB memory buffer and skip 90 of them, rather than allocating a whole GB. So, in practice, you might prefer:
dd if=file bs=100M skip=90 | wc -c
Note also that I am piping to wc rather than awk because my test data is not line oriented - it is just zeros.
Or, if your record size is 30kB and you want to skip a million records and discard diagnostic output:
dd if=file bs=30K skip=1000000 2> /dev/null | awk ...
Note that:
your line numbers will be "wrong" in awk (because awk didn't "see" them), and
your first line may be incomplete (because dd isn't "line oriented") but I guess that doesn't matter.
Note also, that it is generally very advantageous to use a large block size. So, if you want 8MB, you will do much better with bs=1m count=8 than with bs=8 count=1000000 which will cause a million writes of 8 bytes each.
Note also, that if you like processing very large files, you can get GNU Parallel to divide them up for processing in parallel by multiple subprocesses. So, for example, the following code takes the 10GB file we made at the start and starts 10 parallel jobs counting the bytes in each 1GB chunk:
parallel -a file --recend "" --pipepart --block 1G wc -c
If you know the full size of the file (lets say 5 million lines) you can do this:
tail -2000000 filename|grep "yourfilter"
This way you will do whatever editing, or printing, starting below the first 3 million lines
Not tested the performance on very large files, compared to tail | grep, but you could try GNU sed:
sed -n '3000001,$ {/your regex/p}' file
skips the first 3 millions lines and then prints all lines matching the your regex regular expression. Same with awk:
awk 'NR>3000000 && /your regex/' file

bash: split ascii file into n parts; iterate over ONLY those files

I have an ASCII file of a few thousand lines, processed one line at a time by a bash script. Because the processing is embarrassingly parallel, I'd like to split the file into parts of roughly the same size, preserving line breaks, one part per CPU core. Unfortunately the file suffixes made by split r/numberOfCores aren't easily iterated over.
split --numeric-suffixes=1 r/42 ... makes files foo.01, foo.02, ..., foo.42, which can be iterated over with for i in `seq -w 1 42 ` because -w adds a leading zero). But if the 42 changes to something smaller than 10, the files still have the leading zero but the seq doesn't, so it fails. This concern is valid, because nowadays some PCs have fewer than 10 cores, some more than 10. A ghastly workaround:
[[ $numOfCores < 10 ]] && optionForSeq="" || optionForSeq="-w"
The naive solution for f in foo.* is risky: the wildcard might match files other than the ones that split made.
An ugly way to make the suffixes seq-friendly, but with the same risk:
split -n r/numOfCores infile foo.
for i in `seq 1 $numOfCores`; do
mv `ls foo.* | head -1` newPrefix.$i
done
for i in `seq 1 $numofCores`; do
... newPrefix.$i ...
done
Is there a cleaner, robust way of splitting the file into n parts, where 1<=n<=64 isn't known until runtime, and then iterating over those parts? split only into a freshly created directory?
(Edit: To clarify "if the 42 changes to something smaller than 10," the same code should work on a PC with 8 cores and on another PC with 42 cores.)
A seq-based solution is clunky. A wildcard-based solution is risky. Is there an alternative to split? (csplit with line numbers would be even clunkier.) A gawk one-liner?
How about using a format string with seq?
$ seq -f '%02g' 1 4
01
02
03
04
$ seq -f '%02g' 1 12
01
02
03
...
09
10
11
12
With GNU bash 4:
Use printf to format your numbers:
for ((i=1;i<=4;i++)); do printf -v num "%02d" $i; echo "$num"; done
Output:
01
02
03
04
Are you sure this is not a job for GNU Parallel?
cat file | parallel --pipe -N1 myscript_that_reads_one_line_from_stdin
This way you do not need to have the temporary files at all.
If your script can read more than one line (so it is in practice a UNIX filter), then this should be very close to optimal:
parallel --pipepart -k --roundrobin -a file myscript_that_reads_from_stdin
It will spawn one job per core and split file into one part per core on the fly. If some lines are harder to process than others (i.e. you can get "stuck" for a while on a single line), then this solution might be better:
parallel --pipepart -k -a file myscript_that_reads_from_stdin
It will spawn one job per core and split file into 10 part per core on the fly, thus running on average 10 jobs per core in total.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
Get the filenames with ls and then use a regex:
for n in $(ls foo.* |grep "^foo\.[0-9][0-9]*$") ; do

Grepping a 1M row file with 320K patterns stored in another file

I tried to grep a 1M row '|' separated file with 320K patterns from another file with piping to Ole Tange's parallel package and piping the matched results into another file. I am using Cygwin on Windows 7 with 24 cores and 16GB physical memory.
The command I used after going thru this link
Grepping a huge file (80GB) any way to speed it up?
< matchReport1.dat parallel --pipe --block 2M LC_ALL=C grep --file=nov15.DAT > test.match
where matchReport1.dat is the 1M row '|' separated file and the 320K patterns are stored in nov15.DAT. The task manager activity hits all 24 cores and the amount of physical memory usage jumps to ~15GB and I start getting messages that grep memory has been exhausted.
I then tried to split the nov15.DAT patterns file into 10 smaller chunks and run grep of those
parallel --bar -j0 -a xaa "LC_ALL=C grep {} matchReport1.dat" > testxaa
but this just takes too long (only 1.6K out of 30K lines grepping took aout 15 minutes).
My nov15.DAT pattern file consists of strings like 'A12345M' and the file where this pattern needs to match i.e. matchReport1.dat has strings like 'A12345M_dfdf' and 'A12345M_02' so cannot use the -F option in grep. Could someone suggest a fix or any other option other than using databases?
Heres a sample
nov15.DAT -> http://pastebin.com/raw/cUeGcYLb
matchReport1.dat -> http://pastebin.com/raw/01KSGN6k
I assume that you only want to compare strings from nov15.DAT with start of the second column from matchReport1.dat.
Try this: modify nov15.DAT to avoid comparing in every row from first to last character:
sed 's/.*/^"[^|]*"|"&/' nov15.DAT > mov15_mod1.DAT
And then use mov15_mod1.DAT with your parallel command.
Not very accurate, but if the IDs in nov15 are unique and does not match other places in the line, then this might just work. And it is fast:
perl -F'\|' -ane 'BEGIN{chomp(#nov15=`cat nov15.DAT`);#m{#nov15}=1..$#nov15+1;} for $l (split/"|_/,$F[1]) { if($m{$l}) { print }}' matchReport1.dat

Using GNU Parallel With Split

I'm loading a pretty gigantic file to a postgresql database. To do this I first use split in the file to get smaller files (30Gb each) and then I load each smaller file to the database using GNU Parallel and psql copy.
The problem is that it takes about 7 hours to split the file, and then it starts to load a file per core. What I need is a way to tell split to print the file name to std output each time it finishes writing a file so I can pipe it to Parallel and it starts loading the files at the time split finish writing it. Something like this:
split -l 50000000 2011.psv carga/2011_ | parallel ./carga_postgres.sh {}
I have read the split man pages and I can't find anything. Is there a way to do this with split or any other tool?
You could let parallel do the splitting:
<2011.psv parallel --pipe -N 50000000 ./carga_postgres.sh
Note, that the manpage recommends using --block over -N, this will still split the input at record separators, \n by default, e.g.:
<2011.psv parallel --pipe --block 250M ./carga_postgres.sh
Testing --pipe and -N
Here's a test that splits a sequence of 100 numbers into 5 files:
seq 100 | parallel --pipe -N23 'cat > /tmp/parallel_test_{#}'
Check result:
wc -l /tmp/parallel_test_[1-5]
Output:
23 /tmp/parallel_test_1
23 /tmp/parallel_test_2
23 /tmp/parallel_test_3
23 /tmp/parallel_test_4
8 /tmp/parallel_test_5
100 total
If you use GNU split, you can do this with the --filter option
‘--filter=command’
With this option, rather than simply writing to each output file, write through a pipe to the specified shell command for each output file. command should use the $FILE environment variable, which is set to a different output file name for each invocation of the command.
You can create a shell script, which creates a file and start carga_postgres.sh at the end in the background
#! /bin/sh
cat >$FILE
./carga_postgres.sh $FILE &
and use that script as the filter
split -l 50000000 --filter=./filter.sh 2011.psv

Grepping a huge file (80GB) any way to speed it up?

grep -i -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql
This has been running for an hour on a fairly powerful linux server which is otherwise not overloaded.
Any alternative to grep? Anything about my syntax that can be improved, (egrep,fgrep better?)
The file is actually in a directory which is shared with a mount to another server but the actual diskspace is local so that shouldn't make any difference?
the grep is grabbing up to 93% CPU
Here are a few options:
1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.
2) Use fgrep because you're searching for a fixed string, not a regular expression.
3) Remove the -i option, if you don't need it.
So your command becomes:
LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql
It will also be faster if you copy your file to RAM disk.
If you have a multicore CPU, I would really recommend GNU parallel. To grep a big file in parallel use:
< eightygigsfile.sql parallel --pipe grep -i -C 5 'db_pd.Clients'
Depending on your disks and CPUs it may be faster to read larger blocks:
< eightygigsfile.sql parallel --pipe --block 10M grep -i -C 5 'db_pd.Clients'
It's not entirely clear from you question, but other options for grep include:
Dropping the -i flag.
Using the -F flag for a fixed string
Disabling NLS with LANG=C
Setting a max number of matches with the -m flag.
Some trivial improvement:
Remove the -i option, if you can, case insensitive is quite slow.
Replace the . by \.
A single point is the regex symbol to match any character, which is also slow
Two lines of attack:
are you sure, you need the -i, or do you habe a possibility to get rid of it?
Do you have more cores to play with? grep is single-threaded, so you might want to start more of them at different offsets.
< eightygigsfile.sql parallel -k -j120% -n10 -m grep -F -i -C 5 'db_pd.Clients'
If you need to search for multiple strings, grep -f strings.txt saves a ton of time. The above is a translation of something that I am currently testing. the -j and -n option value seemed to work best for my use case. The -F grep also made a big difference.
Try ripgrep
It provides much better results compared to grep.
All the above answers were great. What really did help me on my 111GB file was using the LC_ALL=C fgrep -m < maxnum > fixed_string filename.
However, sometimes there may be 0 or more repeating patterns, in which case calculating the maxnum isn't possible. The workaround is to use the start and end patterns for the event(s) you are trying to process, and then work on the line numbers between them. Like so -
startline=$(grep -n -m 1 "$start_pattern" file|awk -F":" {'print $1'})
endline=$(grep -n -m 1 "$end_pattern" file |awk -F":" {'print $1'})
logs=$(tail -n +$startline file |head -n $(($endline - $startline + 1)))
Then work on this subset of logs!
hmm…… what speeds do you need ? i created a synthetic 77.6 GB file with nearly 525 mn rows with plenty of unicode :
rows = 524759550. | UTF8 chars = 54008311367. | bytes = 83332269969.
and randomly selected rows at an avg. rate of 1 every 3^5, using rand() not just NR % 243, to place the string db_pd.Clients at a random position in the middle of the existing text, totaling 2.16 mn rows where the regex pattern hits
rows = 2160088. | UTF8 chars = 42286394. | bytes = 42286394.
% dtp; pvE0 < testfile_gigantic_001.txt|
mawk2 '
_^(_<_)<NF { print (__=NR-(_+=(_^=_<_)+(++_)))<!_\
?_~_:__,++__+_+_ }' FS='db_pd[.]Clients' OFS=','
in0: 77.6GiB 0:00:59 [1.31GiB/s] [1.31GiB/s] [===>] 100%
out9: 40.3MiB 0:00:59 [ 699KiB/s] [ 699KiB/s] [ <=> ]
524755459,524755470
524756132,524756143
524756326,524756337
524756548,524756559
524756782,524756793
524756998,524757009
524757361,524757372
And mawk2 took just 59 seconds to extract out a list of row ranges it needs. From there it should be relatively trivial. Some overlapping may exist.
At throughput rates of 1.3GiB/s, as seen above calculated by pv, it might even be detrimental to use utils like parallel to split the tasks.

Resources