Split up text and process in parallel - bash

I have a program that generates lots (terabytes) of output and sends it to stdout.
I want to split that output and process it in parallel with a bunch of instances of another program. It can be distributed in any way, as long as the lines are left intact.
Parallel can do this, but it takes a fixed number of lines and restartes the filter process after this:
./relgen | parallel -l 100000 -j 32 --spreadstdin ./filter
Is there a way to keep a constant number of processes running and distribute data among them?

-l is no good for performance. Use --block instead if possible.
You can have the data distributed roundrobin with: --roundrobin.
./relgen | parallel --block 3M --round-robin -j 32 --pipe ./filter

Related

How to convert SIMD bash commands into GPU processable commands?

Consider a SIMD kind of code which extracts all instances of a pattern match from a file like this:
grep grep -n <some_pattern>
This can be made faster using GNU Parallel and some modifications like this
cat fileName | parallel -j{cores} --pipe --block {chunk_size}M --cat LC_ALL=C grep -n '/some_pattern/'
I can also use xargs to make the parallel execution if the single input file is split into multiple separate files:
xargs -P {cores} -L {line_per_process} bash -c grep {1}< fileID*
But this kind of parallelism is limited by the number of CPU cores that you can have.
I am interested in knowing whether there is any way to convert such commands into GPU(CUDA) threads?
The whole task can be broken into chunks equal to the number of CPU cores and then each CPU Core processes those chunks as individual threads in GPUs?
I will be surprised if there is such a way. grep is not like a matrix multiplication where you do exactly the same machine code instruction for every byte. On the contrary, grep does a lot of optimization for different situations (e.g. if current byte does not match, skip this many bytes ahead).
So while you may call this Same Command Multiple Data (SCMD), it does not qualify as SIMD at the machine code level.
That does not mean that there is no way to convert grep into real SIMD, but this is not going to be automatic. You will have to rewrite grep using algorithms that are suitable for GPUs. And that can clearly be done: https://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/competition/bkase.github.com/CUDA-grep/finalreport.html
If you want to convert another tool than grep you will again have to rewrite that tool. Possibly using some of the algorithms that you used for grep, but not necessarily: It might be that you have to use completely different algorithms.
Normally you will be limited by your disk (your disk is slow, grep is fast).
If you have really fast disks try:
parallel -a filename -k --pipepart --block -1 LC_ALL=C grep '/some_pattern/'
--pipe can deliver in the order of 100MB/s total. --pipepart can deliver in the order of 1 GB/s per core (and usually your disks cannot deliver 1 GB/s/core). --block -1 chops filename into one block per job on the fly.
Unfortunately you lose the ability to see the line number (so grep -n will give the wrong answer).
If your grep is still limited by CPU, then you should probably ask another question and elaborate on why your grep is so CPU intense.

Optimize bash command to count all lines in HDFS txt files

Summary:
I need to count all unique lines in all .txt files in a HDFS instance.
Total size of .txt files ~450GB.
I use this bash command:
hdfs dfs -cat /<top-level-dir>/<sub-dir>/*/*/.txt | cut -d , -f 1 | sort --parallel=<some-number> | uniq | wc -l
The problem is that this command takes all free ram and the HDFS instance exits with code 137 (out of memory).
Question:
Is there any way I can limit the ram usage of this entire command to let's say half of what's free in the hdfs OR somehow clean the memory while the command is still running?
Update:
I need to remove | sort | because it is a merge sort implementation so O(n) space complexity.
I can use only | uniq | without | sort |.
Some things you can try to limit sort's memory consumption:
Use sort -u instead of sort | uniq. That way sort has a chance to remove duplicates on the spot instead of having to keep them until the end. 🞵
Write the input to a file and sort the file instead of running sort in a pipe. Sorting pipes is slower than sorting files and I assume that sorting pipes requires more memory than sorting files:
hdfs ... | cut -d, -f1 > input && sort -u ... input | wc -l
Set the buffer size manually using -S 2G. The size buffer is shared between all threads. The size specified here roughly equals the overall memory consumption when running sort.
Change the temporary directory using -T /some/dir/different/from/tmp. On many linux systems /tmp is a ramdisk so be sure to use the actual hard drive.
If the hard disk is not an option you could also try --compress-program=PROG to compress sort's temporary files. I'd recommend a fast compression algorithm like lz4.
Reduce parallelism using --parallel=N as more threads need more memory. With a small buffer too much threads are less efficient.
Merge at most two temporary files at once using --batch-size=2.
🞵 I assumed that sort was smart enough to immediately remove sequential duplicates in the unsorted input. However, from my experiments it seems that (at least) sort (GNU coreutils) 8.31 does not.
If you know that your input contains a lot of sequential duplicates as in the input generated by the following commands …
yes a | head -c 10m > input
yes b | head -c 10m >> input
yes a | head -c 10m >> input
yes b | head -c 10m >> input
… then you can drastically save resources on sort by using uniq first:
# takes 6 seconds and 2'010'212 kB of memory
sort -u input
# takes less than 1 second and 3'904 kB of memory
uniq input > preprocessed-input &&
sort -u preprocessed-input
Times and memory usage were measured using GNU time 1.9-2 (often installed in /usr/bin/time) and its -v option. My system has an Intel Core i5 M 520 (two cores + hyper-threading) and 8 GB memory.
Reduce number of sorts run in parallel.
From info sort:
--parallel=N: Set the number of sorts run in parallel to N. By default, N is set
to the number of available processors, but limited to 8, as there
are diminishing performance gains after that. Note also that using
N threads increases the memory usage by a factor of log N.
it runs out of memory.
From man sort:
--batch-size=NMERGE
merge at most NMERGE inputs at once; for more use temp files
--compress-program=PROG
compress temporaries with PROG; decompress them with PROG -d-T,
-S, --buffer-size=SIZE
use SIZE for main memory buffer
-T, --temporary-directory=DIR
use DIR for temporaries, not $TMPDIR or /tmp; multiple options
specify multiple directories
These are the options you could be looking into. Specify a temporary directory on the disc and specify buffer size ex. 1GB. So like sort -u -T "$HOME"/tmp -S 1G.
Also as advised in other answers, use sort -u instead of sort | uniq.
Is there any way I can limit the ram usage of this entire command to let's say half of what's free in the hdfs
Kind-of, use -S option. You could sort -S "$(free -t | awk '/Total/{print $4}')".

Is it safe to pipe output of gnu parallel to a single file or pipe

With a construct similar to
find . -type f -name '*log' \
| parallel grep 'somestuff'
| moreComplexLineRearrangementScript
| sort
I am wondering whether the moreComplexLineRearrangementScript has the risk to see garbled lines due to the fact that several grep instance write into the same pipe without any buffer synchronization.
Can this be a problem for naive uses of grep as above or can I rely on the fact that grep's implementation writes lines always with a flush()?
If it not were grep, can there be some magic in parallel that does the flush()?
Is there a way to use parallel that guarantees lines to be kept intact --- apart from redirecting each parallel process' output to a separate file and then go from there?
By default, GNU Parallel buffers output by job, so the output from different jobs is not all mixed up, that is:
parallel --group
If you want GNU Parallel to do line-at-a-time output, possibly mixing output from different jobs, but always in whole lines, use:
parallel --line-buffer
If you like your output really higgeldy-piggeldy and all mixed up even mid-line, use:
parallel --ungroup

GNU Parallel: split file into children

Goal
Use GNU Parallel to split a large .gz file into children. Since the server has 16 CPUs, create 16 children. Each child should contain, at most, N lines. Here, N = 104,214,420 lines. Children should be in .gz format.
Input File
name: file1.fastq.gz
size: 39 GB
line count: 1,667,430,708 (uncompressed)
Hardware
36 GB Memory
16 CPUs
HPCC environment (I'm not admin)
Code
Version 1
zcat "${input_file}" | parallel --pipe -N 104214420 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
Three days later, the job was not finished. split_log.txt was empty. No children were visible in the output directory. Log files indicated that Parallel had increased the --block-size from 1 MB (the default) to over 2 GB. This inspired me to change my code to Version 2.
Version 2
# --block-size 3000000000 means a single record could be 3 GB long. Parallel will increase this value if needed.
zcat "${input_file}" | "${parallel}" --pipe -N 104214420 --block-size 3000000000 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
The job has been running for ~2 hours. split_log.txt is empty. No children are visible in the output directory yet. So far, log files show the following warning:
parallel: Warning: --blocksize >= 2G causes problems. Using 2G-1.
Questions
How can my code be improved ?
Is there a faster way to accomplish this goal ?
Let us assume that the file is a fastq file, and that the record size therefore is 4 lines.
You tell that to GNU Parallel with -L 4.
In a fastq file the order does not matter, so you want to pass blocks of n*4 lines to the children.
To do that efficiently you use --pipe-part, except --pipe-part does not work with compressed files and does not work with -L, so you have to settle for --pipe.
zcat file1.fastq.gz |
parallel -j16 --pipe -L 4 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
This will pass a block to 16 children, and a block defaults to 1 MB, which is chopped at a record boundary (i.e. 4 lines). It will run a job for each block. But what you really want is to have the input passed to only 16 jobs in total, and you can do that round robin. Unfortunately there is an element of randomness in --round-robin, so --resume-failed will not work:
zcat file1.fastq.gz |
parallel -j16 --pipe -L 4 --joblog split_log.txt --round-robin "gzip > ${input_file}_child_{#}.gz"
parallel will be struggling to keep up with the 16 gzips, but you should be able to compress 100-200 MB/s.
Now if you had the fastq-file uncompressed we can do it even faster, but we will have to cheat a little: Often in fastq files you will have a seqname that starts the same string:
#EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
#EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
#EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333
Here it is #EAS54_6_R. Unfortunately this is also a valid string in the quality line (which is a really dumb design), but in practice we would be extremely surprised to see a quality line starting with #EAS54_6_R. It just does not happen.
We can use that to our advantage, because now you can use \n followed by #EAS54_6_R as a record separator, and then we can use --pipe-part. The added benefit is that the order will remain the same. Here you would have to give the block size to 1/16 of the size of file1-fastq:
parallel -a file1.fastq --block <<1/16th of the size of file1.fastq>> -j16 --pipe-part --recend '\n' --recstart '#EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"
If you use GNU Parallel 20161222 then GNU Parallel can do that computation for you. --block -1 means: Choose a block-size so that you can give one block to each of the 16 jobslots.
parallel -a file1.fastq --block -1 -j16 --pipe-part --recend '\n' --recstart '#EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"
Here GNU Parallel will not be the limiting factor: It can easily transfer 20 GB/s.
It is annoying having to open the file to see what the recstart value should be, so this will work in most cases:
parallel -a file1.fastq --pipe-part --block -1 -j16
--regexp --recend '\n' --recstart '#.*\n[A-Za-z\n\.~]'
my_command
Here we assume that the lines will start like this:
#<anything>
[A-Za-z\n\.~]<anything>
<anything>
<anything>
Even if you have a few quality lines starting with '#', then they will never be followed by a line starting with [A-Za-z\n.~], because a quality line is always followed by the seqname line, which starts with #.
You could also have a block size so big that it corresponded to 1/16 of the uncompressed file, but that would be a bad idea:
You would have to be able to keep the full uncompressed file in RAM.
The last gzip will only be started after the last byte had been read (and the first gzip will probably be done by then).
By setting the number of records to 104214420 (using -N) this is basically what you are doing, and your server is probably struggling with keeping the 150 GB of uncompressed data in its 36 GB of RAM.
Paired end poses a restriction: The order does not matter, but the order must be predictable for different files. E.g. record n in file1.r1.fastq.gz must match record n in file1.r2.fastq.gz.
split -n r/16 is very efficient for doing simple round-robin. It does, however, not support multiline records. So we insert \0 as a record separator after every 4th line, which we remove after the splitting. --filter runs a command on the input, so we do not need to save the uncompressed data:
doit() { perl -pe 's/\0//' | gzip > $FILE.gz; }
export -f doit
zcat big.gz | perl -pe '($.-1)%4 or print "\0"' | split -t '\0' -n r/16 --filter doit - big.
Filenames will be named big.aa.gz .. big.ap.gz.

Using GNU Parallel With Split

I'm loading a pretty gigantic file to a postgresql database. To do this I first use split in the file to get smaller files (30Gb each) and then I load each smaller file to the database using GNU Parallel and psql copy.
The problem is that it takes about 7 hours to split the file, and then it starts to load a file per core. What I need is a way to tell split to print the file name to std output each time it finishes writing a file so I can pipe it to Parallel and it starts loading the files at the time split finish writing it. Something like this:
split -l 50000000 2011.psv carga/2011_ | parallel ./carga_postgres.sh {}
I have read the split man pages and I can't find anything. Is there a way to do this with split or any other tool?
You could let parallel do the splitting:
<2011.psv parallel --pipe -N 50000000 ./carga_postgres.sh
Note, that the manpage recommends using --block over -N, this will still split the input at record separators, \n by default, e.g.:
<2011.psv parallel --pipe --block 250M ./carga_postgres.sh
Testing --pipe and -N
Here's a test that splits a sequence of 100 numbers into 5 files:
seq 100 | parallel --pipe -N23 'cat > /tmp/parallel_test_{#}'
Check result:
wc -l /tmp/parallel_test_[1-5]
Output:
23 /tmp/parallel_test_1
23 /tmp/parallel_test_2
23 /tmp/parallel_test_3
23 /tmp/parallel_test_4
8 /tmp/parallel_test_5
100 total
If you use GNU split, you can do this with the --filter option
‘--filter=command’
With this option, rather than simply writing to each output file, write through a pipe to the specified shell command for each output file. command should use the $FILE environment variable, which is set to a different output file name for each invocation of the command.
You can create a shell script, which creates a file and start carga_postgres.sh at the end in the background
#! /bin/sh
cat >$FILE
./carga_postgres.sh $FILE &
and use that script as the filter
split -l 50000000 --filter=./filter.sh 2011.psv

Resources