bash: split ascii file into n parts; iterate over ONLY those files - bash

I have an ASCII file of a few thousand lines, processed one line at a time by a bash script. Because the processing is embarrassingly parallel, I'd like to split the file into parts of roughly the same size, preserving line breaks, one part per CPU core. Unfortunately the file suffixes made by split r/numberOfCores aren't easily iterated over.
split --numeric-suffixes=1 r/42 ... makes files foo.01, foo.02, ..., foo.42, which can be iterated over with for i in `seq -w 1 42 ` because -w adds a leading zero). But if the 42 changes to something smaller than 10, the files still have the leading zero but the seq doesn't, so it fails. This concern is valid, because nowadays some PCs have fewer than 10 cores, some more than 10. A ghastly workaround:
[[ $numOfCores < 10 ]] && optionForSeq="" || optionForSeq="-w"
The naive solution for f in foo.* is risky: the wildcard might match files other than the ones that split made.
An ugly way to make the suffixes seq-friendly, but with the same risk:
split -n r/numOfCores infile foo.
for i in `seq 1 $numOfCores`; do
mv `ls foo.* | head -1` newPrefix.$i
done
for i in `seq 1 $numofCores`; do
... newPrefix.$i ...
done
Is there a cleaner, robust way of splitting the file into n parts, where 1<=n<=64 isn't known until runtime, and then iterating over those parts? split only into a freshly created directory?
(Edit: To clarify "if the 42 changes to something smaller than 10," the same code should work on a PC with 8 cores and on another PC with 42 cores.)
A seq-based solution is clunky. A wildcard-based solution is risky. Is there an alternative to split? (csplit with line numbers would be even clunkier.) A gawk one-liner?

How about using a format string with seq?
$ seq -f '%02g' 1 4
01
02
03
04
$ seq -f '%02g' 1 12
01
02
03
...
09
10
11
12

With GNU bash 4:
Use printf to format your numbers:
for ((i=1;i<=4;i++)); do printf -v num "%02d" $i; echo "$num"; done
Output:
01
02
03
04

Are you sure this is not a job for GNU Parallel?
cat file | parallel --pipe -N1 myscript_that_reads_one_line_from_stdin
This way you do not need to have the temporary files at all.
If your script can read more than one line (so it is in practice a UNIX filter), then this should be very close to optimal:
parallel --pipepart -k --roundrobin -a file myscript_that_reads_from_stdin
It will spawn one job per core and split file into one part per core on the fly. If some lines are harder to process than others (i.e. you can get "stuck" for a while on a single line), then this solution might be better:
parallel --pipepart -k -a file myscript_that_reads_from_stdin
It will spawn one job per core and split file into 10 part per core on the fly, thus running on average 10 jobs per core in total.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Get the filenames with ls and then use a regex:
for n in $(ls foo.* |grep "^foo\.[0-9][0-9]*$") ; do

Related

Concatenate specific number of files

I have a bunch of files named uv_set_XXXXXXXX where the 6 Xs stand for the usual format year, month and day. Imagine I have 325 files of this type. I would like to concatenate by groups of 50 files, so in the end I have 7 files (6 files of 50 and 1 of 25).
I have been thinking in using cat but I can't see an option to select a number of files from a list. I could do this with Python, but just wondering if some Unix command line utility does it more directly.
Thanks.
With GNU parallel you can use the following command
parallel -n50 "cat {} > out{#}" ::: uv_set_*
This will merge the first 50 files into out1, the next 50 files into out2, and so on.
I would just break down and do this in Awk.
awk 'FNR==1 && (++i%50 == 0) {
if(NR>1) close p;
p = "dest_" ++j }
{ print >p }' uv_set_????????
This creates files dest_1 through dest_7, the first 6 with 50 files in each and the last with the remainder.
Closing the previous file is necessary because the system only allows Awk to have a limited number of open file handles (though the limit is typically higher than 7 so it's probably not important in your example).
Thinking out loud dept, just to prevent anyone else from wasting time on repeating this dead end.
You could use xargs -L 50 cat to concatenate 50 files at a time, but there is no simple way to pass in a new redirection for standard output for each invocation. You could try to hack your way around that with something like
# XXX Do not use: incomplete
printf '%s\n' uv_set_???????? |
xargs -L 50 sh -c 'cat "$#" > ... something' _
but I can't come up with elegant way to have a different something each time.

GNU Parallel: split file into children

Goal
Use GNU Parallel to split a large .gz file into children. Since the server has 16 CPUs, create 16 children. Each child should contain, at most, N lines. Here, N = 104,214,420 lines. Children should be in .gz format.
Input File
name: file1.fastq.gz
size: 39 GB
line count: 1,667,430,708 (uncompressed)
Hardware
36 GB Memory
16 CPUs
HPCC environment (I'm not admin)
Code
Version 1
zcat "${input_file}" | parallel --pipe -N 104214420 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
Three days later, the job was not finished. split_log.txt was empty. No children were visible in the output directory. Log files indicated that Parallel had increased the --block-size from 1 MB (the default) to over 2 GB. This inspired me to change my code to Version 2.
Version 2
# --block-size 3000000000 means a single record could be 3 GB long. Parallel will increase this value if needed.
zcat "${input_file}" | "${parallel}" --pipe -N 104214420 --block-size 3000000000 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
The job has been running for ~2 hours. split_log.txt is empty. No children are visible in the output directory yet. So far, log files show the following warning:
parallel: Warning: --blocksize >= 2G causes problems. Using 2G-1.
Questions
How can my code be improved ?
Is there a faster way to accomplish this goal ?
Let us assume that the file is a fastq file, and that the record size therefore is 4 lines.
You tell that to GNU Parallel with -L 4.
In a fastq file the order does not matter, so you want to pass blocks of n*4 lines to the children.
To do that efficiently you use --pipe-part, except --pipe-part does not work with compressed files and does not work with -L, so you have to settle for --pipe.
zcat file1.fastq.gz |
parallel -j16 --pipe -L 4 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
This will pass a block to 16 children, and a block defaults to 1 MB, which is chopped at a record boundary (i.e. 4 lines). It will run a job for each block. But what you really want is to have the input passed to only 16 jobs in total, and you can do that round robin. Unfortunately there is an element of randomness in --round-robin, so --resume-failed will not work:
zcat file1.fastq.gz |
parallel -j16 --pipe -L 4 --joblog split_log.txt --round-robin "gzip > ${input_file}_child_{#}.gz"
parallel will be struggling to keep up with the 16 gzips, but you should be able to compress 100-200 MB/s.
Now if you had the fastq-file uncompressed we can do it even faster, but we will have to cheat a little: Often in fastq files you will have a seqname that starts the same string:
#EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
#EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
#EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333
Here it is #EAS54_6_R. Unfortunately this is also a valid string in the quality line (which is a really dumb design), but in practice we would be extremely surprised to see a quality line starting with #EAS54_6_R. It just does not happen.
We can use that to our advantage, because now you can use \n followed by #EAS54_6_R as a record separator, and then we can use --pipe-part. The added benefit is that the order will remain the same. Here you would have to give the block size to 1/16 of the size of file1-fastq:
parallel -a file1.fastq --block <<1/16th of the size of file1.fastq>> -j16 --pipe-part --recend '\n' --recstart '#EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"
If you use GNU Parallel 20161222 then GNU Parallel can do that computation for you. --block -1 means: Choose a block-size so that you can give one block to each of the 16 jobslots.
parallel -a file1.fastq --block -1 -j16 --pipe-part --recend '\n' --recstart '#EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"
Here GNU Parallel will not be the limiting factor: It can easily transfer 20 GB/s.
It is annoying having to open the file to see what the recstart value should be, so this will work in most cases:
parallel -a file1.fastq --pipe-part --block -1 -j16
--regexp --recend '\n' --recstart '#.*\n[A-Za-z\n\.~]'
my_command
Here we assume that the lines will start like this:
#<anything>
[A-Za-z\n\.~]<anything>
<anything>
<anything>
Even if you have a few quality lines starting with '#', then they will never be followed by a line starting with [A-Za-z\n.~], because a quality line is always followed by the seqname line, which starts with #.
You could also have a block size so big that it corresponded to 1/16 of the uncompressed file, but that would be a bad idea:
You would have to be able to keep the full uncompressed file in RAM.
The last gzip will only be started after the last byte had been read (and the first gzip will probably be done by then).
By setting the number of records to 104214420 (using -N) this is basically what you are doing, and your server is probably struggling with keeping the 150 GB of uncompressed data in its 36 GB of RAM.
Paired end poses a restriction: The order does not matter, but the order must be predictable for different files. E.g. record n in file1.r1.fastq.gz must match record n in file1.r2.fastq.gz.
split -n r/16 is very efficient for doing simple round-robin. It does, however, not support multiline records. So we insert \0 as a record separator after every 4th line, which we remove after the splitting. --filter runs a command on the input, so we do not need to save the uncompressed data:
doit() { perl -pe 's/\0//' | gzip > $FILE.gz; }
export -f doit
zcat big.gz | perl -pe '($.-1)%4 or print "\0"' | split -t '\0' -n r/16 --filter doit - big.
Filenames will be named big.aa.gz .. big.ap.gz.

How to split files up and process them in parallel and then stitch them back? unix

I have a text file infile.txt as such:
abc what's the foo bar.
foobar hello world, hhaha cluster spatio something something.
xyz trying to do this in parallel
kmeans you're mean, who's mean?
Each line in the file will be processed by this perl command into the out.txt
`cat infile.txt | perl dosomething > out.txt`
Imagine if the textfile is 100,000,000 lines. I want to parallelize the bash command so i tried something like this:
$ mkdir splitfiles
$ mkdir splitfiles_processed
$ cd splitfiles
$ split -n3 ../infile.txt
$ for i in $(ls); do "cat $i | perl dosomething > ../splitfiles_processed/$i &"; done
$ wait
$ cd ../splitfiles_processed
$ cat * > ../infile_processed.txt
But is there a less verbose way to do the same?
The answer from #Ulfalizer gives you a good hint about the solution, but it lacks some details.
You can use GNU parallel (apt-get install parallel on Debian)
So your problem can be solved using the following command:
cat infile.txt | parallel -l 1000 -j 10 -k --spreadstdin perl dosomething > result.txt
Here is the meaning of the arguments:
-l 1000: send 1000 lines blocks to command
-j 10: launch 10 jobs in parallel
-k: keep sequence of output
--spreadstdin: sends the above 1000 line block to the stdin of the command
I've never tried it myself, but GNU parallel might be worth checking out.
Here's an excerpt from the man page (parallel(1)) that's similar to what you're currently doing. It can split the input in other ways too.
EXAMPLE: Processing a big file using more cores
To process a big file or some output you can use --pipe to split up
the data into blocks and pipe the blocks into the processing program.
If the program is gzip -9 you can do:
cat bigfile | parallel --pipe --recend '' -k gzip -9 >bigfile.gz
This will split bigfile into blocks of 1 MB and pass that to gzip -9
in parallel. One gzip will be run per CPU core. The output of gzip -9
will be kept in order and saved to bigfile.gz
Whether this is worthwhile depends on how CPU-intensive your processing is. For simple scripts you'll spend most of the time shuffling data to and from the disk, and parallelizing won't get you much.
You can find some introductory videos by the GNU Parallel author here.
Assuming your limiting factor is NOT your disk, you can do this in perl with fork() and specifically Parallel::ForkManager:
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
my $max_forks = 8; #2x procs is usually optimal
sub process_line {
#do something with this line
}
my $fork_manager = Parallel::ForkManager -> new ( $max_forks );
open ( my $input, '<', 'infile.txt' ) or die $!;
while ( my $line = <$input> ) {
$fork_manager -> start and next;
process_line ( $line );
$fork_manager -> finish;
}
close ( $input );
$fork_manager -> wait_all_children();
The downside of doing something like this though is that of coalescing your output. Each parallel task doesn't necessarily finish in the sequence it started, so you have all sorts of potential problems regarding serialising the results.
You can work around these with something like flock but you need to be careful, as too many locking operations can take away your parallel advantage in the first place. (Hence my first statement - if your limiting factor is disk IO, then parallelism doesn't help very much at all anyway).
There's various possible solutions though - so much the wrote a whole chapter on it in the perl docs: perlipc - but keep in mind you can retrieve data with Parallel::ForkManager too.

Using GNU Parallel With Split

I'm loading a pretty gigantic file to a postgresql database. To do this I first use split in the file to get smaller files (30Gb each) and then I load each smaller file to the database using GNU Parallel and psql copy.
The problem is that it takes about 7 hours to split the file, and then it starts to load a file per core. What I need is a way to tell split to print the file name to std output each time it finishes writing a file so I can pipe it to Parallel and it starts loading the files at the time split finish writing it. Something like this:
split -l 50000000 2011.psv carga/2011_ | parallel ./carga_postgres.sh {}
I have read the split man pages and I can't find anything. Is there a way to do this with split or any other tool?
You could let parallel do the splitting:
<2011.psv parallel --pipe -N 50000000 ./carga_postgres.sh
Note, that the manpage recommends using --block over -N, this will still split the input at record separators, \n by default, e.g.:
<2011.psv parallel --pipe --block 250M ./carga_postgres.sh
Testing --pipe and -N
Here's a test that splits a sequence of 100 numbers into 5 files:
seq 100 | parallel --pipe -N23 'cat > /tmp/parallel_test_{#}'
Check result:
wc -l /tmp/parallel_test_[1-5]
Output:
23 /tmp/parallel_test_1
23 /tmp/parallel_test_2
23 /tmp/parallel_test_3
23 /tmp/parallel_test_4
8 /tmp/parallel_test_5
100 total
If you use GNU split, you can do this with the --filter option
‘--filter=command’
With this option, rather than simply writing to each output file, write through a pipe to the specified shell command for each output file. command should use the $FILE environment variable, which is set to a different output file name for each invocation of the command.
You can create a shell script, which creates a file and start carga_postgres.sh at the end in the background
#! /bin/sh
cat >$FILE
./carga_postgres.sh $FILE &
and use that script as the filter
split -l 50000000 --filter=./filter.sh 2011.psv

Very slow loop using grep or fgrep on large datasets

I’m trying to do something pretty simple; grep from a list, an exact match for the string, on the files in a directory:
#try grep each line from the files
for i in $(cat /data/datafile); do
LOOK=$(echo $i);
fgrep -r $LOOK /data/filestosearch >>/data/output.txt
done
The file with the matches to grep with has 20 million lines, and the directory has ~600 files, with a total of ~40Million lines
I can see that this is going to be slow but we estimated it will take 7 years. Even if I use 300 cores on our HPC splitting the job by files to search, it looks like it could take over a week.
there are similar questions:
Loop Running VERY Slow
:
Very slow foreach loop
here and although they are on different platforms, I think possibly if else might help me.
or fgrep which is potentially faster (but seems to be a bit slow as I'm testing it now)
Can anyone see a faster way to do this?
Thank you in advance
sounds like the -f flag for grep would be suitable here:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
so grep can already do what your loop is doing, and you can replace the loop with:
grep -F -r -f /data/datafile /data/filestosearch >>/data/output.txt
Now I'm not sure about the performance of 20 million patterns, but at least you aren't starting 20 million processes this way so it's probably significantly faster.
As Martin has already said in his answer, you should use the -f option instead of looping. I think it should be faster than looping.
Also, this looks like an excellent use case for GNU parallel. Check out this answer for usage examples. It looks difficult, but is actually quite easy to set up and run.
Other than that, 40 million lines should not be a very big deal for grep if there was only one string to match. It should be able to do it in a minute or two on any decent machine. I tested that 2 million lines takes 6 s on my laptop. So 40 mil lines should take 2 mins.
The problem is with the fact that there are 20 million strings to be matched. I think it must be running out of memory or something, especially when you run multiple instances of it on different directories. Can you try splitting the input match-list file? Try splitting it into chunks of 100000 words each for example.
EDIT: Just tried parallel on my machine. It is amazing. It automatically takes care of splitting the grep on to several cores and several machines.
Here's one way to speed things up:
while read i
do
LOOK=$(echo $i)
fgrep -r $LOOK /deta/filetosearch >> /data/output.txt
done < /data/datafile
When you do that for i in $(cat /data/datafile), you first spawn another process, but that process must cat out all of those lines before running the rest of the script. Plus, there's a good possibility that you'll overload the command line and lose some of the files on the end.
By using q while read loop and redirecting the input from /data/datafile, you eliminate the need to spawn a shell. Plus, your script will immediately start reading through the while loop without first having to cat out the entire /data/datafile.
If $i are a list of directories, and you are interested in the files underneath, I wonder if find might be a bit faster than fgrep -r.
while read i
do
LOOK=$(echo $i)
find $i -type f | xargs fgrep $LOOK >> /data/output.txt
done < /data/datafile
The xargs will take the output of find, and run as many files as possible under a single fgrep. The xargs can be dangerous if file names in those directories contain whitespace or other strange characters. You can try (depending upon the system), something like this:
find $i -type f -print0 | xargs --null fgrep $LOOK >> /data/output.txt
On the Mac it's
find $i -type f -print0 | xargs -0 fgrep $LOOK >> /data/output.txt
As others have stated, if you have the GNU version of grep, you can give it the -f flag and include your /data/datafile. Then, you can completely eliminate the loop.
Another possibility is to switch to Perl or Python which actually will run faster than the shell will, and give you a bit more flexibility.
Since you are searching for simple strings (and not regexp) you may want to use comm:
comm -12 <(sort find_this) <(sort in_this.*) > /data/output.txt
It takes up very little memory, whereas grep -f find_this can gobble up 100 times the size of 'find_this'.
On a 8 core this takes 100 sec on these files:
$ wc find_this; cat in_this.* | wc
3637371 4877980 307366868 find_this
16000000 20000000 1025893685
Be sure to have a reasonably new version of sort. It should support --parallel.
You can write perl/python script, that will do the job for you. It saves all the forks you need to do when you do this with external tools.
Another hint: you can combine strings that you are looking for in one regular expression.
In this case grep will do only one pass for all combined lines.
Example:
Instead of
for i in ABC DEF GHI JKL
do
grep $i file >> results
done
you can do
egrep "ABC|DEF|GHI|JKL" file >> results

Resources