Grepping a huge file (80GB) any way to speed it up? - bash

grep -i -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql
This has been running for an hour on a fairly powerful linux server which is otherwise not overloaded.
Any alternative to grep? Anything about my syntax that can be improved, (egrep,fgrep better?)
The file is actually in a directory which is shared with a mount to another server but the actual diskspace is local so that shouldn't make any difference?
the grep is grabbing up to 93% CPU

Here are a few options:
1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.
2) Use fgrep because you're searching for a fixed string, not a regular expression.
3) Remove the -i option, if you don't need it.
So your command becomes:
LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql
It will also be faster if you copy your file to RAM disk.

If you have a multicore CPU, I would really recommend GNU parallel. To grep a big file in parallel use:
< eightygigsfile.sql parallel --pipe grep -i -C 5 'db_pd.Clients'
Depending on your disks and CPUs it may be faster to read larger blocks:
< eightygigsfile.sql parallel --pipe --block 10M grep -i -C 5 'db_pd.Clients'
It's not entirely clear from you question, but other options for grep include:
Dropping the -i flag.
Using the -F flag for a fixed string
Disabling NLS with LANG=C
Setting a max number of matches with the -m flag.

Some trivial improvement:
Remove the -i option, if you can, case insensitive is quite slow.
Replace the . by \.
A single point is the regex symbol to match any character, which is also slow

Two lines of attack:
are you sure, you need the -i, or do you habe a possibility to get rid of it?
Do you have more cores to play with? grep is single-threaded, so you might want to start more of them at different offsets.

< eightygigsfile.sql parallel -k -j120% -n10 -m grep -F -i -C 5 'db_pd.Clients'
If you need to search for multiple strings, grep -f strings.txt saves a ton of time. The above is a translation of something that I am currently testing. the -j and -n option value seemed to work best for my use case. The -F grep also made a big difference.

Try ripgrep
It provides much better results compared to grep.

All the above answers were great. What really did help me on my 111GB file was using the LC_ALL=C fgrep -m < maxnum > fixed_string filename.
However, sometimes there may be 0 or more repeating patterns, in which case calculating the maxnum isn't possible. The workaround is to use the start and end patterns for the event(s) you are trying to process, and then work on the line numbers between them. Like so -
startline=$(grep -n -m 1 "$start_pattern" file|awk -F":" {'print $1'})
endline=$(grep -n -m 1 "$end_pattern" file |awk -F":" {'print $1'})
logs=$(tail -n +$startline file |head -n $(($endline - $startline + 1)))
Then work on this subset of logs!

hmm…… what speeds do you need ? i created a synthetic 77.6 GB file with nearly 525 mn rows with plenty of unicode :
rows = 524759550. | UTF8 chars = 54008311367. | bytes = 83332269969.
and randomly selected rows at an avg. rate of 1 every 3^5, using rand() not just NR % 243, to place the string db_pd.Clients at a random position in the middle of the existing text, totaling 2.16 mn rows where the regex pattern hits
rows = 2160088. | UTF8 chars = 42286394. | bytes = 42286394.
% dtp; pvE0 < testfile_gigantic_001.txt|
mawk2 '
_^(_<_)<NF { print (__=NR-(_+=(_^=_<_)+(++_)))<!_\
?_~_:__,++__+_+_ }' FS='db_pd[.]Clients' OFS=','
in0: 77.6GiB 0:00:59 [1.31GiB/s] [1.31GiB/s] [===>] 100%
out9: 40.3MiB 0:00:59 [ 699KiB/s] [ 699KiB/s] [ <=> ]
524755459,524755470
524756132,524756143
524756326,524756337
524756548,524756559
524756782,524756793
524756998,524757009
524757361,524757372
And mawk2 took just 59 seconds to extract out a list of row ranges it needs. From there it should be relatively trivial. Some overlapping may exist.
At throughput rates of 1.3GiB/s, as seen above calculated by pv, it might even be detrimental to use utils like parallel to split the tasks.

Related

GNU Parallel: split file into children

Goal
Use GNU Parallel to split a large .gz file into children. Since the server has 16 CPUs, create 16 children. Each child should contain, at most, N lines. Here, N = 104,214,420 lines. Children should be in .gz format.
Input File
name: file1.fastq.gz
size: 39 GB
line count: 1,667,430,708 (uncompressed)
Hardware
36 GB Memory
16 CPUs
HPCC environment (I'm not admin)
Code
Version 1
zcat "${input_file}" | parallel --pipe -N 104214420 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
Three days later, the job was not finished. split_log.txt was empty. No children were visible in the output directory. Log files indicated that Parallel had increased the --block-size from 1 MB (the default) to over 2 GB. This inspired me to change my code to Version 2.
Version 2
# --block-size 3000000000 means a single record could be 3 GB long. Parallel will increase this value if needed.
zcat "${input_file}" | "${parallel}" --pipe -N 104214420 --block-size 3000000000 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
The job has been running for ~2 hours. split_log.txt is empty. No children are visible in the output directory yet. So far, log files show the following warning:
parallel: Warning: --blocksize >= 2G causes problems. Using 2G-1.
Questions
How can my code be improved ?
Is there a faster way to accomplish this goal ?
Let us assume that the file is a fastq file, and that the record size therefore is 4 lines.
You tell that to GNU Parallel with -L 4.
In a fastq file the order does not matter, so you want to pass blocks of n*4 lines to the children.
To do that efficiently you use --pipe-part, except --pipe-part does not work with compressed files and does not work with -L, so you have to settle for --pipe.
zcat file1.fastq.gz |
parallel -j16 --pipe -L 4 --joblog split_log.txt --resume-failed "gzip > ${input_file}_child_{#}.gz"
This will pass a block to 16 children, and a block defaults to 1 MB, which is chopped at a record boundary (i.e. 4 lines). It will run a job for each block. But what you really want is to have the input passed to only 16 jobs in total, and you can do that round robin. Unfortunately there is an element of randomness in --round-robin, so --resume-failed will not work:
zcat file1.fastq.gz |
parallel -j16 --pipe -L 4 --joblog split_log.txt --round-robin "gzip > ${input_file}_child_{#}.gz"
parallel will be struggling to keep up with the 16 gzips, but you should be able to compress 100-200 MB/s.
Now if you had the fastq-file uncompressed we can do it even faster, but we will have to cheat a little: Often in fastq files you will have a seqname that starts the same string:
#EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
#EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
#EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
;;;;;;;;;;;9;7;;.7;393333
Here it is #EAS54_6_R. Unfortunately this is also a valid string in the quality line (which is a really dumb design), but in practice we would be extremely surprised to see a quality line starting with #EAS54_6_R. It just does not happen.
We can use that to our advantage, because now you can use \n followed by #EAS54_6_R as a record separator, and then we can use --pipe-part. The added benefit is that the order will remain the same. Here you would have to give the block size to 1/16 of the size of file1-fastq:
parallel -a file1.fastq --block <<1/16th of the size of file1.fastq>> -j16 --pipe-part --recend '\n' --recstart '#EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"
If you use GNU Parallel 20161222 then GNU Parallel can do that computation for you. --block -1 means: Choose a block-size so that you can give one block to each of the 16 jobslots.
parallel -a file1.fastq --block -1 -j16 --pipe-part --recend '\n' --recstart '#EAS54_6_R' --joblog split_log.txt "gzip > ${input_file}_child_{#}.gz"
Here GNU Parallel will not be the limiting factor: It can easily transfer 20 GB/s.
It is annoying having to open the file to see what the recstart value should be, so this will work in most cases:
parallel -a file1.fastq --pipe-part --block -1 -j16
--regexp --recend '\n' --recstart '#.*\n[A-Za-z\n\.~]'
my_command
Here we assume that the lines will start like this:
#<anything>
[A-Za-z\n\.~]<anything>
<anything>
<anything>
Even if you have a few quality lines starting with '#', then they will never be followed by a line starting with [A-Za-z\n.~], because a quality line is always followed by the seqname line, which starts with #.
You could also have a block size so big that it corresponded to 1/16 of the uncompressed file, but that would be a bad idea:
You would have to be able to keep the full uncompressed file in RAM.
The last gzip will only be started after the last byte had been read (and the first gzip will probably be done by then).
By setting the number of records to 104214420 (using -N) this is basically what you are doing, and your server is probably struggling with keeping the 150 GB of uncompressed data in its 36 GB of RAM.
Paired end poses a restriction: The order does not matter, but the order must be predictable for different files. E.g. record n in file1.r1.fastq.gz must match record n in file1.r2.fastq.gz.
split -n r/16 is very efficient for doing simple round-robin. It does, however, not support multiline records. So we insert \0 as a record separator after every 4th line, which we remove after the splitting. --filter runs a command on the input, so we do not need to save the uncompressed data:
doit() { perl -pe 's/\0//' | gzip > $FILE.gz; }
export -f doit
zcat big.gz | perl -pe '($.-1)%4 or print "\0"' | split -t '\0' -n r/16 --filter doit - big.
Filenames will be named big.aa.gz .. big.ap.gz.

`uniq` without sorting an immense text file?

I have a stupidly large text file (i.e. 40 gigabytes as of today) that I would like to filter for unique lines without sorting the file.
The file has unix line endings, and all content matches [[:print:]]. I tried the following awk script to display only unique lines:
awk 'a[$0] {next} 1' stupid.txt > less_stupid.txt
The thought was that I'd populate an array by referencing its elements, using the contents of the file as keys, then skip lines that were already in the array. But this fails for two reasons -- firstly, because it inexplicably just doesn't work (even on small test files), and secondly because I know that my system will run out of memory before the entire set of unique lines is loaded into memory by awk.
After searching, I found this answer which recommended:
awk '!x[$0]++'
And while this works on small files, it also will run out of memory before reading my entire file.
What's a better (i.e. working) solution? I'm open to just about anything, though I'm more partial to solutions in languages I know (bash & awk, hence the tags). In trying to visualize the problem, the best I've come up with would be to store an array of line checksums or MD5s rather than the lines themselves, but that only saves a little space and runs the risk of checksum collisions.
Any tips would be very welcome. Telling me this is impossible would also be welcome, so that I stop trying to figure it out. :-P
The awk '!x[$0]++' trick is one of the most elegant solutions to de-duplicate a file or stream without sorting. However, it is inefficient in terms of memory and unsuitable for large files, since it saves all unique lines into memory.
However, a much more efficient implementation would be to save a constant-length hash representation of the lines in the array rather than the whole line. You can achieve this with Perl in one line and it is quite similar to the awk script.
perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' huge.txt
Here I used md5_base64 instead of md5_hex because the base64 encoding takes 22 bytes, while the hex representation 32.
However, since the Perl implementation of hashes still requires around 120bytes for each key, you may quickly run out of memory for your huge file.
The solution in this case is to process the file in chunks, splitting manually or using GNU Parallel with the --pipe, --keep-order and --block options (taking advantage of the fact that duplicate lines are not far apart, as you mentioned). Here is how you could do it with parallel:
cat huge.txt | pv |
parallel --pipe --keep-order --block 100M -j4 -q \
perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' > uniq.txt
The --block 100M option tells parallel to process the input in chunks of 100MB. -j4 means start 4 processes in parallel. An important argument here is --keep-order since you want the unique lines output to remain in the same order. I have included pv in the pipeline to get some nice statistics while the long running process is executing.
In a benchmark I performed with a random-data 1GB file, I reached a 130MB/sec throughput with the above settings, meaning you may de-duplicate your 40GB file in 4 minutes (if you have a sufficiently fast hard disk able to write at this rate).
Other options include:
Use an efficient trie structure to store keys and check for duplicates. For example a very efficient implementation is marisa-trie coded in C++ with wrappers in Python.
Sort your huge file with an external merge sort or distribution/bucket sort
Store your file in a database and use SELECT DISTINCT on an indexed column containing your lines or most efficiently md5_sums of your lines.
Or use bloom filters
Here is an example of using the Bloom::Faster module of Perl:
perl -e 'use Bloom::Faster; my $f = new Bloom::Faster({n => 100000000, e => 0.00001}); while(<>) { print unless $f->add($_); }' huge.txt > uniq.txt
You may install Bloom::Faster from cran (sudo cran and then install "Bloom::Faster")
Explanation:
You have to specify the probabilistic error rate e and the number of available buckets n. The memory required for each bucket is about 2.5 bytes. If your file has 100 million unique lines then you will need 100 million buckets and around 260MB of memory.
The $f->add($_) function adds the hash of a line to the filter and returns true if the key (i.e. the line here) is a duplicate.
You can get an estimation of the number of unique lines in your file, parsing a small section of your file with dd if=huge.txt bs=400M count=1 | awk '!a[$0]++' | wc -l (400MB) and multiplying that number by 100 (40GB). Then set the n option a little higher to be on the safe side.
In my benchmarks, this method achieved a 6MB/s processing rate. You may combine this approach with the GNU parallel suggestion above to utilize multiple cores and achieve a higher throughput.
I don't have your data (or anything like it) handy, so I can't test this, but here's a proof of concept for you:
$ t='one\ntwo\nthree\none\nfour\nfive\n'
$ printf "$t" | nl -w14 -nrz -s, | sort -t, -k2 -u | sort -n | cut -d, -f2-
one
two
three
four
five
Our raw data includes one duplicated line. The pipes function as follows:
nl adds line numbers. It's a standard, low-impact unix tool.
sort the first time 'round sorts on the SECOND field -- what would have been the beginning of the line before nl. Adjust this as required for you data.
sort the second time puts things back in the order defined by the nl command.
cut merely strips off the line numbers. There are multiple ways to do this, but some of them depend on your OS. This one's portable, and works for my example.
Now... For obscenely large files, the sort command will need some additional options. In particular, --buffer-size and --temporary-directory. Read man sort for details about this.
I can't say I expect this to be fast, and I suspect you'll be using a ginormous amount of disk IO, but I don't see why it wouldn't at least work.
Assuming you can sort the file in the first place (i.e. that you can get sort file to work) then I think something like this might work (depends on whether a large awk script file is better then a large awk array in terms of memory usage/etc.).
sort file | uniq -dc | awk '{gsub("\"", "\\\"", $0); print "$0==\""substr($0, index($0, $1) + 2)"\"{x["NR"]++; if (x["NR"]>1){next}}"} END{print 7}' > dedupe.awk
awk -f dedupe.awk file
Which on a test input file like:
line 1
line 2
line 3
line 2
line 2
line 3
line 4
line 5
line 6
creates an awk script of:
$0=="line 2"{x[1]++; if (x[1]>1){next}}
$0=="line 3"{x[2]++; if (x[2]>1){next}}
7
and run as awk -f dedupe.awk file outputs:
line 1
line 2
line 3
line 4
line 5
line 6
If the size of the awk script itself is a problem (probably unlikely) you could cut that down by using another sentinel value something like:
sort file | uniq -dc | awk 'BEGIN{print "{f=1}"} {gsub("\"", "\\\"", $0); print "$0==\""substr($0, index($0, $1) + 2)"\"{x["NR"]++;f=(x["NR"]<=1)}"} END{print "f"}'
which cuts seven characters off each line (six if you remove the space from the original too) and generates:
{f=1}
$0=="line 2"{x[1]++;f=(x[1]<=1)}
$0=="line 3"{x[2]++;f=(x[2]<=1)}
f
This solution will probably run slower though because it doesn't short-circuit the script as matches are found.
If runtime of the awk script is too great it might even be possible to improve the time by sorting the duplicate lines based on match count (but whether that matters is going to be fairly data dependent).
I'd do it like this:
#! /bin/sh
usage ()
{
echo "Usage: ${0##*/} <file> [<lines>]" >&2
exit 1
}
if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi
if [ "$2" ]; then
expr "$2" : '[1-9][0-9]*$' >/dev/null || usage
fi
LC_ALL=C
export LC_ALL
split -l ${2:-10000} -d -a 6 "$1"
for x in x*; do
awk '!x[$0]++' "$x" >"y${x}" && rm -f "$x"
done
cat $(sort -n yx*) | sort | uniq -d | \
while IFS= read -r line; do
fgrep -x -n "$line" /dev/null yx* | sort -n | sed 1d | \
while IFS=: read -r file nr rest; do
sed -i -d ${nr}d "$file"
done
done
cat $(sort -n yx*) >uniq_"$1" && rm -f yx*
(proof of concept; needs more polishing before being used in production).
What's going on here:
split splits the file in chunks of 10000 lines (configurable); the chunks are named x000000, x000001, ...
awk removes duplicates from each chunk, without messing with the line order; the resulting files are yx000000, yx000001, ... (since awk can't portably do changes in place)
cat $(sort -n yx*) | sort | uniq -d reassembles the chunks and finds a list of duplicates; because of the way the chunks were constructed, each duplicated line can appear at most once in each chunk
fgrep -x -n "$line" /dev/null yx* finds where each duplicated line lives; the result is a list of lines yx000005:23:some text
sort -n | sed 1d removes the first chunk from the list above (this is the first occurrence of the line, and it should be left alone)
IFS=: read -r file nr rest splits yx000005:23:some text into file=yx000005, nr=23, and the rest
sed -i -e ${nr}d "$file" removes line $nr from chunk $file
cat $(sort -n yx*) reassembles the chunks; they need to be sorted, to make sure they come in the right order.
This is probably not very fast, but I'd say it should work. Increasing the number of lines in each chunk from 10000 can speed things up, at the expense of using more memory. The operation is O(N^2) in the number of duplicate lines across chunks; with luck, this wouldn't be too large.
The above assumes GNU sed (for -i). It also assumes there are no files named x* or yx* in the current directory (that's the part that could use some cleanup, perhaps by moving the junk into a directory created by mktemp -d).
Edit: Second version, after feedback from #EtanReisner:
#! /bin/sh
usage ()
{
echo "Usage: ${0##*/} <file> [<lines>]" >&2
exit 1
}
if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi
if [ "$2" ]; then
expr "$2" : '[1-9][0-9]*$' >/dev/null || usage
fi
tdir=$(mktemp -d -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1
dupes=$(mktemp -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1
trap 'rm -rf "$tdir" "$dupes"' EXIT HUP INT QUIT TERM
LC_ALL=C
export LC_ALL
split -l ${2:-10000} -d -a 6 "$1" "${tdir}/x"
ls -1 "$tdir" | while IFS= read -r x; do
awk '!x[$0]++' "${tdir}/${x}" >"${tdir}/y${x}" && \
rm -f "${tdir}/$x" || exit 1
done
find "$tdir" -type f -name 'yx*' | \
xargs -n 1 cat | \
sort | \
uniq -d >"$dupes" || exit 1
find "$tdir" -type f -name 'yx*' -exec fgrep -x -n -f "$dupes" /dev/null {} + | \
sed 's!.*/!!' | \
sort -t: -n -k 1.3,1 -k 2,2 | \
perl '
while(<STDIN>) {
chomp;
m/^(yx\d+):(\d+):(.*)$/o;
if ($dupes{$3}++)
{ push #{$del{$1}}, int($2) }
else
{ $del{$1} = [] }
}
undef %dupes;
chdir $ARGV[0];
for $fn (sort <"yx*">) {
open $fh, "<", $fn
or die qq(open $fn: $!);
$line = $idx = 0;
while(<$fh>) {
$line++;
if ($idx < #{$del{$fn}} and $line == $del{$fn}->[$idx])
{ $idx++ }
else
{ print }
}
close $fh
or die qq(close $fn: $!);
unlink $fn
or die qq(remove $fn: $!);
}
' "$tdir" >uniq_"$1" || exit 1
If there's a lot of duplication, one possibility is to split the file using split(1) into manageable pieces and using something conventional like sort/uniq to make a summary of unique lines. This will be shorter than the actual piece itself. After this, you can compare the pieces to arrive at an actual summary.
Maybe not the answer you've been looking for but here goes: use a bloom filter.
https://en.wikipedia.org/wiki/Bloom_filter This sort of problem is one of the main reasons they exist.

how to split a file into smaller files (one file per line) [split doesn't work]

I'm trying to split a very large file to one new file per line.
Why? It's going to be input for Mahout. but there are too many lines and not enough suffixes for split.
Is there a way to do this in bash?
Increase Your Suffix Length with Split
If you insist on using split, then you have to increase your suffix length. For example, assuming you have 10,000 lines in your file:
split --suffix-length=5 --lines=1 foo.txt
If you really want to go nuts with this approach, you can even set the suffix length dynamically with the wc command and some shell arithmetic. For example:
file='foo.txt'
split \
--suffix-length=$(( $(wc --chars < <(wc --lines < "$file")) - 1 )) \
--lines=1 \
"$file"
Use Xargs Instead
However, the above is really just a kludge anyway. A more correct solution would be to use xargs from the GNU findutils package to invoke some command once per line. For example:
xargs --max-lines=1 --arg-file=foo.txt your_command
This will pass one line at a time to your command. This is a much more flexible approach and will dramatically reduce your disk I/O.
split --lines=1 --suffix-length=5 input.txt output.
This will use 5 characters per suffix, which is enough for 265 = 11881376 files. If you really have more than that, increase suffix-length.
Here's another way to do something for each line:
while IFS= read -r line; do
do_something_with "$line"
done < big.file
GNU Parallel can do this:
cat big.file | parallel --pipe -N1 'cat > {#}'
But if Mahout can read from stdin then you can avoid the temporary files:
cat big.file | parallel --pipe -N1 mahout --input-file -
Learn more about GNU Parallel https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1 and walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html

Very slow loop using grep or fgrep on large datasets

I’m trying to do something pretty simple; grep from a list, an exact match for the string, on the files in a directory:
#try grep each line from the files
for i in $(cat /data/datafile); do
LOOK=$(echo $i);
fgrep -r $LOOK /data/filestosearch >>/data/output.txt
done
The file with the matches to grep with has 20 million lines, and the directory has ~600 files, with a total of ~40Million lines
I can see that this is going to be slow but we estimated it will take 7 years. Even if I use 300 cores on our HPC splitting the job by files to search, it looks like it could take over a week.
there are similar questions:
Loop Running VERY Slow
:
Very slow foreach loop
here and although they are on different platforms, I think possibly if else might help me.
or fgrep which is potentially faster (but seems to be a bit slow as I'm testing it now)
Can anyone see a faster way to do this?
Thank you in advance
sounds like the -f flag for grep would be suitable here:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
so grep can already do what your loop is doing, and you can replace the loop with:
grep -F -r -f /data/datafile /data/filestosearch >>/data/output.txt
Now I'm not sure about the performance of 20 million patterns, but at least you aren't starting 20 million processes this way so it's probably significantly faster.
As Martin has already said in his answer, you should use the -f option instead of looping. I think it should be faster than looping.
Also, this looks like an excellent use case for GNU parallel. Check out this answer for usage examples. It looks difficult, but is actually quite easy to set up and run.
Other than that, 40 million lines should not be a very big deal for grep if there was only one string to match. It should be able to do it in a minute or two on any decent machine. I tested that 2 million lines takes 6 s on my laptop. So 40 mil lines should take 2 mins.
The problem is with the fact that there are 20 million strings to be matched. I think it must be running out of memory or something, especially when you run multiple instances of it on different directories. Can you try splitting the input match-list file? Try splitting it into chunks of 100000 words each for example.
EDIT: Just tried parallel on my machine. It is amazing. It automatically takes care of splitting the grep on to several cores and several machines.
Here's one way to speed things up:
while read i
do
LOOK=$(echo $i)
fgrep -r $LOOK /deta/filetosearch >> /data/output.txt
done < /data/datafile
When you do that for i in $(cat /data/datafile), you first spawn another process, but that process must cat out all of those lines before running the rest of the script. Plus, there's a good possibility that you'll overload the command line and lose some of the files on the end.
By using q while read loop and redirecting the input from /data/datafile, you eliminate the need to spawn a shell. Plus, your script will immediately start reading through the while loop without first having to cat out the entire /data/datafile.
If $i are a list of directories, and you are interested in the files underneath, I wonder if find might be a bit faster than fgrep -r.
while read i
do
LOOK=$(echo $i)
find $i -type f | xargs fgrep $LOOK >> /data/output.txt
done < /data/datafile
The xargs will take the output of find, and run as many files as possible under a single fgrep. The xargs can be dangerous if file names in those directories contain whitespace or other strange characters. You can try (depending upon the system), something like this:
find $i -type f -print0 | xargs --null fgrep $LOOK >> /data/output.txt
On the Mac it's
find $i -type f -print0 | xargs -0 fgrep $LOOK >> /data/output.txt
As others have stated, if you have the GNU version of grep, you can give it the -f flag and include your /data/datafile. Then, you can completely eliminate the loop.
Another possibility is to switch to Perl or Python which actually will run faster than the shell will, and give you a bit more flexibility.
Since you are searching for simple strings (and not regexp) you may want to use comm:
comm -12 <(sort find_this) <(sort in_this.*) > /data/output.txt
It takes up very little memory, whereas grep -f find_this can gobble up 100 times the size of 'find_this'.
On a 8 core this takes 100 sec on these files:
$ wc find_this; cat in_this.* | wc
3637371 4877980 307366868 find_this
16000000 20000000 1025893685
Be sure to have a reasonably new version of sort. It should support --parallel.
You can write perl/python script, that will do the job for you. It saves all the forks you need to do when you do this with external tools.
Another hint: you can combine strings that you are looking for in one regular expression.
In this case grep will do only one pass for all combined lines.
Example:
Instead of
for i in ABC DEF GHI JKL
do
grep $i file >> results
done
you can do
egrep "ABC|DEF|GHI|JKL" file >> results

Fastest possible grep

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I'd like to know if there is any tip to make grep as fast as possible. I have a rather large base of text files to search in the quickest possible way. I've made them all lowercase, so that I could get rid of -i option. This makes the search much faster.
Also, I've found out that -F and -P modes are quicker than the default one. I use the former when the search string is not a regular expression (just plain text), the latter if regex is involved.
Does anyone have any experience in speeding up grep? Maybe compile it from scratch with some particular flag (I'm on Linux CentOS), organize the files in a certain fashion or maybe make the search parallel in some way?
Try with GNU parallel, which includes an example of how to use it with grep:
grep -r greps recursively through directories. On multicore CPUs GNU
parallel can often speed this up.
find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}
This will run 1.5 job per core, and give 1000 arguments to grep.
For big files, it can split it the input in several chunks with the --pipe and --block arguments:
parallel --pipe --block 2M grep foo < bigfile
You could also run it on several different machines through SSH (ssh-agent needed to avoid passwords):
parallel --pipe --sshlogin server.example.com,server2.example.net grep foo < bigfile
If you're searching very large files, then setting your locale can really help.
GNU grep goes a lot faster in the C locale than with UTF-8.
export LC_ALL=C
Ripgrep claims to now be the fastest.
https://github.com/BurntSushi/ripgrep
Also includes parallelism by default
-j, --threads ARG
The number of threads to use. Defaults to the number of logical CPUs (capped at 6). [default: 0]
From the README
It is built on top of Rust's regex engine. Rust's regex engine uses
finite automata, SIMD and aggressive literal optimizations to make
searching very fast.
Apparently using --mmap can help on some systems:
http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
Not strictly a code improvement but something I found helpful after running grep on 2+ million files.
I moved the operation onto a cheap SSD drive (120GB). At about $100, it's an affordable option if you are crunching lots of files regularly.
If you don't care about which files contains the string, you might want to separate reading and grepping into two jobs, since it might be costly to spawn grep many times – once for each small file.
If you've one very large file:
parallel -j100% --pipepart --block 100M -a <very large SEEKABLE file> grep <...>
Many small compressed files (sorted by inode)
ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j80% --group "gzcat {}" | parallel -j50% --pipe --round-robin -u -N1000 grep <..>
I usually compress my files with lz4 for maximum throughput.
If you want just the filename with the match:
ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j100% --group "gzcat {} | grep -lq <..> && echo {}
Building on the response by Sandro I looked at the reference he provided here and played around with BSD grep vs. GNU grep. My quick benchmark results showed: GNU grep is way, way faster.
So my recommendation to the original question "fastest possible grep": Make sure you are using GNU grep rather than BSD grep (which is the default on MacOS for example).
I personally use the ag (silver searcher) instead of grep and it's way faster, also you can combine it with parallel and pipe block.
https://github.com/ggreer/the_silver_searcher
Update:
I now use https://github.com/BurntSushi/ripgrep which is faster than ag depending on your use case.
One thing I've found faster for using grep to search (especially for changing patterns) in a single big file is to use split + grep + xargs with it's parallel flag. For instance:
Having a file of ids you want to search for in a big file called my_ids.txt
Name of bigfile bigfile.txt
Use split to split the file into parts:
# Use split to split the file into x number of files, consider your big file
# size and try to stay under 26 split files to keep the filenames
# easy from split (xa[a-z]), in my example I have 10 million rows in bigfile
split -l 1000000 bigfile.txt
# Produces output files named xa[a-t]
# Now use split files + xargs to iterate and launch parallel greps with output
for id in $(cat my_ids.txt) ; do ls xa* | xargs -n 1 -P 20 grep $id >> matches.txt ; done
# Here you can tune your parallel greps with -P, in my case I am being greedy
# Also be aware that there's no point in allocating more greps than x files
In my case this cut what would have been a 17 hour job into a 1 hour 20 minute job. I'm sure there's some sort of bell curve here on efficiency and obviously going over the available cores won't do you any good but this was a much better solution than any of the above comments for my requirements as stated above. This has an added benefit over the script parallel in using mostly (linux) native tools.
cgrep, if it's available, can be orders of magnitude faster than grep.
MCE 1.508 includes a dual chunk-level {file, list} wrapper script supporting many C binaries; agrep, grep, egrep, fgrep, and tre-agrep.
https://metacpan.org/source/MARIOROY/MCE-1.509/bin/mce_grep
https://metacpan.org/release/MCE
One does not need to convert to lowercase when wanting -i to run fast. Simply pass --lang=C to mce_grep.
Output order is preserved. The -n and -b output is also correct. Unfortunately, that is not the case for GNU parallel mentioned on this page. I was really hoping for GNU Parallel to work here. In addition, mce_grep does not sub-shell (sh -c /path/to/grep) when calling the binary.
Another alternate is the MCE::Grep module included with MCE.
A slight deviation from the original topic: the indexed search command line utilities from the googlecodesearch project are way faster than grep: https://github.com/google/codesearch:
Once you compile it (the golang package is needed), you can index a folder with:
# index current folder
cindex .
The index will be created under ~/.csearchindex
Now you can search:
# search folders previously indexed with cindex
csearch eggs
I'm still piping the results through grep to get colorized matches.

Resources