Very slow loop using grep or fgrep on large datasets - bash

I’m trying to do something pretty simple; grep from a list, an exact match for the string, on the files in a directory:
#try grep each line from the files
for i in $(cat /data/datafile); do
LOOK=$(echo $i);
fgrep -r $LOOK /data/filestosearch >>/data/output.txt
done
The file with the matches to grep with has 20 million lines, and the directory has ~600 files, with a total of ~40Million lines
I can see that this is going to be slow but we estimated it will take 7 years. Even if I use 300 cores on our HPC splitting the job by files to search, it looks like it could take over a week.
there are similar questions:
Loop Running VERY Slow
:
Very slow foreach loop
here and although they are on different platforms, I think possibly if else might help me.
or fgrep which is potentially faster (but seems to be a bit slow as I'm testing it now)
Can anyone see a faster way to do this?
Thank you in advance

sounds like the -f flag for grep would be suitable here:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
so grep can already do what your loop is doing, and you can replace the loop with:
grep -F -r -f /data/datafile /data/filestosearch >>/data/output.txt
Now I'm not sure about the performance of 20 million patterns, but at least you aren't starting 20 million processes this way so it's probably significantly faster.

As Martin has already said in his answer, you should use the -f option instead of looping. I think it should be faster than looping.
Also, this looks like an excellent use case for GNU parallel. Check out this answer for usage examples. It looks difficult, but is actually quite easy to set up and run.
Other than that, 40 million lines should not be a very big deal for grep if there was only one string to match. It should be able to do it in a minute or two on any decent machine. I tested that 2 million lines takes 6 s on my laptop. So 40 mil lines should take 2 mins.
The problem is with the fact that there are 20 million strings to be matched. I think it must be running out of memory or something, especially when you run multiple instances of it on different directories. Can you try splitting the input match-list file? Try splitting it into chunks of 100000 words each for example.
EDIT: Just tried parallel on my machine. It is amazing. It automatically takes care of splitting the grep on to several cores and several machines.

Here's one way to speed things up:
while read i
do
LOOK=$(echo $i)
fgrep -r $LOOK /deta/filetosearch >> /data/output.txt
done < /data/datafile
When you do that for i in $(cat /data/datafile), you first spawn another process, but that process must cat out all of those lines before running the rest of the script. Plus, there's a good possibility that you'll overload the command line and lose some of the files on the end.
By using q while read loop and redirecting the input from /data/datafile, you eliminate the need to spawn a shell. Plus, your script will immediately start reading through the while loop without first having to cat out the entire /data/datafile.
If $i are a list of directories, and you are interested in the files underneath, I wonder if find might be a bit faster than fgrep -r.
while read i
do
LOOK=$(echo $i)
find $i -type f | xargs fgrep $LOOK >> /data/output.txt
done < /data/datafile
The xargs will take the output of find, and run as many files as possible under a single fgrep. The xargs can be dangerous if file names in those directories contain whitespace or other strange characters. You can try (depending upon the system), something like this:
find $i -type f -print0 | xargs --null fgrep $LOOK >> /data/output.txt
On the Mac it's
find $i -type f -print0 | xargs -0 fgrep $LOOK >> /data/output.txt
As others have stated, if you have the GNU version of grep, you can give it the -f flag and include your /data/datafile. Then, you can completely eliminate the loop.
Another possibility is to switch to Perl or Python which actually will run faster than the shell will, and give you a bit more flexibility.

Since you are searching for simple strings (and not regexp) you may want to use comm:
comm -12 <(sort find_this) <(sort in_this.*) > /data/output.txt
It takes up very little memory, whereas grep -f find_this can gobble up 100 times the size of 'find_this'.
On a 8 core this takes 100 sec on these files:
$ wc find_this; cat in_this.* | wc
3637371 4877980 307366868 find_this
16000000 20000000 1025893685
Be sure to have a reasonably new version of sort. It should support --parallel.

You can write perl/python script, that will do the job for you. It saves all the forks you need to do when you do this with external tools.
Another hint: you can combine strings that you are looking for in one regular expression.
In this case grep will do only one pass for all combined lines.
Example:
Instead of
for i in ABC DEF GHI JKL
do
grep $i file >> results
done
you can do
egrep "ABC|DEF|GHI|JKL" file >> results

Related

grep - how to output progress bar or status

Sometimes I'm grep-ing thousands of files and it'd be nice to see some kind of progress (bar or status).
I know this is not trivial because grep outputs the search results to STDOUT and my default workflow is that I output the results to a file and would like the progress bar/status to be output to STDOUT or STDERR .
Would this require modifying source code of grep?
Ideal command is:
grep -e "STRING" --results="FILE.txt"
and the progress:
[curr file being searched], number x/total number of files
written to STDOUT or STDERR
This wouldn't necessarily require modifying grep, although you could probably get a more accurate progress bar with such a modification.
If you are grepping "thousands of files" with a single invocation of grep, it is most likely that you are using the -r option to recursively a directory structure. In that case, it is not even clear that grep knows how many files it will examine, because I believe it starts examining files before it explores the entire directory structure. Exploring the directory structure first would probably increase the total scan time (and, indeed, there is always a cost to producing progress reports, which is why few traditional Unix utilities do this.)
In any case, a simple but slightly inaccurate progress bar could be obtained by constructing the complete list of files to be scanned and then feeding them to grep in batches of some size, maybe 100, or maybe based on the total size of the batch. Small batches would allow for more accurate progress reports but they would also increase overhead since they would require additional grep process start-up, and the process start-up time can be more than grepping a small file. The progress report would be updated for each batch of files, so you would want to choose a batch size that gave you regular updates without increasing overhead too much. Basing the batch size on the total size of the files (using, for example, stat to get the filesize) would make the progress report more exact but add an additional cost to process startup.
One advantage of this strategy is that you could also run two or more greps in parallel, which might speed the process up a bit.
In broad terms, a simple script (which just divides the files by count, not by size, and which doesn't attempt to parallelize).
# Requires bash 4 and Gnu grep
shopt -s globstar
files=(**)
total=${#files[#]}
for ((i=0; i<total; i+=100)); do
echo $i/$total >>/dev/stderr
grep -d skip -e "$pattern" "${files[#]:i:100}" >>results.txt
done
For simplicity, I use a globstar (**) to safely put all the files in an array. If your version of bash is too old, then you can do it by looping over the output of find, but that's not very efficient if you have lots of files. Unfortunately, there is no way that I know of to write a globstar expression which only matches files. (**/ only matches directories.) Fortunately, GNU grep provides the -d skip option which silently skips directories. That means that the file count will be slightly inaccurate, since directories will be counted, but it probably doesn't make much difference.
You probably will want to make the progress report cleaner by using some console codes. The above is just to get you started.
The simplest way to divide that into different processes would be to just divide the list into X different segments and run X different for loops, each with a different starting point. However, they probably won't all finish at the same time so that is sub-optimal. A better solution is GNU parallel. You might do something like this:
find . -type f -print0 |
parallel --progress -L 100 -m -j 4 grep -e "$pattern" > results.txt
(Here -L 100 specifies that up to 100 files should be given to each grep instance, and -j 4 specifies four parallel processes. I just pulled those numbers out of the air; you'll probably want to adjust them.)
Try the parallel program
find * -name \*.[ch] | parallel -j5 --bar '(grep grep-string {})' > output-file
Though I found this to be slower than a simple
find * -name \*.[ch] | xargs grep grep-string > output-file
This command show the progress (speed and offset), but not the total amount. This could be manually estimated however.
dd if=/input/file bs=1c skip=<offset> | pv | grep -aob "<string>"
I'm pretty sure you would need to alter the grep source code. And those changes would be huge.
Currently grep does not know how many lines a file as until it's finished parsing the whole file. For your requirement it would need to parse the file 2 times or a least determine the full line count any other way.
The first time it would determine the line count for the progress bar. The second time it would actually do the work an search for your pattern.
This would not only increase the runtime but violate one of the main UNIX philosophies.
Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features". (source)
There might be other tools out there for your need, but afaik grep won't fit here.
I normaly use something like this:
grep | tee "FILE.txt" | cat -n | sed 's/^/match: /;s/$/ /' | tr '\n' '\r' 1>&2
It is not perfect, as it does only display the matches, and if they to long or differ to much in length there are errors, but it should provide you with the general idea.
Or a simple dots:
grep | tee "FILE.txt" | sed 's/.*//' | tr '\n' '.' 1>&2

interrupt pdfgrep after N matches found

It appear as though pdfgrep, unlike regular grep does not offer the option to limit searches after N matches are returned. I have a script that has to search a somewhat large .pdf file, and it takes pdfgrep over a minute to search though the entire file. But searching the entire document is unnecessarily inefficient, especially since I may need to run this command several times a day, waiting for command completion to get on with my work.
Since pdfgrep prints results as it finds them while searching, I wonder how BASH might be able to interrupt (cntl-C) the process (command) after N matches have been output at the terminal? Although such a feat seems quite plausible, I am uncertain how to implement such a solution.
Any suggs or ideas are welcome reading.
Two options I can think of
pdfgrep ... | head -n 1
IFS= read -r line < <(pdfgrep ...) && echo "$line"
Since version 1.3.1, pdfgrep also has the -m /--max-count option from grep:
pdfgrep --max-count 4 pattern some.pdf

Fastest possible grep

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I'd like to know if there is any tip to make grep as fast as possible. I have a rather large base of text files to search in the quickest possible way. I've made them all lowercase, so that I could get rid of -i option. This makes the search much faster.
Also, I've found out that -F and -P modes are quicker than the default one. I use the former when the search string is not a regular expression (just plain text), the latter if regex is involved.
Does anyone have any experience in speeding up grep? Maybe compile it from scratch with some particular flag (I'm on Linux CentOS), organize the files in a certain fashion or maybe make the search parallel in some way?
Try with GNU parallel, which includes an example of how to use it with grep:
grep -r greps recursively through directories. On multicore CPUs GNU
parallel can often speed this up.
find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}
This will run 1.5 job per core, and give 1000 arguments to grep.
For big files, it can split it the input in several chunks with the --pipe and --block arguments:
parallel --pipe --block 2M grep foo < bigfile
You could also run it on several different machines through SSH (ssh-agent needed to avoid passwords):
parallel --pipe --sshlogin server.example.com,server2.example.net grep foo < bigfile
If you're searching very large files, then setting your locale can really help.
GNU grep goes a lot faster in the C locale than with UTF-8.
export LC_ALL=C
Ripgrep claims to now be the fastest.
https://github.com/BurntSushi/ripgrep
Also includes parallelism by default
-j, --threads ARG
The number of threads to use. Defaults to the number of logical CPUs (capped at 6). [default: 0]
From the README
It is built on top of Rust's regex engine. Rust's regex engine uses
finite automata, SIMD and aggressive literal optimizations to make
searching very fast.
Apparently using --mmap can help on some systems:
http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
Not strictly a code improvement but something I found helpful after running grep on 2+ million files.
I moved the operation onto a cheap SSD drive (120GB). At about $100, it's an affordable option if you are crunching lots of files regularly.
If you don't care about which files contains the string, you might want to separate reading and grepping into two jobs, since it might be costly to spawn grep many times – once for each small file.
If you've one very large file:
parallel -j100% --pipepart --block 100M -a <very large SEEKABLE file> grep <...>
Many small compressed files (sorted by inode)
ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j80% --group "gzcat {}" | parallel -j50% --pipe --round-robin -u -N1000 grep <..>
I usually compress my files with lz4 for maximum throughput.
If you want just the filename with the match:
ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j100% --group "gzcat {} | grep -lq <..> && echo {}
Building on the response by Sandro I looked at the reference he provided here and played around with BSD grep vs. GNU grep. My quick benchmark results showed: GNU grep is way, way faster.
So my recommendation to the original question "fastest possible grep": Make sure you are using GNU grep rather than BSD grep (which is the default on MacOS for example).
I personally use the ag (silver searcher) instead of grep and it's way faster, also you can combine it with parallel and pipe block.
https://github.com/ggreer/the_silver_searcher
Update:
I now use https://github.com/BurntSushi/ripgrep which is faster than ag depending on your use case.
One thing I've found faster for using grep to search (especially for changing patterns) in a single big file is to use split + grep + xargs with it's parallel flag. For instance:
Having a file of ids you want to search for in a big file called my_ids.txt
Name of bigfile bigfile.txt
Use split to split the file into parts:
# Use split to split the file into x number of files, consider your big file
# size and try to stay under 26 split files to keep the filenames
# easy from split (xa[a-z]), in my example I have 10 million rows in bigfile
split -l 1000000 bigfile.txt
# Produces output files named xa[a-t]
# Now use split files + xargs to iterate and launch parallel greps with output
for id in $(cat my_ids.txt) ; do ls xa* | xargs -n 1 -P 20 grep $id >> matches.txt ; done
# Here you can tune your parallel greps with -P, in my case I am being greedy
# Also be aware that there's no point in allocating more greps than x files
In my case this cut what would have been a 17 hour job into a 1 hour 20 minute job. I'm sure there's some sort of bell curve here on efficiency and obviously going over the available cores won't do you any good but this was a much better solution than any of the above comments for my requirements as stated above. This has an added benefit over the script parallel in using mostly (linux) native tools.
cgrep, if it's available, can be orders of magnitude faster than grep.
MCE 1.508 includes a dual chunk-level {file, list} wrapper script supporting many C binaries; agrep, grep, egrep, fgrep, and tre-agrep.
https://metacpan.org/source/MARIOROY/MCE-1.509/bin/mce_grep
https://metacpan.org/release/MCE
One does not need to convert to lowercase when wanting -i to run fast. Simply pass --lang=C to mce_grep.
Output order is preserved. The -n and -b output is also correct. Unfortunately, that is not the case for GNU parallel mentioned on this page. I was really hoping for GNU Parallel to work here. In addition, mce_grep does not sub-shell (sh -c /path/to/grep) when calling the binary.
Another alternate is the MCE::Grep module included with MCE.
A slight deviation from the original topic: the indexed search command line utilities from the googlecodesearch project are way faster than grep: https://github.com/google/codesearch:
Once you compile it (the golang package is needed), you can index a folder with:
# index current folder
cindex .
The index will be created under ~/.csearchindex
Now you can search:
# search folders previously indexed with cindex
csearch eggs
I'm still piping the results through grep to get colorized matches.

Performance issue with parsing large log files (~5gb) using awk, grep, sed

I am currently dealing with log files with sizes approx. 5gb. I'm quite new to parsing log files and using UNIX bash, so I'll try to be as precise as possible. While searching through log files, I do the following: provide the request number to look for, then optionally to provide the action as a secondary filter. A typical command looks like this:
fgrep '2064351200' example.log | fgrep 'action: example'
This is fine dealing with smaller files, but with a log file that is 5gb, it's unbearably slow. I've read online it's great to use sed or awk to improve performance (or possibly even combination of both), but I'm not sure how this is accomplished. For example, using awk, I have a typical command:
awk '/2064351200/ {print}' example.log
Basically my ultimate goal is to be able print/return the records (or line number) that contain the strings (could be up to 4-5, and I've read piping is bad) to match in a log file efficiently.
On a side note, in bash shell, if I want to use awk and do some processing, how is that achieved? For example:
BEGIN { print "File\tOwner" }
{ print $8, "\t", \
$3}
END { print " - DONE -" }
That is a pretty simple awk script, and I would assume there's a way to put this into a one liner bash command? But I'm not sure how the structure is.
Thanks in advance for the help. Cheers.
You need to perform some tests to find out where your bottlenecks are, and how fast your various tools perform. Try some tests like this:
time fgrep '2064351200' example.log >/dev/null
time egrep '2064351200' example.log >/dev/null
time sed -e '/2064351200/!d' example.log >/dev/null
time awk '/2064351200/ {print}' example.log >/dev/null
Traditionally, egrep should be the fastest of the bunch (yes, faster than fgrep), but some modern implementations are adaptive and automatically switch to the most appropriate searching algorithm. If you have bmgrep (which uses the Boyer-Moore search algorithm), try that. Generally, sed and awk will be slower because they're designed as more general-purpose text manipulation tools rather than being tuned for the specific job of searching. But it really depends on the implementation, and the correct way to find out is to run tests. Run them each several times so you don't get messed up by things like caching and competing processes.
As #Ron pointed out, your search process may be disk I/O bound. If you will be searching the same log file a number of times, it may be faster to compress the log file first; this makes it faster to read off disk, but then require more CPU time to process because it has to be decompressed first. Try something like this:
compress -c example2.log >example2.log.Z
time zgrep '2064351200' example2.log.Z >/dev/null
gzip -c example2.log >example2.log.gz
time zgrep '2064351200' example2.log.gz >/dev/null
bzip2 -k example.log
time bzgrep '2064351200' example.log.bz2 >/dev/null
I just ran a quick test with a fairly compressible text file, and found that bzip2 compressed best, but then took far more CPU time to decompress, so the zgip option wound up being fastest overall. Your computer will have different disk and CPU performance than mine, so your results may be different. If you have any other compressors lying around, try them as well, and/or try different levels of gzip compression, etc.
Speaking of preprocessing: if you're searching the same log over and over, is there a way to preselect out just the log lines that you might be interested in? If so, grep them out into a smaller (maybe compressed) file, then search that instead of the whole thing. As with compression, you spend some extra time up front, but then each individual search runs faster.
A note about piping: other things being equal, piping a huge file through multiple commands will be slower than having a single command do all the work. But all things are not equal here, and if using multiple commands in a pipe (which is what zgrep and bzgrep do) buys you better overall performance, go for it. Also, consider whether you're actually passing all of the data through the entire pipe. In the example you gave, fgrep '2064351200' example.log | fgrep 'action: example', the first fgrep will discard most of the file; the pipe and second command only have to process the small fraction of the log that contains '2064351200', so the slowdown will likely be negligible.
tl;dr TEST ALL THE THINGS!
EDIT: if the log file is "live" (i.e. new entries are being added), but the bulk of it is static, you may be able to use a partial preprocess approach: compress (& maybe prescan) the log, then when scanning use the compressed (&/prescanned) version plus a tail of the part of the log added since you did the prescan. Something like this:
# Precompress:
gzip -v -c example.log >example.log.gz
compressedsize=$(gzip -l example.log.gz | awk '{if(NR==2) print $2}')
# Search the compressed file + recent additions:
{ gzip -cdfq example.log.gz; tail -c +$compressedsize example.log; } | egrep '2064351200'
If you're going to be doing several related searches (e.g. a particular request, then specific actions with that request), you can save prescanned versions:
# Prescan for a particular request (repeat for each request you'll be working with):
gzip -cdfq example.log.gz | egrep '2064351200' > prescan-2064351200.log
# Search the prescanned file + recent additions:
{ cat prescan-2064351200.log; tail -c +$compressedsize example.log | egrep '2064351200'; } | egrep 'action: example'
If you don't know the sequence of your strings, then:
awk '/str1/ && /str2/ && /str3/ && /str4/' filename
If you know that they will appear one following another in the line:
grep 'str1.*str2.*str3.*str4' filename
(note for awk, {print} is the default action block, so it can be omitted if the condition is given)
Dealing with files that large is going to be slow no matter how you slice it.
As to multi-line programs on the command line,
$ awk 'BEGIN { print "File\tOwner" }
> { print $8, "\t", \
> $3}
> END { print " - DONE -" }' infile > outfile
Note the single quotes.
If you process the same file multiple times, it might be faster to read it into a database, and perhaps even create an index.

How do you handle the "Too many files" problem when working in Bash?

I many times have to work with directories containing hundreds of thousands of files, doing text matching, replacing and so on. If I go the standard route of, say
grep foo *
I get the too many files error message, so I end up doing
for i in *; do grep foo $i; done
or
find ../path/ | xargs -I{} grep foo "{}"
But these are less than optimal (create a new grep process per each file).
This looks like more of a limitation in the size of the arguments programs can receive, because the * in the for loop works alright. But, in any case, what's the proper way to handle this?
PS: Don't tell me to do grep -r instead, I know about that, I'm thinking about tools that do not have a recursive option.
In newer versions of findutils, find can do the work of xargs (including the glomming behavior, such that only as many grep processes as needed are used):
find ../path -exec grep foo '{}' +
The use of + rather than ; as the last argument triggers this behavior.
If there is a risk of filenames containing spaces, you should remember to use the -print0 flag to find together with the -0 flag to xargs:
find . -print0 | xargs -0 grep -H foo
xargs does not start a new process for each file. It bunches together the arguments. Have a look at the -n option to xargs - it controls the number of arguments passed to each execution of the sub-command.
I can't see that
for i in *; do
grep foo $i
done
would work since I thought the "too many files" was a shell limitation, hence it would fail for the for loop as well.
Having said that, I always let xargs do the grunt-work of splitting the argument list into manageable bits thus:
find ../path/ | xargs grep foo
It won't start a process per file but per group of files.
Well, I had the same problems, but it seems that everything I came up with is already mentioned. Mostly, had two problems. Doing globs is expensive, doing ls on a million files directory takes forever (20+ minutes on one of my servers) and doing ls * on a million files directory takes forever and fails with "argument list too long" error.
find /some -type f -exec some command {} \;
seems to help with both problems. Also, if you need to do more complex operations on these files, you might consider to script your stuff into multiple threads. Here is a python primer for scripting CLI stuff.
http://www.ibm.com/developerworks/aix/library/au-pythocli/?ca=dgr-lnxw06pythonunixtool&S_TACT=105AGX59&S_CMP=GR

Resources