Why is "grep --ignore-case" 50 times slower?

Why is "grep --ignore-case" 50 times slower? - performance

I was very surprised to see that when you add the --ignore-case option to grep that it can slow down the search by 50x times. I've tested this on two different machines with the same result. I am curious to find out an explanation for the huge performance difference.
I would also like to see an alternative command to grep for case-insensitive searches. I don't need regular expressions, just fixed string searching. First the test file will be a 50 MB plain text file with some dummy data, you may use the following code to generate it:
Create test.txt
yes all work and no play makes Jack a dull boy | head -c 50M > test.txt
echo "Jack is no fun" >> test.txt
echo "Jack is no Fun" >> test.txt
Demonstration
Below is a demonstration of the slowness. By adding the --ignore-case option the command becomes 57x times slower.
$ time grep fun test.txt
all work and no plJack is no fun
real 0m0.061s
$ time grep --ignore-case fun test.txt
all work and no plJack is no fun
Jack is no Fun
real 0m3.498s
Possible Explanations
Googling around I found an a discussion on grep being slow in the UTF-8 locale. So I ran the following test, and it did speed up. The default locale on my machine is en_US.UTF-8, so setting it to POSIX seems to have made a performance boot, but now of course I can't search correctly on Unicode text which is undesirable. It is also still 2.5 times slower.
$ time LANG=POSIX grep --ignore-case fun test.txt
all work and no plJack is no fun
Jack is no Fun
real 0m0.142s
Alternatives
We could use Perl instead it is faster, but still 5.5 times faster then the case sensitive grep. And the POSIX grep above is about twice as fast.
$ time perl -ne '/fun/i && print' test.txt
all work and no plJack is no fun
Jack is no Fun
real 0m0.388s
So I'd love to find a fast correct alternative and an explanation if anyone has one.
UPDATE - CentOS
The two machines that were tested above both were running Ubuntu one 11.04 (Natty Narwhal), the other 12.04 (Precise Pangolin). Running the same tests on a CentOS 5.3 machine produces the following interesting results. The performance results of the two cases are almost identical. Now CentOS 5.3 was released in Jan 2009 an is running grep 2.5.1 while Ubuntu 12.04 is running grep 2.10. So there might be changes in the new version and differences in the two distributions.
$ time grep fun test.txt
Jack is no fun
real 0m0.026s
$ time grep --ignore-case fun test.txt
Jack is no fun
Jack is no Fun
real 0m0.027s

I think this bug report helps in understanding why it is slow:
bug report grep, slow on ignore-case

This slowness is due to grep (on a UTF-8 locale) constantly accesses files "/usr/lib/locale/locale-archive" and "/usr/lib/gconv/gconv-modules.cache".
It can be shown using the strace utility. Both files are from glibc.

The reason is that it needs to do a Unicode-aware comparison for the current locale, and judging by Marat's answer, it's not very efficient in doing so.
This shows how much faster it is when Unicode is not taken into consideration:
$ time LC_CTYPE=C grep -i fun test.txt
all work and no plJack is no fun
Jack is no Fun
real 0m0.192s
Of course, this alternative won't work with characters in other languages such as Ñ/ñ, Ø/ø, Ð/ð, Æ/æ and so on.
Another alternative is to modify the regex so that it matches with case insensitivity:
$ time grep '[Ff][Uu][Nn]' test.txt
all work and no plJack is no fun
Jack is no Fun
real 0m0.193s
This is reasonably fast, but of course it's a pain to convert each character into a class, and it's not easy to convert it to an alias or an sh script, unlike the above one.
For comparison, in my system:
$ time grep fun test.txt
all work and no plJack is no fun
real 0m0.085s
$ time grep -i fun test.txt
all work and no plJack is no fun
Jack is no Fun
real 0m3.810s

To do a case-insensitive search, grep first has to convert your entire 50 MB file to one case or the other. That's going to take time. Not only that, but there are memory copies...
In your test case, you first generate the file. This means that it will be memory cached. The first grep run only has to mmap the cached pages; it doesn't even have to access the disk.
The case-insensitive grep does the same, but then it tries to modify that data. This means the kernel will take an exception for each modified 4 kB page, and will end up having to copy the entire 50 MB into new memory, one page at a time.
Basically, I'd expect this to be slower. Maybe not 57x slower, but definitely slower.

Related

Text file search tool for Windows (command line) with an extremely large pattern list

Is there an efficient way to search a list of strings from another text file or from a piped output?
I have tried the following methods:
FINDSTR /G:patternlist.txt <filetocheck>
or
Some program whose output is piped to FINDSTR
SOMEPROGRAM | FINDSTR /G:patternlist.txt
Similarly, tried GREP from MSYS, UnixUtils, GNU package etc.,
GREP -w -F -f patternlist.txt <filetocheck>
or
Some program whose output is piped to GREP
SOMEPROGRAM | GREP -w -F -f patternlist.txt
For example, Pattern List file is a text file which contains one literal string per line.
For example
Patternlist.txt
65sM547P
Bu83842T
t400N897
s20I9U10
i1786H6S
Qj04404e
tTV89965
etc.,
And the file_to_be_checked contains similar texts, but there might be multiple words in a single line in some cases.
For example
Filetocheck.txt
3Lo76SZ4 CjBS8WeS
iI9NvIDC
TIS0jFUI w6SbUuJq joN2TOVZ
Ee3z83an rpAb8rWp
Rmd6vBcg
O2UYJOos
hKjL91CB
Dq0tpL5R R04hKmeI W9Gs34AU
etc.,
They work as expected if the number of pattern literals are less than 50000 and sometime works very slow upto 100000 patterns.
Also, the filetocheck.txt will contain upto 250000 lines and grows upto 30 MB in size.
The problem comes when the pattern file becomes larger than this. I have an instance of patternfile which is around 20 MB and contains 600000 string literals.
Matching this against a list or output of 250000 to 300000 lines of text literally stalls the processor.
I tried SIFT, and multiple other text search tools, but they just kill the system with the memory requirements and processor usage and make the system unresponsive.
I require a commandline based solution or utility which could help in achieving this task because this is a part of another big script.
I have tried multiple programs and methods to speed up, but all in vain like indexing the pattern file, sorting the file alphabetically etc.,.
Since the input will be from a program, there is no option to split the input file as well. It is all in one big piped command.
Example:
PASSWORDGEN | <COMMAND_TO_FILTER_KNOWN_PASSWORDS> >> FILTERED_OUTPUT
The above problem is in part where the system hangs or take very long time to filter the stdout stream or from a saved results file.
System configuration details if this will be any help:
I am running this on a modest 8 GB RAM, SATA HDD, Core i7 with Win 7 64bit and currently I do not have any better configuration available currently.
Any help in this issue is much appreciated.
I am also trying to find a solution if not create a specific code to achieve this (help appreciated in that sense as well.)

Sed pacman.conf remove # for multilib & include

I'm actually facing a wall with my custom installation script.
At a point of the script, I need to enable the 64 bits repository for 64 bits machines and (for instance) I need to get from that format :
#multilib-testing[...]
#include[...]
#multilib[...]
#include[...]
To that format
#multilib-testing[...]
#include[...]
multilib[...]
include[...]
But as you can see, there are include everywhere and I can't use sed because it will recursively delete all the "include" of that specific file and it's not what I want...
I can't seem to find a solution with sed. I tried something I saw on another thread with
cat /etc/pacman.conf | grep -A 1 "multilib"
But I didn't get it well and I'm out of options...
Ideally, I would like to get a sed solution (but feel free to tell me what others options I could get as long as you explain !).
The pattern (and the beginning) shoud be something like that :
sed -i '/multilib/ s/#//' /etc/pacman.conf
And should be effective for the pattern and the line after (which is the include).
Also, I will be pleased if you could actually teach me why you do that or that as I'm learning and I can't remember something if I can't figure why I did like that. (also excuse my mid-game english).

We can use this to match a range by patterns. We can then match the # at the beginning of each line and remove it.
sed -i "/\[multilib\]/,/Include/"'s/^#//' /etc/pacman.conf

Finding number of graphics cards in red hat weird error

So I know how to find the number of the video cards but in a ruby script I wrote I had this small little method to determine it:
def getNumCards
_numGpu = %x{lspci | grep VGA}.split("\n").size
end
But have determined I need to do a search for 3D as well as VGA so I changed it to:
def getNumCards
_numGpu = %x{lspci | grep "VGA\|3D"}.split("\n").size
end
But I am finding it returns 0 when I run the second. If I run the command on it's own on the command line, it shows me 3 video cards (1 on board VGA and two Tesla NVIDIA cards that come up as 3D cards). I am not sure what is happening in the split part that may be messing something up.
Any help would be awesome!
Cheers

man grep:
-E, --extended-regexp
...
egrep is the same as grep -E.
so, egrep should help

I'd go after this information one of two ways.
The almost-purely command-line version would be:
def getNumCards
`lspci | grep -P '\b(?:VGA|3D)\b' | wc -l`.to_i
end
which lets the OS do almost all the work, except for the final conversion to an integer.
-P '\b(?:VGA|3D)\b' is a Perl regex that says "find a word-break, then look for VGA or 3D, followed by another word-break". That'll help avoid any hits due to the targets being embedded in other strings.
The more-Ruby version would be:
def getNumCards
`lspci`.split("\n").grep(/\b(?:VGA|3D)\b/).count
end
It does the same thing, only all in Ruby.

bash script to rename files based on a calculation

I have a file system containing PNG images. The layout of the filesystem is: ZOOM/X/Y.png where ZOOM, X, and Y are all integers.
I need to change the names of the PNG files. Basically, I need to convert Y from its current value to 2^ZOOM-Y-1. I've written a bash script to accomplish this task. However, I suspect it can be optimized substantially. (I also suspect that I may have been better off writing it in Perl, but that is another story.)
Here is the script. Is this about as good as it gets? Or can the performance be optimized? Are there tools I can use that would profile the script for me and tell me where I'm spending all my execution time?
#!/bin/bash
tiles=`ls -d */*/*`
for oldPath in $tiles
do
oldY=`basename -s .png $oldPath`
zoomX=`dirname $oldPath`
zoom=`echo $zoomX | sed 's#\([^\]\)/.*#\1#'`
newY=`echo 2^$zoom-$oldY-1|bc`
mv ${zoomX}/${oldY}.png ${zoomX}/${newY}.png
done

for oldpath in */*/*
do
x=$(basename "$oldpath" .png)
zoom_y=$(dirname "$oldpath")
y=$(basename "$zoom_y")
ozoom=$(dirname "$zoom_y")
nzoom=$(echo "2^$zoom-$y-1" | bc)
mv "$oldpath" $nzoom/$y/$x.png
done
This avoids using sed. I like basename and dirname. However, you can also use bash (and Korn) shell notations such as:
y=${zoom_y#*/}
ozoom=${zoom_y%/*}
You might be able to do it all without invoking basename or dirname at all.

REWRITE due to misunderstanding of the formula and the updated var names. Still no subprocesses apart from mv and ls.
#!/bin/bash
tiles=`ls -d */*/*`
for thisPath in $tiles
do
thisFile=${thisPath#*/*/}
oldY=${thisFile%.png}
zoomX=${thisPath%/*}
zoom=${thisPath%/*/*}
newY=$(((1<<zoom) - oldY - 1))
mv ${zoomX}/${oldY}.png ${zoomX}/${newY}.png
done

It's likely that the overall throughput of your rename is limited by the filesystem. Choosing the right filesystem and tuning it for this sort of operation would speed up the overall job much more than tweaking the script.
If you optimize the script you'll probably see less CPU consumed but the same total duration. Since forking off the various subprocesses (basename, dirname, sed, bc) are probably more significant than the actual work you are probably right that a perl implementation would use less CPU because it can do all of those operations internally (including the mv).

I see 3 improvements I would do, if it was my script. Whether they have an huge impact - I don't think so.
But you should avoid as hell parsing the output of ls. Maybe this directory is very predictable, from the things found inside, but if I read your script correctly, you can use the globbing with for directly:
for thisPath in */*/*
repeatedly, $(cmd) is better than cmd with the deprecated backticks, which aren't nestable.
thisDir=$(dirname $thisPath)
arithmetic in bash directly:
newTile=$((2**$zoom-$thisTile-1))
as long as you don't need floating point, or output is getting too big.
I don't get the sed-part:
zoom=`echo $zoomX | sed 's#\([^\]\)/.*#\1#'`
Is there something missing after the backslash? A second one? You're searching for something which isn't a backslash, followed by a slash-something? Maybe it could be done purely in bash too.

one precept of computing credited to Donald Knuth is, "don't optimize too early." Scripts run pretty fast and 'mv' operations(as long as they're not going across filesystems where you're really copying it to another disk and then deleting the file) are pretty fast as well, as all the filesystem has to do in most cases is just rename the file or change its parentage.
Probably where it's spending most of its time is in that intial 'ls' operation. I suspect you have ALOT of files. There isn't much that can be done there. Doing it another language like perl or python is going to face the same hurdle. However you might be able to get more INTELLIGENCE and not limit yourself to 3 levels(//*).

How to keep a file under N lines long?

I have a log that should have the latest N entries. There's no problem if the file is a bit bigger a few times.
My first attempt is periodically running:
tail -n 20 file.log > file.log
Unfortunately, that just empties the file. I could:
tail -n 20 file.log > .file.log; mv .file.log file.log
However, that seems messy. Is there a better way?

It sounds like you are looking for logrotate.

I agree, logrotate is probably what you need. If you still want a command line solution, this will get the job done. Ex is a line editor. Nobody uses line editors anymore except for use in shell scripts. Syntax is for Sh/Ksh/Bash shells. I think it's the same in C shell.
ex log.001 << HERE
$
-20
1,-1d
w
q
HERE

logrotate, with size=xxx where xxx is the approximate size for 20 lines, and possibly delaycompress to keep the previous one also human readble.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio