How long can I expect grep to take on a 10 TB file?

How long can I expect grep to take on a 10 TB file? - performance

I have a 10 TB file with words from multiple books, and I'm trying to grep for some uncommon strings (no regex). For example:
grep "cappucino" filename
I'm trying to estimate how long this will take. I'm not really looking for whether it's the right approach or not. I'd like to learn more about what really happens under the hood when I call grep.
Please correct me if I'm wrong:
I use mechanical harddrive with roughly 200 MB/s read speed, so it will take roughly 10 million / 200 = 50000 seconds = 14 hours to finish. Is this an accurate estimate?

The short answer is: no.
The longer answer is: it depends.
The even longer answer is: grep's performance depends on a lot of things:
are you running a fixed string search (-F, fgrep) or not - grep uses Boyer-Moore algorithm which by itself isn't capable of finding regular expressions so what grep does (or at least used to do) is it first finds a fixed string in your regexp, tries to find it using BM in the text and do a regexp match (not sure about the current implementation whether it uses an NFA or a DFA implementation, probably a hybrid)
how long is your pattern - BM works faster for longer patterns
how many matches will you have - the less the matches the faster it will be
what is your CPU and memory - hard drive will help you only during reading not during computation time
what other options are you using with your grep
14 hours might not even be your lower bound because Boyer-Moore is smart enough to compute an offset at which next possible match might occur so it doesn't need to read-in the whole file. This does depend on the implementation though and is just my speculation. After re-running the below test with a much longer pattern I was able to go down to 0.23sec and I don't think my disk is that fast. But there might be some caching involved instead.
For instance I'm running on a 500MB/s SSD (at least that's what the manufacturer says) and grepping a 200MB file with a very short pattern (few chars) gives me:
With 808320 hits
real 0m1.734s
user 0m1.334s
sys 0m0.120s
With 0 hits:
real 0m0.059s
user 0m0.046s
sys 0m0.016s
#Edit: in short read about Boyer-Moore :-)
#Edit2: well to check how grep works you should instead check the source code, I described a very general workflow above.

Related

Parameter expansion slow for large data sets

If I take the first 1,000 bytes from a file, Bash can replace some characters pretty quick
$ cut -b-1000 get_video_info
muted=0&status=ok&length_seconds=24&endscreen_module=http%3A%2F%2Fs.ytimg.com%2F
yts%2Fswfbin%2Fendscreen-vfl4_CAIR.swf&plid=AATWGZfL-Ysy64Mp&sendtmp=1&view_coun
t=3587&author=hye+jeong+Jeong&pltype=contentugc&threed_layout=1&storyboard_spec=
http%3A%2F%2Fi1.ytimg.com%2Fsb%2FLHelEIJVxiE%2Fstoryboard3_L%24L%2F%24N.jpg%7C48
%2327%23100%2310%2310%230%23default%23cTWfBXjxZMDvzL5cyCgHdDJ3s_A%7C80%2345%2324
%2310%2310%231000%23M%24M%23m1lhUvkKk6sTnuyKXnPBojTIqeM%7C160%2390%2324%235%235%
231000%23M%24M%23r-fWFZpjrP1oq2uq_Y_1im4iu2I%7C320%23180%2324%233%233%231000%23M
%24M%23uGg7bth0q6XSYb8odKLRqkNe7ao&approx_threed_layout=1&allow_embed=1&allow_ra
tings=1&url_encoded_fmt_stream_map=fallback_host%3Dtc.v11.cache2.c.youtube.com%2
6quality%3Dhd1080%26sig%3D610EACBDE06623717B1DC2265696B473C47BD28F.98097DEC78411
95A074D6D6EBFF8B277F9C071AE%26url%3Dhttp%253A%252F%252Fr9---sn-q4f7dney.c.youtub
e.com%252Fvideoplayback%253Fms%253Dau%2526ratebypass%253Dyes%2526ipbits%253D8%25
26key%253Dyt1%2526ip%253D99.109.97.214%2
$ read aa < <(cut -b-1000 get_video_info)
$ time set "${aa//%/\x}"
real 0m0.025s
user 0m0.031s
sys 0m0.000s
However if I take 10,000 bytes it slows dramatically
$ read aa < <(cut -b-10000 get_video_info)
$ time set "${aa//%/\x}"
real 0m8.125s
user 0m8.127s
sys 0m0.000s
I read Greg Wooledge’s post but it lacks an explanation as to why Bash parameter expansion is slow.

For the why, you can see the implementation of this code in pat_subst in subst.c in the bash source code.
For each match in the string, the length of the string is counted numerous times (in pat_subst, match_pattern and match_upattern), both as a C string and more expensively as a multibyte string. This makes the function both slower than necessary, and more importantly, quadratic in complexity.
This is why it's slow for larger input, and here's a pretty graph:
As for workarounds, just use sed. It's more likely to be optimized for string replacement operations (though you should be aware that POSIX only guarantees 8192 bytes per line, even though GNU sed handles arbitrarily large ones).

Originally, older shells and other utilities imposed LINE_MAX = 2048
on file input for this kind of reason. For huge variables bash has no
problem parking them in memory. But substitution requires at least two
concurrent copies. And lots of thrashing: as groups of characters are
removed whole strings get rewritten. Over and over and over.
There are tools meant for this - sed is a premiere choice. bash is a
distant second choice. sed works on streams, bash works on memory blocks.
Another choice:
bash is extensible - your can write custom C code to stuff stuff well
when bash was not meant to do it.
CFA Johnson has good articles on how to do that:
Some ready to load builtins:
http://cfajohnson.com/shell/bash/loadables/
DIY builtins explained:
http://cfajohnson.com/shell/articles/dynamically-loadable/

How to compare all the lines in a sorted file (file size > 1GB) in a very efficient manner

Lets say the input file is:
Hi my name NONE
Hi my name is ABC
Hi my name is ABC
Hi my name is DEF
Hi my name is DEF
Hi my name is XYZ
I have to create the following output:
Hi my name NONE 1
Hi my name is ABC 2
Hi my name is DEF 2
Hi my name is XYZ 1
The number of words in a single line can vary from 2 to 10. File size will be more than 1GB.
How can I get the required output in the minimum possible time. My current implementation uses a C++ program to read a line from the file and then compare it with next line. The running time of this implementation will always be O(n) where n is the number of characters in the file.
To improve the running time, the next option is to use the mmap. But before implementing it, I just wanted to confirm is there a faster way to do it? Using any other language/scripting?

uniq -c filename | perl -lane 'print "#F[1..$#F] $F[0]"'
The perl step is only to take the output of uniq (which looks like "2 Hi my name is ABC") and re-order it into "Hi my name is ABC 2". You can use a different language for it, or else leave it off entirely.
As for your question about runtime, big-O seems misplaced here; surely there isn't any chance of scanning the whole file in less than O(n). mmap and strchr seem like possibilities for constant-factor speedups, but a stdio-based approach is probably good enough unless your stdio sucks.
The code for BSD uniq could be illustrative here. It does a very simple job with fgets, strcmp, and a very few variables.

In most cases this operation will be completely I/O bound. (Especially using well-designed C++)
Given that, its likely the only bottleneck you need to care about is the disk.
I think you will find this to be relevant:
mmap() vs. reading blocks
Ben Collins has a very good answer comparing mmap to standard read/write.

Well there is two time scales you are comparing which aren't related to each other really. The first is algorithmic complexity which you are expressing in O notation. This has, however, nothing to do with the complexity of reading from a file.
Say in the ideal case you have all your data in memory and you have to find the duplicates with an algorithm - depending on how your data is organized (e.g. a simple list, a hash map etc) you can find duplicates you could go with O(n^2), O(n) or even O(1) if you have a perfect hash (just for detecting the item).
Reading from a file or mapping to memory has no relation to the "big-Oh" notation at all so you don't consider that for complexity calculations at all. You will just pick the one that takes less measured time - nothing more.

How can I determine why my Racket code runs so slowly?

Just for fun, I wrote a quick Racket command-line script to parse old Unix fortune files. Fortune files are just giant text files, with a single % on a blank line separating entries.
Just as a quick first hack, I wrote the following Racket code:
(define fortunes
(with-input-from-file "fortunes.txt"
(λ ()
(regexp-split #rx"%" (port->string)))))
I thought it would run nearly instantly. Instead, it takes a very long time to run—on the order of a couple of minutes. In comparison, what I think of as equivalent Python:
with open('fortunes.txt') as f:
fortunes = f.read().split('%')
executes immediately, with equivalent results to the Racket code.
What am I doing wrong here? Yes, there's some obvious low-hanging fruit, such as I'm sure that things would be better if I didn't slurp the whole file into RAM with port->string, but the behavior is so pathologically bad I feel as if I must be doing something stupid at a much higher level than that.
Is there a more Racket-like way to do this with equivalently better performance? Is Racket I/O really poor for some operations? Is there some way to profile my code slightly deeper than the naïve profiler in DrRacket so I can figure out what about a given line is causing a problem?
EDIT: The fortunes file I'm using is FreeBSD's as found at http://fortunes.cat-v.org/freebsd/, which weighs in at about 2 MB. The best runtime for Racket 5.1.3 x64 on OS X Lion was:
real 1m1.479s
user 0m57.400s
sys 0m0.691s
For Python 2.7.1 x64, it was:
real 0m0.057s
user 0m0.029s
sys 0m0.015s
Eli's right that the time is being spent almost entirely in regexp-split (although a full second appears to be spent in port->string), but it's not clear to me that there's a preferred yet equally simple method.

Looks like most of the cost is due to running regexp-split on a string. The fastest alternative that I found was splitting a byte-string, then converting the results to a strings:
(map bytes->string/utf-8
(call-with-input-file "db"
(λ (i) (regexp-split #rx#"%" (port->bytes i)))))
With a random fortune DB of ~2MB, your code takes about 35s, and this version takes 33ms.
(I'm not sure why it takes so long on a string, yet, but it's definitely way too slow.)
EDIT: We tracked it to an efficiency bug. Rough description: when Racket does a regexp-match on a string, it will convert large parts of the string to a byte string (in UTF-8) for the search. This function is the core one that is implemented in C. regexp-split uses it repeatedly to find all of the matches, and therefore keeps re-doing this conversion. I'm looking at a way to do things better, but until it's fixed, use the above workaround.

This is now fixed in the latest Git HEAD version of Racket, see: github.com/plt/racket/commit/8eefaba. Your example now runs in 0.1 seconds for me.

Print 1 followed by googolplex number of zeros

Assuming we are not concerned about running time of the program (which is practically infinite for human mortals) and using limited amount of memory (2^64 bytes), we want to print out in base 10, the exact value of 10^(googolplex), one digit at a time on screen (mostly zeros).
Describe an algorithm (which can be coded on current day computers), or write a program to do this.
Since we cannot practically check the output, so we will rely on collective opinion on the correctness of the program.
NOTE : I do not know the solution, or whether a solution exists or not. The problem is my own invention. To those readers who are quick to mark this offtopic... kindly reconsider. This is difficult and bit theoretical but definitely CS.

This is impossible. There are more states (10^(10^100)) in the program than there are electrons in the universe (~10^80). Therefore, in our universe, there can be no such realization of a machine capable of executing the task.

First of all, we note that 10^(10^100) is equivalent to ((((10^10)^10)^...)^10), 100 times.
Or 10↑↑↑↑↑↑↑↑↑↑10.
This gives rise to the following solution:
print 1
for i in A(10, 100)
print 0

in bash:
printf 1
while true; do
printf 0
done
... close enough.

Here's an algorithm that solves this:
print 1
for 1 to 10^(10^100)
print 0
One can trivially prove correctness using Hoare logic:
There are no pre-conditions
The post condition is that a one followed by 10^(10^100) zeros are printed
The cycle's invariant is that the number of zeros printed so far is equal to i
EDIT: A machine to solve the problem needs the ability to distinguish between one googolplex of distinct states: each state is the result of printing one more zero than the previous. The amount of memory needed to do this is the same needed to store the number one googolplex. If there isn't that much memory available, this problem cannot be solved.
This does not mean it isn't a computable problem: it can be solved by a Turing machine because a Turing machine has a limitless amount of memory.

There definitely is a solution to this problem in theory, assuming of course you have a machine that is capable of producing that sort of output. I'm pretty sure that a googolplex is larger than the number of atoms in the universe, at least according to what the physicists tell us, so I don't think that any physically realizable model of computation could print it out. However, mathematically speaking, you could define a Turing machine capable of printing out the value by just giving it a googolplex-ish number of states and having each write a zero and then move to the next lower state.

Consider the following:
The console window to which you are printing the output will have a maximum buffer size.
When this buffer size is exceeded, anything printed earlier is discarded, and the user will not be able to scroll back to see it.
The maximum buffer size will be minuscule compared to a googolplex.
Therefore, if you want to mimic the user experience of your program running to completion, find the maximum buffer size of the console you will print to and print that many zeroes.
Hurray laziness!

Do comments affect Perl performance?

I'm optimizing some frequently run Perl code (once per day per file).
Do comments slow Perl scripts down? My experiments lean towards no:
use Benchmark;
timethese(20000000, {
'comments' => '$b=1;
# comment ... (100 times)
', 'nocomments' => '$b=1;'});
Gives pretty much identical values (apart from noise).
Benchmark: timing 10000000 iterations of comments, nocomments...
comments: 1 wallclock secs ( 0.53 usr + 0.00 sys = 0.53 CPU) # 18832391.71/s (n=10000000)
nocomments: 0 wallclock secs ( 0.44 usr + 0.00 sys = 0.44 CPU) # 22935779.82/s (n=10000000)
Benchmark: timing 20000000 iterations of comments, nocomments...
comments: 0 wallclock secs ( 0.86 usr + -0.01 sys = 0.84 CPU) # 23696682.46/s (n=20000000)
nocomments: 1 wallclock secs ( 0.90 usr + 0.00 sys = 0.90 CPU) # 22099447.51/s (n=20000000)
I get similar results if I run the comments and no-comments versions as separate Perl scripts.
It seems counter-intuitive though, if nothing else the interpreter needs to read the comments into memory every time.

Runtime performance? No.
Parsing and lexing performance? Yes, of course.
Since Perl tends to parse and lex on the fly, then comments will affect "start up" performance.
Will they affect it noticably? Unlikely.

Perl is a just-in-time compiled language, so comments and POD have no effect on run-time performance.
Comments and POD have a minuscule effect on compile-time, but they're so easy and fast for Perl to parse it's almost impossible to measure the performance hit. You can see this for yourself by using the -c flag to just compile.
On my Macbook, a Perl program with 2 statements and 1000 lines of 70 character comments takes the same time to compile as one with 1000 lines of empty comments as one with just 2 print statements. Be sure to run each benchmark twice to allow your OS to cache the file, otherwise what you're benchmarking is the time to read the file from the disk.
If startup time is a problem for you, it's not because of comments and POD.

Perl compiles a script and then executes it. Comments marginally slow the compile phase, but have zero effect on the run phase.

Perl is not a scripting language in the same sense that shell scripts are. The interpreter does not read the file line by line. The execution of a Perl program is done in two basic stages: compilation and runtime [1]. During the compilation stage the source code is parsed and converted into bytecode. During the runtime stage the bytecode is executed on a virtual machine.
Comments will slow down the parsing stage but the difference is negligible compared to the time required to parse the script itself (which is already very small for most programs). About the only time you're really concerned with parsing time is in a webserver environment where the program could be called many times per second. mod_perl exists to solve this problem.
You're using Benchmark. That's good! You should be looking for ways to improve the algorithm -- not micro-optimizing. Devel::DProf might be helpful to find any hot spots. You absolutely should not strip comments in a misguided attempt to make your program faster. You'll just make it unmaintainable.
[1] This is commonly called "just in time" compilation. Perl actually has several more stages like INIT and END that don't matter here.

The point is: optimize bottlenecks. Reading in a file consists of:
opening the file,
reading in its contents,
closing the file,
parsing the contents.
Of these steps, reading is the fastest part by far (I am not sure about closing, it is a syscall, but you don't have to wait for it to finish). Even if it is 10% of the whole thing (which is is not, I think), then reducing it by half only gives 5% improved performance, at the cost of missing comments (which is a very bad thing). For the parser, throwing away a line that begins with # is not a tangible slowdown. And after that, the comments are gone, so there can be no slowdown.
Now, imagine that you could actually improve the "reading in the script" part by 5% through stripping all comments (which is a really optimistic estimate, see above). How big is the share of "reading in the script" in overall time consumption of the script? Depends on how much it does, of course, but since perl scripts usually read at least one more file, it is 50% at most, but since perl scripts usually do something more, an honest estimate will bring this down to something in the range of 1%. So, the expected efficiency improvement by stripping all comments is at most (very optimistic) 2.5%, but really closer to 0.05%. And then, those where it actually gives more than 1% are already fast since they do almost nothing, so you are again optimizing at the wrong point.
Concluding, optimize bottlenecks.

The Benchmark module is useless in this case. It's only measuring the times to run the code over and over again. Since your code doesn't actually do anything, most of it is optimized it away. That's why you're seeing it run 22 million times a second.
I have almost on entire chapter about this in Mastering Perl. The error of measurement in the Benchmark technique is about 7%. Your benchmark numbers are well within that, so there's virtually no difference.

From Paul Tomblins comment:
Doesn't perl do some sort of on-the-fly compilation? Maybe the comments get discarded early? –
Yes Perl does.
It is a programming language in between compiled and interpreted. The code gets compiled on the fly and then run. the comments usually don't make any difference. The most it would probably effect is when it is initially parsing the file line by line and pre compiling it, you might see a nano second difference.

I would expect that the one comment would only get parsed once, not multiple times in the loop, so I doubt it is a valid test.
I would expect that comments would slightly slow compilation, but I expect it would be too minor to bother removing them.

Do Perl comments slow a script down? Well, parsing it, yes. Executing it after parsing it? No. How often is a script parsed? Only once, so if you have a comment within a for loop, the comment is discarded by the parses once, before the script even runs, once it started running, the comment is already gone (and the script is not stored as script internally by Perl), thus no matter how many times the for loop repeats, the comment won't have an influence. How fast can the parser skip over comments? The way Perl comments are done, very fast, thus I doubt you will notice. You will notice a higher start-up time if you have 5 lines of code and between each line 1 Mio lines of comments... but how likely is that and of what use would a comment that large be?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio