Do comments affect Perl performance? - performance

I'm optimizing some frequently run Perl code (once per day per file).
Do comments slow Perl scripts down? My experiments lean towards no:
use Benchmark;
timethese(20000000, {
'comments' => '$b=1;
# comment ... (100 times)
', 'nocomments' => '$b=1;'});
Gives pretty much identical values (apart from noise).
Benchmark: timing 10000000 iterations of comments, nocomments...
comments: 1 wallclock secs ( 0.53 usr + 0.00 sys = 0.53 CPU) # 18832391.71/s (n=10000000)
nocomments: 0 wallclock secs ( 0.44 usr + 0.00 sys = 0.44 CPU) # 22935779.82/s (n=10000000)
Benchmark: timing 20000000 iterations of comments, nocomments...
comments: 0 wallclock secs ( 0.86 usr + -0.01 sys = 0.84 CPU) # 23696682.46/s (n=20000000)
nocomments: 1 wallclock secs ( 0.90 usr + 0.00 sys = 0.90 CPU) # 22099447.51/s (n=20000000)
I get similar results if I run the comments and no-comments versions as separate Perl scripts.
It seems counter-intuitive though, if nothing else the interpreter needs to read the comments into memory every time.

Runtime performance? No.
Parsing and lexing performance? Yes, of course.
Since Perl tends to parse and lex on the fly, then comments will affect "start up" performance.
Will they affect it noticably? Unlikely.

Perl is a just-in-time compiled language, so comments and POD have no effect on run-time performance.
Comments and POD have a minuscule effect on compile-time, but they're so easy and fast for Perl to parse it's almost impossible to measure the performance hit. You can see this for yourself by using the -c flag to just compile.
On my Macbook, a Perl program with 2 statements and 1000 lines of 70 character comments takes the same time to compile as one with 1000 lines of empty comments as one with just 2 print statements. Be sure to run each benchmark twice to allow your OS to cache the file, otherwise what you're benchmarking is the time to read the file from the disk.
If startup time is a problem for you, it's not because of comments and POD.

Perl compiles a script and then executes it. Comments marginally slow the compile phase, but have zero effect on the run phase.

Perl is not a scripting language in the same sense that shell scripts are. The interpreter does not read the file line by line. The execution of a Perl program is done in two basic stages: compilation and runtime [1]. During the compilation stage the source code is parsed and converted into bytecode. During the runtime stage the bytecode is executed on a virtual machine.
Comments will slow down the parsing stage but the difference is negligible compared to the time required to parse the script itself (which is already very small for most programs). About the only time you're really concerned with parsing time is in a webserver environment where the program could be called many times per second. mod_perl exists to solve this problem.
You're using Benchmark. That's good! You should be looking for ways to improve the algorithm -- not micro-optimizing. Devel::DProf might be helpful to find any hot spots. You absolutely should not strip comments in a misguided attempt to make your program faster. You'll just make it unmaintainable.
[1] This is commonly called "just in time" compilation. Perl actually has several more stages like INIT and END that don't matter here.

The point is: optimize bottlenecks. Reading in a file consists of:
opening the file,
reading in its contents,
closing the file,
parsing the contents.
Of these steps, reading is the fastest part by far (I am not sure about closing, it is a syscall, but you don't have to wait for it to finish). Even if it is 10% of the whole thing (which is is not, I think), then reducing it by half only gives 5% improved performance, at the cost of missing comments (which is a very bad thing). For the parser, throwing away a line that begins with # is not a tangible slowdown. And after that, the comments are gone, so there can be no slowdown.
Now, imagine that you could actually improve the "reading in the script" part by 5% through stripping all comments (which is a really optimistic estimate, see above). How big is the share of "reading in the script" in overall time consumption of the script? Depends on how much it does, of course, but since perl scripts usually read at least one more file, it is 50% at most, but since perl scripts usually do something more, an honest estimate will bring this down to something in the range of 1%. So, the expected efficiency improvement by stripping all comments is at most (very optimistic) 2.5%, but really closer to 0.05%. And then, those where it actually gives more than 1% are already fast since they do almost nothing, so you are again optimizing at the wrong point.
Concluding, optimize bottlenecks.

The Benchmark module is useless in this case. It's only measuring the times to run the code over and over again. Since your code doesn't actually do anything, most of it is optimized it away. That's why you're seeing it run 22 million times a second.
I have almost on entire chapter about this in Mastering Perl. The error of measurement in the Benchmark technique is about 7%. Your benchmark numbers are well within that, so there's virtually no difference.

From Paul Tomblins comment:
Doesn't perl do some sort of on-the-fly compilation? Maybe the comments get discarded early? –
Yes Perl does.
It is a programming language in between compiled and interpreted. The code gets compiled on the fly and then run. the comments usually don't make any difference. The most it would probably effect is when it is initially parsing the file line by line and pre compiling it, you might see a nano second difference.

I would expect that the one comment would only get parsed once, not multiple times in the loop, so I doubt it is a valid test.
I would expect that comments would slightly slow compilation, but I expect it would be too minor to bother removing them.

Do Perl comments slow a script down? Well, parsing it, yes. Executing it after parsing it? No. How often is a script parsed? Only once, so if you have a comment within a for loop, the comment is discarded by the parses once, before the script even runs, once it started running, the comment is already gone (and the script is not stored as script internally by Perl), thus no matter how many times the for loop repeats, the comment won't have an influence. How fast can the parser skip over comments? The way Perl comments are done, very fast, thus I doubt you will notice. You will notice a higher start-up time if you have 5 lines of code and between each line 1 Mio lines of comments... but how likely is that and of what use would a comment that large be?

Related

Fortran unformatted I/O optimization

I'm working on a set of Fortran programs that are heavily I/O bound, and so am trying to optimize this. I've read at multiple places that writing entire arrays is faster than individual elements, i.e. WRITE(10)arr is faster than DO i=1,n; WRITE(10) arr(i); ENDDO. But, I'm unclear where my case would fall in this regard. Conceptually, my code is something like:
OPEN(10,FILE='testfile',FORM='UNFORMATTED')
DO i=1,n
[calculations to determine m values stored in array arr]
WRITE(10) m
DO j=1,m
WRITE(10) arr(j)
ENDDO
ENDDO
But m may change each time through the DO i=1,n loop such that writing the whole array arr isn't an option. So, collapsing the DO loop for writing would end up with WRITE(10) arr(1:m), which isn't the same as writing the whole array. Would this still provide a speed-up to writing, what about reading? I could allocate an array of size m after the calculations, assign the values to that array, write it, then deallocate it, but that seems too involved.
I've also seen differing information on implied DO loop writes, i.e. WRITE(10) (arr(j),j=1,m), as to whether they help/hurt on I/O overhead.
I'm running a couple of tests now, and intend to update with my observations. Other suggestions on applicable
Additional details:
The first program creates a large file, the second reads it. And, no, merging the two programs and keeping everything in memory isn't a valid option.
I'm using unformatted I/O and have access to the Portland Group and gfortran compilers. It's my understanding the PG's is generally faster, so that's what I'm using.
The output file is currently ~600 GB, the codes take several hours to run.
The second program (reading in the file) seems especially costly. I've monitored the system and seen that it's mostly CPU-bound, even when I reduce the code to little more than reading the file, indicating that there is very significant CPU overhead on all the I/O calls when each value is read in one-at-a-time.
Compiler flags: -O3 (high optimization) -fastsse (various performance enhancements, optimized for SSE hardware) -Mipa=fast,inline (enables aggressive inter-procedural analysis/optimization on compiler)
UPDATE
I ran the codes with WRITE(10) arr(1:m) and READ(10) arr(1:m). My tests with these agreed, and showed a reduction in runtime of about 30% for the WRITE code, the output file is also slightly less than half the original's size. For the second code, reading in the file, I made the code do basically nothing but read the file to compare pure read time. This reduced the run time by a factor of 30.
If you use normal unformatted (record-oriented) I/O, you also write a record marker before and after the data itself. So you add eight bytes (usually) of overhead to each data item, which can easily (almost) double the data written to disc if your number is a double precision. The runtime overhead mentioned in the other answers is also significant.
The argument above does not apply if you use unformatted stream.
So, use
WRITE (10) m
WRITE (10) arr(1:m)
For gfortran, this is faster than an implied DO loop (i.e. the solution WRITE (10) (arr(i),i=1,m)).
In the suggested solution, an array descriptor is built and passed to the library with a single call. I/O can then be done much more efficiently, in your case taking advantage of the fact that the data is contiguous.
For the implied DO loop, gfortran issues multiple library calls, with much more overhead. This could be optimized, and is subject of a long-standing bug report, PR 35339, but some complicated corner cases and the presence of a viable alternative have kept this from being optimized.
I would also suggest doing I/O in stream access, not because of the rather insignificant saving in space (see above) but because keeping up the leading record marker up to date on writing needs a seek, which is additional effort.
If your data size is very large, above ~ 2^31 bytes, you might run into different behavior with record markers. gfortran uses subrecords in this case (compatible to Intel), but it should just work. I don't know what Portland does in this case.
For reading, of course, you can read m, then allocate an allocatable array, then read the whole array in one READ statement.
The point of avoiding outputting an array by looping over multiple WRITE() operations is to avoid the multiple WRITE() operations. It's not particularly important that the data being output are all the members of the array.
Writing either an array section or a whole array via a single WRITE() operation is a good bet. An implied DO loop cannot be worse than an explicit outer loop, but whether it's any better is a question of compiler implementation. (Though I'd expect the implied-DO to be better than an outer loop.)

How long can I expect grep to take on a 10 TB file?

I have a 10 TB file with words from multiple books, and I'm trying to grep for some uncommon strings (no regex). For example:
grep "cappucino" filename
I'm trying to estimate how long this will take. I'm not really looking for whether it's the right approach or not. I'd like to learn more about what really happens under the hood when I call grep.
Please correct me if I'm wrong:
I use mechanical harddrive with roughly 200 MB/s read speed, so it will take roughly 10 million / 200 = 50000 seconds = 14 hours to finish. Is this an accurate estimate?
The short answer is: no.
The longer answer is: it depends.
The even longer answer is: grep's performance depends on a lot of things:
are you running a fixed string search (-F, fgrep) or not - grep uses Boyer-Moore algorithm which by itself isn't capable of finding regular expressions so what grep does (or at least used to do) is it first finds a fixed string in your regexp, tries to find it using BM in the text and do a regexp match (not sure about the current implementation whether it uses an NFA or a DFA implementation, probably a hybrid)
how long is your pattern - BM works faster for longer patterns
how many matches will you have - the less the matches the faster it will be
what is your CPU and memory - hard drive will help you only during reading not during computation time
what other options are you using with your grep
14 hours might not even be your lower bound because Boyer-Moore is smart enough to compute an offset at which next possible match might occur so it doesn't need to read-in the whole file. This does depend on the implementation though and is just my speculation. After re-running the below test with a much longer pattern I was able to go down to 0.23sec and I don't think my disk is that fast. But there might be some caching involved instead.
For instance I'm running on a 500MB/s SSD (at least that's what the manufacturer says) and grepping a 200MB file with a very short pattern (few chars) gives me:
With 808320 hits
real 0m1.734s
user 0m1.334s
sys 0m0.120s
With 0 hits:
real 0m0.059s
user 0m0.046s
sys 0m0.016s
#Edit: in short read about Boyer-Moore :-)
#Edit2: well to check how grep works you should instead check the source code, I described a very general workflow above.

Ruby Benchmarking Accuracy - Branch Prediction at its finest?

So this morning I decided to play around with Benchmarking for the first time.
I was curious about the speed different between code with "do-end" block formatting vs. "{ }" formatting.
So I stored the Benchmark code in a Proc so I could call it multiple times consecutively:
n = 100_000_000
bmp = Proc.new do
Benchmark.bm do |x|
x.report {n.times {a = "1"}}
x.report {n.times do; a = "1"; end}
end
end
My results we're expected when I ran it once.
>> bmp.call
user system total real
1.840000 0.030000 1.870000 ( 1.874507)
1.860000 0.050000 1.910000 ( 1.926101)
=> true
But then ran it again.
>> bmp.call
user system total real
1.870000 0.050000 1.920000 ( 1.922810)
1.840000 0.000000 1.840000 ( 1.850615)
To me this looks like the exact opposite of what I'm expecting. I am familiar with the concept of Branch Prediction. Is this a classic example of Branch Prediction? If not, what? Is there anyway to prevent any inaccuracies like this (if this is even considered one)?
EDIT: I did run this code over 30 times, after some suggestions. Frequently it would alternate between the two results. The sample of the data is found here:
gist.github.com/TheLarkInn/5599676
First of all, your benchmark is utterly pointless. The difference between the do / end syntax and the { / } syntax is just that: syntax. There is no semantic difference. Ergo, there cannot possibly be any runtime performance difference whatsoever between the two. It's just not logically possible. You don't need to benchmark it.
The only performance difference that could exist, is that one takes longer to parse than the other. However, none of the two is harder to parse than the other. The only difference is precedence. Therefore, there very likely isn't any performance difference in parsing, either.
And even if there were a performance difference in parsing, your benchmark wouldn't show it. You are using a benchmark written in Ruby, but in order to run Ruby code, the Ruby execution engine has to parse it first, which means that parsing will already have happened, before your benchmark even starts. So, even if your benchmark weren't pointless, it would still be useless, since it cannot possibly measure the performance difference in parsing.
As to your question about Branch Prediction: there are no branches in your code, there is nothing to predict.
BTW: even if your benchmark was intended for a different purpose, it still wouldn't be measuring anything, since at least the more advanced Ruby implementations would recognize that your blocks are essentially no-ops and simply optimize them away. And even if they aren't optimized away, all they are measuring is memory allocator performance (allocating a couple hundred megabytes of tiny String objects), not the performance of blocks.
Just a quick primer on stats:
I'm not sure if two runs is enough to spot a trend. What if there was a difference in system load between the two test blocks the second time you ran it?
A rule of thumb for determining a statistical difference between two samples is that 30 or more data points will give you a statistically relevant result.
I'd run your tests at least that many times, store the results for the two versions independently, and then compare them internally to ensure they're consistent, before comparing the two sets to one-another.
It could be that your initial premise is incorrect :)

Do speeds of if statements in a repetitive loop affect overall performance?

If I have code that will take a while to execute, printing out results every iteration will slow down the program a lot. To still receive occasional output to check on the progress of the code, I might have:
if (i % 10000 == 0) {
# print progress here
}
Does the if statement checking every time slow it down at all? Should I just not put output and just wait, will that make it noticeably faster at all?
Also, is it faster to do: (i % 10000 == 0) or (i == 10000)?
Is checking equality or modulus faster?
In general case, it won't matter at all.
A slightly longer answer: It won't matter unless the loop is run millions of times and the other statement in it is actually less demanding than an if statement (for example, a simple multiplication etc.). In that case, you might see a slight performance drop.
Regarding (i % 10000 == 0) vs. (i == 10000), the latter is obviously faster, because it only compares, whereas the former possibility does a (fairly costly) modulus and a comparison.
That said, both an if statement and a modulus count won't make any difference if your loop doesn't take up 90 % of the program's running time. Which usually is the case only at school :). You probably spent a lot more time by asking this question than you would have saved by not printing anything. For development and debugging, this is not a bad way to go.
The golden rule for this kind of decisions:
Write the most readable and explicit code you can imagine to do the
thing you want it to do. If you have a performance problem, look at
wrong data structures and algorithmic choices first. If you have done
all those and need a really quick program, profile it to see which
part takes most time. After all those, you're allowed to do this kind
of low-level guesses.

How to benchmark Matlab processes?

Searching for an idea how to avoid using loop in my Matlab code, I found following comments under one question on SE:
The statement "for loops are slow in Matlab" is no longer generally true since Matlab...euhm, R2008a?
and
Have you tried to benchmark a for loop vs what you already have? sometimes it is faster than vectorized code...
So I would like to ask, is there commonly used way to test the speed of a process in Matlab? Can user see somewhere how much time the process takes or the only way is to extend the processes for several minutes in order to compare the times between each other?
The best tool for testing the performance of MATLAB code is Steve Eddins' timeit function, available here from the MATLAB Central File Exchange.
It handles many subtle issues related to benchmarking MATLAB code for you, such as:
ensuring that JIT compilation is used by wrapping the benchmarked code in a function
warming up the code
running the code several times and averaging
Update: As of release R2013b, timeit is part of core MATLAB.
Update: As of release R2016a, MATLAB also includes a performance testing framework that handles the above issues for you in a similar way to timeit.
You can use the profiler to assess how much time your functions, and the blocks of code within them, are taking.
>> profile on; % Starts the profiler
>> myfunctiontorun( ); % This can be a function, script or block of code
>> profile viewer; % Opens the viewer showing you how much time everything took
Viewer also clears the current profile data for next time.
Bear in mind, profile does tend to slow execution a bit, but I believe it does so in a uniform way across everything.
Obviously if your function is very quick, you might find you don't get reliable results so if you can run it many times or extend the computation that would improve matters.
If it's really simple stuff you're testing, you can also just time it using tic and toc:
>> tic; % Start the timer
>> myfunctionname( );
>> toc; % End the timer and display elapsed time
Also if you want multiple timers, you can assign them to variables:
>> mytimer = tic;
>> myfunctionname( );
>> toc(mytimer);
Finally, if you want to store the elapsed time instead of display it:
>> myresult = toc;
I think that I am right to state that many of us time Matlab by wrapping the block of code we're interested in between tic and toc. Furthermore, we take care to ensure that the total time is of the order of 10s of seconds (rather than 1s of seconds or 100s of seconds) and repeat it 3 - 5 times and take some measure of central tendency (such as the mean) and draw our conclusions from that.
If the piece of code takes less than, say 10s, then repeat it as many times as necessary to bring it into the range, being careful to avoid any impact of one iteration on the next. And if the code naturally takes 100s of seconds or longer, either spend longer on the testing or try it with artificially small input data to run more quickly.
In my experience it's not necessary to run programs for minutes to get data on average run time with acceptably low variance. If I run a program 5 times and one (or two) of the results is wildly different from the mean I'll re-run it.
Of course, if the code has any features which make its run time non-deterministic then it's a different matter.

Resources