Ruby v2.3.3p222
MacOS Catalina v10.15.3
Processor: 2.6GHz 6-Core Intel Core i7
I have the following performance benchmark script, which was designed to test the difference between one larger loop operation vs. two smaller loops:
require 'benchmark'
N = 10_000_000
def one_loop
N.times do
foo = 1+1
bar = 2+2
end
end
def two_loops
N.times do
foo = 1+1
end
N.times do
bar = 2+2
end
end
Benchmark.bmbm do |performance|
performance.report("two smaller loops") { two_loops }
performance.report("one large loop") { one_loop }
end
My hypothesis was that these two methods would execute in approximately the same amount of time, because (I thought) they were both executing the same number of instructions: the larger loop was doing 2 * 10,000,000 operations, while each of the 2 smaller loops was doing 1 * 10,000,000 operations.
However, that doesn't appear to be what I observed. When I run the script, I get the following output:
Rehearsal -----------------------------------------------------
two smaller loops 0.840000 0.000000 0.840000 ( 0.838101)
one large loop 0.500000 0.010000 0.510000 ( 0.506283)
-------------------------------------------- total: 1.350000sec
user system total real
two smaller loops 0.850000 0.000000 0.850000 ( 0.863052)
one large loop 0.500000 0.000000 0.500000 ( 0.494525)
This is really disappointing, because I was hoping to convince my team that we wouldn't see any reduced performance by splitting up our 1 large loop of code into several more concise loops which each did one thing and did it well.
I thought it might be due to the order in which the reports are generated, but when I reverse the order of the two calls to performance.report, I get similarly disappointing results:
Rehearsal -----------------------------------------------------
one large loop 0.500000 0.010000 0.510000 ( 0.508246)
two smaller loops 0.850000 0.000000 0.850000 ( 0.852467)
-------------------------------------------- total: 1.360000sec
user system total real
one large loop 0.490000 0.000000 0.490000 ( 0.496130)
two smaller loops 0.830000 0.000000 0.830000 ( 0.831476)
Am I missing something? Are the 2 smaller loops really doing a much greater amount of work than the single larger loop? Or did I somehow construct my benchmark script in a misleading or inaccurate way?
the larger loop was doing 2 * 10,000,000 operations, while each of the 2 smaller loops was doing 1 * 10,000,000 operations
It doesn't make sense to talk about "operations" without defining the machine model and cost model within which we model those "operations". Or, to put it simply: it doesn't make sense to count things until you are clear what you are counting.
In this case, you are counting additions. And you are correct: within your model which counts only additions, both versions have the same amount of additions.
They do, however, not have the same amount of block activations.
Remember, Integer#times looks roughly like this:
class Integer
def times
return enum_for(__callee__) unless block_given?
return self unless positive?
i = -1
yield i while (i += 1) < self
self
end
end
So, for each iteration of the loop, there is an activation (i.e. a yield) of the block passed to Integer#times.
If we add that as a new class of "operation", we have the following:
one_loop: 20,000,000 additions and 10,000,000 block activations
two_loops: 20,000,000 additions and 20,000,000 block activations
So, both methods have the same number of additions but two_loops has twice the number of block activations.
Which means, we also have to take into account the relative costs of additions versus block activations. Now, semantically, an addition is just a normal method call. And activating a block is sort-of similar to a method call.
So, we would expect that an addition and a block activation have roughly similar cost, which means our cost would be:
one_loop: 30,000,000 "method call like operations"
two_loops: 40,000,000 "method call like operations"
In other words, we would expect two_loops to be 33% slower or one_loop to be 25% faster, depending on how you look at it.
However, we actually find that the difference is much bigger, so clearly we are missing something in our model.
What we are missing are optimizations. Arithmetic operations on integers are extremely common and extremely performance-critical, and so all Ruby implementations go to great lengths to make them fast. In fact, on all Ruby implementations, simple additions such as the one you are using will be mapped directly to single CPU ADD instructions, and will not incur the overhead of a method call at all.
Block activations are also very important in Ruby, and thus they are also heavily optimized, but they are just fundamentally several orders of magnitude more complex than adding two machine-word integers.
In fact, the relative complexity of block activations to machine-word integer additions is so big, that we can actually ignore the additions completely in our model:
one_loop: 10,000,000 block activations
two_loops: 20,000,000 block activations
This gives us a factor of 2:1, so we would expect two_loops to be 100% slower or one_loop to be 50% faster.
By the way, I ignored another operation that is happening: the definition and initialization of the local variables. The argument is similar: that is an operation that is so fast that it is negligible compared to the block activation.
Actually, so far, we have only talked about the relative cost of those operations and how they mean that we can ignore the cost of the additions and the local variables. There is, however, an even stronger reason to ignore those: optimizations.
Even the most simple Ruby implementation will be able to optimize away the local variables completely: they are defined and initialized in only one place, and they are never accessed ever again. They only exist within the scope of the block, for the duration of one activation of the block, so even a very simple optimizer can see that they are completely useless, so even the most simple of optimizers will optimize the code to something roughly like this:
def one_loop
N.times do
1+1
2+2
end
end
def two_loops
N.times do
1+1
end
N.times do
2+2
end
end
Meaning that we can not only ignore the cost of the local variables because it is small compared to the other costs, but actually, the local variables don't even exist.
Likewise, a slightly smarter optimizer will recognize that the first addition in one_loop has no side-effects, is not returned, is not stored in a variable (or at least not in one that is used anywhere), and in general does not influence the outcome of the computation in any way, shape, or form, and thus will optimize the code to this:
def one_loop
N.times do
2+2
end
end
def two_loops
N.times do
1+1
end
N.times do
2+2
end
end
Furthermore, the same argument actually applies to the remaining addition. It has no side-effect, all it does is being returned from the block, but Integer#times ignores the return value of the block. I haven't taken a look at the generated code, but I strongly suspect that even the most stupid optimizer can easily prove that your blocks are no-ops, and thus it will optimize the code to something roughly like this:
def one_loop
N.times do
end
end
def two_loops
N.times do
end
N.times do
end
end
Which means that one_loop has N iterations of the block, two_loops has 2 * N iterations, and thus should take roughly twice as long.
Now, we can see in your benchmarks that the numbers are not actually 2:1. They are 1.75:1 or roughly 7:4.
I can confirm those results on my machine, here with YARV 2.7.1 without JIT, I get almost exactly 7:4:
user system total real
two smaller loops 0.711479 0.000099 0.711578 ( 0.711680)
one large loop 0.401808 0.000059 0.401867 ( 0.401916)
However, when I turn on the JIT, I get almost exactly the 2:1 we would expect:
user system total real
two smaller loops 0.587017 0.000279 0.587296 ( 0.587098)
one large loop 0.291713 0.000062 0.291775 ( 0.291779)
You will also notice that it is faster in general.
With JRuby 9.2.9.0, we get again slightly faster execution and almost 2:1:
user system total real
two smaller loops 0.740000 0.010000 0.750000 ( 0.517670)
one large loop 0.260000 0.000000 0.260000 ( 0.263270)
This was with default options, here are the results with some more aggressive compiler flags:
user system total real
two smaller loops 0.370000 0.000000 0.370000 ( 0.362050)
one large loop 0.390000 0.010000 0.400000 ( 0.213861)
TruffleRuby 20.1.0 is again much faster than even JRuby:
user system total real
two smaller loops 0.009955 0.000039 0.009994 ( 0.010035)
one large loop 0.004759 0.000007 0.004766 ( 0.004742)
And again, very close to 2:1. Also, even though we are only interested in the relative performance of the two methods, it is quite pleasing to see that TruffleRuby is 70x-100x faster than YARV on this benchmark!
Actually, I am somewhat surprised that TruffleRuby could not prove that Integer#times with an empty block body is a no-op. I would have expected it to be able to optimize the code like this:
def one_loop
end
def two_loops
end
And thus have no runtime difference between the two versions at all.
Am I missing something? Are the 2 smaller loops really doing a much greater amount of work than the single larger loop? Or did I somehow construct my benchmark script in a misleading or inaccurate way?
I would say all of the above.
The main problem is that you are measuring almost the exact opposite of what you are counting. You are counting only additions and ignoring block activations, and there is nothing wrong with that, IFF all you are interested in is the number of additions and nothing else.
And you are measuring only the cost of block activations and ignoring the cost of additions, which is also totally fine if that is what you are interested in.
The problem is that those two do not match up: you are not measuring what you are counting and you are not counting what you are measuring, so you simply cannot draw any conclusions from the results of your experiments towards your hypothesis.
In one of your comments, you asked:
so does this mean that each iteration of each loop counts as its own operation, in addition to whatever operation(s) happen inside the loop?
We can't tell you that. You need to define what operations you are interested in, and what operations you want to ignore. If you define "operation" to only mean "addition", then no, each iteration of the loop does not count as its own operation, and both of your examples have the exact same amount of operations.
Another problem is that your hypothesis "the number of additions is the same, therefore the execution time is the same" is invalid, because additions are not the only operations that take time. And even if you count other kinds of operations, then your hypothesis still assumes that every operation takes the same amount of time, which is also not true.
There are also some more problems with your benchmarking approach in general, which however are not the source of your confusion. Here are some of the problems with your benchmarks I have identified, although I am sure there others:
Your benchmarks are written in a way that all the operations you are interested about are optimized away, leaving only operations you are not interested about.
Even if they weren't optimized away, their execution time negligible compared to the execution time of the operations you don't care about.
Your benchmarks don't run long and often enough to give the optimizer a chance. For example, the default threshold for when a method is compiled is between 20 and 20000 calls depending on the Ruby implementation, compiler flags, etc. Both your methods are only called twice, once during rehearsal, once during the real thing. You would need to make sure they are called a lot more than 20000 times to ensure that a) they get compiled at all, and b) there are enough iterations after they have been compiled that the slower iterations before they were compiled do not significantly affect the outcome.
I always recommend that someone who wants to write a benchmark read and understand the following mailing list thread:
JMH vs Caliper: reference thread
Especially the sub-thread and following discussion starting at the linked message.
While this thread is about a specific benchmark tool for benchmarking Java code, anything that is discussed in the thread applies to all benchmarking on all modern high-performance language implementations.
There is a reason why benchmarks are written by benchmark engineers who have writing benchmarks as their full-time job: it takes a lot of knowledge and expertise to write benchmarks.
You need, at least
Deep knowledge of computer organization in general.
Deep knowledge of all the specific hardware platforms that you are benchmarking on.
Deep knowledge of programming languages in general.
Deep knowledge of al the specific programming languages you are writing benchmark code in.
Deep knowledge of language implementations in general, including but not limited to ahead-of-time compilers, JIT compilers, interpreters, VMs, garbage collectors, memory allocators, optimizers, inlining, loop unrolling, dead code elimination, constant folding, common subexpression elimination, tail call elimination, peephole optimizations, compile-time evaluation, Polymorphic Inline Caching, and lots more.
Deep knowledge of the specific language implementations you are running your code on.
And more, e.g. operating systems, scheduling, NUMA, multi-threading, SMT, CMT, …
And when you have all of that, you are presented with a bunch of numbers you need to know how to interpret, which requires deep knowledge of statistics.
The benchmark library in the Ruby stdlib is an example of many of those "sins". It was perfectly fine 25 years ago, when there was only one single Ruby implementation which was simply a dumb AST-walking interpreter with no optimizations whatsoever, and computers had one CPU with no out-of-order execution or speculation, and that was it. But for current times, where we have a multitude of aggressively optimizing Ruby implementations (most prominently TruffleRuby) and complex CPUs that perform their own optimizations as well, it just doesn't cut it anymore.
Unfortunately, there is no benchmark tool of comparable sophistication to what's available e.g. in the Java world, but there are at least some alternatives such as better-benchmark (no longer maintained), benchmark-ips (by Evan Phoenix, founder of Rubinius), or fruity (by Marc-André Lafortune, ruby-core team member).
This is 10 million iterations, and in each iteration two calculations are done, for a total of 30 million of what we'll call operations:
N.times do
foo = 1+1
bar = 2+2
end
This is 20 million iterations, and in each iteration one calculation is done, for a total of 40 million of what we'll call operations:
N.times do
foo = 1+1
end
N.times do
bar = 2+2
end
30 < 40, therefore the first example is faster.
I'm trying to process a large list of numbers:
require 'benchmark'
N = 999999
Benchmark.bm 10 do |bm|
bm.report 'Eager:' do
(0..N).select(&:even?).map{|x| x * x}.reduce(&:+)
end
bm.report 'Lazy:' do
(0..N).lazy.select(&:even?).map{|x| x * x}.reduce(&:+)
end
end;
To my understanding, the lazy version should be much faster, because the eager version needs to allocate two lists of half a million items each(one for select and one for map) while the lazy version is streaming everything.
However, when I run it, the lazy version takes more than twice as long as the eager one! (http://rextester.com/OTEX7399)
user system total real
Eager: 0.210000 0.010000 0.220000 ( 0.216572)
Lazy: 0.580000 0.000000 0.580000 ( 0.635091)
How can it be?
I'd say an Enumerator is a far slower middle man than memory is.
This was also reported a while back and Ruby core team member Yusuke Endoh said:
Enumerator::Lazy is not a silver bullet; it removes the overhead for
creating an intermediate array, but brings the drawback for calling
a block. Unfortunately, the latter is much bigger than the former.
Thus, in general, Lazy does bring performance drawback.
An analogy I just thought of: Imagine you're building some furniture for a friend.
Non-lazy: You build the whole thing, rent a truck, and drive it to your friend.
Lazy: You build a little piece and drive it to your friend with your car. You build the next little piece and drive it to your friend with your car. You build the next little piece and drive it to your friend with your car. And so on.
Yes, renting that truck is extra overhead, but it's nothing compared to driving over and over again with your car.
The real reason that being lazy can save time is that after the first few pieces your friend finds out that you slept with his wife so now you're no longer friends and he doesn't want your stupid furniture anymore and you're not building the remaining pieces at all.
function w=oja(X, varargin)
% get the dimensionality
[m n] = size(X);
% random initial weights
w = randn(m,1);
options = struct( ...
'rate', .00005, ...
'niter', 5000, ...
'delta', .0001);
options = getopt(options, varargin);
success = 0;
% run through all input samples
for iter = 1:options.niter
y = w'*X;
for ii = 1:n
% y is a scalar, not a vector
w = w + options.rate*(y(ii)*X(:,ii) - y(ii)^2*w);
end
end
if (any(~isfinite(w)))
warning('Lost convergence; lower learning rate?');
end
end
size(X)= 400 153600
This code implements oja's rule and runs slow. I am not able to vectorize it any more. To make it run faster I wanted to do computations on the GPU, therefore I changed
X=gpuArray(X)
But the code instead ran slower. The computation used seems to be compatible with GPU. Please let me know my mistake.
Profile Code Output:
Complete details:
https://drive.google.com/file/d/0B16PrXUjs69zRjFhSHhOSTI5RzQ/view?usp=sharing
This is not a full answer on how to solve it, but more an explanation why GPUs does not speed up, but actually enormously slow down your code.
GPUs are fantastic to speed up code that is parallel, meaning that they can do A LOT of things at the same time (i.e. my GPU can do 30070 things at the same time, while a modern CPU cant go over 16). However, GPU processors are very slow! Nowadays a decent CPU has around 2~3Ghz speed while a modern GPU has 700Mhz. This means that a CPU is much faster than a GPU, but as GPUs can do lots of things at the same time they can win overall.
Once I saw it explained as: What do you prefer, A million dollar sports car or a scooter? A million dolar car or a thousand scooters? And what if your job is to deliver pizza? Hopefully you answered a thousand scooters for this last one (unless you are a scooter fan and you answered the scooters in all of them, but that's not the point). (source and good introduction to GPU)
Back to your code: your code is incredibly sequential. Every inner iteration depends in the previous one and the same with the outer iteration. You can not run 2 of these in parallel, as you need the result from one iteration to run the next one. This means that you will not get a pizza order until you have delivered the last one, thus what you want is to deliver 1 by 1, as fast as you can (so sports car is better!).
And actually, each of these 1 line equations is incredibly fast! If I run 50 of them in my computer I get 13.034 seconds on that line which is 1.69 microseconds per iteration (7680000 calls).
Thus your problem is not that your code is slow, is that you call it A LOT of times. The GPU will not accelerate this line of code, because it is already very fast, and we know that CPUs are faster than GPUs for these kind of things.
Thus, unfortunately, GPUs suck for sequential code and your code is very sequential, therefore you can not use GPUs to speed up. An HPC will neither help, because every loop iteration depends in the previous one (no parfor :( ).
So, as far I can say, you will need to deal with it.
So this morning I decided to play around with Benchmarking for the first time.
I was curious about the speed different between code with "do-end" block formatting vs. "{ }" formatting.
So I stored the Benchmark code in a Proc so I could call it multiple times consecutively:
n = 100_000_000
bmp = Proc.new do
Benchmark.bm do |x|
x.report {n.times {a = "1"}}
x.report {n.times do; a = "1"; end}
end
end
My results we're expected when I ran it once.
>> bmp.call
user system total real
1.840000 0.030000 1.870000 ( 1.874507)
1.860000 0.050000 1.910000 ( 1.926101)
=> true
But then ran it again.
>> bmp.call
user system total real
1.870000 0.050000 1.920000 ( 1.922810)
1.840000 0.000000 1.840000 ( 1.850615)
To me this looks like the exact opposite of what I'm expecting. I am familiar with the concept of Branch Prediction. Is this a classic example of Branch Prediction? If not, what? Is there anyway to prevent any inaccuracies like this (if this is even considered one)?
EDIT: I did run this code over 30 times, after some suggestions. Frequently it would alternate between the two results. The sample of the data is found here:
gist.github.com/TheLarkInn/5599676
First of all, your benchmark is utterly pointless. The difference between the do / end syntax and the { / } syntax is just that: syntax. There is no semantic difference. Ergo, there cannot possibly be any runtime performance difference whatsoever between the two. It's just not logically possible. You don't need to benchmark it.
The only performance difference that could exist, is that one takes longer to parse than the other. However, none of the two is harder to parse than the other. The only difference is precedence. Therefore, there very likely isn't any performance difference in parsing, either.
And even if there were a performance difference in parsing, your benchmark wouldn't show it. You are using a benchmark written in Ruby, but in order to run Ruby code, the Ruby execution engine has to parse it first, which means that parsing will already have happened, before your benchmark even starts. So, even if your benchmark weren't pointless, it would still be useless, since it cannot possibly measure the performance difference in parsing.
As to your question about Branch Prediction: there are no branches in your code, there is nothing to predict.
BTW: even if your benchmark was intended for a different purpose, it still wouldn't be measuring anything, since at least the more advanced Ruby implementations would recognize that your blocks are essentially no-ops and simply optimize them away. And even if they aren't optimized away, all they are measuring is memory allocator performance (allocating a couple hundred megabytes of tiny String objects), not the performance of blocks.
Just a quick primer on stats:
I'm not sure if two runs is enough to spot a trend. What if there was a difference in system load between the two test blocks the second time you ran it?
A rule of thumb for determining a statistical difference between two samples is that 30 or more data points will give you a statistically relevant result.
I'd run your tests at least that many times, store the results for the two versions independently, and then compare them internally to ensure they're consistent, before comparing the two sets to one-another.
It could be that your initial premise is incorrect :)
I just got a gig to help speed up a program in R by improving the efficiency of the algorithms used to calculate data. There are many loops that do different calculations, and I'm wondering which loops end up using the most resources. I want to know how can I count the amount of time it takes for a loop to completely finish. I can use that information to figure out which algorithms to optimize, or even to write a C extension that will handle the calculations.
You can use:
Sys.time() or system.time()
The rbenchmark package
The microbenchmark package
Or a profiler (e.g. ?RProf)
I use Rprof to tell where to look.
It generates a file of stack samples, and I just look at a small number of those, like 10, chosen randomly.
Or I just make the time between samples large enough so I don't get too many samples to begin with.
There are 2 reasons this works.
1) By actually examining individual stack samples, with your own eyes, you can see problems that simple statistics don't expose, because by looking at the stack, you can see the reasons why things are being done. That tells you if you could get rid of it, and that's the essential information.
2) If you see such a pattern of activity that you could improve, you only have to see it on more than one sample to know it's worth fixing.
All the extra samples, if they mean you cannot do (1), are actually detrimental.
here is an example of using benchmark from another SO questions which compared using tapply vs by vs data.table: Edited as per on comments
library(rbenchmark)
# Different tests being compared
benchmark( using.tapply = tapply(x[, 1], x[, "f"], mean),
using.by = by(x[, 1], x[, "f"], mean),
using.dtable = dt[,mean(col1),by=key(dt)]),
# Number of reps. How results are.
replications = 250, order = "relative"
)
#------------------------#
# RESULTS #
#------------------------#
# COMPARING data.table VS tapply VS by #
#------------------------------------------#
# test elapsed relative
# 2 using.dtable 0.168 1.000
# 1 using.tapply 2.396 14.262
# 3 using.by 8.566 50.988