Speed differences between proc, Proc.new, lambda, and stabby lambda - ruby

Procs and lambdas differ with respect to method scoping and the effect of the return keyword. I am rather interested in the performance differences between them. I wrote a test as shown below:
def time(&block)
start = Time.now
block.call
p "that took #{Time.now - start}"
end
def test(proc)
time{(0..10000000).each{|n| proc.call(n)}}
end
def test_block(&block)
time{(0..10000000).each{|n| block.call(n)}}
end
def method_test
time{(1..10000000).each{|n| my_method(n)}}
end
proc1 = Proc.new{|x| x*x}
proc2 = proc{|x| x*x}
lam1 = lambda{|x| x*x}
lam2 = ->x{x*x}
def my_method(x)
x*x
end
test(proc1)
test(proc2)
test(lam1)
test(lam2)
test_block{|x| x*x}
test(method(:my_method))
method_test
The result of this code is shown below.
"that took 0.988388739"
"that took 0.963193172"
"that took 0.943111226"
"that took 0.950506263"
"that took 0.960760843"
"that took 1.090146951"
"that took 0.644500627"
method(:my_method) is the slowest, which is because it checks a look up table for each iteration in the loop.
Similarly, another test as below:
def test2(&block)
time{(0..1000000).each{block.call}}
end
test2{Proc.new{|x| x*x}}
test2{proc{|x| x*x}}
test2{lambda{|x| x*x}}
test2{->(x){x*x}}
returns this result:
"that took 0.415290453"
"that took 0.378787963"
"that took 0.3888118"
"that took 0.391414639"
Proc.new is the slowest creation method, which is because we have the overhead of creating an entire object to wrap our proc.
I assert that the execution time of the procs and lambdas are the same as one another regardless of their creation method.
Why is normal method invocation so much faster than procs and lambdas (1/3 time reduction)?
Do people think this is likely to change with different block functions etc.?
Are there any other (performance based) reasons to chose between the different approaches?

So it seems you have three questions. The middle one is unclear to me, so I will address the other two:
Why is normal method invocation so much faster?
This is the easier of the questions.
First realize that the times involved here are for function call overhead. I did my own timings based on your code (but with an identity function instead of multiplication), and non-direct invocations took 49% longer. With one multiplication, non-direct invocations took only 43% longer. In other words, one reason why you're seeing a large disparity is that your function itself is doing almost nothing. Even a single multiplication makes 6% of the difference "vanish". In a method of any reasonable complexity, the method call overhead is usually going to be a relatively small percentage of the overall time.
Next, remember that a proc/block/lambda is essentially a chunk of code that is being carried around (though a block literal cannot be saved into a variable). This implies one more level of indirection than a method call...meaning that at the very least the CPU is going to have to traverse a pointer to something.
Also, remember that Ruby supports closures, and I'm betting there is some overhead in deciding which environment the indirect code should run in.
On my machine, running a C program that invokes a function directly has 10% less overhead than one that uses a pointer to a function. An interpreted language like Ruby, where closures are also involved, is definitely going to use more.
My measurements (Ruby 2.2) indicate a direct method invocation takes about as long as 6 multiplications, and an indirect invocation takes about as long as 10.
So while the overhead is nearly twice as large, remember that the overhead in both cases is often relatively small.
Are there any other (performance based) reasons to chose between the different approaches?
I'd say given the above data the answer is usually no: you're much better off using the construct that gives you the most maintainable code.
There are definitely good reasons to choose one over the other. To be honest, I'm surprised about the difference I see between lambdas and blocks (on my machine, lambdas have 20% less overhead). Lambdas create anonymous methods that include parameter list checking, so if anything I would expect it to be slightly slower, but my measurements put it ahead.
That the direct invocation is faster simply isn't surprising at all.
The place where this kind of thing makes a difference is in very frequently called code where the overhead adds up to be noticeable in wall-clock kinds of ways. In this case, it can make sense to do all manner of ugly optimizations to try to squeeze a bit more speed, all the way up to inlining code (dodge the function-call overhead altogether).
For example, say your application uses a database framework. You profile it and find a frequently-called, small (e.g. less than 20 multiplications worth of work) function that copies the data from the database result into data structures. Such a function might comprise the lion's share of the result processing, and simply inlining that function might shave off significant amounts of CPU time at the expense of some code clarity.
Simply inlining your "square function" into a long numeric calculation with a billion steps could save you dramatic amounts of time because the operation itself takes a lot less time than even a direct method invocation.
In most cases, though, you're better off with clean, clear code.

Related

Why are these two smaller loops so much slower than a single loop containing the same instructions?

Ruby v2.3.3p222
MacOS Catalina v10.15.3
Processor: 2.6GHz 6-Core Intel Core i7
I have the following performance benchmark script, which was designed to test the difference between one larger loop operation vs. two smaller loops:
require 'benchmark'
N = 10_000_000
def one_loop
N.times do
foo = 1+1
bar = 2+2
end
end
def two_loops
N.times do
foo = 1+1
end
N.times do
bar = 2+2
end
end
Benchmark.bmbm do |performance|
performance.report("two smaller loops") { two_loops }
performance.report("one large loop") { one_loop }
end
My hypothesis was that these two methods would execute in approximately the same amount of time, because (I thought) they were both executing the same number of instructions: the larger loop was doing 2 * 10,000,000 operations, while each of the 2 smaller loops was doing 1 * 10,000,000 operations.
However, that doesn't appear to be what I observed. When I run the script, I get the following output:
Rehearsal -----------------------------------------------------
two smaller loops 0.840000 0.000000 0.840000 ( 0.838101)
one large loop 0.500000 0.010000 0.510000 ( 0.506283)
-------------------------------------------- total: 1.350000sec
user system total real
two smaller loops 0.850000 0.000000 0.850000 ( 0.863052)
one large loop 0.500000 0.000000 0.500000 ( 0.494525)
This is really disappointing, because I was hoping to convince my team that we wouldn't see any reduced performance by splitting up our 1 large loop of code into several more concise loops which each did one thing and did it well.
I thought it might be due to the order in which the reports are generated, but when I reverse the order of the two calls to performance.report, I get similarly disappointing results:
Rehearsal -----------------------------------------------------
one large loop 0.500000 0.010000 0.510000 ( 0.508246)
two smaller loops 0.850000 0.000000 0.850000 ( 0.852467)
-------------------------------------------- total: 1.360000sec
user system total real
one large loop 0.490000 0.000000 0.490000 ( 0.496130)
two smaller loops 0.830000 0.000000 0.830000 ( 0.831476)
Am I missing something? Are the 2 smaller loops really doing a much greater amount of work than the single larger loop? Or did I somehow construct my benchmark script in a misleading or inaccurate way?
the larger loop was doing 2 * 10,000,000 operations, while each of the 2 smaller loops was doing 1 * 10,000,000 operations
It doesn't make sense to talk about "operations" without defining the machine model and cost model within which we model those "operations". Or, to put it simply: it doesn't make sense to count things until you are clear what you are counting.
In this case, you are counting additions. And you are correct: within your model which counts only additions, both versions have the same amount of additions.
They do, however, not have the same amount of block activations.
Remember, Integer#times looks roughly like this:
class Integer
def times
return enum_for(__callee__) unless block_given?
return self unless positive?
i = -1
yield i while (i += 1) < self
self
end
end
So, for each iteration of the loop, there is an activation (i.e. a yield) of the block passed to Integer#times.
If we add that as a new class of "operation", we have the following:
one_loop: 20,000,000 additions and 10,000,000 block activations
two_loops: 20,000,000 additions and 20,000,000 block activations
So, both methods have the same number of additions but two_loops has twice the number of block activations.
Which means, we also have to take into account the relative costs of additions versus block activations. Now, semantically, an addition is just a normal method call. And activating a block is sort-of similar to a method call.
So, we would expect that an addition and a block activation have roughly similar cost, which means our cost would be:
one_loop: 30,000,000 "method call like operations"
two_loops: 40,000,000 "method call like operations"
In other words, we would expect two_loops to be 33% slower or one_loop to be 25% faster, depending on how you look at it.
However, we actually find that the difference is much bigger, so clearly we are missing something in our model.
What we are missing are optimizations. Arithmetic operations on integers are extremely common and extremely performance-critical, and so all Ruby implementations go to great lengths to make them fast. In fact, on all Ruby implementations, simple additions such as the one you are using will be mapped directly to single CPU ADD instructions, and will not incur the overhead of a method call at all.
Block activations are also very important in Ruby, and thus they are also heavily optimized, but they are just fundamentally several orders of magnitude more complex than adding two machine-word integers.
In fact, the relative complexity of block activations to machine-word integer additions is so big, that we can actually ignore the additions completely in our model:
one_loop: 10,000,000 block activations
two_loops: 20,000,000 block activations
This gives us a factor of 2:1, so we would expect two_loops to be 100% slower or one_loop to be 50% faster.
By the way, I ignored another operation that is happening: the definition and initialization of the local variables. The argument is similar: that is an operation that is so fast that it is negligible compared to the block activation.
Actually, so far, we have only talked about the relative cost of those operations and how they mean that we can ignore the cost of the additions and the local variables. There is, however, an even stronger reason to ignore those: optimizations.
Even the most simple Ruby implementation will be able to optimize away the local variables completely: they are defined and initialized in only one place, and they are never accessed ever again. They only exist within the scope of the block, for the duration of one activation of the block, so even a very simple optimizer can see that they are completely useless, so even the most simple of optimizers will optimize the code to something roughly like this:
def one_loop
N.times do
1+1
2+2
end
end
def two_loops
N.times do
1+1
end
N.times do
2+2
end
end
Meaning that we can not only ignore the cost of the local variables because it is small compared to the other costs, but actually, the local variables don't even exist.
Likewise, a slightly smarter optimizer will recognize that the first addition in one_loop has no side-effects, is not returned, is not stored in a variable (or at least not in one that is used anywhere), and in general does not influence the outcome of the computation in any way, shape, or form, and thus will optimize the code to this:
def one_loop
N.times do
2+2
end
end
def two_loops
N.times do
1+1
end
N.times do
2+2
end
end
Furthermore, the same argument actually applies to the remaining addition. It has no side-effect, all it does is being returned from the block, but Integer#times ignores the return value of the block. I haven't taken a look at the generated code, but I strongly suspect that even the most stupid optimizer can easily prove that your blocks are no-ops, and thus it will optimize the code to something roughly like this:
def one_loop
N.times do
end
end
def two_loops
N.times do
end
N.times do
end
end
Which means that one_loop has N iterations of the block, two_loops has 2 * N iterations, and thus should take roughly twice as long.
Now, we can see in your benchmarks that the numbers are not actually 2:1. They are 1.75:1 or roughly 7:4.
I can confirm those results on my machine, here with YARV 2.7.1 without JIT, I get almost exactly 7:4:
user system total real
two smaller loops 0.711479 0.000099 0.711578 ( 0.711680)
one large loop 0.401808 0.000059 0.401867 ( 0.401916)
However, when I turn on the JIT, I get almost exactly the 2:1 we would expect:
user system total real
two smaller loops 0.587017 0.000279 0.587296 ( 0.587098)
one large loop 0.291713 0.000062 0.291775 ( 0.291779)
You will also notice that it is faster in general.
With JRuby 9.2.9.0, we get again slightly faster execution and almost 2:1:
user system total real
two smaller loops 0.740000 0.010000 0.750000 ( 0.517670)
one large loop 0.260000 0.000000 0.260000 ( 0.263270)
This was with default options, here are the results with some more aggressive compiler flags:
user system total real
two smaller loops 0.370000 0.000000 0.370000 ( 0.362050)
one large loop 0.390000 0.010000 0.400000 ( 0.213861)
TruffleRuby 20.1.0 is again much faster than even JRuby:
user system total real
two smaller loops 0.009955 0.000039 0.009994 ( 0.010035)
one large loop 0.004759 0.000007 0.004766 ( 0.004742)
And again, very close to 2:1. Also, even though we are only interested in the relative performance of the two methods, it is quite pleasing to see that TruffleRuby is 70x-100x faster than YARV on this benchmark!
Actually, I am somewhat surprised that TruffleRuby could not prove that Integer#times with an empty block body is a no-op. I would have expected it to be able to optimize the code like this:
def one_loop
end
def two_loops
end
And thus have no runtime difference between the two versions at all.
Am I missing something? Are the 2 smaller loops really doing a much greater amount of work than the single larger loop? Or did I somehow construct my benchmark script in a misleading or inaccurate way?
I would say all of the above.
The main problem is that you are measuring almost the exact opposite of what you are counting. You are counting only additions and ignoring block activations, and there is nothing wrong with that, IFF all you are interested in is the number of additions and nothing else.
And you are measuring only the cost of block activations and ignoring the cost of additions, which is also totally fine if that is what you are interested in.
The problem is that those two do not match up: you are not measuring what you are counting and you are not counting what you are measuring, so you simply cannot draw any conclusions from the results of your experiments towards your hypothesis.
In one of your comments, you asked:
so does this mean that each iteration of each loop counts as its own operation, in addition to whatever operation(s) happen inside the loop?
We can't tell you that. You need to define what operations you are interested in, and what operations you want to ignore. If you define "operation" to only mean "addition", then no, each iteration of the loop does not count as its own operation, and both of your examples have the exact same amount of operations.
Another problem is that your hypothesis "the number of additions is the same, therefore the execution time is the same" is invalid, because additions are not the only operations that take time. And even if you count other kinds of operations, then your hypothesis still assumes that every operation takes the same amount of time, which is also not true.
There are also some more problems with your benchmarking approach in general, which however are not the source of your confusion. Here are some of the problems with your benchmarks I have identified, although I am sure there others:
Your benchmarks are written in a way that all the operations you are interested about are optimized away, leaving only operations you are not interested about.
Even if they weren't optimized away, their execution time negligible compared to the execution time of the operations you don't care about.
Your benchmarks don't run long and often enough to give the optimizer a chance. For example, the default threshold for when a method is compiled is between 20 and 20000 calls depending on the Ruby implementation, compiler flags, etc. Both your methods are only called twice, once during rehearsal, once during the real thing. You would need to make sure they are called a lot more than 20000 times to ensure that a) they get compiled at all, and b) there are enough iterations after they have been compiled that the slower iterations before they were compiled do not significantly affect the outcome.
I always recommend that someone who wants to write a benchmark read and understand the following mailing list thread:
JMH vs Caliper: reference thread
Especially the sub-thread and following discussion starting at the linked message.
While this thread is about a specific benchmark tool for benchmarking Java code, anything that is discussed in the thread applies to all benchmarking on all modern high-performance language implementations.
There is a reason why benchmarks are written by benchmark engineers who have writing benchmarks as their full-time job: it takes a lot of knowledge and expertise to write benchmarks.
You need, at least
Deep knowledge of computer organization in general.
Deep knowledge of all the specific hardware platforms that you are benchmarking on.
Deep knowledge of programming languages in general.
Deep knowledge of al the specific programming languages you are writing benchmark code in.
Deep knowledge of language implementations in general, including but not limited to ahead-of-time compilers, JIT compilers, interpreters, VMs, garbage collectors, memory allocators, optimizers, inlining, loop unrolling, dead code elimination, constant folding, common subexpression elimination, tail call elimination, peephole optimizations, compile-time evaluation, Polymorphic Inline Caching, and lots more.
Deep knowledge of the specific language implementations you are running your code on.
And more, e.g. operating systems, scheduling, NUMA, multi-threading, SMT, CMT, …
And when you have all of that, you are presented with a bunch of numbers you need to know how to interpret, which requires deep knowledge of statistics.
The benchmark library in the Ruby stdlib is an example of many of those "sins". It was perfectly fine 25 years ago, when there was only one single Ruby implementation which was simply a dumb AST-walking interpreter with no optimizations whatsoever, and computers had one CPU with no out-of-order execution or speculation, and that was it. But for current times, where we have a multitude of aggressively optimizing Ruby implementations (most prominently TruffleRuby) and complex CPUs that perform their own optimizations as well, it just doesn't cut it anymore.
Unfortunately, there is no benchmark tool of comparable sophistication to what's available e.g. in the Java world, but there are at least some alternatives such as better-benchmark (no longer maintained), benchmark-ips (by Evan Phoenix, founder of Rubinius), or fruity (by Marc-André Lafortune, ruby-core team member).
This is 10 million iterations, and in each iteration two calculations are done, for a total of 30 million of what we'll call operations:
N.times do
foo = 1+1
bar = 2+2
end
This is 20 million iterations, and in each iteration one calculation is done, for a total of 40 million of what we'll call operations:
N.times do
foo = 1+1
end
N.times do
bar = 2+2
end
30 < 40, therefore the first example is faster.

How does JIT optimize branching while processing elements of collections? (in Scala)

This is a question about performance of code written in Scala.
Consider the following two code snippets, assume that x is some collection containing ~50 million elements:
def process(x: Traversable[T]) = {
processFirst x.head
x reduce processPair
processLast x.last
}
Versus something like this (assume for now we have some way to determine if we're operating on the first element versus the last element):
def isFirstElement[T](x: T) = ???
def isLastElement[T](x: T) = ???
def process(x: Traversable[T]) = {
x reduce {
(left, right) =>
if (isFirstElement(left)
processFirst(left)
else if (isLastElement(right))
processLast(right)
processPair(left, right)
}
}
Which approach is faster? and for ~50 million elements, how much faster?
It seems to me that the first example would be faster because there are fewer conditional checks occurring for all but the first and last elements. However for the latter example there is some argument to suggest that the JIT might be clever enough to optimize away those additional head/last conditional checks that would otherwise occur for all but the first/last elements.
Is the JIT clever enough to perform such operations? The obvious advantage of the latter approach is that all business can be placed in the same function body while in the latter case business must be partitioned into three separate function bodies invoked separately.
** EDIT **
Thanks for all the great responses. While I am leaving the second code snippet above to illustrate its incorrectness, I want to revise the first approach slightly to reflect better the problem I am attempting to solve:
// x is some iterator
def process(x: Iterator[T]) = {
if (x.hasNext)
{
var previous = x.next
var current = null
processFirst previous
while(x.hasNext)
{
current = x.next
processPair(previous, current)
previous = current
}
processLast previous
}
}
While there are no additional checks occurring in the body, there is an additional reference assignment that appears to be unavoidable (previous = current). This is also a much more imperative approach that relies on nullable mutable variables. Implementing this in a functional yet high performance manner would be another exercise for another question.
How does this code snippet stack-up against the last of the two examples above? (the single-iteration block approach containing all the branches). The other thing I realize is that the latter of the two examples is also broken on collections containing fewer than two elements.
If your underlying collection has an inexpensive head and last method (not true for a generic Traversable), and the reduction operations are relatively inexpensive, then the second way takes about 10% longer (maybe a little less) than the first on my machine. (You can use a var to get first, and you can keep updating a second far with the right argument to obtain last, and then do the final operation outside of the loop.)
If you have an expensive last (i.e. you have to traverse the whole collection), then the first operation takes about 10% longer (maybe a little more).
Mostly you shouldn't worry too much about it and instead worry more about correctness. For instance, in a 2-element list your second code has a bug (because there is an else instead of a separate test). In a 1-element list, the second code never calls reduce's lambda at all, so again fails to work.
This argues that you should do it the first way unless you're sure last is really expensive in your case.
Edit: if you switch to a manual reduce-like-operation using an iterator, you might be able to shave off up to about 40% of your time compared to the expensive-last case (e.g. list). For inexpensive last, probably not so much (up to ~20%). (I get these values when operating on lengths of strings, for example.)
First of all, note that, depending on the concrete implementation of Traversable, doing something like x.last may be really expensive. Like, more expensive than all the rest of what's going on here.
Second, I doubt the cost of conditionals themselves is going to be noticeable, even on a 50 million collection, but actually figuring out whether a given element is the first or the last, might again, depending on implementation, get pricey.
Third, JIT will not be able to optimize the conditionals away: if there was a way to do that, you would have been able to write your implementation without conditionals to begin with.
Finally, if you are at a point where it starts looking like an extra if statement might affect performance, you might consider switching to java or even "C". Don't get me wrong, I love scala, it is a great language, with lots of power and useful features, but being super-fast just isn't one of them.

Ruby Benchmarking Accuracy - Branch Prediction at its finest?

So this morning I decided to play around with Benchmarking for the first time.
I was curious about the speed different between code with "do-end" block formatting vs. "{ }" formatting.
So I stored the Benchmark code in a Proc so I could call it multiple times consecutively:
n = 100_000_000
bmp = Proc.new do
Benchmark.bm do |x|
x.report {n.times {a = "1"}}
x.report {n.times do; a = "1"; end}
end
end
My results we're expected when I ran it once.
>> bmp.call
user system total real
1.840000 0.030000 1.870000 ( 1.874507)
1.860000 0.050000 1.910000 ( 1.926101)
=> true
But then ran it again.
>> bmp.call
user system total real
1.870000 0.050000 1.920000 ( 1.922810)
1.840000 0.000000 1.840000 ( 1.850615)
To me this looks like the exact opposite of what I'm expecting. I am familiar with the concept of Branch Prediction. Is this a classic example of Branch Prediction? If not, what? Is there anyway to prevent any inaccuracies like this (if this is even considered one)?
EDIT: I did run this code over 30 times, after some suggestions. Frequently it would alternate between the two results. The sample of the data is found here:
gist.github.com/TheLarkInn/5599676
First of all, your benchmark is utterly pointless. The difference between the do / end syntax and the { / } syntax is just that: syntax. There is no semantic difference. Ergo, there cannot possibly be any runtime performance difference whatsoever between the two. It's just not logically possible. You don't need to benchmark it.
The only performance difference that could exist, is that one takes longer to parse than the other. However, none of the two is harder to parse than the other. The only difference is precedence. Therefore, there very likely isn't any performance difference in parsing, either.
And even if there were a performance difference in parsing, your benchmark wouldn't show it. You are using a benchmark written in Ruby, but in order to run Ruby code, the Ruby execution engine has to parse it first, which means that parsing will already have happened, before your benchmark even starts. So, even if your benchmark weren't pointless, it would still be useless, since it cannot possibly measure the performance difference in parsing.
As to your question about Branch Prediction: there are no branches in your code, there is nothing to predict.
BTW: even if your benchmark was intended for a different purpose, it still wouldn't be measuring anything, since at least the more advanced Ruby implementations would recognize that your blocks are essentially no-ops and simply optimize them away. And even if they aren't optimized away, all they are measuring is memory allocator performance (allocating a couple hundred megabytes of tiny String objects), not the performance of blocks.
Just a quick primer on stats:
I'm not sure if two runs is enough to spot a trend. What if there was a difference in system load between the two test blocks the second time you ran it?
A rule of thumb for determining a statistical difference between two samples is that 30 or more data points will give you a statistically relevant result.
I'd run your tests at least that many times, store the results for the two versions independently, and then compare them internally to ensure they're consistent, before comparing the two sets to one-another.
It could be that your initial premise is incorrect :)

Ruby performance in .each loops

Consider the following two peices of ruby code
Example 1
name = user.first_name
round_number = rounds.count
users.each do |u|
puts "#{name} beat #{u.first_name} in round #{round_number}"
end
Example 2
users.each do |u|
puts "#{user.first_name} beat #{u.first_name} in #{rounds.count}"
end
For both pieces of code imagine
#user.rb
def first_name
name.split.first
end
So in a classical analysis of algorithms, the first piece of code would be more efficient, however in most modern compiled languages, modern compilers would optimize the second piece of code to make it look like the first, eliminating the need to optimize code in such maner.
Will ruby optimize or cache values for this code before execution? Should my ruby code look like example 1 or example 2?
Example 1 will run faster, as first_name() is only called once, and it's value stored in the variable.
In Example 2 Ruby will not memoize this value automatically, since the value could have changed between iterations for the each() loop.
Therefor expensive-to-calculate methods should be explicitly memoized if they are expected to be used more than once without the return value changing.
Making use of Ruby's Benchmark Module can be useful when making decisions like this. It will likely only be worth memoizing if there are a lot of values in users, or if first_name() is expensive to calculate.
A compiler can only perform this optimization if it can prove that the method has no side effects. This is even more difficult in Ruby than most languages, as everything is mutable and can be overridden at runtime. Whether it happens or not is implementation dependent, but since it's hard to do in Ruby, most do not. I actually don't know of any that do at the time of this posting.

Performance difference between iterating once and iterating twice?

Consider something like...
for (int i = 0; i < test.size(); ++i) {
test[i].foo();
test[i].bar();
}
Now consider..
for (int i = 0; i < test.size(); ++i) {
test[i].foo();
}
for (int i = 0; i < test.size(); ++i) {
test[i].bar();
}
Is there a large difference in time spent between these two? I.e. what is the cost of the actual iteration? It seems like the only real operations you are repeating are an increment and a comparison (though I suppose this would become significant for a very large n). Am I missing something?
First, as noted above, if your compiler can't optimize the size() method out so it's just called once, or is nothing more than a single read (no function call overhead), then it will hurt.
There is a second effect you may want to be concerned with, though. If your container size is large enough, then the first case will perform faster. This is because, when it gets to test[i].bar(), test[i] will be cached. The second case, with split loops, will thrash the cache, since test[i] will always need to be reloaded from main memory for each function.
Worse, if your container (std::vector, I'm guessing) has so many items that it won't all fit in memory, and some of it has to live in swap on your disk, then the difference will be huge as you have to load things in from disk twice.
However, there is one final thing that you have to consider: all this only makes a difference if there is no order dependency between the function calls (really, between different objects in the container). Because, if you work it out, the first case does:
test[0].foo();
test[0].bar();
test[1].foo();
test[1].bar();
test[2].foo();
test[2].bar();
// ...
test[test.size()-1].foo();
test[test.size()-1].bar();
while the second does:
test[0].foo();
test[1].foo();
test[2].foo();
// ...
test[test.size()-1].foo();
test[0].bar();
test[1].bar();
test[2].bar();
// ...
test[test.size()-1].bar();
So if your bar() assumes that all foo()'s have run, you will break it if you change the second case to the first. Likewise, if bar() assumes that foo() has not been run on later objects, then moving from the second case to the first will break your code.
So be careful and document what you do.
There are many aspects in such comparison.
First, complexity for both options is O(n), so difference isn't very big anyway. I mean, you must not care about it if you write quite big and complex program with a large n and "heavy" operations .foo() and bar(). So, you must care about it only in case of very small simple programs (this is kind of programs for embedded devices, for example).
Second, it will depend on programming language and compiler. I'm assured that, for instance, most of C++ compilers will optimize your second option to produce same code as for the first one.
Third, if compiler haven't optimized your code, performance difference will heavily depend on the target processor. Consider loop in a term of assembly commands - it will look something like this (pseudo assembly language):
LABEL L1:
do this ;; some commands
call that
IF condition
goto L1
;; some more instructions, ELSE part
I.e. every loop passage is just IF statement. But modern processors don't like IF. This is because processors may rearrange instructions to execute them beforehand or just to avoid idles. With the IF (in fact, conditional goto or jump) instructions, processors do not know if they may rearrange operation or not.
There's also a mechanism called branch predictor. From material of Wikipedia:
branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure.
This "soften" effect of IF's, through if the predictor's guess is wrong, no optimization will be performed.
So, you can see that there's a big amount of conditions for both your options: target language and compiler, target machine, it's processor and branch predictor. This all makes very complex system, and you cannot foresee what exact result you will get. I believe, that if you don't deal with embedded systems or something like that, the best solution is just to use the form which your are more comfortable with.
For your examples you have the additional concern of how expensive .size() is, since it's compared for each time i increments in most languages.
How expensive is it? Well that depends, it's certainly all relative. If .foo() and .bar() are expensive, the cost of the actual iteration is probably minuscule in comparison. If they're pretty lightweight, then it'll be a larger percentage of your execution time. If you want to know about a particular case test it, this is the only way to be sure about your specific scenario.
Personally, I'd go with the single iteration to be on the cheap side for sure (unless you need the .foo() calls to happen before the .bar() calls).
I assume .size() will be constant. Otherwise, the first code example might not give the same as the second one.
Most compilers would probably store .size() in a variable before the loop starts, so the .size() time will be cut down.
Therefore the time of the stuff inside the two for loops will be the same, but the other part will be twice as much.
Performance tag, right.
As long as you are concentrating on the "cost" of this or that minor code segment, you are oblivious to the bigger picture (isolation); and your intention is to justify something that, at a higher level (outside your isolated context), is simply bad practice, and breaks guidelines. The question is too low level and therefore too isolated. A system or program which is set of integrated components will perform much better that a collection of isolated components.
The fact that this or that isolated component (work inside the loop) is fast or faster is irrelevant when the loop itself is repeated unnecessarily, and which would therefore take twice the time.
Given that you have one family car (CPU), why on Earth would you:
sit at home and send your wife out to do her shopping
wait until she returns
take the car, go out and do your shopping
leaving her to wait until you return
If it needs to be stated, you would spend (a) almost half of your hard-earned resources executing one trip and shopping at the same time and (b) have those resources available to have fun together when you get home.
It has nothing to do with the price of petrol at 9:00 on a Saturday, or the time it takes to grind coffee at the café, or cost of each iteration.
Yes, there is a large diff in the time and the resources used. But the cost is not merely in the overhead per iteration; it is in the overall cost of the one organised trip vs the two serial trips.
Performance is about architecture; never doing anything twice (that you can do once), which are the higher levels of organisation; integrated of the parts that make up the whole. It is not about counting pennies at the bowser or cycles per iteration; those are lower orders of organisation; which ajust a collection of fragmented parts (not a systemic whole).
Masseratis cannot get through traffic jams any faster than station wagons.

Resources