Expression performance of overload operator? - performance

(i++) and (i = i + 1)
(i += n) and (i = i + n)
which is better (performance)?

It doesn't matter
The compiler will convert statements like that to (what it thinks, and often is) their most efficient form.
I'd recommend you write statements like this in the same way as the rest of your code base in order to keep consistency.
If you are just doing your own thing on a personal project you can either do what you prefer or what is common for your particular language.

It does not matter, the performance is the same. In 1978 when C was invented these would map to different PDP-11 instructions, resulting in faster performance of ++ and +=. These days, however, the operations are optimized into the same exact sequences of instructions.

Related

OpenCL, substituting branches with arithmetic operations

The following question is more related to design, rather than actual coding. I don't know if there's a technical term for such problem, so I'll proceed with an example.
I have some openCL code not optimized at all, and in the Kernel there's essentially a switch statement similar to the following
switch(const) {
case const_a : do_something_a(...); break;
case const_b : do_something_b(....); break;
... //etc
}
I cannot write the actual statement since is quite long. As a simple example consider the following switch statement:
int a;
switch(input):
case 13 : {a = 3; break;}
case 1 : {a = 7; break;}
case 23 : {a = 1; break;}
default : {...}
The question is... would it be better to change such switch with an expression like
a = (input == 13)*3 + (input == 1)*7 + (input == 23)
?
If it's not, is it possible to make it more efficient anyway?
You can assume input only takes values in the set of cases of the switch statement.
You've discovered an interesting question that GPU compilers wrestle with. The general advice is try not to branch. Tricks to make that possible are splitting kernels up (as suggested above) and preprocessor (program-time definitions). Research in GPU algorithm development basically works from this axiom.
Branching all over the place won't get great efficiency because of the inherent divergence (channel = work item within the SIMD thread/warp). Remember that all these channels must execute together. So in a switch where all are taking different paths everyone else goes along for the ride silently waiting for their "case" to execute. Now, if input is always always the same value, it can still be a win.
Another popular option is a table indirection.
kernel void foo(const int *t, ...)
...
a = tbl[input];
This case has a few problems too depending on hardware, inputs, and problem size.
Without more specific context, I can conjure up a case where any of these can run well or poorly.
Switching (or big if-then-else chains).
PROS: If all work items generally take the same path (input is mostly the same value), it's going to be efficient. You could also write an if-then-else chain putting the most common cases first. (On GPUs a switch is not necessarily as easy as an indirect jump since there are multiple work items and they may take different paths.)
CONS: Might generate lots of program code and could blow out the instruction cache. Branching all over the place can get a little costly depending on how many cases need to be evaluated. It might just be better to grind through the compute with the predicated code.
Predicated Code (Your (input == 13)*3 ... code).
PROS: This will probably generate smaller programs and stress the I$ less. (Lookup the OpenCL select function to see a more general approach for your case.)
CONS: We've basically punted and decided to evaluate every "case in the switch". If input is usually the same value, we're wasting time here.
Lookup-table based approaches (my example).
PROS: If the switch you are evaluating has a massive number of cases (branches), but can be indexed by integer you might be ahead to just use a lookup table. On some hardware this means a read from global memory (far far away). Other architectures have a dedicated constant cache, but I understand that a vector lookup will serialize (K cycles for each channel). So it might be only marginally better than the global memory table. However, the code table-lookup generated will be short (I$ friendly) and as the number of branches (case statements) grow this will win in the limit. This approach also deals well with uniform/scattered distributions of input's value.
CONS: The read from global memory (or serialized access from the constant cache) has a big latency even compared to branching. In some cases, to eliminate the extra memory traffic I've seen compilers convert lookup tables into if-then-else/switch chains. It's rare that we have 100 element case statements.
I am now inspired to go study this cutoff. :-)

Is it worth it to rewrite an if statement to avoid branching?

Recently I realized I have been doing too much branching without caring the negative impact on performance it had, therefore I have made up my mind to attempt to learn all about not branching. And here is a more extreme case, in attempt to make the code to have as little branch as possible.
Hence for the code
if(expression)
A = C; //A and C have to be the same type here obviously
expression can be A == B, or Q<=B, it could be anything that resolve to true or false, or i would like to think of it in term of the result being 1 or 0 here
I have come up with this non branching version
A += (expression)*(C-A); //Edited with thanks
So my question would be, is this a good solution that maximize efficiency?
If yes why and if not why?
Depends on the compiler, instruction set, optimizer, etc. When you use a boolean expression as an int value, e.g., (A == B) * C, the compiler has to do the compare, and the set some register to 0 or 1 based on the result. Some instruction sets might not have any way to do that other than branching. Generally speaking, it's better to write simple, straightforward code and let the optimizer figure it out, or find a different algorithm that branches less.
Jeez, no, don't do that!
Anyone who "penalize[s] [you] a lot for branching" would hopefully send you packing for using something that awful.
How is it awful, let me count the ways:
There's no guarantee you can multiply a quantity (e.g., C) by a boolean value (e.g., (A==B) yields true or false). Some languages will, some won't.
Anyone casually reading it is going observe a calculation, not an assignment statement.
You're replacing a comparison, and a conditional branch with two comparisons, two multiplications, a subtraction, and an addition. Seriously non-optimal.
It only works for integral numeric quantities. Try this with a wide variety of floating point numbers, or with an object, and if you're really lucky it will be rejected by the compiler/interpreter/whatever.
You should only ever consider doing this if you had analyzed the runtime properties of the program and determined that there is a frequent branch misprediction here, and that this is causing an actual performance problem. It makes the code much less clear, and its not obvious that it would be any faster in general (this is something you would also have to measure, under the circumstances you are interested in).
After doing research, I came to the conclusion that when there are bottleneck, it would be good to include timed profiler, as these kind of codes are usually not portable and are mainly used for optimization.
An exact example I had after reading the following question below
Why is it faster to process a sorted array than an unsorted array?
I tested my code on C++ using that, that my implementation was actually slower due to the extra arithmetics.
HOWEVER!
For this case below
if(expression) //branched version
A += C;
//OR
A += (expression)*(C); //non-branching version
The timing was as of such.
Branched Sorted list was approximately 2seconds.
Branched unsorted list was aproximately 10 seconds.
My implementation (whether sorted or unsorted) are both 3seconds.
This goes to show that in an unsorted area of bottleneck, when we have a trivial branching that can be simply replaced by a single multiplication.
It is probably more worthwhile to consider the implementation that I have suggested.
** Once again it is mainly for the areas that is deemed as the bottleneck **

Go switch vs if-else efficiency

In Go, switches are much more flexible than in C (and C++) since they can handle cases of boolean expressions and replace large else-if ladders seemingly entirely, especially with the default switch {...} blocks.
switch {
case x < 5 && y > 2:
//...
case y == 1 || x > 2:
//...
default:
}
Is there any efficiency advantage to using a switch over else-if in Go? It seems that the boosted efficiency would be lost by the switch's flexibility. Is it just up to the compiler to figure it out and see if it can make a jump table?
Is there any performance advantage to using switch over if and else?
Unless all your case are integral constants then you lose the possibility of transforming the switch to a jump-table.
So, at best, Go's switch might be equivalent to C++'s switch if you only use integral constants, but otherwise it will be no more efficient than if/else.
It's completely up to the compiler to figure it out and choose a good implementation strategy for your code. You can always find out what code the compiler is generating by requesting an assembly listing of the compiler output. See the -S option to the Go compiler.
It's surely irrelevant for your application performance. There is probably other more complex situation where you can improve performance. Like saving a single SQL query is probably like 1 million if/else/switch.
Do not worry much about detail like that and focus on higher level stuff.

What's more costly on current CPUs: arithmetic operations or conditionals?

20-30 years ago arithmetic operations like division were one of the most costly operations for CPUs. Saving one division in a piece of repeatedly called code was a significant performance gain. But today CPUs have fast arithmetic operations and since they heavily use instruction pipelining, conditionals can disrupt efficient execution. If I want to optimize code for speed, should I prefer arithmetic operations in favor of conditionals?
Example 1
Suppose we want to implement operations modulo n. What will perform better:
int c = a + b;
result = (c >= n) ? (c - n) : c;
or
result = (a + b) % n;
?
Example 2
Let's say we're converting 24-bit signed numbers to 32-bit. What will perform better:
int32_t x = ...;
result = (x & 0x800000) ? (x | 0xff000000) : x;
or
result = (x << 8) >> 8;
?
All the low hanging fruits are already picked and pickled by authors of compilers and guys who build hardware. If you are the kind of person who needs to ask such question, you are unlikely to be able to optimize anything by hand.
While 20 years ago it was possible for a relatively competent programmer to make some optimizations by dropping down to assembly, nowadays it is the domain of experts, specializing in the target architecture; also, optimization requires not only knowing the program, but knowing the data it will process. Everything comes down to heuristics, tests under different conditions etc.
Simple performance questions no longer have simple answers.
If you want to optimise for speed, you should just tell your compiler to optimise for speed. Modern compilers will generally outperform you in this area.
I've sometimes been surprised trying to relate assembly code back to the original source for this very reason.
Optimise your source code for readability and let the compiler do what it's best at.
I expect that in example #1, the first will perform better. The compiler will probably apply some bit-twiddling trick to avoid a branch. But you're taking advantage of knowledge that it's extremely unlikely that the compiler can deduce: namely that the sum is always in the range [0:2*n-2] so a single subtraction will suffice.
For example #2, the second way is both faster on modern CPUs and simpler to follow. A judicious comment would be appropriate in either version. (I wouldn't be surprised to see the compiler convert the first version into the second.)

Is Scala functional programming slower than traditional coding?

In one of my first attempts to create functional code, I ran into a performance issue.
I started with a common task - multiply the elements of two arrays and sum up the results:
var first:Array[Float] ...
var second:Array[Float] ...
var sum=0f;
for (ix<-0 until first.length)
sum += first(ix) * second(ix);
Here is how I reformed the work:
sum = first.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)
When I benchmarked the two approaches, the second method takes 40 times as long to complete!
Why does the second method take so much longer? How can I reform the work to be both speed efficient and use functional programming style?
The main reasons why these two examples are so different in speed are:
the faster one doesn't use any generics, so it doesn't face boxing/unboxing.
the faster one doesn't create temporary collections and, thus, avoids extra memory copies.
Let's consider the slower one by parts. First:
first.zip(second)
That creates a new array, an array of Tuple2. It will copy all elements from both arrays into Tuple2 objects, and then copy a reference to each of these objects into a third array. Now, notice that Tuple2 is parameterized, so it can't store Float directly. Instead, new instances of java.lang.Float are created for each number, the numbers are stored in them, and then a reference for each of them is stored into the Tuple2.
map{ case (a,b) => a*b }
Now a fourth array is created. To compute the values of these elements, it needs to read the reference to the tuple from the third array, read the reference to the java.lang.Float stored in them, read the numbers, multiply, create a new java.lang.Float to store the result, and then pass this reference back, which will be de-referenced again to be stored in the array (arrays are not type-erased).
We are not finished, though. Here's the next part:
reduceLeft(_+_)
That one is relatively harmless, except that it still do boxing/unboxing and java.lang.Float creation at iteration, since reduceLeft receives a Function2, which is parameterized.
Scala 2.8 introduces a feature called specialization which will get rid of a lot of these boxing/unboxing. But let's consider alternative faster versions. We could, for instance, do map and reduceLeft in a single step:
sum = first.zip(second).foldLeft(0f) { case (a, (b, c)) => a + b * c }
We could use view (Scala 2.8) or projection (Scala 2.7) to avoid creating intermediary collections altogether:
sum = first.view.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)
This last one doesn't save much, actually, so I think the non-strictness if being "lost" pretty fast (ie, one of these methods is strict even in a view). There's also an alternative way of zipping that is non-strict (ie, avoids some intermediary results) by default:
sum = (first,second).zipped.map{ case (a,b) => a*b }.reduceLeft(_+_)
This gives much better result that the former. Better than the foldLeft one, though not by much. Unfortunately, we can't combined zipped with foldLeft because the former doesn't support the latter.
The last one is the fastest I could get. Faster than that, only with specialization. Now, Function2 happens to be specialized, but for Int, Long and Double. The other primitives were left out, as specialization increases code size rather dramatically for each primitive. On my tests, though Double is actually taking longer. That might be a result of it being twice the size, or it might be something I'm doing wrong.
So, in the end, the problem is a combination of factors, including producing intermediary copies of elements, and the way Java (JVM) handles primitives and generics. A similar code in Haskell using supercompilation would be equal to anything short of assembler. On the JVM, you have to be aware of the trade-offs and be prepared to optimize critical code.
I did some variations of this with Scala 2.8. The loop version is as you write but the
functional version is slightly different:
(xs, ys).zipped map (_ * _) reduceLeft(_ + _)
I ran with Double instead of Float, because currently specialization only kicks in for Double. I then tested with arrays and vectors as the carrier type. Furthermore, I tested Boxed variants which work on java.lang.Double's instead of primitive Doubles to measure
the effect of primitive type boxing and unboxing. Here is what I got (running Java 1.6_10 server VM, Scala 2.8 RC1, 5 runs per test).
loopArray 461 437 436 437 435
reduceArray 6573 6544 6718 6828 6554
loopVector 5877 5773 5775 5791 5657
reduceVector 5064 4880 4844 4828 4926
loopArrayBoxed 2627 2551 2569 2537 2546
reduceArrayBoxed 4809 4434 4496 4434 4365
loopVectorBoxed 7577 7450 7456 7463 7432
reduceVectorBoxed 5116 4903 5006 4957 5122
The first thing to notice is that by far the biggest difference is between primitive array loops and primitive array functional reduce. It's about a factor of 15 instead of the 40 you have seen, which reflects improvements in Scala 2.8 over 2.7. Still, primitive array loops are the fastest of all tests whereas primitive array reduces are the slowest. The reason is that primitive Java arrays and generic operations are just not a very good fit. Accessing elements of primitive Java arrays from generic functions requires a lot of boxing/unboxing and sometimes even requires reflection. Future versions of Scala will specialize the Array class and then we should see some improvement. But right now that's what it is.
If you go from arrays to vectors, you notice several things. First, the reduce version is now faster than the imperative loop! This is because vector reduce can make use of efficient bulk operations. Second, vector reduce is faster than array reduce, which illustrates the inherent overhead that arrays of primitive types pose for generic higher-order functions.
If you eliminate the overhead of boxing/unboxing by working only with boxed java.lang.Double values, the picture changes. Now reduce over arrays is a bit less than 2 times slower than looping, instead of the 15 times difference before. That more closely approximates the inherent overhead of the three loops with intermediate data structures instead of the fused loop of the imperative version. Looping over vectors is now by far the slowest solution, whereas reducing over vectors is a little bit slower than reducing over arrays.
So the overall answer is: it depends. If you have tight loops over arrays of primitive values, nothing beats an imperative loop. And there's no problem writing the loops because they are neither longer nor less comprehensible than the functional versions. In all other situations, the FP solution looks competitive.
This is a microbenchmark, and it depends on how the compiler optimizes you code. You have 3 loops composed here,
zip . map . fold
Now, I'm fairly sure the Scala compiler cannot fuse those three loops into a single loop, and the underlying data type is strict, so each (.) corresponds to an intermediate array being created. The imperative/mutable solution would reuse the buffer each time, avoiding copies.
Now, an understanding of what composing those three functions means is key to understanding performance in a functional programming language -- and indeed, in Haskell, those three loops will be optimized into a single loop that reuses an underlying buffer -- but Scala cannot do that.
There are benefits to sticking to the combinator approach, however -- by distinguishing those three functions, it will be easier to parallelize the code (replace map with parMap etc). In fact, given the right array type, (such as a parallel array) a sufficiently smart compiler will be able to automatically parallelize your code, yielding more performance wins.
So, in summary:
naive translations may have unexpected copies and inefficiences
clever FP compilers remove this overhead (but Scala can't yet)
sticking to the high level approach pays off if you want to retarget your code, e.g. to parallelize it
Don Stewart has a fine answer, but it might not be obvious how going from one loop to three creates a factor of 40 slowdown. I'll add to his answer that Scala compiles to JVM bytecodes, and not only does the Scala compiler not fuse the three loops into one, but the Scala compiler is almost certainly allocating all the intermediate arrays. Notoriously, implementations of the JVM are not designed to handle the allocation rates required by functional languages. Allocation is a significant cost in functional programs, and that's one the loop-fusion transformations that Don Stewart and his colleagues have implemented for Haskell are so powerful: they eliminate lots of allocations. When you don't have those transformations, plus you're using an expensive allocator such as is found on a typical JVM, that's where the big slowdown comes from.
Scala is a great vehicle for experimenting with the expressive power of an unusual mix of language ideas: classes, mixins, modules, functions, and so on. But it's a relatively young research language, and it runs on the JVM, so it's unreasonable to expect great performance except on the kind of code that JVMs are good at. If you want to experiment with the mix of language ideas that Scala offers, great—it's a really interesting design—but don't expect the same performance on pure functional code that you'd get with a mature compiler for a functional language, like GHC or MLton.
Is scala functional programming slower than traditional coding?
Not necessarily. Stuff to do with first-class functions, pattern matching, and currying need not be especially slow. But with Scala, more than with other implementations of other functional languages, you really have to watch out for allocations—they can be very expensive.
The Scala collections library is fully generic, and the operations provided are chosen for maximum capability, not maximum speed. So, yes, if you use a functional paradigm with Scala without paying attention (especially if you are using primitive data types), your code will take longer to run (in most cases) than if you use an imperative/iterative paradigm without paying attention.
That said, you can easily create non-generic functional operations that perform quickly for your desired task. In the case of working with pairs of floats, we might do the following:
class FastFloatOps(a: Array[Float]) {
def fastMapOnto(f: Float => Float) = {
var i = 0
while (i < a.length) { a(i) = f(a(i)); i += 1 }
this
}
def fastMapWith(b: Array[Float])(f: (Float,Float) => Float) = {
val len = a.length min b.length
val c = new Array[Float](len)
var i = 0
while (i < len) { c(i) = f(a(i),b(i)); i += 1 }
c
}
def fastReduce(f: (Float,Float) => Float) = {
if (a.length==0) Float.NaN
else {
var r = a(0)
var i = 1
while (i < a.length) { r = f(r,a(i)); i += 1 }
r
}
}
}
implicit def farray2fastfarray(a: Array[Float]) = new FastFloatOps(a)
and then these operations will be much faster. (Faster still if you use Double and 2.8.RC1, because then the functions (Double,Double)=>Double will be specialized, not generic; if you're using something earlier, you can create your own abstract class F { def f(a: Float) : Float } and then call with new F { def f(a: Float) = a*a } instead of (a: Float) => a*a.)
Anyway, the point is that it's not the functional style that makes functional coding in Scala slow, it's that the library is designed with maximum power/flexibility in mind, not maximum speed. This is sensible, since each person's speed requirements are typically subtly different, so it's hard to cover everyone supremely well. But if it's something you're doing more than just a little, you can write your own stuff where the performance penalty for a functional style is extremely small.
I am not an expert Scala programmer, so there is probably a more efficient method, but what about something like this. This can be tail call optimized, so performance should be OK.
def multiply_and_sum(l1:List[Int], l2:List[Int], sum:Int):Int = {
if (l1 != Nil && l2 != Nil) {
multiply_and_sum(l1.tail, l2.tail, sum + (l1.head * l2.head))
}
else {
sum
}
}
val first = Array(1,2,3,4,5)
val second = Array(6,7,8,9,10)
multiply_and_sum(first.toList, second.toList, 0) //Returns: 130
To answer the question in the title: Simple functional constructs may be slower than imperative on the JVM.
But, if we consider only simple constructs, then we might as well throw out all modern languages and stick with C or assembler. If you look a the programming language shootout, C always wins.
So why choose a modern language? Because it lets you express a cleaner design. Cleaner design leads to performance gains in the overall operation of the application. Even if some low-level methods can be slower. One of my favorite examples is the performance of BuildR vs. Maven. BuildR is written in Ruby, an interpreted, slow, language. Maven is written in Java. A build in BuildR is twice as fast as Maven. This is due mostly to the design of BuildR which is lightweight compared with that of Maven.
Your functional solution is slow because it is generating unnecessary temporary data structures. Removing these is known as deforesting and it is easily done in strict functional languages by rolling your anonymous functions into a single anonymous function and using a single aggregator. For example, your solution written in F# using zip, map and reduce:
let dot xs ys = Array.zip xs ys |> Array.map (fun (x, y) -> x * y) -> Array.reduce ( * )
may be rewritten using fold2 so as to avoid all temporary data structures:
let dot xs ys = Array.fold2 (fun t x y -> t + x * y) 0.0 xs ys
This is a lot faster and the same transformation can be done in Scala and other strict functional languages. In F#, you can also define the fold2 as inline in order to have the higher-order function inlined with its functional argument whereupon you recover the optimal performance of the imperative loop.
Here is dbyrnes solution with arrays (assuming Arrays are to be used) and just iterating over the index:
def multiplyAndSum (l1: Array[Int], l2: Array[Int]) : Int =
{
def productSum (idx: Int, sum: Int) : Int =
if (idx < l1.length)
productSum (idx + 1, sum + (l1(idx) * l2(idx))) else
sum
if (l2.length == l1.length)
productSum (0, 0) else
error ("lengths don't fit " + l1.length + " != " + l2.length)
}
val first = (1 to 500).map (_ * 1.1) toArray
val second = (11 to 510).map (_ * 1.2) toArray
def loopi (n: Int) = (1 to n).foreach (dummy => multiplyAndSum (first, second))
println (timed (loopi (100*1000)))
That needs about 1/40 of the time of the list-approach. I don't have 2.8 installed, so you have to test #tailrec yourself. :)

Resources