Scala: branching statement optimisation using callback function degrading performance

Scala: branching statement optimisation using callback function degrading performance - performance

It is common in C/C++ programming to use function pointers to optimize branching in the main data path. So I wrote a test program to find out if similar performance savings can be gotten in Scala using functional programming techniques. The usecase is that of a function which is invoked millions of times and has a branching statement based on a global flag. The code using if() statement -
val b = true
def test() = {
if(b) // do something
else // do something else
}
for(i <- 0 to 100000) test()
And trying to get rid of the if() I did this -
def onTrue() = { // do something }
def onFalse() = { // do something else }
lazy val callback: () => Unit = if(b) onTrue else onFalse
def test() = callback()
for(i <- 0 to 100000) test()
I did a comparison of both these programs by running them for large counters (in the for loop) and running them many times and using the System.nanoTime() differential to measure the time taken.
And my tests seem to suggest that the callback method is actually SLOWER than using if() in the loop. The reason for this could be that a function call requires the params and returns to be pushed on the stack and a new stack frame created etc. Given this finding wanted to know -
Is there a functional way one could code which will better the performance of using the if() in the loop with Scala?
#inline works with compiler. Is there a runtime equivalent to avoid the stack activities? (similar to tail call optimization)
Could my test or results be inaccurate/erroneous in some way?

3) It's very easy to get your methodology wrong when testing this way. Use something like JMH if you want quasi-trustable microbenchmarks!
2) The JVM does inlining at runtime.
1) You aren't measuring a difference in whether something is "functional". You're measuring the difference between using a lazy val and not. If you don't have the lazy val in there, the JVM will probably be able to optimize your code (depending on what "do something" is).
If you remove the lazy val, the second one optimizes to the same speed in my hands. (It has an extra mandatory check for every access that it isn't being initialized in a multi-threaded context.)

Related

Removing mutability without losing speed

I have a function like this:
fun randomWalk(numSteps: Int): Int {
var n = 0
repeat(numSteps) { n += (-1 + 2 * Random.nextInt(2)) }
return n.absoluteValue
}
This works fine, except that it uses a mutable variable, and I would like to make everything immutable when possible, for better safety and readability. So I came up with an equivalent version that doesn't use any mutable variables:
fun randomWalk_seq(numSteps: Int): Int =
generateSequence(0) { it + (-1 + 2 * Random.nextInt(2)) }
.elementAt(numSteps)
.absoluteValue
This also works fine and produces the same results, but it takes 3 times longer.
I used the following way to measure it:
#OptIn(ExperimentalTime::class)
fun main() {
val numSamples = 100000
val numSteps = 15708
repeat(5) {
val randomWalkSamples: IntArray
val duration = measureTime {
randomWalkSamples = IntArray(numSamples) { randomWalk(numSteps) }
}
println(duration)
}
}
I know it's a bit hacky (I could have used JMH but this is just a quick test - at least I know that measureTime uses a monotonic clock). The results for the iterative (mutable) version:
2.965358406s
2.560777033s
2.554363661s
2.564279403s
2.608323586s
As expected, the first line shows it took a bit longer on the first run due to the warming up of the JIT, but the next 4 lines have fairly small variation.
After replacing randomWalk with randomWalk_seq:
6.636866719s
6.980840906s
6.993998111s
6.994038706s
7.018054467s
Somewhat surprisingly, I don't see any warmup time - the first line is always lesser duration than the following 4 lines, every time I run this. And also, every time I run it, the duration keeps increasing, with line 5 always being the greatest duration.
Can someone explain the findings, and also is there any way of making this function not use any mutable variables but still have performance that is close to the mutable version?

Your solution is slower for two main reasons: boxing and the complexity of the iterator used by generateSequence()'s Sequence implementation.
Boxing happens because a Sequence uses its types generically, so it cannot use primitive 32-bit Ints directly, but must wrap them in classes and unwrap them when retrieving the items.
You can see the complexity of the iterator by Ctrl+clicking the generateSequence function to view the source code.
#Михаил Нафталь's suggestion is faster because it avoids the complex iterator of the sequence, but it still has boxing.
I tried writing an overload of sumOf that uses IntProgression directly instead of Iterable<T>, so it won't use boxing, and that resulted in equivalent performance to your imperative code with the var. As you can see, it's inline and when put together with the { -1 + 2 * Random.nextInt(2) } lambda suggested by #Михаил Нафталь, then the resulting compiled code will be equivalent to your imperative code.
inline fun IntProgression.sumOf(selector: (Int) -> Int): Int {
var sum: Int = 0.toInt()
for (element in this) {
sum += selector(element)
}
return sum
}
Ultimately, I don't think you're buying yourself much in the way of code clarity by removing a single var in such a small function. I would say the sequence code is arguably harder to read. vars may add to code complexity in complex algorithms, but I don't think they do in such simple algorithms, especially when there's only one of them and it's local to the function.

Equivalent immutable one-liner is:
fun randomWalk2(numSteps: Int) =
(1..numSteps).sumOf { -1 + 2 * Random.nextInt(2) }.absoluteValue
Probably, even more performant would be to replace
with
so that you'll have one multiplication and n additions instead of n multiplications and (2*n-1) additions:
fun randomWalk3(numSteps: Int) =
(-numSteps + 2 * (1..numSteps).sumOf { Random.nextInt(2) }).absoluteValue
Update
As #Tenfour04 noted, there is no specific stdlib implementation for IntProgression.sumOf, so it's resolved to Iterable<T>.sumOf, which will add unnecessary overhead for int boxing.
So, it's better to use IntArray here instead of IntProgression:
fun randomWalk4(numSteps: Int) =
(-numSteps + 2 * IntArray(numSteps).sumOf { Random.nextInt(2) }).absoluteValue
Still encourage you to check this all with JMH

I think:"Removing mutability without losing speed" is wrong title .because
mutability thing comes to deal with the flow that program want to achieve .
you are using var inside function.... and 100% this var will not ever change from outside this function and that is mutability concept.
if we git rid off from var everywhere why we need it in programming ?

Why can't dead code detection be fully solved by a compiler?

The compilers I've been using in C or Java have dead code prevention (warning when a line won't ever be executed). My professor says that this problem can never be fully solved by compilers though. I was wondering why that is. I am not too familiar with the actual coding of compilers as this is a theory-based class. But I was wondering what they check (such as possible input strings vs acceptable inputs, etc.), and why that is insufficient.

The dead code problem is related to the Halting problem.
Alan Turing proved that it is impossible to write a general algorithm that will be given a program and be able to decide whether that program halts for all inputs. You may be able to write such an algorithm for specific types of programs, but not for all programs.
How does this relate to dead code?
The Halting problem is reducible to the problem of finding dead code. That is, if you find an algorithm that can detect dead code in any program, then you can use that algorithm to test whether any program will halt. Since that has been proven to be impossible, it follows that writing an algorithm for dead code is impossible as well.
How do you transfer an algorithm for dead code into an algorithm for the Halting problem?
Simple: you add a line of code after the end of the program you want to check for halt. If your dead-code detector detects that this line is dead, then you know that the program does not halt. If it doesn't, then you know that your program halts (gets to the last line, and then to your added line of code).
Compilers usually check for things that can be proven at compile-time to be dead. For example, blocks that are dependent on conditions that can be determined to be false at compile time. Or any statement after a return (within the same scope).
These are specific cases, and therefore it's possible to write an algorithm for them. It may be possible to write algorithms for more complicated cases (like an algorithm that checks whether a condition is syntactically a contradiction and therefore will always return false), but still, that wouldn't cover all possible cases.

Well, let's take the classical proof of the undecidability of the halting problem and change the halting-detector to a dead-code detector!
C# program
using System;
using YourVendor.Compiler;
class Program
{
static void Main(string[] args)
{
string quine_text = #"using System;
using YourVendor.Compiler;
class Program
{{
static void Main(string[] args)
{{
string quine_text = #{0}{1}{0};
quine_text = string.Format(quine_text, (char)34, quine_text);
if (YourVendor.Compiler.HasDeadCode(quine_text))
{{
System.Console.WriteLine({0}Dead code!{0});
}}
}}
}}";
quine_text = string.Format(quine_text, (char)34, quine_text);
if (YourVendor.Compiler.HasDeadCode(quine_text))
{
System.Console.WriteLine("Dead code!");
}
}
}
If YourVendor.Compiler.HasDeadCode(quine_text) returns false, then the line System.Console.WriteLn("Dead code!"); won't be ever executed, so this program actually does have dead code, and the detector was wrong.
But if it returns true, then the line System.Console.WriteLn("Dead code!"); will be executed, and since there is no more code in the program, there is no dead code at all, so again, the detector was wrong.
So there you have it, a dead-code detector that returns only "There is dead code" or "There is no dead code" must sometimes yield wrong answers.

If the halting problem is too obscure, think of it this way.
Take a mathematical problem that is believed to be true for all positive integer's n, but hasn't been proven to be true for every n. A good example would be Goldbach's conjecture, that any positive even integer greater than two can be represented by the sum of two primes. Then (with an appropriate bigint library) run this program (pseudocode follows):
for (BigInt n = 4; ; n+=2) {
if (!isGoldbachsConjectureTrueFor(n)) {
print("Conjecture is false for at least one value of n\n");
exit(0);
}
}
Implementation of isGoldbachsConjectureTrueFor() is left as an exercise for the reader but for this purpose could be a simple iteration over all primes less than n
Now, logically the above must either be the equivalent of:
for (; ;) {
}
(i.e. an infinite loop) or
print("Conjecture is false for at least one value of n\n");
as Goldbach's conjecture must either be true or not true. If a compiler could always eliminate dead code, there would definitely be dead code to eliminate here in either case. However, in doing so at the very least your compiler would need to solve arbitrarily hard problems. We could provide problems provably hard that it would have to solve (e.g. NP-complete problems) to determine which bit of code to eliminate. For instance if we take this program:
String target = "f3c5ac5a63d50099f3b5147cabbbd81e89211513a92e3dcd2565d8c7d302ba9c";
for (BigInt n = 0; n < 2**2048; n++) {
String s = n.toString();
if (sha256(s).equals(target)) {
print("Found SHA value\n");
exit(0);
}
}
print("Not found SHA value\n");
we know that the program will either print out "Found SHA value" or "Not found SHA value" (bonus points if you can tell me which one is true). However, for a compiler to be able to reasonably optimise that would take of the order of 2^2048 iterations. It would in fact be a great optimisation as I predict the above program would (or might) run until the heat death of the universe rather than printing anything without optimisation.

I don't know if C++ or Java have an Eval type function, but many languages do allow you do call methods by name. Consider the following (contrived) VBA example.
Dim methodName As String
If foo Then
methodName = "Bar"
Else
methodName = "Qux"
End If
Application.Run(methodName)
The name of the method to be called is impossible to know until runtime. Therefore, by definition, the compiler cannot know with absolute certainty that a particular method is never called.
Actually, given the example of calling a method by name, the branching logic isn't even necessary. Simply saying
Application.Run("Bar")
Is more than the compiler can determine. When the code is compiled, all the compiler knows is that a certain string value is being passed to that method. It doesn't check to see if that method exists until runtime. If the method isn't called elsewhere, through more normal methods, an attempt to find dead methods can return false positives. The same issue exists in any language that allows code to be called via reflection.

Unconditional dead code can be detected and removed by advanced compilers.
But there is also conditional dead code. That is code that cannot be known at the time of compilation and can only be detected during runtime. For example, a software may be configurable to include or exclude certain features depending on user preference, making certain sections of code seemingly dead in particular scenarios. That is not be real dead code.
There are specific tools that can do testing, resolve dependencies, remove conditional dead code and recombine the useful code at runtime for efficiency. This is called dynamic dead code elimination. But as you can see it is beyond the scope of compilers.

A simple example:
int readValueFromPort(const unsigned int portNum);
int x = readValueFromPort(0x100); // just an example, nothing meaningful
if (x < 2)
{
std::cout << "Hey! X < 2" << std::endl;
}
else
{
std::cout << "X is too big!" << std::endl;
}
Now assume that the port 0x100 is designed to return only 0 or 1. In that case the compiler cannot figure out that the else block will never be executed.
However in this basic example:
bool boolVal = /*anything boolean*/;
if (boolVal)
{
// Do A
}
else if (!boolVal)
{
// Do B
}
else
{
// Do C
}
Here the compiler can calculate out the the else block is a dead code.
So the compiler can warn about the dead code only if it has enough data to to figure out the dead code and also it should know how to apply that data in order to figure out if the given block is a dead code.
EDIT
Sometimes the data is just not available at the compilation time:
// File a.cpp
bool boolMethod();
bool boolVal = boolMethod();
if (boolVal)
{
// Do A
}
else
{
// Do B
}
//............
// File b.cpp
bool boolMethod()
{
return true;
}
While compiling a.cpp the compiler cannot know that boolMethod always returns true.

The compiler will always lack some context information. E.g. you might know, that a double value never exeeds 2, because that is a feature of the mathematical function, you use from a library. The compiler does not even see the code in the library, and it can never know all features of all mathematical functions, and detect all weired and complicated ways to implement them.

The compiler doesn't necessarily see the whole program. I could have a program that calls a shared library, which calls back into a function in my program which isn't called directly.
So a function which is dead with respect to the library it's compiled against could become alive if that library was changed at runtime.

If a compiler could eliminate all dead code accurately, it would be called an interpreter.
Consider this simple scenario:
if (my_func()) {
am_i_dead();
}
my_func() can contain arbitrary code and in order for the compiler to determine whether it returns true or false, it will either have to run the code or do something that is functionally equivalent to running the code.
The idea of a compiler is that it only performs a partial analysis of the code, thus simplifying the job of a separate running environment. If you perform a full analysis, that isn't a compiler any more.
If you consider the compiler as a function c(), where c(source)=compiled code, and the running environment as r(), where r(compiled code)=program output, then to determine the output for any source code you have to compute the value of r(c(source code)). If calculating c() requires the knowledge of the value of r(c()) for any input, there is no need for a separate r() and c(): you can just derive a function i() from c() such that i(source)=program output.

Others have commented on the halting problem and so forth. These generally apply to portions of functions. However it can be hard/impossible to know whether even an entire type (class/etc) is used or not.
In .NET/Java/JavaScript and other runtime driven environments there's nothing stopping types being loaded via reflection. This is popular with dependency injection frameworks, and is even harder to reason about in the face of deserialisation or dynamic module loading.
The compiler cannot know whether such types would be loaded. Their names could come from external config files at runtime.
You might like to search around for tree shaking which is a common term for tools that attempt to safely remove unused subgraphs of code.

Take a function
void DoSomeAction(int actnumber)
{
switch(actnumber)
{
case 1: Action1(); break;
case 2: Action2(); break;
case 3: Action3(); break;
}
}
Can you prove that actnumber will never be 2 so that Action2() is never called...?

I disagree about the halting problem. I wouldn't call such code dead even though in reality it will never be reached.
Instead, lets consider:
for (int N = 3;;N++)
for (int A = 2; A < int.MaxValue; A++)
for (int B = 2; B < int.MaxValue; B++)
{
int Square = Math.Pow(A, N) + Math.Pow(B, N);
float Test = Math.Sqrt(Square);
if (Test == Math.Trunc(Test))
FermatWasWrong();
}
private void FermatWasWrong()
{
Press.Announce("Fermat was wrong!");
Nobel.Claim();
}
(Ignore the type and overflow errors) Dead code?

Look at this example:
public boolean isEven(int i){
if(i % 2 == 0)
return true;
if(i % 2 == 1)
return false;
return false;
}
The compiler can't know that an int can only be even or odd. Therefore the compiler must be able to understand the semantics of your code. How should this be implemented? The compiler can't ensure that the lowest return will never be executed. Therefore the compiler can't detect the dead code.

performance of many if statements/switch cases

If I had literally 1000s of simple if statements or switch statements
ex:
if 'a':
return 1
if 'b':
return 2
if 'c':
return 3
...
...
Would the performance of creating trivial if statements be faster when compared to searching a list for something? I imagined that because every if statement must be tested until the desired output is found (worst case O(n)) it would have the same performance if I were to search through a list. This is just an assumption. I have no evidence to prove this. I am curious to know this.

You could potentially put these things in to delegates that are then in a map, the key of which is the input you've specified.
C# Example:
// declare a map. The input(key) is a char, and we have a function that will return an
// integer based on that char. The function may do something more complicated.
var map = new Dictionary<char, Func<char, int>>();
// Add some:
map['a'] = (c) => { return 1; };
map['b'] = (c) => { return 2; };
map['c'] = (c) => { return 3; };
// etc... ad infinitum.
Now that we have this map, we can quite cleanly return something based on the input
public int Test(char c)
{
Func<char, int> func;
if(map.TryGetValue(c, out func))
return func(c);
return 0;
}
In the above code, we can call Test and it will find the appropriate function to call (if present). This approach is better (imho) than a list as you'd have to potentially search the entire list to find the desired input.

This depends on the language and the compiler/interpreter you use. In many interpreted languages, the performance will be the same, in other languages, the switch statements gives the compiler crucial additional information that it can use to optimize the code.
In C, for instance, I expect a long switch statement like the one you present to use a lookup table under the hood, avoiding explicit comparison with all the different values. With that, your switch decision takes the same time, no matter how many cases you have. A compiler might also hardcode a binary search for the matching case. These optimizations are typically not performed when evaluating a long else if() ladder.
In any case, I repeat, it depends on the interpreter/compiler: If your compiler optimized else if() ladders, but no switch statements, what it could do with a switch statement is quite irrelevant. However, for mainline languages, you should be able to expect all constructs to be optimized.
Apart from that, I advise to use a switch statement wherever applicable, it carries a lot more semantic information to the reader than an equivalent else if() ladder.

Automated GOTO removal algorithm

I've heard that it's been proven theoretically possible to express any control flow in a Turing-complete language using only structured programming constructs, (conditionals, loops and loop-breaks, and subroutine calls,) without any arbitrary GOTO statements. Is there any way to use that theory to automate refactoring of code that contains GOTOs into code that does not?
Let's say I have an arbitrary single subroutine in a simple imperative language, such as C or Pascal. I also have a parser that can verify that this subroutine is valid, and produce an Abstract Syntax Tree from it. But the code contains GOTOs and Labels, which could jump forwards or backwards to any arbitrary point, including into or out of conditional or loop blocks, but not outside of the subroutine itself.
Is there an algorithm that could take this AST and rework it into new code which is semantically identical, but does not contain any Labels or GOTO statements?

In principle, it is always possible to do this, though the results might not be pretty.
One way to always eliminate gotos is to transform the program in the following way. Start off by numbering all the instructions in the original program. For example, given this program:
start:
while (true) {
if (x < 5) goto start;
x++
}
You could number the statements like this:
0 start:
1 while (x < 3) {
2 if (x < 5) goto start;
3 x++
}
To eliminate all gotos, you can simulate the flow of the control through this function by using a while loop, an explicit variable holding the program counter, and a bunch of if statements. For example, you might translate the above code like this:
int PC = 0;
while (PC <= 3) {
if (PC == 0) {
PC = 1; // Label has no effect
} else if (PC == 1) {
if (x < 3) PC = 4; // Skip loop, which ends this function.
else PC = 2; // Enter loop.
} else if (PC == 2) {
if (x < 5) PC = 0; // Simulate goto
else PC = 3; // Simulate if-statement fall-through
} else if (PC == 3) {
x++;
PC = 1; // Simulate jump back up to the top of the loop.
}
}
This is a really, really bad way to do the translation, but it shows that in theory it is always possible to do this. Actually implementing this would be very messy - you'd probably number the basic blocks of the function, then generate code that puts the basic blocks into a loop, tracks which basic block is currently executing, then simulates the effect of running a basic block and the transition from that basic block to the appropriate next basic block.
Hope this helps!

I think you want to read Taming Control Flow by Erosa and Hendren, 1994. (Earlier link on Google scholar).
By the way, loop-breaks are also easy to eliminate. There is a simple mechanical procedure involving the creating of a boolean state variable and the restructuring of nested conditionals to create straight-line control flow. It does not produce pretty code :)
If your target language has tail-call optimization (and, ideally, inlining), you can mechanically remove both break and continue by turning the loop into a tail-recursive function. (If the index variable is modified by the loop body, you need to work harder at this. I'll just show the simplest case.) Here's the transformation of a simple loop:
for (Type Index = Start; function loop(Index: Type):
Condition(Index); if (Condition)
Index = Advance(Index)){ return // break
Body Body
} return loop(Advance(Index)) // continue
loop(Start)
The return statements labeled "continue" and "break" are precisely the transformation of continue and break. Indeed, the first step in the procedure might have been to rewrite the loop into its equivalent form in the original language:
{
Type Index = Start;
while (true) {
if (!Condition(Index))
break;
Body;
continue;
}
}

I use either/both Polyhedron's spag and vast's 77to90 to begin the process of refactoring fortran and then converting it to matlab source. However, these tools always leave 1/4 to 1/2 of the goto's in the program.
I wrote up a goto remover which accomplishes something similar to what you were describing: it takes fortran code and refactors all the remaining goto's from a program and replacing them with conditionals and do/cycle/exit's which can then be converted into other languages like matlab. You can read more about the process I use here:
http://engineering.dartmouth.edu/~d30574x/consulting/consulting_gotorefactor.html
This program could be adapted to work with other languages, but I have not gotten than far yet.

Does the #inline annotation in Scala really help performance?

Or does it just clutter up the code for something the JIT would take care of automatically anyway?

I have yet to find a case where it improves performance, and I've tried in quite a few different spots. The JVM seems to be quite good at inlining when it's possible, and even if you ask for #inline in Scala, it can't always do it (and sometimes I've noticed that it doesn't even when I think it ought to be able to).
The place where you expect to see a bytecode difference is in something like this:
object InlineExample {
final class C(val i: Int) {
#inline def t2 = i*2
#inline def t4 = t2*2
}
final class D(val i: Int) {
def t2 = i*2
def t4 = t2*2
}
}
when compiled with -optimise. And you do see the difference, but it generally doesn't run any faster since the JIT compiler can notice that the same optimizations apply to D.
So it may be worth a try in the final stages of optimization, but I wouldn't bother doing it routinely without checking to see if it makes a difference in performance.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Scala: branching statement optimisation using callback function degrading performance - performance

Related

Removing mutability without losing speed

Why can't dead code detection be fully solved by a compiler?

performance of many if statements/switch cases

Automated GOTO removal algorithm

Does the #inline annotation in Scala really help performance?

Categories

Resources