I'm curious if I break an expression into different declarations e.g
int x = (c / 23) % (5 *3);
into
int a = c /23;
int b = 5 * 3;
int x = a % b;
does it slow down execution, or the compiler should recognize that it can be declared into one variable. (or some other trick)
I want to know if it can be a concern to give up some readability at a performance sensitive function.
Of course, I'm not asking about this particular example, my question is about the the general rule here.
I'm using C++, but I guess is that this question can be generalized to any - at least compiled - language.
Depending on the optimization level selected for your compiler, separation into several lines of code should not matter.
You can always use the Godbolt Compiler Explorer to try and look at the resulting assembly code.
Related
I'm studying processors and one thing that caught my attention is the fact that high-performance CPUs have the ability to execute more than one instruction during a clock cycle and even execute them out of order in order to improve performance. All this without any help from the compilers.
As far as I understood, the processors are able to do that by analysing data dependencies to determine which instructions can be run first/in a same ILP-paralell-step (issue).
#edit
I'll try giving an example. Imagine these two pieces of code:
int myResult;
myResult = myFunc1(); // 1
myResult = myFunc2(); // 2
j = myResult + 3; // 3
-
int myFirstResult, mySecondResult;
myFirstResult = myFunc1(); // 1
mySecondResult = myFunc2(); // 2
j = mySecondResult + 3; // 3
They both do the same thing, the difference is that on the first I reuse my variable and in the second I don't.
I assume (and please correct me if I'm wrong) that the processor could run instructions 2 and 3 before instruction 1 on the second example, because the data would be stored in two different places (registers?).
The same would not be possible for the first example, because if it runs instruction 2 and 3 before instruction 1, the value assigned on instruction 1 would be kept in memory (instead of the value from instruction 2).
Question is :
Is there any strategy to run instructions 2 and 3 before 1 if I reuse the variable (like in the first example)?
Or reusing variables prevent instruction-level parallelism and OoO execution?
A modern microprocessor is an extremely sophisticated piece of equipment and already has enough complexity that understanding every single aspect of how it functions is beyond the reach of most people. There's an additional layer introduced by your compiler or runtime which increases the complexity. It's only really possible to speak in terms of generalities here, as ARM processor X might handle this than ARM processor Y, and both of those differently from Intel U or AMD V.
Looking more closely at your code:
int myResult;
myResult = myFunc1(); // 1
myResult = myFunc2(); // 2
j = myResult + 3; // 3
The int myResult line doesn't necessarily do anything CPU-wise. It's just instructing the compiler that there will be a variable named myResult of type int. It's not initialized, so there's no need to do anything yet.
On the first assignment the value is not used. By default the compiler usually does a pretty straight-forward conversion of your code to machine instructions, but when you turn on optimization, which you normally do for production code, that assumption goes out the window. A good compiler will recognize that this value is never used and will omit the assignment. A better compiler will warn you that the value is never used.
The second one actually assigns to the variable and that variable is later used. Obviously before the third assignment can happen the second assignment must be completed. There's not much optimizing that can go on here unless those functions are trivial and end up inlined. Then it's a matter of what those functions do.
A "superscalar" processsor, or one capable of running things out-of-order, has limitations on how ambitious it can get. The type of code it works best with resembles the following:
int a = 1;
int b = f();
int c = a * 2;
int d = a + 2;
int e = g(b);
The assignment of a is straightforward and immediate. b is a computed value. Where it gets interesting is that c and d have the same dependency and can actually execute in parallel. They also don't depend on b so theoretically they could run before, during, or after the f() call so long as the end-state is correct.
A single thread can execute multiple operations concurrently, but most processors have limits on the types and number of them. For example, a floating-point multiply and an integer add could happen, or two integer adds, but not two floating point multiply ops. It depends on what operations the CPU has, what registers those can operate on, and how the compiler has arranged the data in advance.
If you're looking to optimize code and shave nanoseconds off of things you'll need to find a really good technical manual on the CPU(s) you're targeting, plus spend untold hours trying different approaches and benchmarking things.
The short answer is variables don't matter. It's all about dependencies, your compiler, and what capabilities your CPU has.
I have written some code which assigns variables using the results of condition expressions without the explicit use of IF-ELSE statements.
In the simplest form, the problem looks like this:
Version 1
if (x < K)
y = A;
else
y = B;
I've seen a "trick" in the past in which people accomplish the same task in one line without the conditional like this:
Version 2
y = (x < K) * A + !(x < K) * B;
This approach extends relatively easily to handle IF-ELSE IF-ELSE assignments. The trick is to ensure that the conditions are all mutually exclusive.
From a unit testing perspective, I'm required to achieve 100% code path coverage.
My coworkers agree that the Version 2 is more elegant, but they contend it is less readable. Furthermore, they argue that I am "side-stepping" the path coverage requirement and that I would be able to achieve 100% path coverage by "hiding" the conditional logic inside the single line of code without actually exercising both conditions ((x < K) and !(x < K)).
I argue that I am able to blend Boolean algebra and numeric algebra to perform variable assignment because the computer treats Boolean 'true' and 'false' as '1' and '0' which can be multiplied by 'float' and 'int' variables. To me, it becomes simply an arithmetic expression with zeros and ones multiplying variables.
Why am I doing this?
I am doing this blend of Boolean and numeric algebra to minimize the number of IF-statements, minimize lines of code, and general code cleanup. Obviously performance can be improved by saving the result of the condition to a variable and referencing.
The Question
Is this practice (and ternary operators) frowned upon from a unit testing perspective?
If this question is too subjective, please suggest edits.
I'd suggest avoiding it (this trick is actually useful when the intention is to avoid branching, which may be the context you've seen it in). Given that the language doesn't have a conditional operator, you should be able to define the equivalent of
cond(bool, x, y) { if (bool) return x; else return y; }
yourself and write y = cond(x < K, A, B). It's more readable, harder to make a mistake when writing, is usable with non-number types, and is considered correctly in path coverage. It evaluates both sides, unlike the actual conditional operator (unless the language has macros or lazy evaluation), but so does the described trick.
Imagine that we have a been given an Excel spreadsheet with three columns, labeled COND, X and Y.
COND = TRUE or FALSE (user input)
X = if(COND == TRUE) then 0 else Y
Y = if(COND == TRUE) then X else 1;
These formulas evaluate perfectly fine in Excel, and Excel does not generate a Circular Dependency error.
I am writing a compiler that tries to convert these Excel formulas to C code. In my compiler, these formulas do generate a circular dependency error. The issue is that (naïvely) the expression of X depends on Y and the expression for Y depends on X and my compiler is unable to logically continue.
Excel is able to accomplish this feat because it is a lazy, interpreted language. Excel will just lazily evaluate the formulas at run-time (with user inputs), and since no circular dependency occurs at run-time Excel has no problem evaluating such logic.
Unfortunately, I need to convert these formulas to a compiled language (not an interpreted one). The actual formulas, in the actual spreadsheets, have more complicated dependencies between multiple cells/variables (involving up to over half a dozen different cells). This means that my compiler has to perform some kind of sophisticated static, semantic analysis of the formulas and be smart enough to detect that there are no circular references if we "look inside" the conditional branches. The compiler would then have to generate the following C code from the above Excel formulas:
bool COND;
int X, Y;
if(COND) { X = 0; Y = X; } else { Y = 1; X = Y; }
Notice that the order of the assignment instructions is different in each branch of the if-statement in C.
My question is, is there any established algorithm or literature on compilers that explains how to implement this type of analysis in a compiler? Do functional programming language compilers have to solve this problem?
Why aren't standard optimization techniques adequate?
Presumably, the Excel formulas form a DAG with the leaves being primitive values and the nodes being computations/assignments. (If the Excel computation forms a cycle, then you need
some kind of iterative solver assuming you want a fixpoint).
If you simply propagate the conditional by lifting it (a class compiler optimization), we start with your original equations, where each computation is evaluated in any order WRT to others, such that the result computes dag-like (that "anyorder" is an operator intending to model that):
X = if(COND == TRUE) then 0 else Y;
anyorder
Y = if(COND == TRUE) then X else 1;
then lifting the conditional:
if (COND) { X=0; } else { X = 1; }
anyorder
if (COND) { Y=X; } else { Y = 1; }
then
if (COND) { X=0; anyorder Y=X; } else { X = Y; anyorder Y = 1; }
Each of the arms must be dag-like.
The first arm is daglike evaluating the X=0 assignment first.
The second arm is daglike evaluating Y=1 first. So, we get the answer you wanted:
if (COND) { X=0; Y=X; } else { Y = 1; X = Y; }
So conventional transformations and knowledge about anyorder-if-daglike knowledges
seems to give the right effect.
I'm not sure what you do if COND is computed as a function of the cells.
I suspect the way to do this is to generate a dependency graph of computations with
with conditionals on the dependencies. You probably have to propagate/group those conditionals over the arcs more as less as I did over the syntax.
Yes, literature exists, sorry I cannot quote any, I simply don't remember and would it just google up just as you can..
Basic algos for dependency and cycle analysis are really simple. I.e. detect symbols in the expression, build a set of expressions and dependencies in form:
inps expr outs
cell_A6, cell_B7 -> expr3 -> cell_A7
cell_A1, cell_B4 -> expr1 -> cell_A5
cell_A1, cell_A5 -> expr2 -> cell_A6
and then by comparing and iteratively expanding/replacing sets of inputs/outputs:
step0:
cell_A6, cell_B7 -> expr3 -> cell_A7
cell_A1, cell_B4 -> expr1 -> cell_A5 <--1 note that cell_A5 ~ (A1,B4)
cell_A1, cell_A5 -> expr2 -> cell_A6 <--1 apply that knowledge here
so dependency
cell_A1, cell_A5 -> expr2 -> cell_A6
morphs into
cell_A1, cell_B4 -> expr2 -> cell_A6 <--2 note that cell_A6 ~ (A1,B4) and so on
Finally, you will get either a set of full dependencies, where you can easily detect circular dependencies, like for example:
cell_A1, cell_D6, cell_F7 -> exprN -> cell_D6
or, if none found - you will be able to determine a safe, incremental order of the execution.
If the expressions contain branches or sideeffects other than the 'returned value', you can apply various transformations to reduce/expand the expressions into new ones, or into groups of new expressions that will be of the form above. For example:
B5 = { if(A5 + A3 > 0) A3-1 else A5+1 }
so
inps ... outs
A3, A5 -> theExpr -> B5
the condition can be 'lifted' and form two conditional rules:
A5 + A3 > 0 : A3 -> reducedexpr "A3-1" -> B5
A5 + A3 <= 0 : A5 -> reducedexpr "A5-1" -> B5
but now, your execution/analysis must also take care of the conditions before applying the rules. Lifting is only one of possible transformations.
However, you stil need something more than that, at least some an 'extension' for it. The hard part of your problem is that your expressions are complex, have branches, and you need to include user-random input to resolve branches to eliminate the dead branches and break dead dependencies.
Since the key is elimination of dead dependencies, you have to somehow detect dead branches. Conditions can be of any arbitrary complexity, and user-input is random, so you cannot work it out completely statically, really. After playing with transformations, you would still have to analyze the conditions and generate code accordingly. To do so, you would need to generate code for all possible combinations of the outcomes of the conditions, and all resulting branching and rule combinations, which is simply infeasible except for some trivial cases. With number of unknown the number of leafs can grow exponentially (2^N) which is a huge bloat after crossing some threshold.
Of course while analyzing conditions based on Bools, you can analyze, group and eliminate conflicting conditions like (a & b & !a)..
..but if your input values and conditions include NON-BOOL data, like integers or floating or strings, just imagine your condition is have a condition that executes some external weird statistical function and checks its result.. Ignore the 'weird' part and focus on 'external'. If you meet some expressions that use complex functions like AVG or MAX, you cannot chew through something like that statically(*). Even simple arithmetic is hard to analyze: (a+b)*(c+d) - you could derive a fact that c+d can be ignored when a+b==0, but this a really tough task to cover fully..
IIRC, doing a satisfiability analysis (SAT) for boolean expressions with basic operators is an NP-hard problem, not mentioning integers or floating points with all their math.. Calculating the result of expression is much easier than telling which values does it really depend on!!
So, since input values may be either hardcoded (cool) or user-supplied at runtime (doh!), your compler most probably will not be able to fully analyze it up front. Now link it with the fact marked as (*) and it's quite obvious that you can include some static analysis and try to eliminate some branches at 'compilation time', but still there might be some parts that must be delayed until the user provides the actual input.
So, if part of the analysis must be done at runtime, all the branch elimination is just an optional optimisation and I think you should focus on the runtime part now.
At minimal unoptimized version, your generated program could simply remember all the excel-expressions and wait for input data. Once the program is run and input is given, the program has to substitute the input in the expressions, and then try to iteratively reduce them to output values.
Writing such algo in imperative language is completely possible. Actually, you'd need to write it once, and later you'd just merge it with a different sets of rules derived from cell-formulas and done. Runtime part of the program would be the same, formulas would change.
You could then expand the 'compiler' side to try to help by i.e. preliminarily partially analyzing the dependencies and trying to reorder the rules so later they will be checked in a "better order", or by precalculating constants, or inlining some expressions and so on but as I said, it's all optimizations, not core feature.
Sadly, I cannot really tell you much anything serious about the "functional languages", but since usually their runtimes are 'very dynamic' and sometimes they even execute the code in terms of symbols and transformations, it could reduce the complexity of your 'compiler' and 'engine' part. The most valuable asset here is the dynamism. So, even a Ruby would do much better than C - but in no way it's a "compiled" language as you'd say.
For example, you could try to transform excel rules directly into functions:
def cell_A5 = expr1(cell_A1, cell_B4)
def cell_A7 = expr3(cell_A6, cell_B7)
def cell_A6 = expr2(cell_A1, cell_A5)
write it down as part of the program, then when at runtime when the user provides some values, you'd those would just redefine some of the parts of the program
cell_B7 = 11.2 // filling up undefined variable
cell_A1 = 23 // filling up undefined variable
cell_A5 = 13 // overwriting the function with a value
That's the power of dynamic platforms, nothing very 'functional' here. Dynamic platforms make it easy to fill/override bits. But then, once the user provided some bits and once the program has been "corrected on the fly", which one function would you call first?
The answer is somewhat sad.. You don't know.
If your dynamic language has some rule-engine built into it, you can try generating rules instead of functions and later rely on that engine to "fill up" everything that is possible to calculate.
But if it doesn't have rule engine, you are back to point one..
afterthought:
Hm.. sorry, I think I just wrote too much and too vaguely/chatty. If you think it's helpful, please drop me a comment. Otherwise I'll delete it after few days or a week.
20-30 years ago arithmetic operations like division were one of the most costly operations for CPUs. Saving one division in a piece of repeatedly called code was a significant performance gain. But today CPUs have fast arithmetic operations and since they heavily use instruction pipelining, conditionals can disrupt efficient execution. If I want to optimize code for speed, should I prefer arithmetic operations in favor of conditionals?
Example 1
Suppose we want to implement operations modulo n. What will perform better:
int c = a + b;
result = (c >= n) ? (c - n) : c;
or
result = (a + b) % n;
?
Example 2
Let's say we're converting 24-bit signed numbers to 32-bit. What will perform better:
int32_t x = ...;
result = (x & 0x800000) ? (x | 0xff000000) : x;
or
result = (x << 8) >> 8;
?
All the low hanging fruits are already picked and pickled by authors of compilers and guys who build hardware. If you are the kind of person who needs to ask such question, you are unlikely to be able to optimize anything by hand.
While 20 years ago it was possible for a relatively competent programmer to make some optimizations by dropping down to assembly, nowadays it is the domain of experts, specializing in the target architecture; also, optimization requires not only knowing the program, but knowing the data it will process. Everything comes down to heuristics, tests under different conditions etc.
Simple performance questions no longer have simple answers.
If you want to optimise for speed, you should just tell your compiler to optimise for speed. Modern compilers will generally outperform you in this area.
I've sometimes been surprised trying to relate assembly code back to the original source for this very reason.
Optimise your source code for readability and let the compiler do what it's best at.
I expect that in example #1, the first will perform better. The compiler will probably apply some bit-twiddling trick to avoid a branch. But you're taking advantage of knowledge that it's extremely unlikely that the compiler can deduce: namely that the sum is always in the range [0:2*n-2] so a single subtraction will suffice.
For example #2, the second way is both faster on modern CPUs and simpler to follow. A judicious comment would be appropriate in either version. (I wouldn't be surprised to see the compiler convert the first version into the second.)
I was trying to solve ITA Software's "Word Nubmers" puzzle using a brute force approach. It looks like my Haskell version is more than 10 times slower than a C#/C++ version.
The answer
Thanks to Bryan O'Sullivan's answer, I was able to "correct" my program to acceptable performance. You can read his code which is much cleaner than mine. I am going to outline the key points here.
Int is Int64 on Linux GHC x64. Unless you unsafeCoerce, you should just use Int. This saves you from having to fromIntegral. Doing Int64 on Windows 32-bit GHC is just darn slow, avoid it. (This is in fact not GHC's fault. As mentioned in my blog post below, 64 bit integers in 32-bit programs is slow in general (at least in Windows))
-fllvm or -fvia-C for performance.
Prefer quotRem to divMod, quotRem already suffices. That gave me 20% speed up.
In general, prefer Data.Vector to Data.Array as an "array"
Use the wrapper-worker pattern liberally.
The above points were enough to give me about 100% boost over my original version.
In my blog post, I have detailed a step-by-step illustrated example of how I turned the original program to match Bryan's program. There are other points mentioned there as well.
The original question
(This may sound like a "could you do the work for me" post, but I argue that such a concrete example would be very instructive since profiling Haskell performance is often seen as a myth)
(As noted in the comments, I think I have misinterpreted the problem. But who cares, we can focus on performance in a different problem)
Here's a my version of a quick recap of the problem:
A wordNumber is defined as
wordNumber 1 = "one"
wordNumber 2 = "onetwo"
wordNumber 3 = "onethree"
wordNumber 15 = "onetwothreefourfivesixseveneightnineteneleventwelvethirteenfourteenfifteen"
...
Problem: Find the 51-billion-th letter of (wordNumber Infinity); assume that letter is found at 'wordNumber x', also find 'sum [1..x]'
From an imperative perspective, a naive algorithm would be to have 2 counters, one for sum of numbers and one for sum of lengths. Keep counting the length of each wordNumber and "break" to return the result.
The imperative brute-force approach is implemented in C# here: http://ideone.com/JjCb3. It takes about 1.5 minutes to find the answer on my computer. There is also an C++ implementation that runs in 45 seconds on my computer.
Then I implemented a brute-force Haskell version: http://ideone.com/ngfFq. It cannot finish the calculation in 5 minutes on my machine. (Irony: it's has more lines than the C# version)
Here is the -p profile of the Haskell program: http://hpaste.org/49934
Question: How to make it perform comparatively to the C# version? Are there obvious mistakes I am making?
(Note: I am fully aware that brute-forcing it is not the correct solution to this problem. I am mainly interested in making the Haskell version perform comparatively to the C# version. Right now it is at least 5x slower so obviously I am missing something obvious)
(Note 2: It does not seem to be space leaking. The program runs with constant memory (about 2MB) on my computer)
(Note 3: I am compiling with `ghc -O2 WordNumber.hs)
To make the question more reader friendly, I include the "gist" of the two versions.
// C#
long sumNum = 0;
long sumLen = 0;
long target = 51000000000;
long i = 1;
for (; i < 999999999; i++)
{
// WordiLength(1) = 3 "one"
// WordiLength(101) = 13 "onehundredone"
long newLength = sumLen + WordiLength(i);
if (newLength >= target)
break;
sumNum += i;
sumLen = newLength;
}
Console.WriteLine(Wordify(i)[Convert.ToInt32(target - sumLen - 1)]);
-
-- Haskell
-- This has become totally ugly during my squeeze for
-- performance
-- Tail recursive
-- n-th number (51000000000 in our problem) -> accumulated result -> list of 'zipped' left to try
-- accumulated has the format (sum of numbers, current lengths of the whole chain, the current number)
solve :: Int64 -> (Int64, Int64, Int64) -> [(Int64, Int64)] -> (Int64, Int64, Int64)
solve !n !acc#(!sumNum, !sumLen, !curr) ((!num, !len):xs)
| sumLen' >= n = (sumNum', sumLen, num)
| otherwise = solve n (sumNum', sumLen', num) xs
where
sumNum' = sumNum + num
sumLen' = sumLen + len
-- wordLength 1 = 3 "one"
-- wordLength 101 = 13 "onehundredone"
wordLength :: Int64 -> Int64
-- wordLength = ...
solution :: Int64 -> (Int64, Char)
solution !x =
let (sumNum, sumLen, n) = solve x (0,0,1) (map (\n -> (n, wordLength n)) [1..])
in (sumNum, (wordify n) !! (fromIntegral $ x - sumLen - 1))
I've written a gist that contains both a C++ version (a copy of yours from a Haskell-cafe message, with a bug fixed) and a Haskell translation.
Notice that the two are structurally almost identical. When compiled with -fllvm, the Haskell code runs at about half the speed of the C++ code, which is pretty good.
Now let's compare my Haskell wordLength code to yours. You're passing around an extra unnecessary parameter, which is unnecessary (you apparently figured that out when writing the C++ code that I translated). Also, the large number of bang patterns suggests panic; they're almost all useless.
Your solve function is also very confused.
You're passing parameters in three different ways: a regular Int, a 3-tuple, and a list! Whoa.
This function is necessarily not very regular in its behaviour, so while you gain nothing stylistically by using a list to supply your counter, you probably force GHC to allocate memory. In other words, this both obfuscates the code and makes it slower.
By using a tuple for three parameters (for no obvious reason), you're again working hard to force GHC to allocate memory for every step through the loop, when it could avoid doing so if you passed the parameters directly.
Only your n parameter is dealt with in a sensible way, but you don't need a bang pattern on it.
The only parameter that needs a bang pattern is sumNum, because you never inspect its value until after the loop has finished. GHC's strictness analyser will deal with the others. All of your other bang patterns are unnecessary at best, misdirections at worst.
Here are two pointers I could come up with in a quick investigation:
Note that using Int64 is really slow when you are using a 32 bit build of GHC, as is the default for Haskell Platform, currently. This also turned out to be the main villain in a previous performance problem (there I give a few more details).
For reasons I don't quite understand the divMod function does not seem to get inlined. As a result, the numbers are returned on the heap. When using div and mod separately, wordLength' executes purely on the stack as it should be.
Sadly I currently have no 64-bit GHC around to test whether this is enough to solve the problem.