Do speeds of if statements in a repetitive loop affect overall performance? - performance

If I have code that will take a while to execute, printing out results every iteration will slow down the program a lot. To still receive occasional output to check on the progress of the code, I might have:
if (i % 10000 == 0) {
# print progress here
}
Does the if statement checking every time slow it down at all? Should I just not put output and just wait, will that make it noticeably faster at all?
Also, is it faster to do: (i % 10000 == 0) or (i == 10000)?
Is checking equality or modulus faster?

In general case, it won't matter at all.
A slightly longer answer: It won't matter unless the loop is run millions of times and the other statement in it is actually less demanding than an if statement (for example, a simple multiplication etc.). In that case, you might see a slight performance drop.
Regarding (i % 10000 == 0) vs. (i == 10000), the latter is obviously faster, because it only compares, whereas the former possibility does a (fairly costly) modulus and a comparison.
That said, both an if statement and a modulus count won't make any difference if your loop doesn't take up 90 % of the program's running time. Which usually is the case only at school :). You probably spent a lot more time by asking this question than you would have saved by not printing anything. For development and debugging, this is not a bad way to go.
The golden rule for this kind of decisions:
Write the most readable and explicit code you can imagine to do the
thing you want it to do. If you have a performance problem, look at
wrong data structures and algorithmic choices first. If you have done
all those and need a really quick program, profile it to see which
part takes most time. After all those, you're allowed to do this kind
of low-level guesses.

Related

Fast check if element is in MATLAB matrix

I would like to verify whether an element is present in a MATLAB matrix.
At the beginning, I implemented as follows:
if ~isempty(find(matrix(:) == element))
which is obviously slow. Thus, I changed to:
if sum(matrix(:) == element) ~= 0
but this is again slow: I am calling a lot of times the function that contains this instruction, and I lose 14 seconds each time!
Is there a way of further optimize this instruction?
Thanks.
If you just need to know if a value exists in a matrix, using the second argument of find to specify that you just want one value will be slightly faster (25-50%) and even a bit faster than using sum, at least on my machine. An example:
matrix = randi(100,1e4,1e4);
element = 50;
~isempty(find(matrix(:)==element,1))
However, in recent versions of Matlab (I'm using R2014b), nnz is finally faster for this operation, so:
matrix = randi(100,1e4,1e4);
element = 50;
nnz(matrix==element)~=0
On my machine this is about 2.8 times faster than any other approach (including using any, strangely) for the example provided. To my mind, this solution also has the benefit of being the most readable.
In my opinion, there are several things you could try to improve performance:
following your initial idea, i would go for the function any to test is any of the equality tests had a success:
if any(matrix(:) == element)
I tested this on a 1000 by 1000 matrix and it is faster than the solutions you have tested.
I do not think that the unfolding matrix(:) is penalizing since it is equivalent to a reshape and Matlab does this in a smart way where it does not actually allocate and move memory since you are not modifying the temporary object matrix(:)
If your does not change between the calls to the function or changes rarely you could simply use another vector containing all the elements of your matrix, but sorted. This way you could use a more efficient search algorithm O(log(N)) test for the presence of your element.
I personally like the ismember function for this kind of problems. It might not be the fastest but for non critical parts of the code it greatly improves readability and code maintenance (and I prefer to spend one hour coding something that will take day to run than spending one day to code something that will run in one hour (this of course depends on how often you use this program, but it is something one should never forget)
If you can have a sorted copy of the elements of your matrix, you could consider using the undocumented Matlab function ismembc but remember that inputs must be sorted non-sparse non-NaN values.
If performance really is critical you might want to write your own mex file and for this task you could even include some simple parallelization using openmp.
Hope this helps,
Adrien.

Are boolean operations slower than mathematical operations in loops?

I really tried to find something about this kind of operations but I don't find specific information about my question... It's simple: Are boolean operations slower than typical math operations in loops?
For example, this can be seen when working with some kind of sorting. The method will make an iteration and compare X with Y... But is this slower than a summatory or substraction loop?
Example:
Boolean comparisons
for(int i=1; i<Vector.Length; i++) if(Vector[i-1] < Vector[i])
Versus summation:
Double sum = 0;
for(int i=0; i<Vector.Length; i++) sum += Vector[i];
(Talking about big length loops)
Which is faster for the processor to complete?
Do booleans require more operations in order to return "true" or "false" ?
Short version
There is no correct answer because your question is not specific enough (the two examples of code you give don't achieve the same purpose).
If your question is:
Is bool isGreater = (a > b); slower or faster than int sum = a + b;?
Then the answer would be: It's about the same unless you're very very very very very concerned about how many cycles you spend, in which case it depends on your processor and you need to read its documentation.
If your question is:
Is the first example I gave going to iterate slower or faster than the second example?
Then the answer is: It's going to depend primarily on the values the array contains, but also on the compiler, the processor, and plenty of other factors.
Longer version
On most processors a boolean operation has no reason to significantly be slower or faster than an addition: both are basic instructions, even though comparison may take two of them (subtracting, then comparing to zero). The number of cycles it takes to decode the instruction depends on the processor and might be different, but a few cycles won't make a lot of difference unless you're in a critical loop.
In the example you give though, the if condition could potentially be harmful, because of instruction pipelining. Modern processors try very hard to guess what the next bunch of instructions are going to be so they can pre-fetch them and treat them in parallel. If there is branching, the processor doesn't know if it will have to execute the then or the else part, so it guesses based on the previous times.
If the result of your condition is the same most of the time, the processor will likely guess it right and this will go well. But if the result of the condition keeps changing, then the processor won't guess correctly. When such a branch misprediction happens, it means it can just throw away the content of the pipeline and do it all over again because it just realized it was moot. That. does. hurt.
You can try it yourself: measure the time it takes to run your loop over a million elements when they are of same, increasing, decreasing, alternating, or random value.
Which leads me to the conclusion: processors have become some seriously complex beasts and there is no golden answers, just rules of thumb, so you need to measure and profile. You can read what other people did measure though to get an idea of what you should or should not do.
Have fun experimenting. :)

Is it worth it to rewrite an if statement to avoid branching?

Recently I realized I have been doing too much branching without caring the negative impact on performance it had, therefore I have made up my mind to attempt to learn all about not branching. And here is a more extreme case, in attempt to make the code to have as little branch as possible.
Hence for the code
if(expression)
A = C; //A and C have to be the same type here obviously
expression can be A == B, or Q<=B, it could be anything that resolve to true or false, or i would like to think of it in term of the result being 1 or 0 here
I have come up with this non branching version
A += (expression)*(C-A); //Edited with thanks
So my question would be, is this a good solution that maximize efficiency?
If yes why and if not why?
Depends on the compiler, instruction set, optimizer, etc. When you use a boolean expression as an int value, e.g., (A == B) * C, the compiler has to do the compare, and the set some register to 0 or 1 based on the result. Some instruction sets might not have any way to do that other than branching. Generally speaking, it's better to write simple, straightforward code and let the optimizer figure it out, or find a different algorithm that branches less.
Jeez, no, don't do that!
Anyone who "penalize[s] [you] a lot for branching" would hopefully send you packing for using something that awful.
How is it awful, let me count the ways:
There's no guarantee you can multiply a quantity (e.g., C) by a boolean value (e.g., (A==B) yields true or false). Some languages will, some won't.
Anyone casually reading it is going observe a calculation, not an assignment statement.
You're replacing a comparison, and a conditional branch with two comparisons, two multiplications, a subtraction, and an addition. Seriously non-optimal.
It only works for integral numeric quantities. Try this with a wide variety of floating point numbers, or with an object, and if you're really lucky it will be rejected by the compiler/interpreter/whatever.
You should only ever consider doing this if you had analyzed the runtime properties of the program and determined that there is a frequent branch misprediction here, and that this is causing an actual performance problem. It makes the code much less clear, and its not obvious that it would be any faster in general (this is something you would also have to measure, under the circumstances you are interested in).
After doing research, I came to the conclusion that when there are bottleneck, it would be good to include timed profiler, as these kind of codes are usually not portable and are mainly used for optimization.
An exact example I had after reading the following question below
Why is it faster to process a sorted array than an unsorted array?
I tested my code on C++ using that, that my implementation was actually slower due to the extra arithmetics.
HOWEVER!
For this case below
if(expression) //branched version
A += C;
//OR
A += (expression)*(C); //non-branching version
The timing was as of such.
Branched Sorted list was approximately 2seconds.
Branched unsorted list was aproximately 10 seconds.
My implementation (whether sorted or unsorted) are both 3seconds.
This goes to show that in an unsorted area of bottleneck, when we have a trivial branching that can be simply replaced by a single multiplication.
It is probably more worthwhile to consider the implementation that I have suggested.
** Once again it is mainly for the areas that is deemed as the bottleneck **

Algorithm Efficiency - Is partially unrolling a loop effective if it requires more comparisons?

How to judge if putting two extra assignments in an iteration is expensive or setting a if condition to test another thing? here I elaborate. question is to generate and PRINT the first n terms of the Fibonacci sequence where n>=1. my implement in C was:
#include<stdio.h>
void main()
{
int x=0,y=1,output=0,l,n;
printf("Enter the number of terms you need of Fibonacci Sequence ? ");
scanf("%d",&n);
printf("\n");
for (l=1;l<=n;l++)
{
output=output+x;
x=y;
y=output;
printf("%d ",output);
}
}
but the author of the book "how to solve it by computer" says it is inefficient since it uses two extra assignments for a single fibonacci number generated. he suggested:
a=0
b=1
loop:
print a,b
a=a+b
b=a+b
I agree this is more efficient since it keeps a and b relevant all the time and one assignment generates one number. BUT it is printing or supplying two fibonacci numbers at a time. suppose question is to generate an odd number of terms, what would we do? author suggested put a test condition to check if n is an odd number. wouldn't we be losing the gains of reducing number of assignments by adding an if test in every iteration?
I consider it very bad advice from the author to even bring this up in a book targeted at beginning programmers. (Edit: In all fairness, the book was originally published in 1982, a time when programming was generally much more low-level than it is now.)
99.9% of code does not need to be optimized. Especially in code like this that mixes extremely cheap operations (arithmetic on integers) with very expensive operations (I/O), it's a complete waste of time to optimize the cheap part.
Micro-optimizations like this should only be considered in time-critical code when it is necessary to squeeze every bit of performance out of your hardware.
When you do need it, the only way to know which of several options performs best is to measure. Even then, the results may change with different processors, platforms, memory configurations...
Without commenting on your actual code: As you are learning to program, keep in mind that minor efficiency improvements that make code harder to read are not worth it. At least, they aren't until profiling of a production application reveals that it would be worth it.
Write code that can be read by humans; it will make your life much easier and keep maintenance programmers from cursing the name of you and your offspring.
My first advice echoes the others: Strive first for clean, clear code, then optimize where you know there is a performance issue. (It's hard to imagine a time-critical fibonacci sequencer...)
However, speaking as someone who does work on systems where microseconds matter, there is a simple solution to the question you ask: Do the "if odd" test only once, not inside the loop.
The general pattern for loop unrolling is
create X repetitions of the loop logic.
divide N by X.
execute the loop N/X times.
handle the N%X remaining items.
For your specific case:
a=0;
b=1;
nLoops = n/2;
while (nloops-- > 0) {
print a,b;
a=a+b;
b=a+b;
}
if (isOdd(n)) {
print a;
}
(Note also that N/2 and isOdd are trivially implemented and extremely fast on a binary computer.)

Performance difference between iterating once and iterating twice?

Consider something like...
for (int i = 0; i < test.size(); ++i) {
test[i].foo();
test[i].bar();
}
Now consider..
for (int i = 0; i < test.size(); ++i) {
test[i].foo();
}
for (int i = 0; i < test.size(); ++i) {
test[i].bar();
}
Is there a large difference in time spent between these two? I.e. what is the cost of the actual iteration? It seems like the only real operations you are repeating are an increment and a comparison (though I suppose this would become significant for a very large n). Am I missing something?
First, as noted above, if your compiler can't optimize the size() method out so it's just called once, or is nothing more than a single read (no function call overhead), then it will hurt.
There is a second effect you may want to be concerned with, though. If your container size is large enough, then the first case will perform faster. This is because, when it gets to test[i].bar(), test[i] will be cached. The second case, with split loops, will thrash the cache, since test[i] will always need to be reloaded from main memory for each function.
Worse, if your container (std::vector, I'm guessing) has so many items that it won't all fit in memory, and some of it has to live in swap on your disk, then the difference will be huge as you have to load things in from disk twice.
However, there is one final thing that you have to consider: all this only makes a difference if there is no order dependency between the function calls (really, between different objects in the container). Because, if you work it out, the first case does:
test[0].foo();
test[0].bar();
test[1].foo();
test[1].bar();
test[2].foo();
test[2].bar();
// ...
test[test.size()-1].foo();
test[test.size()-1].bar();
while the second does:
test[0].foo();
test[1].foo();
test[2].foo();
// ...
test[test.size()-1].foo();
test[0].bar();
test[1].bar();
test[2].bar();
// ...
test[test.size()-1].bar();
So if your bar() assumes that all foo()'s have run, you will break it if you change the second case to the first. Likewise, if bar() assumes that foo() has not been run on later objects, then moving from the second case to the first will break your code.
So be careful and document what you do.
There are many aspects in such comparison.
First, complexity for both options is O(n), so difference isn't very big anyway. I mean, you must not care about it if you write quite big and complex program with a large n and "heavy" operations .foo() and bar(). So, you must care about it only in case of very small simple programs (this is kind of programs for embedded devices, for example).
Second, it will depend on programming language and compiler. I'm assured that, for instance, most of C++ compilers will optimize your second option to produce same code as for the first one.
Third, if compiler haven't optimized your code, performance difference will heavily depend on the target processor. Consider loop in a term of assembly commands - it will look something like this (pseudo assembly language):
LABEL L1:
do this ;; some commands
call that
IF condition
goto L1
;; some more instructions, ELSE part
I.e. every loop passage is just IF statement. But modern processors don't like IF. This is because processors may rearrange instructions to execute them beforehand or just to avoid idles. With the IF (in fact, conditional goto or jump) instructions, processors do not know if they may rearrange operation or not.
There's also a mechanism called branch predictor. From material of Wikipedia:
branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure.
This "soften" effect of IF's, through if the predictor's guess is wrong, no optimization will be performed.
So, you can see that there's a big amount of conditions for both your options: target language and compiler, target machine, it's processor and branch predictor. This all makes very complex system, and you cannot foresee what exact result you will get. I believe, that if you don't deal with embedded systems or something like that, the best solution is just to use the form which your are more comfortable with.
For your examples you have the additional concern of how expensive .size() is, since it's compared for each time i increments in most languages.
How expensive is it? Well that depends, it's certainly all relative. If .foo() and .bar() are expensive, the cost of the actual iteration is probably minuscule in comparison. If they're pretty lightweight, then it'll be a larger percentage of your execution time. If you want to know about a particular case test it, this is the only way to be sure about your specific scenario.
Personally, I'd go with the single iteration to be on the cheap side for sure (unless you need the .foo() calls to happen before the .bar() calls).
I assume .size() will be constant. Otherwise, the first code example might not give the same as the second one.
Most compilers would probably store .size() in a variable before the loop starts, so the .size() time will be cut down.
Therefore the time of the stuff inside the two for loops will be the same, but the other part will be twice as much.
Performance tag, right.
As long as you are concentrating on the "cost" of this or that minor code segment, you are oblivious to the bigger picture (isolation); and your intention is to justify something that, at a higher level (outside your isolated context), is simply bad practice, and breaks guidelines. The question is too low level and therefore too isolated. A system or program which is set of integrated components will perform much better that a collection of isolated components.
The fact that this or that isolated component (work inside the loop) is fast or faster is irrelevant when the loop itself is repeated unnecessarily, and which would therefore take twice the time.
Given that you have one family car (CPU), why on Earth would you:
sit at home and send your wife out to do her shopping
wait until she returns
take the car, go out and do your shopping
leaving her to wait until you return
If it needs to be stated, you would spend (a) almost half of your hard-earned resources executing one trip and shopping at the same time and (b) have those resources available to have fun together when you get home.
It has nothing to do with the price of petrol at 9:00 on a Saturday, or the time it takes to grind coffee at the café, or cost of each iteration.
Yes, there is a large diff in the time and the resources used. But the cost is not merely in the overhead per iteration; it is in the overall cost of the one organised trip vs the two serial trips.
Performance is about architecture; never doing anything twice (that you can do once), which are the higher levels of organisation; integrated of the parts that make up the whole. It is not about counting pennies at the bowser or cycles per iteration; those are lower orders of organisation; which ajust a collection of fragmented parts (not a systemic whole).
Masseratis cannot get through traffic jams any faster than station wagons.

Resources