When, if ever, is loop unrolling still useful? - performance

I've been trying to optimize some extremely performance-critical code (a quick sort algorithm that's being called millions and millions of times inside a monte carlo simulation) by loop unrolling. Here's the inner loop I'm trying to speed up:
// Search for elements to swap.
while(myArray[++index1] < pivot) {}
while(pivot < myArray[--index2]) {}
I tried unrolling to something like:
while(true) {
if(myArray[++index1] < pivot) break;
if(myArray[++index1] < pivot) break;
// More unrolling
}
while(true) {
if(pivot < myArray[--index2]) break;
if(pivot < myArray[--index2]) break;
// More unrolling
}
This made absolutely no difference so I changed it back to the more readable form. I've had similar experiences other times I've tried loop unrolling. Given the quality of branch predictors on modern hardware, when, if ever, is loop unrolling still a useful optimization?

Loop unrolling makes sense if you can break dependency chains. This gives a out of order or super-scalar CPU the possibility to schedule things better and thus run faster.
A simple example:
for (int i=0; i<n; i++)
{
sum += data[i];
}
Here the dependency chain of the arguments is very short. If you get a stall because you have a cache-miss on the data-array the cpu cannot do anything but to wait.
On the other hand this code:
for (int i=0; i<n-3; i+=4) // note the n-3 bound for starting i + 0..3
{
sum1 += data[i+0];
sum2 += data[i+1];
sum3 += data[i+2];
sum4 += data[i+3];
}
sum = sum1 + sum2 + sum3 + sum4;
// if n%4 != 0, handle final 0..3 elements with a rolled up loop or whatever
could run faster. If you get a cache miss or other stall in one calculation there are still three other dependency chains that don't depend on the stall. A out of order CPU can execute these in parallel.
(See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an in-depth look at how register-renaming helps CPUs find that parallelism, and an in depth look at the details for FP dot-product on modern x86-64 CPUs with their throughput vs. latency characteristics for pipelined floating-point SIMD FMA ALUs. Hiding latency of FP addition or FMA is a major benefit to multiple accumulators, since latencies are longer than integer but SIMD throughput is often similar.)

Those wouldn't make any difference because you're doing the same number of comparisons. Here's a better example. Instead of:
for (int i=0; i<200; i++) {
doStuff();
}
write:
for (int i=0; i<50; i++) {
doStuff();
doStuff();
doStuff();
doStuff();
}
Even then it almost certainly won't matter but you are now doing 50 comparisons instead of 200 (imagine the comparison is more complex).
Manual loop unrolling in general is largely an artifact of history however. It's another of the growing list of things that a good compiler will do for you when it matters. For example, most people don't bother to write x <<= 1 or x += x instead of x *= 2. You just write x *= 2 and the compiler will optimize it for you to whatever is best.
Basically there's increasingly less need to second-guess your compiler.

Regardless of branch prediction on modern hardware, most compilers do loop unrolling for you anyway.
It would be worthwhile finding out how much optimizations your compiler does for you.
I found Felix von Leitner's presentation very enlightening on the subject. I recommend you read it. Summary: Modern compilers are VERY clever, so hand optimizations are almost never effective.

As far as I understand it, modern compilers already unroll loops where appropriate - an example being gcc, if passed the optimisation flags it the manual says it will:
Unroll loops whose number of
iterations can be determined at
compile time or upon entry to the
loop.
So, in practice it's likely that your compiler will do the trivial cases for you. It's up to you therefore to make sure that as many as possible of your loops are easy for the compiler to determine how many iterations will be needed.

Loop unrolling, whether it's hand unrolling or compiler unrolling, can often be counter-productive, particularly with more recent x86 CPUs (Core 2, Core i7). Bottom line: benchmark your code with and without loop unrolling on whatever CPUs you plan to deploy this code on.

Trying without knowing is not the way to do it.
Does this sort take a high percentage of overall time?
All loop unrolling does is reduce the loop overhead of incrementing/decrementing, comparing for the stop condition, and jumping. If what you're doing in the loop takes more instruction cycles than the loop overhead itself, you're not going to see much improvement percentage-wise.
Here's an example of how to get maximum performance.

Loop unrolling can be helpful in specific cases. The only gain isn't skipping some tests!
It can for instance allow scalar replacement, efficient insertion of software prefetching... You would be surprised actually how useful it can be (you can easily get 10% speedup on most loops even with -O3) by aggressively unrolling.
As it was said before though, it depends a lot on the loop and the compiler and experiment is necessary. It's hard to make a rule (or the compiler heuristic for unrolling would be perfect)

Loop unrolling entirely depends on your problem size. It is entirely dependent on your algorithm being able to reduce the size into smaller groups of work. What you did above does not look like that. I am not sure if a monte carlo simulation can even be unrolled.
I good scenario for loop unrolling would be rotating an image. Since you could rotate separate groups of work. To get this to work you would have to reduce the number of iterations.

Loop unrolling is still useful if there are a lot of local variables both in and with the loop. To reuse those registers more instead of saving one for the loop index.
In your example, you use small amount of local variables, not overusing the registers.
Comparison (to loop end) are also a major drawback if the comparison is heavy (i.e non-test instruction), especially if it depends on an external function.
Loop unrolling helps increasing the CPU's awareness for branch prediction as well, but those occur anyway.

Related

Time Complexity single loop vs multiple sequential loops

Today, me and my colleague had a small argument about one particular code snippet. The code looks something like this. At least, this is what he imagined it to be.
for(int i = 0; i < n; i++) {
// Some operations here
}
for (int i = 0; i < m; i++) { // m is always small
// Some more operations here
}
He wanted me to remove the second loop, since it would cause performance issues.
However, I was sure that since I don't have any nested loops here, the complexity will always be O(n), no matter how many sequential loops I put (only 2 we had).
His argument was that if n is 1,000,000 and the loop takes 5 seconds, my code will take 10 seconds, since it has 2 for loops. I was confused after this statement.
What I remember from my DSA lessons is that we ignore such constants while calculating Big Oh.
What am I missing here?
Yes,
the complexity theory may help to compare two distinct methods of calculation in [?TIME][?SPACE],
but
Do not use [PTIME] complexity as an argument for a poor efficiency
Fact #1: O( f(N) ) is relevant for comparing complexities, in areas near N ~ INFTY, so the process principal limits are being possible to be compared "there"
Fact #2: Given N ~ { 10k | 10M | 10G }, none of such cases meets the above cited condition
Fact #3: If the process ( algorithm ) allows the loops to get merged without any side-effects ( on resources / blocking / etc ) into a single pass, the single-loop processing may always benefit from the reduced looping overheads.
A micro benchmark will decide, not the O( f( N ) ) for N ~ INFTY
as many additional effects get stronger influence - better or poor cache-line alignment and the amount of possible L1/L2/L3-cache re-uses, smart harnessing of more / less CPU-registers - all of which is driven by possible compiler-optimisations and may further increase code-execution speeds for small N-s, beyond any expectations from above.
So,
do perform several scaling-dependent microbenchmarking, before resorting to argue about limits of O( f( N ) )
Always do.
In asymptotic notation, your code has time complexity O(n + n) = O(2n) =
O(n)
Side note:
If the first loop takes n iterations and the second loop m, then the time complexity would be O(n + m).
PS: I assume that the bodies of your for loops is not heavy enough to affect the overall complexity, as you mentioned too.
You may be confusing time complexity and performance. These are two different (but related) things.
Time complexity deals with comparing the rate of growth of algorithms and ignores constant factors and messy real-world conditions. These simplifications make it a valuable theoretical framework for reasoning about algorithm scalability.
Performance is how fast code runs on an actual computer. Unlike in Big O-land, constant factors exist and often play a dominant role in determining execution time. Your coworker is reasonable to acknowledge this. It's easy to forget that O(1000000n) is the same as O(n) in Big O-land, but to an actual computer, the constant factor is a very real thing.
The bird's-eye view that Big O provides is still valuable; it can help determine if your coworker is getting lost in the details and pursuing a micro-optimization.
Furthermore, your coworker considers simple instruction counting as a step towards comparing actual performance of these loop arrangements, but this is still a major simplification. Consider cache characteristics; out-of-order execution potential; friendliness to prefetching, loop unrolling, vectorization, branch prediction, register allocation and other compiler optimizations; garbage collection/allocation overhead and heap vs stack memory accesses as just a few of the factors that can make enormous differences in execution time beyond including simple operations in the analysis.
For example, if your code is something like
for (int i = 0; i < n; i++) {
foo(arr[i]);
}
for (int i = 0; i < m; i++) {
bar(arr[i]);
}
and n is large enough that arr doesn't fit neatly in the cache (perhaps elements of arr are themselves large, heap-allocated objects), you may find that the second loop has a dramatically harmful effect due to having to bring evicted blocks back into the cache all over again. Rewriting it as
for (int i = 0, end = max(n, m); i < end; i++) {
if (i < n) {
foo(arr[i]);
}
if (i < m) {
bar(arr[i]);
}
}
may have a disproportionate efficiency increase because blocks from arr are brought into the cache once. The if statements might seem to add overhead, but branch prediction may make the impact negligible, avoiding pipeline flushes.
Conversely, if arr fits in the cache, the second loop's performance impact may be negligible (particularly if m is bounded and, better still, small).
Yet again, what is happening in foo and bar could be a critical factor. There simply isn't enough information here to tell which is likely to run faster by looking at these snippets, simple as they are, and the same applies to the snippets in the question.
In some cases, the compiler may have enough information to generate the same code for both of these examples.
Ultimately, the only hope to settle debates like this is to write an accurate benchmark (not necessarily an easy task) that measures the code under its normal working conditions (not always possible) and evaluate the outcome against other constraints and metrics you may have for the app (time, budget, maintainability, customer needs, energy efficiency, etc...).
If the app meets its goals or business needs either way it may be premature to debate performance. Profiling is a great way to determine if the code under discussion is even a problem. See Eric Lippert's Which is Faster? which makes a strong case for (usually) not worrying about these sort of things.
This is a benefit of Big O--if two pieces of code only differ by a small constant factor, there's a decent chance it's not worth worrying about until it proves to be worth attention through profiling.

Removing finished futures to keep their number constant

I have a program that needs to launch a large number of futures; specifically, more than size_t. A normal way to have many futures is to keep them in a container but since there are too many of them, I would have to remove the finished ones. The program needs to count the number of new lines in parallel.
This is what I want to work for n>size_t:
vector<future<int>> vf;
for(size_t i=0; i<n;++i){
vf.emplace_back(async([&](){ return count_lines(part_of_an_array);});
}
double cnt=0;
for(auto i:vf) cnt+=i;
One way I thought of doing it is to keep a vector<char> busy_f (vector<bool> is probably not thread safe). As count_lines starts --> busy_f[i_future]=0, and when it would finish --> busy_f[i_future]=1.
Is there a faster approach?
Creating the threads or even the futures "manually" in such cases is usually not a good idea, because it is difficult to create the "right amount" of them: remember you only have a relatively small number of actual cores/threads to execute on, and creating all the extra futures, which do not immediately map to a thread and just block and wait and take space in memory is wasteful.
I'd use some sort of higher-level parallelization primitive, like a 'parallel for' or a parallel map-reduce implementation.
I don't know what OS/compiler you're using, so I'm going to suggest to use TBB as a cross-platform solution. If you're on Microsoft stack, they have their own parallel library, which in some aspects is better than TBB.
In TBB they have a parallel_reduce template function, which looks exactly like what you need, and note what they promise:
If the range and body take O(1) space, and the range splits into
nearly equal pieces, then the space complexity is O(P log(N)), where N
is the size of the range and P is the number of threads.
However, all ranges in TBB are limited to size_t... Maybe you can write an outer loop, which "makes" "chunks" of size_t elements from the larger problem, and then for each chunk you could call a parallel_reduce and sum up their results.
double result = 0;
for(BingNumber offset = 0; offset < n; offset += BigNumber(size_t_size))
{
result += parallel_reduce( ... )
}

Are boolean operations slower than mathematical operations in loops?

I really tried to find something about this kind of operations but I don't find specific information about my question... It's simple: Are boolean operations slower than typical math operations in loops?
For example, this can be seen when working with some kind of sorting. The method will make an iteration and compare X with Y... But is this slower than a summatory or substraction loop?
Example:
Boolean comparisons
for(int i=1; i<Vector.Length; i++) if(Vector[i-1] < Vector[i])
Versus summation:
Double sum = 0;
for(int i=0; i<Vector.Length; i++) sum += Vector[i];
(Talking about big length loops)
Which is faster for the processor to complete?
Do booleans require more operations in order to return "true" or "false" ?
Short version
There is no correct answer because your question is not specific enough (the two examples of code you give don't achieve the same purpose).
If your question is:
Is bool isGreater = (a > b); slower or faster than int sum = a + b;?
Then the answer would be: It's about the same unless you're very very very very very concerned about how many cycles you spend, in which case it depends on your processor and you need to read its documentation.
If your question is:
Is the first example I gave going to iterate slower or faster than the second example?
Then the answer is: It's going to depend primarily on the values the array contains, but also on the compiler, the processor, and plenty of other factors.
Longer version
On most processors a boolean operation has no reason to significantly be slower or faster than an addition: both are basic instructions, even though comparison may take two of them (subtracting, then comparing to zero). The number of cycles it takes to decode the instruction depends on the processor and might be different, but a few cycles won't make a lot of difference unless you're in a critical loop.
In the example you give though, the if condition could potentially be harmful, because of instruction pipelining. Modern processors try very hard to guess what the next bunch of instructions are going to be so they can pre-fetch them and treat them in parallel. If there is branching, the processor doesn't know if it will have to execute the then or the else part, so it guesses based on the previous times.
If the result of your condition is the same most of the time, the processor will likely guess it right and this will go well. But if the result of the condition keeps changing, then the processor won't guess correctly. When such a branch misprediction happens, it means it can just throw away the content of the pipeline and do it all over again because it just realized it was moot. That. does. hurt.
You can try it yourself: measure the time it takes to run your loop over a million elements when they are of same, increasing, decreasing, alternating, or random value.
Which leads me to the conclusion: processors have become some seriously complex beasts and there is no golden answers, just rules of thumb, so you need to measure and profile. You can read what other people did measure though to get an idea of what you should or should not do.
Have fun experimenting. :)

Intel Cilk Plus code example with cilk_for keyword

cilk_for is a keyword of Intel Cilk Plus, and we can use it following way:
cilk_for (int i = 0; i < 8; ++i)
{
do_work(i);
}
I need some more example codes of Intel Cilk Plus with cilk_for keyword.
That's pretty much all there is. A cilk_for loop is one of the easiest ways you can parallelize your code. Things to watch out for:
Don't try to size your loop to the number of cores. Tuning your code like this is inherently fragile. Instead, expose the full range of your data in the for loop and let the Cilk Plus runtime worry about scheduling the loop iterations.
Beware of races! If you haven't tested your application with a race detector like Cilkscreen or Intel Inspector, you've probably got races slowing you down (at best) and generating anomalous results.
cilk_for loops (examples) are implemented using a divide-and=conquer algorithm that recursively splits the range in half until the number of iterations remaining is less than the "grainsize". The runtime calculates grainsize by dividing the range by 8P, or 8 times the number of cores. This is a usually a pretty good value - Not too much so there's excess overhead, not too little so you're starved for parallelism. You can specify the grainsize using a pragma of the form "#pragma cilk grainsize=value", where "value" can be a constant or an expression. But our experience is that there are some specialized places where the correct grainsize is 1, and in most others you're best off using the default.
If your code is accumulating a result, consider using reducers instead of locks. Reducers provide lock-free "views" of the data that get merged automatically by the Cilk Plus runtime so that sequential ordering is preserved.
Barry Tannenbaum, Intel Cilk Plus Development

Algorithm Efficiency - Is partially unrolling a loop effective if it requires more comparisons?

How to judge if putting two extra assignments in an iteration is expensive or setting a if condition to test another thing? here I elaborate. question is to generate and PRINT the first n terms of the Fibonacci sequence where n>=1. my implement in C was:
#include<stdio.h>
void main()
{
int x=0,y=1,output=0,l,n;
printf("Enter the number of terms you need of Fibonacci Sequence ? ");
scanf("%d",&n);
printf("\n");
for (l=1;l<=n;l++)
{
output=output+x;
x=y;
y=output;
printf("%d ",output);
}
}
but the author of the book "how to solve it by computer" says it is inefficient since it uses two extra assignments for a single fibonacci number generated. he suggested:
a=0
b=1
loop:
print a,b
a=a+b
b=a+b
I agree this is more efficient since it keeps a and b relevant all the time and one assignment generates one number. BUT it is printing or supplying two fibonacci numbers at a time. suppose question is to generate an odd number of terms, what would we do? author suggested put a test condition to check if n is an odd number. wouldn't we be losing the gains of reducing number of assignments by adding an if test in every iteration?
I consider it very bad advice from the author to even bring this up in a book targeted at beginning programmers. (Edit: In all fairness, the book was originally published in 1982, a time when programming was generally much more low-level than it is now.)
99.9% of code does not need to be optimized. Especially in code like this that mixes extremely cheap operations (arithmetic on integers) with very expensive operations (I/O), it's a complete waste of time to optimize the cheap part.
Micro-optimizations like this should only be considered in time-critical code when it is necessary to squeeze every bit of performance out of your hardware.
When you do need it, the only way to know which of several options performs best is to measure. Even then, the results may change with different processors, platforms, memory configurations...
Without commenting on your actual code: As you are learning to program, keep in mind that minor efficiency improvements that make code harder to read are not worth it. At least, they aren't until profiling of a production application reveals that it would be worth it.
Write code that can be read by humans; it will make your life much easier and keep maintenance programmers from cursing the name of you and your offspring.
My first advice echoes the others: Strive first for clean, clear code, then optimize where you know there is a performance issue. (It's hard to imagine a time-critical fibonacci sequencer...)
However, speaking as someone who does work on systems where microseconds matter, there is a simple solution to the question you ask: Do the "if odd" test only once, not inside the loop.
The general pattern for loop unrolling is
create X repetitions of the loop logic.
divide N by X.
execute the loop N/X times.
handle the N%X remaining items.
For your specific case:
a=0;
b=1;
nLoops = n/2;
while (nloops-- > 0) {
print a,b;
a=a+b;
b=a+b;
}
if (isOdd(n)) {
print a;
}
(Note also that N/2 and isOdd are trivially implemented and extremely fast on a binary computer.)

Resources