Loop takes less than 1 cycle despite dependency between iterations

Loop takes less than 1 cycle despite dependency between iterations - performance

I wanted to benchmark the time needed to do a single addition on my Skylake (i5-6500) CPU. C is low-level enough for me, so I wrote the following code:
// Initializing stuffs
int a = rand();
int b = rand();
const unsigned long loop_count = 1000000000;
unsigned int ignored; // used for __rdtscp
// Warming up whatever needs to be warmed up
for (int i = 0; i < 100000; i++) {
asm volatile("" : "+r" (a)); // prevents Clang from replacing the loop with a multiplication
a += b;
}
// The actual measurement
uint64_t timer = __rdtscp(&ignored);
for (unsigned long i = 0; i < loop_count; i++) {
asm volatile("" : "+r" (a)); // prevents Clang from replacing the loop with a multiplication
a += b;
}
timer = __rdtscp(&ignored) - timer;
printf("%.2f cycles/iteration\n", (double)timer / loop_count);
Compiling with Clang 7.0.0 -O3, I get the following assembly (for the loop only):
# %bb.2:
rdtscp
movq %rdx, %rdi
movl %ecx, 4(%rsp)
shlq $32, %rdi
orq %rax, %rdi
movl $1000000000, %eax # imm = 0x3B9ACA00
.p2align 4, 0x90
.LBB0_3: # =>This Inner Loop Header: Depth=1
#APP
#NO_APP
addl %esi, %ebx
addq $-1, %rax
jne .LBB0_3
# %bb.4:
rdtscp
And running this code outputs
0.94 cycles/iteration
(or a number pretty much always between 0.93 and 0.96)
I'm surprised that this loop can execute in less than 1 cycle/iteration, since there is a data dependency on a that should prevent parallel execution of a += b.
IACA also confirms that the expected throughput is 0.96 cycles. llvm-mca on the other hand predicts a total of 104 cycles to execute 100 iterations of the loop. (I can edit in the traces if needed; let me know)
I observe a similar behavior when I use SSE registers rather than general purpose ones.
I can imagine that the CPU is smart enough to notice that b is constant and since addition is commutative, it could unroll the loop and optimize the additions somehow. However, I've never heard nor read anything about this. And furthermore, if this was what's going on, I'd expect better performances (ie. fewer cycles/iteration) than 0.94 cycles/iteration.
What is going on? How is this loop able to execute in less than 1 cycle per iteration?
Some background, for completeness. Ignore the remaining of the question if you're not interested in why I'm trying to benchmark a single addition.
I know that there are tools (llvm-exegesis for instance) designed to benchmark a single instruction and that I should instead of them (or just look at agner fog's docs). However, I'm actually trying to compare three different additions: one doing a single addition in a loop (the object of my question); one doing 3 additions per loop (on SSE registers, that should maximize port usage and not be limited by data dependencies), and one where the addition is implemented as a circuit in software. While the results are mostly as I expected; the 0.94 cycles/iteration for the version with a single addition in a loop left me puzzled.

The core frequency and the TSC frequency can be different. Your loop is expected to run at 1 core cycles per iteration. If the core frequency happens to be twice as the TSC frequency for the duration of the loop execution, the throughput would be 0.5 TSC cycles per iteration, which is equivalent to 1 core cycles per iteration.
In your case, it appears that the average core frequency was slightly higher than the TSC frequency. If you don't want to take dynamic frequency scaling into account when doing experiments, it'd be easier to just fix the core frequency to be equal to the TSC frequency so that you don't have to convert the numbers. Otherwise, you'd have to measure the average the core frequency as well.
On processors that support per-core frequency scaling, you have to either fix the frequency on all the cores or pin the experiments to a single core with fixed frequency. Alternatively, instead of measuring in TSC cycles, you can use a tool like perf to easily measure the time in core cycles or seconds.
See also: How to get the CPU cycle count in x86_64 from C++?.

Related

Interpreting Absurdly-Low Measured Latency in Careful Profile (Superscalarity Effects?)

I've written some code for profiling small functions. At the high level it:
Sets the thread affinity to only one core and the thread priority to maximum.
Computes statistics from doing the following 100 times:
Estimate the latency of a function that does nothing.
Estimate the latency of the test function.
Subtract the first from the second to remove the cost of doing function-call overhead, thereby roughly getting the cost of the test function's contents.
To estimate the latency of a function, it:
Invalidates caches (this is difficult to actually do in user-mode, but I allocate and write a buffer the size of the L3 to memory, which should maybe help).
Yields the thread, so that the profile loop has as-few-as-possible context switches.
Gets the current time from a std::chrono::high_resolution_clock (which seems to compile to system_clock, but).
Runs the profile loop 100,000,000 times, calling the tested function within.
Gets the current time from a std::chrono::high_resolution_clock and subtracts to get latency.
Because at this level, individual instructions matter, at all points we have to write very careful code to ensure that the compiler doesn't elide, inline, cache, or treat-differently the functions. I have manually validated the generated assembly in various test cases, including the one which I present below.
I am getting extremely low (sub-nanosecond) latencies reported in some cases. I have tried everything I can think of to account for this, but cannot find an error.
I am looking for an explanation accounting for this behavior. Why are my profiled functions taking so little time?
Let's take the example of computing a square root for float.
The function signature is float(*)(float), and the empty function is trivial:
empty_function(float):
ret
Let's compute the square root by using the sqrtss instruction, and by the multiplication-by-reciprocal-square-root hack. I.e., the tested functions are:
sqrt_sseinstr(float):
sqrtss xmm0, xmm0
ret
sqrt_rcpsseinstr(float):
movaps xmm1, xmm0
rsqrtss xmm1, xmm0
mulss xmm0, xmm1
ret
Here's the profile loop. Again, this same code is called with the empty function and with the test functions:
double profile(float):
...
mov rbp,rdi
push rbx
mov ebx, 0x5f5e100
call 1c20 <invalidate_caches()>
call 1110 <sched_yield()>
call 1050 <std::chrono::high_resolution_clock::now()>
mov r12, rax
xchg ax, ax
15b0:
movss xmm0,DWORD PTR [rip+0xba4]
call rbp
sub rbx, 0x1
jne 15b0 <double profile(float)+0x20>
call 1050 <std::chrono::high_resolution_clock::now()>
...
The timing result for sqrt_sseinstr(float) on my Intel 990X is 3.60±0.13 nanoseconds. At this processor's rated 3.46 GHz, that works out to be 12.45±0.44 cycles. This seems pretty spot-on, given that the docs say the latency of sqrtss is around 13 cycles (it's not listed for this processor's Nehalem architecture, but it seems likely to also be around 13 cycles).
the timing result for sqrt_rcpsseinstr(float) is stranger: 0.01±0.07 nanoseconds (or 0.02±0.24 cycles). This is flatly implausible unless another effect is going on.
I thought perhaps the processor is able to hide the latency of the tested function somewhat or perfectly because the tested function uses different instruction ports (i.e. superscalarity is hiding something)? I tried to analyze this by hand, but didn't get very far because I didn't really know what I was doing.
(Note: I cleaned up some of the assembly notation for your convenience. An unedited objdump of the whole program, which includes several other variants, is here, and I am temporarily hosting the binary here (x86-64 SSE2+, Linux).)
The question, again: Why are some profiled functions producing implausibly small values? If it is a higher-order effect, explain?

The problem is with the basic approach of subtracting out the "latency"1 of an empty function, as described:
Estimate the latency of a function that does nothing.
Estimate the latency of the test function.
Subtract the first from the second to remove the cost of doing function-call overhead, thereby roughly getting the cost of the test
function's contents.
The built-in assumption is that the cost of calling a function is X, and if the latency of the work done in the function is Y, then the total cost will be something like X + Y.
This isn't generally true for any two blocks of work and especially isn't true when when one of them is "calling a function". A more sophisticated view would be that the total time would be somewhere between min(X, Y) and X + Y - but even this is often wrong depending on the details. Still, it's enough of a refinement to explain what is going on here: the cost of the function is not additive with the work being doing in the function: they happen in parallel.
The cost of an empty function call is something like 4 to 5 cycles on modern Intel, probably bottlenecked on the front-end throughput for the two taken branches, and possibly by branch and return predictor latency.
However, when you add additional work to an empty function, it generally won't compete for the same resources, and its execution instructions won't depend on the "output" of the call (i.e., the work will form a separate dependency chain), except perhaps in rare cases where the stack pointer is manipulated and the stack engine doesn't remove the dependency.
So essentially the function will take the greater of the time needed for the function call mechanics, or the actual work done by the function. This approximation isn't exact, because some types of work may actually add to the overhead of the function call (e.g., if there are enough instructions for the front end to get through before getting to the ret, the total time may increase on top of the 4-5 cycle empty function time, even if the total work is less than that) - but it's a good first order approximation.
Your first function takes enough time that the actual work dominates the execution time. The second function is much faster, however, enabling it to "hide under" the existing time taken by the call/ret mechanics.
The solution is simple: duplicate the work within the function N times, so that the work always dominates. N=10 or N=50 or something like that is fine. You have to decide whether you want to test latency, in which case the output of one copy of the work should feed into the next, or throughput, in which case it shouldn't.
On other hand, if you actually want to test the cost of the function call + work, e.g., because that's how you'll be using it in real life, it is likely the results you have gotten is already close to correct: stuff really can be "incrementally free" when it hides behind a function call.
1 I'm putting "latency" in quotes here because it isn't clear whether we should be talking about the latency of call/ret or the throughput. call and ret don't have any explicit outputs (and ret has no input), so it doesn't participate in a classic register-based dependency chain - but it might make sense to think of latency if you consider other hidden architectural components like the instruction pointer. In either case latency of throughput mostly points down to the same thing because all call and ret on a thread operate on the same state, so it doesn't make sense to have say "independent" vs "dependent" call chains.

Your benchmarking approach is fundamentally wrong, and your "careful code" is bogus.
First, emptying the cache is bogus. Not only will it quickly be repopulated with the required data, but also the examples you have posted have very little memory interaction (only cache access by call/ret and a load we'll get to.
Second, yielding before the benchmarking loop is bogus. You iterate 100000000 times, which even on a reasonably fast modern processor will take longer than typical scheduling clock interrupts on a stock operating system. If, on the other hand, you disable scheduling clock interrupts, then yielding before the benchmark doesn't do anything.
Now that the useless incidental complexity is out of the way, about the fundamental misunderstanding of modern CPUs:
You expect loop_time_gross/loop_count to be the time spent in each loop iteration. This is wrong. Modern CPUs do not execute instructions one after the other, sequentially. Modern CPUs pipeline, predict branches, execute multiple instructions in parallel, and (reasonably fast CPUs) out of order.
So after the first handful of iterations of the benchmarking loop, all branches are perfectly predicted for the next almost 100000000 iterations. This enables the CPU to speculate. Effectively, the conditional branch in the benchmarking loop goes away, as does most of the cost of the indirect call. Effectively, the CPU can unroll the loop:
movss xmm0, number
movaps xmm1, xmm0
rsqrtss xmm1, xmm0
mulss xmm0, xmm1
movss xmm0, number
movaps xmm1, xmm0
rsqrtss xmm1, xmm0
mulss xmm0, xmm1
movss xmm0, number
movaps xmm1, xmm0
rsqrtss xmm1, xmm0
mulss xmm0, xmm1
...
or, for the other loop
movss xmm0, number
sqrtss xmm0, xmm0
movss xmm0, number
sqrtss xmm0, xmm0
movss xmm0, number
sqrtss xmm0, xmm0
...
Notable, the load of number is always the same (thus quickly cached), and it overwrites the just computed value, breaking the dependency chain.
To be fair, the
call rbp
sub rbx, 0x1
jne 15b0 <double profile(float)+0x20>
are still executed, but the only resources they take from the floating-point code are decode/micro-op cache and execution ports. Notably, while the integer loop code has a dependency chain (ensuring a minimum execution time), the floating-point code does not carry a dependency on it. Furthermore, the floating-point code consists of many mutually totally independent short dependency chains.
Where you expect the CPU to execute instructions sequentially, the CPU can instead execute them in parallel.
A small look at https://agner.org/optimize/instruction_tables.pdf reveals why this parallel execution doesn't work for sqrtss on Nehalem:
instruction: SQRTSS/PS
latency: 7-18
reciprocal throughput: 7-18
i.e., the instruction cannot be pipelined and only runs on one execution port.
In contrast, for movaps, rsqrtss, mulss:
instruction: MOVAPS/D
latency: 1
reciprocal throughput: 1
instruction: RSQRTSS
latency: 3
reciprocal throughput: 2
instruction: MULSS
latency: 4
reciprocal throughput: 1
the maximum reciprocal throughput of the dependency chain is 2, so you can expect the code to finish executing one dependency chain every 2 cycles in the steady state. At this point, the execution time of the floating-point part of the benchmarking loop is less than or equal to the loop overhead and overlapped with it, so your naive approach to subtract the loop overhead leads to nonsensical results.
If you wanted to do this properly, you would ensure that separate loop iterations are dependent on each other, for example by changing your benchmarking loop to
float x = INITIAL_VALUE;
for (i = 0; i < 100000000; i++)
x = benchmarked_function(x);
Obviously you will not benchmark the same input this way, unless INITIAL_VALUE is a fixed point of benchmarked_function(). However, you can arrange for it to be a fixed point of an expanded function by computing float diff = INITIAL_VALUE - benchmarked_function(INITIAL_VALUE); and then making the loop
float x = INITIAL_VALUE;
for (i = 0; i < 100000000; i++)
x = diff + benchmarked_function(x);
with relatively minor overhead, though you should then ensure that floating-point errors do not accumulate to significantly change the value passed to benchmarked_function().

Why are loops always compiled into "do...while" style (tail jump)?

When trying to understand assembly (with compiler optimization on), I see this behavior:
A very basic loop like this
outside_loop;
while (condition) {
statements;
}
Is often compiled into (pseudocode)
; outside_loop
jmp loop_condition ; unconditional
loop_start:
loop_statements
loop_condition:
condition_check
jmp_if_true loop_start
; outside_loop
However, if the optimization is not turned on, it compiles to normally understandable code:
loop_condition:
condition_check
jmp_if_false loop_end
loop_statements
jmp loop_condition ; unconditional
loop_end:
According to my understanding, the compiled code is better resembled by this:
goto condition;
do {
statements;
condition:
}
while (condition_check);
I can't see a huge performance boost or code readability boost, so why is this often the case? Is there a name for this loop style, for example "trailing condition check"?

Related: asm loop basics: While, Do While, For loops in Assembly Language (emu8086)
Terminology: Wikipedia says "loop inversion" is the name for turning a while(x) into if(x) do{}while(x), putting the condition at the bottom of the loop where it belongs.
Fewer instructions / uops inside the loop = better. Structuring the code outside the loop to achieve this is very often a good idea.
Sometimes this requires "loop rotation" (peeling part of the first iteration so the actual loop body has the conditional branch at the bottom). So you do some of the first iteration and maybe skip the loop entirely, then fall into the loop. Sometimes you also need some code after the loop to finish the last iteration.
Sometimes loop rotation is extra useful if the last iteration is a special case, e.g. a store you need to skip. This lets you implement a while(1) {... ; if(x)break; ...; } loop as a do-while, or put one of the conditions of a multiple-condition loop at the bottom.
Some of these optimizations are related to or enable software pipelining, e.g. loading something for the next iteration. (OoO exec on x86 makes SW pipelining not very important these days but it's still useful for in-order cores like many ARM. And unrolling with multiple accumulators is still very valuable for hiding loop-carried FP latency in a reduction loop like a dot product or sum of an array.)
do{}while() is the canonical / idiomatic structure for loops in asm on all architectures, get used to it. IDK if there's a name for it; I would say such a loop has a "do while structure". If you want names, you could call the while() structure "crappy unoptimized code" or "written by a novice". :P Loop-branch at the bottom is universal, and not even worth mentioning as a Loop Optimization. You always do that.
This pattern is so widely used that on CPUs that use static branch prediction for branches without an entry in the branch-predictor caches, unknown forward conditional branches are predicted not-taken, unknown backwards branches are predicted taken (because they're probably loop branches). See Static branch prediction on newer Intel processors on Matt Godbolt's blog, and Agner Fog's branch-prediction chapter at the start of his microarch PDF.
This answer ended up using x86 examples for everything, but much of this applies across the board for all architectures. I wouldn't be surprised if other superscalar / out-of-order implementations (like some ARM, or POWER) also have limited branch-instruction throughput whether they're taken or not. But fewer instructions inside the loop is nearly universal when all you have is a conditional branch at the bottom, and no unconditional branch.
If the loop might need to run zero times, compilers more often put a test-and-branch outside the loop to skip it, instead of jumping to the loop condition at the bottom. (i.e. if the compiler can't prove the loop condition is always true on the first iteration).
BTW, this paper calls transforming while() to if(){ do{}while; } an "inversion", but loop inversion usually means inverting a nested loop. (e.g. if the source loops over a row-major multi-dimensional array in the wrong order, a clever compiler could change for(i) for(j) a[j][i]++; into for(j) for(i) a[j][i]++; if it can prove it's correct.) But I guess you can look at the if() as a zero-or-one iteration loop. Fun fact, compiler devs teaching their compilers how to invert a loop (to allow auto-vectorization) for a (very) specific case is why SPECint2006's libquantum benchmark is "broken". Most compilers can't invert loops in the general case, just ones that looks almost exactly like the one in SPECint2006...
You can help the compiler make more compact asm (fewer instructions outside the loop) by writing do{}while() loops in C when you know the caller isn't allowed to pass size=0 or whatever else guarantees a loop runs at least once.
(Actually 0 or negative for signed loop bounds. Signed vs. unsigned loop counters is a tricky optimization issue, especially if you choose a narrower type than pointers; check your compiler's asm output to make sure it isn't sign-extending a narrow loop counter inside the loop very time if you use it as an array index. But note that signed can actually be helpful, because the compiler can assume that i++ <= bound will eventually become false, because signed overflow is UB but unsigned isn't. So with unsigned, while(i++ <= bound) is infinite if bound = UINT_MAX.) I don't have a blanket recommendation for when to use signed vs. unsigned; size_t is often a good choice for looping over arrays, though, but if you want to avoid the x86-64 REX prefixes in the loop overhead (for a trivial saving in code size) but convince the compiler not to waste any instructions zero or sign-extending, it can be tricky.
I can't see a huge performance boost
Here's an example where that optimization will give speedup of 2x on Intel CPUs before Haswell, because P6 and SnB/IvB can only run branches on port 5, including not-taken conditional branches.
Required background knowledge for this static performance analysis: Agner Fog's microarch guide (read the Sandybridge section). Also read his Optimizing Assembly guide, it's excellent. (Occasionally outdated in places, though.) See also other x86 performance links in the x86 tag wiki. See also Can x86's MOV really be "free"? Why can't I reproduce this at all? for some static analysis backed up by experiments with perf counters, and some explanation of fused vs. unfused domain uops.
You could also use Intel's IACA software (Intel Architecture Code Analyzer) to do static analysis on these loops.
; sum(int []) using SSE2 PADDD (dword elements)
; edi = pointer, esi = end_pointer.
; scalar cleanup / unaligned handling / horizontal sum of XMM0 not shown.
; NASM syntax
ALIGN 16 ; not required for max performance for tiny loops on most CPUs
.looptop: ; while (edi<end_pointer) {
cmp edi, esi ; 32-bit code so this can macro-fuse on Core2
jae .done ; 1 uop, port5 only (macro-fused with cmp)
paddd xmm0, [edi] ; 1 micro-fused uop, p1/p5 + a load port
add edi, 16 ; 1 uop, p015
jmp .looptop ; 1 uop, p5 only
; Sandybridge/Ivybridge ports each uop can use
.done: ; }
This is 4 total fused-domain uops (with macro-fusion of the cmp/jae), so it can issue from the front-end into the out-of-order core at one iteration per clock. But in the unfused domain there are 4 ALU uops and Intel pre-Haswell only has 3 ALU ports.
More importantly, port5 pressure is the bottleneck: This loop can execute at only one iteration per 2 cycles because cmp/jae and jmp both need to run on port5. Other uops stealing port5 could reduce practical throughput somewhat below that.
Writing the loop idiomatically for asm, we get:
ALIGN 16
.looptop: ; do {
paddd xmm0, [edi] ; 1 micro-fused uop, p1/p5 + a load port
add edi, 16 ; 1 uop, p015
cmp edi, esi ; 1 uop, port5 only (macro-fused with cmp)
jb .looptop ; } while(edi < end_pointer);
Notice right away, independent of everything else, that this is one fewer instruction in the loop. This loop structure is at least slightly better on everything from simple non-pipelined 8086 through classic RISC (like early MIPS), especially for long-running loops (assuming they don't bottleneck on memory bandwidth).
Core2 and later should run this at one iteration per clock, twice as fast as the while(){}-structured loop, if memory isn't a bottleneck (i.e. assuming L1D hits, or at least L2 actually; this is only SSE2 16-bytes per clock).
This is only 3 fused-domain uops, so can issue at better than one per clock on anything since Core2, or just one per clock if issue groups always end with a taken branch.
But the important part is that port5 pressure is vastly reduced: only cmp/jb needs it. The other uops will probably be scheduled to port5 some of the time and steal cycles from loop-branch throughput, but this will be a few % instead of a factor of 2. See How are x86 uops scheduled, exactly?.
Most CPUs that normally have a taken-branch throughput of one per 2 cycles can still execute tiny loops at 1 per clock. There are some exceptions, though. (I forget which CPUs can't run tight loops at 1 per clock; maybe Bulldozer-family? Or maybe just some low-power CPUs like VIA Nano.) Sandybridge and Core2 can definitely run tight loops at one per clock. They even have loop buffers; Core2 has a loop buffer after instruction-length decode but before regular decode. Nehalem and later recycle uops in the queue that feeds the issue/rename stage. (Except on Skylake with microcode updates; Intel had to disable the loop buffer because of a partial-register merging bug.)
However, there is a loop-carried dependency chain on xmm0: Intel CPUs have 1-cycle latency paddd, so we're right up against that bottleneck, too. add esi, 16 is also 1 cycle latency. On Bulldozer-family, even integer vector ops have 2c latency, so that would bottleneck the loop at 2c per iteration. (AMD since K8 and Intel since SnB can run two loads per clock, so we need to unroll anyway for max throughput.) With floating point, you definitely want to unroll with multiple accumulators. Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators).
If I'd used an indexed addressing mode, like paddd xmm0, [edi + eax], I could have used sub eax, 16 / jnc at the loop condition. SUB/JNC can macro-fuse on Sandybridge-family, but the indexed load would un-laminate on SnB/IvB (but stay fused on Haswell and later, unless you use the AVX form).
; index relative to the end of the array, with an index counting up towards zero
add rdi, rsi ; edi = end_pointer
xor eax, eax
sub eax, esi ; eax = -length, so [rdi+rax] = first element
.looptop: ; do {
paddd xmm0, [rdi + rax]
add eax, 16
jl .looptop ; } while(idx+=16 < 0); // or JNC still works
(It's usually better to unroll some to hide the overhead of pointer increments instead of using indexed addressing modes, especially for stores, partly because indexed stores can't use the port7 store AGU on Haswell+.)
On Core2/Nehalem add/jl don't macro-fuse, so this is 3 fused-domain uops even in 64-bit mode, without depending on macro-fusion. Same for AMD K8/K10/Bulldozer-family/Ryzen: no fusion of the loop condition, but PADDD with a memory operand is 1 m-op / uop.
On SnB, paddd un-laminates from the load, but add/jl macro-fuse, so again 3 fused-domain uops. (But in the unfused domain, only 2 ALU uops + 1 load, so probably fewer resource conflicts reducing throughput of the loop.)
On HSW and later, this is 2 fused-domain uops because an indexed load can stay micro-fused with PADDD, and add/jl macro-fuses. (Predicted-taken branches run on port 6, so there are never resource conflicts.)
Of course, the loops can only run at best 1 iteration per clock because of taken branch throughput limits even for tiny loops. This indexing trick is potentially useful if you had something else to do inside the loop, too.
But all of these loops had no unrolling
Yes, that exaggerates the effect of loop overhead. But gcc doesn't unroll by default even at -O3 (unless it decides to fully unroll). It only unrolls with profile-guided optimization to let it know which loops are hot. (-fprofile-use). You can enable -funroll-all-loops, but I'd only recommend doing that on a per-file basis for a compilation unit you know has one of your hot loops that needs it. Or maybe even on a per-function basis with an __attribute__, if there is one for optimization options like that.
So this is highly relevant for compiler-generated code. (But clang does default to unrolling tiny loops by 4, or small loops by 2, and extremely importantly, using multiple accumulators to hide latency.)
Benefits with very low iteration count:
Consider what happens when the loop body should run once or twice: There's a lot more jumping with anything other than do{}while.
For do{}while, execution is a straight-line with no taken branches and one not-taken branch at the bottom. This is excellent.
For an if() { do{}while; } that might run the loop zero times, it's two not-taken branches. That's still very good. (Not-taken is slightly cheaper for the front-end than taken when both are correctly predicted).
For a jmp-to-the-bottom jmp; do{}while(), it's one taken unconditional branch, one taken loop condition, and then the loop branch is not-taken. This is kinda clunky but modern branch predictors are very good...
For a while(){} structure, this is one not-taken loop exit, one taken jmp at the bottom, then one taken loop-exit branch at the top.
With more iterations, each loop structure does one more taken branch. while(){} also does one more not-taken branch per iteration, so it quickly becomes obviously worse.
The latter two loop structures have more jumping around for small trip counts.
Jumping to the bottom also has a disadvantage for non-tiny loops that the bottom of the loop might be cold in L1I cache if it hasn't run for a while. Code fetch / prefetch is good at bringing code to the front-end in a straight line, but if prediction didn't predict the branch early enough, you might have a code miss for the jump to the bottom. Also, parallel decode will probably have (or could have) decoded some of the top of the loop while decoding the jmp to the bottom.
Conditionally jumping over a do{}while loop avoids all that: you only jump forwards into code that hasn't been run yet in cases where the code you're jumping over shouldn't run at all. It often predicts very well because a lot of code never actually takes 0 trips through the loop. (i.e. it could have been a do{}while, the compiler just didn't manage to prove it.)
Jumping to the bottom also means the core can't start working on the real loop body until after the front-end chases two taken branches.
There are cases with complicated loop conditions where it's easiest to write it this way, and the performance impact is small, but compilers often avoid it.
Loops with multiple exit conditions:
Consider a memchr loop, or a strchr loop: they have to stop at the end of the buffer (based on a count) or the end of an implicit-length string (0 byte). But they also have to break out of the loop if they find a match before the end.
So you'll often see a structure like
do {
if () break;
blah blah;
} while(condition);
Or just two conditions near the bottom. Ideally you can test multiple logical conditions with the same actual instruction (e.g. 5 < x && x < 25 using sub eax, 5 / cmp eax, 20 / ja .outside_range, unsigned compare trick for range checking, or combine that with an OR to check for alphabetic characters of either case in 4 instructions) but sometimes you can't and just need to use an if()break style loop-exit branch as well as a normal backwards taken branch.
Further reading:
Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” for good ways to look at compiler output (e.g. what kind of inputs give interesting output, and a primer on reading x86 asm for beginners). related: How to remove "noise" from GCC/clang assembly output?
Modern Microprocessors A 90-Minute Guide!. Details look at superscalar pipelined CPUs, mostly architecture neutral. Very good. Explains instruction-level parallelism and stuff like that.
Agner Fog's x86 optimization guide and microarch pdf. This will take you from being able to write (or understand) correct x86 asm to being able to write efficient asm (or see what the compiler should have done).
other links in the x86 tag wiki, including Intel's optimization manuals. Also several of my answers (linked in the tag wiki) have things that Agner missed in his testing on more recent microarchitectures (like un-lamination of micro-fused indexed addressing modes on SnB, and partial register stuff on Haswell+).
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators): how to use multiple accumulators to hide latency of a reduction loop (like an FP dot product).
Lecture 7: Loop Transformations (also on archive.org). Lots of cool stuff that compilers do to loops, using C syntax to describe the asm.
https://en.wikipedia.org/wiki/Loop_optimization
Sort of off topic:
Memory bandwidth is almost always important, but it's not widely known that a single core on most modern x86 CPUs can't saturate DRAM, and not even close on many-core Xeons where single-threaded bandwidth is worse than on a quad-core with dual channel memory controllers.
How much of ‘What Every Programmer Should Know About Memory’ is still valid? (my answer has commentary on what's changed and what's still relevant in Ulrich Drepper's well-known excellent article.)

Unnecessary instructions generated for _mm_movemask_epi8 intrinsic in x64 mode

The intrinsic function _mm_movemask_epi8 from SSE2 is defined by Intel with the following prototype:
int _mm_movemask_epi8 (__m128i a);
This intrinsic function directly corresponds to the pmovmskb instruction, which is generated by all compilers.
According to this reference, the pmovmskb instruction can write the resulting integer mask to either a 32-bit or a 64-bit general purpose register in x64 mode. In any case, only 16 lower bits of the result can be nonzero, i.e. the result is surely within range [0; 65535].
Speaking of the intrinsic function _mm_movemask_epi8, its returned value is of type int, which as a signed integer of 32-bit size on most platforms. Unfortunately, there is no alternative function which returns a 64-bit integer in x64 mode. As a result:
Compiler usually generates pmovmskb instruction with 32-bit destination register (e.g. eax).
Compiler cannot assume that upper 32 bits of the whole register (e.g. rax) are zero.
Compiler inserts unnecessary instruction (e.g. mov eax, eax) to zero the upper half of 64-bit register, given that the register is later used as 64-bit value (e.g. as an index of array).
An example of code and generated assembly with such a problem can be seen in this answer. Also the comments to that answer contain some related discussion. I regularly experience this problem with MSVC2013 compiler, but it seems that it is also present on GCC.
The questions are:
Why is this happening?
Is there any way to reliably avoid generation of unnecessary instructions on popular compilers? In particular, when result is used as index, i.e. in x = array[_mm_movemask_epi8(xmmValue)];
What is the approximate cost of unnecessary instructions like mov eax, eax on modern CPU architectures? Is there any chance that these instructions are completely eliminated by CPU internally and they do not actually occupy time of execution units (Agner Fog's instruction tables document mentions such a possibility).

Why is this happening?
gcc's internal instruction definitions that tells it what pmovmskb does must be failing to inform it that the upper 32-bits of rax will always be zero. My guess is that it's treated like a function call return value, where the ABI allows a function returning a 32bit int to leave garbage in the upper 32bits of rax.
GCC does know about 32-bit operations in general zero-extending for free, but this missed optimization is widespread for intrinsics, also affecting scalar intrinsics like _mm_popcnt_u32.
There's also the issue of gcc (not) knowing that the actual result has set bits only in the low 16 of its 32-bit int result (unless you used AVX2 vpmovmskb ymm). So actual sign extension is unnecessary; implicit zero extension is totally fine.
Is there any way to reliably avoid generation of unnecessary instructions on popular compilers? In particular, when result is used as index, i.e. in x = array[_mm_movemask_epi8(xmmValue)];
No, other than fixing gcc. Has anyone reported this as a compiler missed-optimization bug?
clang doesn't have this bug. I added code to Paul R's test to actually use the result as an array index, and clang is still fine.
gcc always either zero or sign extends (to a different register in this case, perhaps because it wants to "keep" the 32-bit value in the bottom of RAX, not because it's optimizing for mov-elimination.
Casting to unsigned helps with GCC6 and later; it will use the pmovmskb result directly as part of an addressing mode, but also returning it results in a mov rax, rdx.
And with older GCC, at least gets it to use mov instead of movsxd or cdqe.
What is the approximate cost of unnecessary instructions like mov eax, eax on modern CPU architectures? Is there any chance that these instructions are completely eliminated by CPU internally and they do not actually occupy time of execution units (Agner Fog's instruction tables document mentions such a possibility).
mov same,same is never eliminated on SnB-family microarchitectures or AMD zen. mov ecx, eax would be eliminated. See Can x86's MOV really be "free"? Why can't I reproduce this at all? for details.
Even if it doesn't take an execution unit, it still takes a slot in the fused-domain part of the pipeline, and a slot in the uop-cache. And code-size. If you're close to the front-end 4 fused-domain uops per clock limit (pipeline width), then it's a problem.
It also costs an extra 1c of latency in the dep chain.
(Back-end throughput is not a problem, though. On Haswell and newer, it can run on port6 which has no vector execution units. On AMD, the integer ports are separate from the vector ports.)

gcc.godbolt.org is a great online resource for testing this kind of issue with different compilers.
clang seems to do the best with this, e.g.
#include <xmmintrin.h>
#include <cstdint>
int32_t test32(const __m128i v) {
int32_t mask = _mm_movemask_epi8(v);
return mask;
}
int64_t test64(const __m128i v) {
int64_t mask = _mm_movemask_epi8(v);
return mask;
}
generates:
test32(long long __vector(2)): # #test32(long long __vector(2))
vpmovmskb eax, xmm0
ret
test64(long long __vector(2)): # #test64(long long __vector(2))
vpmovmskb eax, xmm0
ret
Whereas gcc generates an extra cdqe instruction in the 64-bit case:
test32(long long __vector(2)):
vpmovmskb eax, xmm0
ret
test64(long long __vector(2)):
vpmovmskb eax, xmm0
cdqe
ret

How to determine CPE: Cycles Per Element

How do I determine the CPE of a program?
For example, I have this assembly code for a loop:
# inner4: data_t = float
# udata in %rbx, vdata in %rax, limit in %rcx,
# i in %rdx, sum in %xmm1
1 .L87: # loop:
2 movss (%rbx,%rdx,4), %xmm0 # Get udata[i]
3 mulss (%rax,%rdx,4), %xmm0 # Multiply by vdata[i]
4 addss %xmm0, %xmm1 # Add to sum
5 addq $1, %rdx # Increment i
6 cmpq %rcx, %rdx # Compare i:limit
7 jl .L87 # If <, goto loop
I have to find the lower bound of the CPE determined by the critical path using the data type float. I believe that the critical path would refer to the slowest possible path, and would thus be the one where the program has to execute the mulss instruction because that takes up the longest number of clock cycles.
However, there doesn't seem to be any clear way to determine the CPE. If one instruction takes two clock cycles, and another takes one, can the latter start after the first clock cycle of the former? Any help would be appreciated. Thanks

If you want to know how long it needs, you should measure it. Execute the loop some about 10^10 times, take the time it needs and multiply by the clock frequency. You get the total count of cycles, divide by 10^10 to get the number of clock cycles per loop iteration.
A theoretical prediction of the execution time will almost never be correct (and most of time to low) because the are numerous effects which determine the speed:
Pipelining (there can be easily about 20 stages in the pipeline)
Superscalar execution (up to 5 instructions in parallel, cmp and jl may be fused)
Decoding to µOps and reordering
The latencies of Caches or Memory
The throughput of the instructions (are there enough executions ports free)
The latencies of the instructions
Bank conflicts, aliasing issues and more esoteric stuff
Depending on the CPU and provided the memory accesses all hit the L1 cache, I believe the loop should need at least 3 clock cycles per iteration, because the longest dependency chain is 3 elements long. On an older CPU with slower mulss or addss instruction the time needed increases.
If you are actually interested in speeding up the code and not only some theoretical observations you should vectorize it. You can increase the performance by a factor of 4-8 with something like
.L87: # loop:
vmovdqa (%rbx,%rdx,4), %ymm0 # Get udata[i]..udata[i+7]
vmulps (%rax,%rdx,4), %ymm0, %ymm0 # Multiply by vdata[i]..vdata[i+7]
vaddps %ymm0, %ymm1, %ymm1 # Add to sum
addq $8, %rdx # Increment i
cmpq %rcx, %rdx # Compare i:limit
jl .L87 # If <, goto loop
You need to horizontal add all 8 elements after that and of course make sure alignment is 32 and loop counter divisible by 8.

If you're running an an Intel CPU, you can find some good documentation on instruction latency and throughput for various CPUs. Here's the link:
Intel® 64 and IA-32
Architectures
Optimization Reference Manual

LPC1768 / ARM Cortex-M3 microsecond delay

I'm trying to implement a microsecond delay in a bare metal arm environment( LPC1768 ) / GCC. I've seen the examples that use the SysTimer to generate an interrupt that then does some counting in C, which is used as a time base
https://bitbucket.org/jpc/lpc1768/src/dea43fb213ff/main.c
However at 12MHz system clock I don't think that will scale very well to microsecond delays. Basically the processor will spend all it's time servicing the interrupt.
Is it possible to query the value of SYSTICK_GetCurrentValue in a loop and determine how many ticks go in a microsecond and bail out of the loop once the number of ticks exceeds the calculated number?
I'd rather not use a separate hardware timer for this (but will if there is no other choice)

One way is just to use a loop to create the delay, something like shown below. You need to calibrate your factor. A more general purpose approach is to calculate the factor on startup based on some known timebase.
#define CAL_FACTOR ( 100 )
void delay (uint32_t interval)
{
uint32_t iterations = interval / CAL_FACTOR;
for(int i=0; i<iterations; ++i)
{
__asm__ volatile // gcc-ish syntax, don't know what compiler is used
(
"nop\n\t"
"nop\n\t"
:::
);
}
}

First off interrupts are not needed for this sort of thing, you can poll the timer, no need to overkill with an interrupt. yes reasons why those examples use interrupts, but that doesnt mean that is the only way to use a timer.
Guy Sirton's answer is sound, but I prefer assembler as I can control it exactly to the clock cycle (so long as there are no interrupts or other items that get in the way). A timer is usually easier though as the code is a bit more portable (change the processor clocks frequency and you have to re-tune the loop, with a timer, sometimes all you have to do is change the init code to use a different prescaler, or change the one line looking for the computed count), and allows for interrupts and such things in the system.
In this case though you are talking about 12mhz, and one microsecond, that is 12 instructions yes? Put in 12 nops. Or branch to some assembler with like 10 nops or 8, whatever it comes out to compensate for the pipeline flush on the two branches. A timer and interrupts is going to burn more than 12 instruction cycles in overhead. Even polling the timer in a loop is going to be sloppy. A counter loop would work too, you need to understand the branch costs though and tune for that:
delay_one_ms:
mov r0,#3
wait:
sub r0,#1 #cortex-m3 means thumb/thumb2 and gas complains about subs.
bne wait
nop #might need some nops to tune the loop accurately
nop
bx lr
Call this function, what 30 million times in a loop using a gpio led or uart output and a stop watch and see that the blinks are 30 seconds apart.
ldr r4,=uart_tx_register_address
mov r5,#0x55
again:
ldr r6,=24000000
str r5,[r4]
top:
bl delay_one_ms
sub r6,#1
bne top
str r5,[r4]
b again
Actually since I assumed 2 clocks per branch, the test loop has 3 clocks, the delay is assumed to be a total 12 clocks so 15 clocks per loop, 30 seconds is 30,000,000 microseconds, ideally 30million loops, but I needed 12/15ths the number of loops to compensate. This is far easier if you have an oscilloscope whose timebase is somewhat accurate, or at least as accurate as you want this delay.
I have not studied ARM's branch costs myself otherwise I would comment on that. It is likely two or three clocks. So the mov is one, the sub is one times the number of loops the bne is lets say two times the number of loops. Two for the branch to get here two for the return. 5+(3*loops)+nops=12. (3*loops)+nops=7 loops is 2 and nops is 1, yes? I think stringing a number of nops together is far easier:
delay_one_ms:
nop
nop
nop
nop
nop
nop
nop
nop
bx lr
You might have to burn a few more instructions temporarily disabling interrupts, if you use them. If you are looking for "at least" one microsecond then dont worry about it.

You can use SYSTICK inside ARM processor for that. Just program it to count each 1uS, or less if you have enough clock speed, and do-while a loop till your delay value expires.
Like this:
void WaitUs(int us) {
unsigned int cnt;
while(us-- >0) {
cnt = STK_VAL; // get systick counter, ticking each 500nS
while( (STK_VAL-cnt) < 2); // repeat till 2 ticks
}
}
Bear in mind that this is an example, you'll need to adjust it for counter roll-over among other things.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio