Interpreting Absurdly-Low Measured Latency in Careful Profile (Superscalarity Effects?) - performance

I've written some code for profiling small functions. At the high level it:
Sets the thread affinity to only one core and the thread priority to maximum.
Computes statistics from doing the following 100 times:
Estimate the latency of a function that does nothing.
Estimate the latency of the test function.
Subtract the first from the second to remove the cost of doing function-call overhead, thereby roughly getting the cost of the test function's contents.
To estimate the latency of a function, it:
Invalidates caches (this is difficult to actually do in user-mode, but I allocate and write a buffer the size of the L3 to memory, which should maybe help).
Yields the thread, so that the profile loop has as-few-as-possible context switches.
Gets the current time from a std::chrono::high_resolution_clock (which seems to compile to system_clock, but).
Runs the profile loop 100,000,000 times, calling the tested function within.
Gets the current time from a std::chrono::high_resolution_clock and subtracts to get latency.
Because at this level, individual instructions matter, at all points we have to write very careful code to ensure that the compiler doesn't elide, inline, cache, or treat-differently the functions. I have manually validated the generated assembly in various test cases, including the one which I present below.
I am getting extremely low (sub-nanosecond) latencies reported in some cases. I have tried everything I can think of to account for this, but cannot find an error.
I am looking for an explanation accounting for this behavior. Why are my profiled functions taking so little time?
Let's take the example of computing a square root for float.
The function signature is float(*)(float), and the empty function is trivial:
empty_function(float):
ret
Let's compute the square root by using the sqrtss instruction, and by the multiplication-by-reciprocal-square-root hack. I.e., the tested functions are:
sqrt_sseinstr(float):
sqrtss xmm0, xmm0
ret
sqrt_rcpsseinstr(float):
movaps xmm1, xmm0
rsqrtss xmm1, xmm0
mulss xmm0, xmm1
ret
Here's the profile loop. Again, this same code is called with the empty function and with the test functions:
double profile(float):
...
mov rbp,rdi
push rbx
mov ebx, 0x5f5e100
call 1c20 <invalidate_caches()>
call 1110 <sched_yield()>
call 1050 <std::chrono::high_resolution_clock::now()>
mov r12, rax
xchg ax, ax
15b0:
movss xmm0,DWORD PTR [rip+0xba4]
call rbp
sub rbx, 0x1
jne 15b0 <double profile(float)+0x20>
call 1050 <std::chrono::high_resolution_clock::now()>
...
The timing result for sqrt_sseinstr(float) on my Intel 990X is 3.60±0.13 nanoseconds. At this processor's rated 3.46 GHz, that works out to be 12.45±0.44 cycles. This seems pretty spot-on, given that the docs say the latency of sqrtss is around 13 cycles (it's not listed for this processor's Nehalem architecture, but it seems likely to also be around 13 cycles).
the timing result for sqrt_rcpsseinstr(float) is stranger: 0.01±0.07 nanoseconds (or 0.02±0.24 cycles). This is flatly implausible unless another effect is going on.
I thought perhaps the processor is able to hide the latency of the tested function somewhat or perfectly because the tested function uses different instruction ports (i.e. superscalarity is hiding something)? I tried to analyze this by hand, but didn't get very far because I didn't really know what I was doing.
(Note: I cleaned up some of the assembly notation for your convenience. An unedited objdump of the whole program, which includes several other variants, is here, and I am temporarily hosting the binary here (x86-64 SSE2+, Linux).)
The question, again: Why are some profiled functions producing implausibly small values? If it is a higher-order effect, explain?

The problem is with the basic approach of subtracting out the "latency"1 of an empty function, as described:
Estimate the latency of a function that does nothing.
Estimate the latency of the test function.
Subtract the first from the second to remove the cost of doing function-call overhead, thereby roughly getting the cost of the test
function's contents.
The built-in assumption is that the cost of calling a function is X, and if the latency of the work done in the function is Y, then the total cost will be something like X + Y.
This isn't generally true for any two blocks of work and especially isn't true when when one of them is "calling a function". A more sophisticated view would be that the total time would be somewhere between min(X, Y) and X + Y - but even this is often wrong depending on the details. Still, it's enough of a refinement to explain what is going on here: the cost of the function is not additive with the work being doing in the function: they happen in parallel.
The cost of an empty function call is something like 4 to 5 cycles on modern Intel, probably bottlenecked on the front-end throughput for the two taken branches, and possibly by branch and return predictor latency.
However, when you add additional work to an empty function, it generally won't compete for the same resources, and its execution instructions won't depend on the "output" of the call (i.e., the work will form a separate dependency chain), except perhaps in rare cases where the stack pointer is manipulated and the stack engine doesn't remove the dependency.
So essentially the function will take the greater of the time needed for the function call mechanics, or the actual work done by the function. This approximation isn't exact, because some types of work may actually add to the overhead of the function call (e.g., if there are enough instructions for the front end to get through before getting to the ret, the total time may increase on top of the 4-5 cycle empty function time, even if the total work is less than that) - but it's a good first order approximation.
Your first function takes enough time that the actual work dominates the execution time. The second function is much faster, however, enabling it to "hide under" the existing time taken by the call/ret mechanics.
The solution is simple: duplicate the work within the function N times, so that the work always dominates. N=10 or N=50 or something like that is fine. You have to decide whether you want to test latency, in which case the output of one copy of the work should feed into the next, or throughput, in which case it shouldn't.
On other hand, if you actually want to test the cost of the function call + work, e.g., because that's how you'll be using it in real life, it is likely the results you have gotten is already close to correct: stuff really can be "incrementally free" when it hides behind a function call.
1 I'm putting "latency" in quotes here because it isn't clear whether we should be talking about the latency of call/ret or the throughput. call and ret don't have any explicit outputs (and ret has no input), so it doesn't participate in a classic register-based dependency chain - but it might make sense to think of latency if you consider other hidden architectural components like the instruction pointer. In either case latency of throughput mostly points down to the same thing because all call and ret on a thread operate on the same state, so it doesn't make sense to have say "independent" vs "dependent" call chains.

Your benchmarking approach is fundamentally wrong, and your "careful code" is bogus.
First, emptying the cache is bogus. Not only will it quickly be repopulated with the required data, but also the examples you have posted have very little memory interaction (only cache access by call/ret and a load we'll get to.
Second, yielding before the benchmarking loop is bogus. You iterate 100000000 times, which even on a reasonably fast modern processor will take longer than typical scheduling clock interrupts on a stock operating system. If, on the other hand, you disable scheduling clock interrupts, then yielding before the benchmark doesn't do anything.
Now that the useless incidental complexity is out of the way, about the fundamental misunderstanding of modern CPUs:
You expect loop_time_gross/loop_count to be the time spent in each loop iteration. This is wrong. Modern CPUs do not execute instructions one after the other, sequentially. Modern CPUs pipeline, predict branches, execute multiple instructions in parallel, and (reasonably fast CPUs) out of order.
So after the first handful of iterations of the benchmarking loop, all branches are perfectly predicted for the next almost 100000000 iterations. This enables the CPU to speculate. Effectively, the conditional branch in the benchmarking loop goes away, as does most of the cost of the indirect call. Effectively, the CPU can unroll the loop:
movss xmm0, number
movaps xmm1, xmm0
rsqrtss xmm1, xmm0
mulss xmm0, xmm1
movss xmm0, number
movaps xmm1, xmm0
rsqrtss xmm1, xmm0
mulss xmm0, xmm1
movss xmm0, number
movaps xmm1, xmm0
rsqrtss xmm1, xmm0
mulss xmm0, xmm1
...
or, for the other loop
movss xmm0, number
sqrtss xmm0, xmm0
movss xmm0, number
sqrtss xmm0, xmm0
movss xmm0, number
sqrtss xmm0, xmm0
...
Notable, the load of number is always the same (thus quickly cached), and it overwrites the just computed value, breaking the dependency chain.
To be fair, the
call rbp
sub rbx, 0x1
jne 15b0 <double profile(float)+0x20>
are still executed, but the only resources they take from the floating-point code are decode/micro-op cache and execution ports. Notably, while the integer loop code has a dependency chain (ensuring a minimum execution time), the floating-point code does not carry a dependency on it. Furthermore, the floating-point code consists of many mutually totally independent short dependency chains.
Where you expect the CPU to execute instructions sequentially, the CPU can instead execute them in parallel.
A small look at https://agner.org/optimize/instruction_tables.pdf reveals why this parallel execution doesn't work for sqrtss on Nehalem:
instruction: SQRTSS/PS
latency: 7-18
reciprocal throughput: 7-18
i.e., the instruction cannot be pipelined and only runs on one execution port.
In contrast, for movaps, rsqrtss, mulss:
instruction: MOVAPS/D
latency: 1
reciprocal throughput: 1
instruction: RSQRTSS
latency: 3
reciprocal throughput: 2
instruction: MULSS
latency: 4
reciprocal throughput: 1
the maximum reciprocal throughput of the dependency chain is 2, so you can expect the code to finish executing one dependency chain every 2 cycles in the steady state. At this point, the execution time of the floating-point part of the benchmarking loop is less than or equal to the loop overhead and overlapped with it, so your naive approach to subtract the loop overhead leads to nonsensical results.
If you wanted to do this properly, you would ensure that separate loop iterations are dependent on each other, for example by changing your benchmarking loop to
float x = INITIAL_VALUE;
for (i = 0; i < 100000000; i++)
x = benchmarked_function(x);
Obviously you will not benchmark the same input this way, unless INITIAL_VALUE is a fixed point of benchmarked_function(). However, you can arrange for it to be a fixed point of an expanded function by computing float diff = INITIAL_VALUE - benchmarked_function(INITIAL_VALUE); and then making the loop
float x = INITIAL_VALUE;
for (i = 0; i < 100000000; i++)
x = diff + benchmarked_function(x);
with relatively minor overhead, though you should then ensure that floating-point errors do not accumulate to significantly change the value passed to benchmarked_function().

Related

Why is my loop much faster when it is contained in one cache line?

When I run this little assembly program on my Ryzen 9 3900X:
_start: xor rax, rax
xor rcx, rcx
loop0: add rax, 1
mov rdx, rax
and rdx, 1
add rcx, rdx
cmp rcx, 1000000000
jne loop0
It completes in 450 ms if all the instructions between loop0 and up to and including the jne, are contained entirely in one cacheline. That is, if:
round((address of loop0)/64) == round((address of end of jne-instruction)/64)
However, if the above condition does not hold, the loop takes 900 ms instead.
I've made a repo with the code https://github.com/avl/strange_performance_repro .
Why is the inner loop much slower in some specific cases?
Edit: Removed a claim with a conclusion from a mistake in testing.
Your issue lies in the variable cost of the jne instruction.
First of all, to understand the impact of the effect, we need to analyze the full loop itself. The architecture of the Ryzen 9 3900X is Zen2. We can retrieve information about this on the AMD website or also WikiChip.
This architecture have 4 ALU and 3 AGU. This roughly means it can execute up to 4 instructions like add/and/cmp per cycle.
Here is the cost of each instruction of the loop (based on the Agner Fog instruction table for Zen1):
# Reciprocal throughput
loop0: add rax, 1 # 0.25
mov rdx, rax # 0.2
and rdx, 1 # 0.25
add rcx, rdx # 0.25
cmp rcx, 1000000000 # 0.25 | Fused 0.5-2 (2 if jumping)
jne loop0 # 0.5-2 |
As you can see, the 4 first computing instructions of the loop can be executed in ~1 cycle. The 2 last instructions can be merged by your processor into a faster one.
Your main issue is that this last jne instruction could be quite slow compared to the rest of the loop. So you are very likely to measure only the overhead of this instruction. From this point, things start to be complicated.
Engineer and researcher worked hard the last decades to reduce the cost of such instructions at (almost) any price. Nowadays, processors (like the Ryzen 9 3900X) use out-of-order instruction scheduling to execute the dependent instructions required by the jne instruction as soon as possible. Most processors can also predict the address of the next instruction to execute after the jne and fetch new instructions (eg. the one of the next loop iteration) while the other instruction of the current loop iteration are being executed.
Performing the fetch as soon as possible is important to prevent any stall in the execution pipeline of the processor (especially in your case).
From the AMD document "Software Optimization Guide for AMD Family 17h Models 30h and Greater Processors", we can read:
2.8.3 Loop Alignment:
For the processor loop alignment is not usually a significant issue. However, for hot loops, some further knowledge of trade-offs can be helpful. Since the processor can read an aligned 64-byte fetch block every cycle, aligning the end of the loop to the last byte of a 64-byte cache line is the best thing to do, if possible.
2.8.1.1 Next Address Logic
The next-address logic determines addresses for instruction fetch. [...]. When branches are identified, the next-address logic is redirected by the branch target and branch direction prediction hardware to generate a non-sequential fetch block address. The processor facilities that are designed to predict the next instruction to be executed following a branch are detailed in the following sections.
Thus, performing a conditional branching to instructions located in another cache line introduces an additional latency overhead due to a fetch of the Op Cache (an instruction cache faster than the L1) not required if the whole loop fit in 1 cache line. Indeed, if a loop is crossing a cache line, 2 line-cache fetches are required, which takes no less than 2 cycles. If the whole loop is fitting in a cache-line, only one 1 line-cache fetch is needed, which take only 1 cycle. As a result, since your loop iterations are very fast, paying 1 additional cycle introduces a significant slowdown. But how much?
You say that the program takes about 450 ms.
Since the Ryzen 9 3900X turbo frequency is 4.6 GHz and your loop is executed 2e9 times, the number cycles per loop iteration is 1.035. Note that this is better than what we can expected before (I guess this processor is able to rename registers, ignore the mov instruction, execute the jne instruction in parallel in only 1 cycle while other instructions of the loop are perfectly pipelined; which is astounding). This also shows that paying an additional fetch overhead of 1 cycle will double the number of cycles needed to execute each loop iteration and so the overall execution time.
If you do not want to pay this overhead, consider unrolling your loop to significantly reduce the number of conditional branches and non-sequential fetches.
This problem can occur on other architectures such as on Intel Skylake. Indeed, the same loop on a i5-9600KF takes 0.70s with loop alignment and 0.90s without (also due to the additional 1 cycle fetch latency). With a 8x unrolling, the result is 0.53s (whatever the alignment).

Assembly - How to score a CPU instruction by latency and throughput

I'm looking for a type of a formula / way to measure how fast an instruction is, or more specific to give a "score" each of the instruction by CPU cycles.
Let's take the follow assembly program for an example,
nop
mov eax,dword ptr [rbp+34h]
inc eax
mov dword ptr [rbp+34h],eax
and the following Intel Skylake information:
mov r,m : Throughput=0.5 Latency=2
mov m,r
: Throughput=1 Latency=2
nop : Throughput=0.25 Latency=non
inc : Throughput=0.25 Latency=1
I know that the order of the instructions in the program are matter in here but
I'm looking to create something general that not need to be "accurate to the single cycle"
any one have any idea how can I do that?
There is no formula you can apply; you have to measure.
The same instruction on different versions of the same uarch family can have different performance. e.g. mulps:
Sandybridge 1c / 5c throughput/latency.
HSW 0.5 / 5. BDW 0.5 / 3 (faster multiply path in the FMA unit? FMA is still 5c).
SKL 0.5 / 4 (lower latency FMA, too). SKL runs addps on the FMA unit as well, dropping the dedicated FP multiply unit so add latency is higher, but throughput is higher.
There's no way you could predict any of this without measuring, or knowing some microarchitectural details. We expect FP math ops won't be single-cycle latency, because they're much more complicated than integer ops. (So if they were single cycle, the clock speed is set too low for integer ops.)
You measure by repeating the instruction many times in an unrolled loop. Or fully unrolled with no looping, but then you defeat the uop-cache and can get front-end bottlenecks. (e.g. for decoding 10-byte mov r64, imm64)
https://uops.info/ has already automated this testing for every form of every (unprivileged) instruction, and you can even click on any table entry to see what test loops they used. e.g. Skylake xchg r32, eax latency testing (https://uops.info/html-lat/SKL/XCHG_R32_EAX-Measurements.html) from each input operand to each output. (2 cycle latency from EAX -> R8D, but 1 cycle latency from R8D -> EAX.) So we can guess that the 3 uops include copying EAX to an internal temporary, but moving directly from the other operand to EAX.
https://uops.info/ is the current best source of test data; when it and Agner's tables disagree, my own measurements and/or other sources have always confirmed uops.info's testing was accurate. And they don't try to make up a latency number for 2 halves of a round-trip like movd xmm0,eax and back, they show you the range of possible latencies assuming the rest of the chain was the minimum plausible.
Agner Fog creates his instruction tables (which you appear to be reading) by timing large non-looping blocks of code that repeat an instruction. https://agner.org/optimize/. The intro section of his instruction-tables explains briefly how he measures, and his microarch guide explains more details of how different x86 microarchitectures work internally. Unfortunately there are occasional typos or copy/paste errors in his hand-edited tables.
http://instlatx64.atw.hu/ also has results of experimental measurements. I think they use a similar technique of a large block of the same instruction repeated, maybe small enough to fit in the uop cache. But they don't use perf counters to measure what execution port each instruction needs, so their throughput numbers don't help you figure out which instructions compete with which other instructions.
These latter two sources have been around for longer than uops.info, and cover some older CPUs, especially older AMD.
To measure latency yourself, you make the output of each instruction an input for the next.
mov ecx, 10000000
inc_latency:
inc eax
inc eax
inc eax
inc eax
inc eax
inc eax
sub ecx,1 ; avoid partial-flag false dep for P4
jnz inc_latency ; dec or sub/jnz macro-fuses into 1 uop on Intel SnB-family
This dependency chain of 7 inc instructions will bottleneck the loop at 1 iteration per 7 * inc_latency cycles. Using perf counters for core clock cycles (not RDTSC cycles), you can easily measure the time for all the iterations to 1 part in 10k, and with more care probably even more precisely than that. The repeat count of 10000000 hides start/stop overhead of whatever timing you use.
I normally put a loop like this in a Linux static executable that just makes a sys_exit(0) system call directly (with a syscall) instruction, and time the whole executable with perf stat ./testloop to get time and a cycle count. (See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example).
Another example is Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths, with the added complication of using lfence to drain the out-of-order execution window for two dep chains.
To measure throughput, you use separate registers, and/or include an xor-zeroing occasionally to break dep chains and let out-of-order exec overlap things. Don't forget to also use perf counters to see which ports it can run on, so you can tell which other instructions it will compete with. (e.g. FMA (p01) and shuffles (p5) don't compete at all for back-end resources on Haswell/Skylake, only for front-end throughput.) Don't forget to measure front-end uop counts, too: some instructions decode to multiply uops.
How many different dependency chains do we need to avoid a bottleneck? Well we know the latency (measure it first), and we know the max possible throughput (number of execution ports, or front-end throughput.)
For example, if FP multiply had 0.25c throughput (4 per clock), we could keep 20 in flight at once on Haswell (5c latency). That's more than we have registers, so we could just use all 16 and discover that in fact the throughput is only 0.5c. But if it had turned out that 16 registers was a bottleneck, we could add xorps xmm0,xmm0 occasionally and let out-of-order execution overlap some blocks.
More is normally better; having just barely enough to hide latency can slow down with imperfect scheduling. If we wanted to go nuts measuring inc, we'd do this:
mov ecx, 10000000
inc_latency:
%rep 10 ;; source-level repeat of a block, no runtime branching
inc eax
inc ebx
; not ecx, we're using it as a loop counter
inc edx
inc esi
inc edi
inc ebp
inc r8d
inc r9d
inc r10d
inc r11d
inc r12d
inc r13d
inc r14d
inc r15d
%endrep
sub ecx,1 ; break partial-flag false dep for P4
jnz inc_latency ; dec/jnz macro-fuses into 1 uop on Intel SnB-family
If we were worried about partial-flag false dependencies or flag-merging effects, we might experiment with mixing in an xor eax,eax somewhere to let OoO exec overlap more than just when sub wrote all flags. (See INC instruction vs ADD 1: Does it matter?)
There's a similar problem for measuring throughput and latency of shl r32, cl on Sandybridge-family: the flag dependency chain isn't normally relevant for a computation, but putting shl back-to-back creates a dependency through FLAGS as well as through the register. (Or for throughput, there isn't even a register dep).
I posted about this on Agner Fog's blog: https://www.agner.org/optimize/blog/read.php?i=415#860. I mixed shl edx,cl in with four add edx,1 instructions, to see what incremental slowdown adding one more instruction had, where the FLAGS dependency was a non-issue. On SKL, it only slows down by an extra 1.23 cycles on average, so the true latency cost of that shl was only ~1.23 cycles, not 2. (It's not a whole number or just 1 because of resource conflicts to run the flag-merging uops of the shl, I guess. BMI2 shlx edx, edx, ecx would be exactly 1c because it's only a single uop.)
Related: for static performance analysis of whole blocks of code (containing different instructions), see What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?. (It's using the word "latency" for the end-to-end latency of a whole computation, but actually asking about things small enough for OoO exec to overlap different parts, so instruction latency and throughput both matter.)
The Latency=2 numbers for load/store appear to be from Agner Fog's instruction tables (https://agner.org/optimize/). They unfortunately aren't accurate for a chain of mov rax, [rax]. You'll find that's 4c
latency if you measure it by putting that in a loop.
Agner splits up load/store latency into something that makes the total store/reload latency come out correct, but for some reason he doesn't make the load part equal to the L1d load-use latency when it comes from cache instead of the store buffer. (But also note that if the load feeds an ALU instruction instead of another load, the latency is 5c. So the simple addressing-mode fast-path only helps for pure pointer-chasing.)

What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

I want to be able to predict, by hand, exactly how long arbitrary arithmetical (i.e. no branching or memory, though that would be nice too) x86-64 assembly code will take given a particular architecture, taking into account instruction reordering, superscalarity, latencies, CPIs, etc.
What / describe the rules must be followed to achieve this?
I think I've got some preliminary rules figured out, but I haven't been able to find any references on breaking down any example code to this level of detail, so I've had to take some guesses. (For example, the Intel optimization manual barely even mentions instruction reordering.)
At minimum, I'm looking for (1) confirmation that each rule is correct or else a correct statement of each rule, and (2) a list of any rules that I may have forgotten.
As many instructions as possible are issued each cycle, starting in-order from the current cycle and potentially as far ahead as the reorder buffer size.
An instruction can be issued on a given cycle if:
No instructions that affect its operands are still being executed. And:
If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering). And:
There is a functional unit available for that instruction on that cycle. Every (?) functional unit is pipelined, meaning it can accept 1 new instruction per cycle, and the number of total functional units is 1/CPI, for the CPI of a given function class (nebulous here: presumably e.g. addps and subps use the same functional unit? How do I determine this?). And:
Fewer than the superscalar width (typically 4) number of instructions have already been issued this cycle.
If no instructions can be issued, the processor simply doesn't issue any—a condition called a "stall".
As an example, consider the following example code (which computes a cross-product):
shufps xmm3, xmm2, 210
shufps xmm0, xmm1, 201
shufps xmm2, xmm2, 201
mulps xmm0, xmm3
shufps xmm1, xmm1, 210
mulps xmm1, xmm2
subps xmm0, xmm1
My attempt to predict the latency for Haswell looks something like this:
; `mulps` Haswell latency=5, CPI=0.5
; `shufps` Haswell latency=1, CPI=1
; `subps` Haswell latency=3, CPI=1
shufps xmm3, xmm2, 210 ; cycle 1
shufps xmm0, xmm1, 201 ; cycle 2
shufps xmm2, xmm2, 201 ; cycle 3
mulps xmm0, xmm3 ; (superscalar execution)
shufps xmm1, xmm1, 210 ; cycle 4
mulps xmm1, xmm2 ; cycle 5
; cycle 6 (stall `xmm0` and `xmm1`)
; cycle 7 (stall `xmm1`)
; cycle 8 (stall `xmm1`)
subps xmm0, xmm1 ; cycle 9
; cycle 10 (stall `xmm0`)
TL:DR: look for dependency chains, especially loop-carried ones. For a long-running loop, see which latency, front-end throughput, or back-end port contention/throughput is the worst bottleneck. That's how many cycles your loop probably takes per iteration, on average, if there are no cache misses or branch mispredicts.
Latency bounds and throughput bounds for processors for operations that must occur in sequence is a good example of analyzing loop-carried dependency chains in a specific loop with two dep chains, one pulling values from the other.
Related: How many CPU cycles are needed for each assembly instruction? is a good introduction to throughput vs. latency on a per-instruction basis, and how what that means for sequences of multiple instructions. See also Assembly - How to score a CPU instruction by latency and throughput for how to measure a single instruction.
This is called static (performance) analysis. Wikipedia says (https://en.wikipedia.org/wiki/List_of_performance_analysis_tools) that AMD's AMD CodeXL has a "static kernel analyzer" (i.e. for computational kernels, aka loops). I've never tried it.
Intel also has a free tool for analyzing how loops will go through the pipeline in Sandybridge-family CPUs: What is IACA and how do I use it?
IACA is not bad, but has bugs (e.g. wrong data for shld on Sandybridge, and last I checked, it doesn't know that Haswell/Skylake can keep indexed addressing modes micro-fused for some instructions. But maybe that will change now that Intel's added details on that to their optimization manual.) IACA is also unhelpful for counting front-end uops to see how close to a bottleneck you are (it likes to only give you unfused-domain uop counts).
Static analysis is often pretty good, but definitely check by profiling with performance counters. See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example of profiling a simple loop to investigate a microarchitectural feature.
Essential reading:
Agner Fog's microarch guide (chapter 2: Out of order exec) explains some of the basics of dependency chains and out-of-order execution. His "Optimizing Assembly" guide has more good introductory and advanced performance stuff.
The later chapters of his microarch guide cover the details of the pipelines in CPUs like Nehalem, Sandybridge, Haswell, K8/K10, Bulldozer, and Ryzen. (And Atom / Silvermont / Jaguar).
Agner Fog's instruction tables (spreadsheet or PDF) are also normally the best source for instruction latency / throughput / execution-port breakdowns.
David Kanter's microarch analysis docs are very good, with diagrams. e.g. https://www.realworldtech.com/sandy-bridge/, https://www.realworldtech.com/haswell-cpu/, and https://www.realworldtech.com/bulldozer/.
See also other performance links in the x86 tag wiki.
I also took a stab at explaining how a CPU core finds and exploits instruction-level parallelism in this answer, but I think you've already grasped those basics as far as it's relevant for tuning software. I did mention how SMT (Hyperthreading) works as a way to expose more ILP to a single CPU core, though.
In Intel terminology:
"issue" means to send a uop into the out-of-order part of the core; along with register-renaming, this is the last step in the front-end. The issue/rename stage is often the narrowest point in the pipeline, e.g. 4-wide on Intel since Core2. (With later uarches like Haswell and especially Skylake often actually coming very close to that in some real code, thanks to SKL's improved decoders and uop-cache bandwidth, as well as back-end and cache bandwidth improvements.) This is fused-domain uops: micro-fusion lets you send 2 uops through the front-end and only take up one ROB entry. (I was able to construct a loop on Skylake that sustains 7 unfused-domain uops per clock). See also http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ re: out-of-order window size.
"dispatch" means the scheduler sends a uop to an execution port. This happens as soon as all the inputs are ready, and the relevant execution port is available. How are x86 uops scheduled, exactly?. Scheduling happens in the "unfused" domain; micro-fused uops are tracked separately in the OoO scheduler (aka Reservation Station, RS).
A lot of other computer-architecture literature uses these terms in the opposite sense, but this is the terminology you will find in Intel's optimization manual, and the names of hardware performance counters like uops_issued.any or uops_dispatched_port.port_5.
exactly how long arbitrary arithmetical x86-64 assembly code will take
It depends on the surrounding code as well, because of OoO exec
Your final subps result doesn't have to be ready before the CPU starts running later instructions. Latency only matters for later instructions that need that value as an input, not for integer looping and whatnot.
Sometimes throughput is what matters, and out-of-order exec can hide the latency of multiple independent short dependency chains. (e.g. if you're doing the same thing to every element of a big array of multiple vectors, multiple cross products can be in flight at once.) You'll end up with multiple iterations in flight at once, even though in program order you finish all of one iteration before doing any of the next. (Software pipelining can help for high-latency loop bodies if OoO exec has a hard time doing all the reordering in HW.)
There are three major dimensions to analyze for a short block
You can approximately characterize a short block of non-branching code in terms of these three factors. Usually only one of them is the bottleneck for a given use-case. Often you're looking at a block that you will use as part of a loop, not as the whole loop body, but OoO exec normally works well enough that you can just add up these numbers for a couple different blocks, if they're not so long that OoO window size prevents finding all the ILP.
latency from each input to the output(s). Look at which instructions are on the dependency chain from each input to each output. e.g. one choice might need one input to be ready sooner.
total uop count (for front-end throughput bottlenecks), fused-domain on Intel CPUs. e.g. Core2 and later can in theory issue/rename 4 fused-domain uops per clock into the out-of-order scheduler/ROB. Sandybridge-family can often achieve that in practice with the uop cache and loop buffer, especially Skylake with its improved decoders and uop-cache throughput.
uop count for each back-end execution port (unfused domain). e.g. shuffle-heavy code will often bottleneck on port 5 on Intel CPUs. Intel usually only publishes throughput numbers, not port breakdowns, which is why you have to look at Agner Fog's tables (or IACA output) to do anything meaningful if you're not just repeating the same instruction a zillion times.
Generally you can assuming best-case scheduling/distribution, with uops that can run on other ports not stealing the busy ports very often, but it does happen some. (How are x86 uops scheduled, exactly?)
Looking at CPI is not sufficient; two CPI=1 instructions might or might not compete for the same execution port. If they don't, they can execute in parallel. e.g. Haswell can only run psadbw on port 0 (5c latency, 1c throughput, i.e. CPI=1) but it's a single uop so a mix of 1 psadbw + 3 add instructions could sustain 4 instructions per clock. There are vector ALUs on 3 different ports in Intel CPUs, with some operations replicated on all 3 (e.g. booleans) and some only on one port (e.g. shifts before Skylake).
Sometimes you can come up with a couple different strategies, one maybe lower latency but costing more uops. A classic example is multiplying by constants like imul eax, ecx, 10 (1 uop, 3c latency on Intel) vs. lea eax, [rcx + rcx*4] / add eax,eax (2 uops, 2c latency). Modern compilers tend to choose 2 LEA vs. 1 IMUL, although clang up to 3.7 favoured IMUL unless it could get the job done with only a single other instruction.
See What is the efficient way to count set bits at a position or lower? for an example of static analysis for a few different ways to implement a function.
See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) (which ended up being way more detailed than you'd guess from the question title) for another summary of static analysis, and some neat stuff about unrolling with multiple accumulators for a reduction.
Every (?) functional unit is pipelined
The divider is pipelined in recent CPUs, but not fully pipelined. (FP divide is single-uop, though, so if you do one divps mixed in with dozens of mulps / addps, it can have negligible throughput impact if latency doesn't matter: Floating point division vs floating point multiplication. rcpps + a Newton iteration is worse throughput and about the same latency.
Everything else is fully pipelined on mainstream Intel CPUs; multi-cycle (reciprocal) throughput for a single uop. (variable-count integer shifts like shl eax, cl have lower-than-expected throughput for their 3 uops, because they create a dependency through the flag-merging uops. But if you break that dependency through FLAGS with an add or something, you can get better throughput and latency.)
On AMD before Ryzen, the integer multiplier is also only partially pipelined. e.g. Bulldozer's imul ecx, edx is only 1 uop, but with 4c latency, 2c throughput.
Xeon Phi (KNL) also has some not-fully-pipelined shuffle instructions, but it tends to bottleneck on the front-end (instruction decode), not the back-end, and does have a small buffer + OoO exec capability to hide back-end bubbles.
If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering)
No.
Maybe you read that for Silvermont, which doesn't do OoO exec for FP/SIMD, only integer (with a small ~20 uop window). Maybe some ARM chips are like that, too, with simpler schedulers for NEON? I don't know much about ARM uarch details.
The mainstream big-core microarchitectures like P6 / SnB-family, and all AMD OoO chips, do OoO exec for SIMD and FP instructions the same as for integer. AMD CPUs use a separate scheduler, but Intel uses a unified scheduler so its full size can be applied to finding ILP in integer or FP code, whichever is currently running.
Even the silvermont-based Knight's Landing (in Xeon Phi) does OoO exec for SIMD.
x86 is generally not very sensitive to instruction ordering, but uop scheduling doesn't do critical-path analysis. So it could sometimes help to put instructions on the critical path first, so they aren't stuck waiting with their inputs ready while other instructions run on that port, leading to a bigger stall later when we get to instructions that need the result of the critical path. (i.e. that's why it is the critical path.)
My attempt to predict the latency for Haswell looks something like this:
Yup, that looks right. shufps runs on port 5, addps runs on p1, mulps runs on p0 or p1. Skylake drops the dedicated FP-add unit and runs SIMD FP add/mul/FMA on the FMA units on p0/p1, all with 4c latency (up/down from 3/5/5 in Haswell, or 3/3/5 in Broadwell).
This is a good example of why keeping a whole XYZ direction vector in a SIMD vector usually sucks. Keeping an array of X, an array of Y, and an array of Z, would let you do 4 cross products in parallel without any shuffles.
The SSE tag wiki has a link to these slides: SIMD at Insomniac Games (GDC 2015) which covers that array-of-structs vs. struct-of-arrays issues for 3D vectors, and why it's often a mistake to always try to SIMD a single operation instead of using SIMD to do multiple operations in parallel.

Why are loops always compiled into "do...while" style (tail jump)?

When trying to understand assembly (with compiler optimization on), I see this behavior:
A very basic loop like this
outside_loop;
while (condition) {
statements;
}
Is often compiled into (pseudocode)
; outside_loop
jmp loop_condition ; unconditional
loop_start:
loop_statements
loop_condition:
condition_check
jmp_if_true loop_start
; outside_loop
However, if the optimization is not turned on, it compiles to normally understandable code:
loop_condition:
condition_check
jmp_if_false loop_end
loop_statements
jmp loop_condition ; unconditional
loop_end:
According to my understanding, the compiled code is better resembled by this:
goto condition;
do {
statements;
condition:
}
while (condition_check);
I can't see a huge performance boost or code readability boost, so why is this often the case? Is there a name for this loop style, for example "trailing condition check"?
Related: asm loop basics: While, Do While, For loops in Assembly Language (emu8086)
Terminology: Wikipedia says "loop inversion" is the name for turning a while(x) into if(x) do{}while(x), putting the condition at the bottom of the loop where it belongs.
Fewer instructions / uops inside the loop = better. Structuring the code outside the loop to achieve this is very often a good idea.
Sometimes this requires "loop rotation" (peeling part of the first iteration so the actual loop body has the conditional branch at the bottom). So you do some of the first iteration and maybe skip the loop entirely, then fall into the loop. Sometimes you also need some code after the loop to finish the last iteration.
Sometimes loop rotation is extra useful if the last iteration is a special case, e.g. a store you need to skip. This lets you implement a while(1) {... ; if(x)break; ...; } loop as a do-while, or put one of the conditions of a multiple-condition loop at the bottom.
Some of these optimizations are related to or enable software pipelining, e.g. loading something for the next iteration. (OoO exec on x86 makes SW pipelining not very important these days but it's still useful for in-order cores like many ARM. And unrolling with multiple accumulators is still very valuable for hiding loop-carried FP latency in a reduction loop like a dot product or sum of an array.)
do{}while() is the canonical / idiomatic structure for loops in asm on all architectures, get used to it. IDK if there's a name for it; I would say such a loop has a "do while structure". If you want names, you could call the while() structure "crappy unoptimized code" or "written by a novice". :P Loop-branch at the bottom is universal, and not even worth mentioning as a Loop Optimization. You always do that.
This pattern is so widely used that on CPUs that use static branch prediction for branches without an entry in the branch-predictor caches, unknown forward conditional branches are predicted not-taken, unknown backwards branches are predicted taken (because they're probably loop branches). See Static branch prediction on newer Intel processors on Matt Godbolt's blog, and Agner Fog's branch-prediction chapter at the start of his microarch PDF.
This answer ended up using x86 examples for everything, but much of this applies across the board for all architectures. I wouldn't be surprised if other superscalar / out-of-order implementations (like some ARM, or POWER) also have limited branch-instruction throughput whether they're taken or not. But fewer instructions inside the loop is nearly universal when all you have is a conditional branch at the bottom, and no unconditional branch.
If the loop might need to run zero times, compilers more often put a test-and-branch outside the loop to skip it, instead of jumping to the loop condition at the bottom. (i.e. if the compiler can't prove the loop condition is always true on the first iteration).
BTW, this paper calls transforming while() to if(){ do{}while; } an "inversion", but loop inversion usually means inverting a nested loop. (e.g. if the source loops over a row-major multi-dimensional array in the wrong order, a clever compiler could change for(i) for(j) a[j][i]++; into for(j) for(i) a[j][i]++; if it can prove it's correct.) But I guess you can look at the if() as a zero-or-one iteration loop. Fun fact, compiler devs teaching their compilers how to invert a loop (to allow auto-vectorization) for a (very) specific case is why SPECint2006's libquantum benchmark is "broken". Most compilers can't invert loops in the general case, just ones that looks almost exactly like the one in SPECint2006...
You can help the compiler make more compact asm (fewer instructions outside the loop) by writing do{}while() loops in C when you know the caller isn't allowed to pass size=0 or whatever else guarantees a loop runs at least once.
(Actually 0 or negative for signed loop bounds. Signed vs. unsigned loop counters is a tricky optimization issue, especially if you choose a narrower type than pointers; check your compiler's asm output to make sure it isn't sign-extending a narrow loop counter inside the loop very time if you use it as an array index. But note that signed can actually be helpful, because the compiler can assume that i++ <= bound will eventually become false, because signed overflow is UB but unsigned isn't. So with unsigned, while(i++ <= bound) is infinite if bound = UINT_MAX.) I don't have a blanket recommendation for when to use signed vs. unsigned; size_t is often a good choice for looping over arrays, though, but if you want to avoid the x86-64 REX prefixes in the loop overhead (for a trivial saving in code size) but convince the compiler not to waste any instructions zero or sign-extending, it can be tricky.
I can't see a huge performance boost
Here's an example where that optimization will give speedup of 2x on Intel CPUs before Haswell, because P6 and SnB/IvB can only run branches on port 5, including not-taken conditional branches.
Required background knowledge for this static performance analysis: Agner Fog's microarch guide (read the Sandybridge section). Also read his Optimizing Assembly guide, it's excellent. (Occasionally outdated in places, though.) See also other x86 performance links in the x86 tag wiki. See also Can x86's MOV really be "free"? Why can't I reproduce this at all? for some static analysis backed up by experiments with perf counters, and some explanation of fused vs. unfused domain uops.
You could also use Intel's IACA software (Intel Architecture Code Analyzer) to do static analysis on these loops.
; sum(int []) using SSE2 PADDD (dword elements)
; edi = pointer, esi = end_pointer.
; scalar cleanup / unaligned handling / horizontal sum of XMM0 not shown.
; NASM syntax
ALIGN 16 ; not required for max performance for tiny loops on most CPUs
.looptop: ; while (edi<end_pointer) {
cmp edi, esi ; 32-bit code so this can macro-fuse on Core2
jae .done ; 1 uop, port5 only (macro-fused with cmp)
paddd xmm0, [edi] ; 1 micro-fused uop, p1/p5 + a load port
add edi, 16 ; 1 uop, p015
jmp .looptop ; 1 uop, p5 only
; Sandybridge/Ivybridge ports each uop can use
.done: ; }
This is 4 total fused-domain uops (with macro-fusion of the cmp/jae), so it can issue from the front-end into the out-of-order core at one iteration per clock. But in the unfused domain there are 4 ALU uops and Intel pre-Haswell only has 3 ALU ports.
More importantly, port5 pressure is the bottleneck: This loop can execute at only one iteration per 2 cycles because cmp/jae and jmp both need to run on port5. Other uops stealing port5 could reduce practical throughput somewhat below that.
Writing the loop idiomatically for asm, we get:
ALIGN 16
.looptop: ; do {
paddd xmm0, [edi] ; 1 micro-fused uop, p1/p5 + a load port
add edi, 16 ; 1 uop, p015
cmp edi, esi ; 1 uop, port5 only (macro-fused with cmp)
jb .looptop ; } while(edi < end_pointer);
Notice right away, independent of everything else, that this is one fewer instruction in the loop. This loop structure is at least slightly better on everything from simple non-pipelined 8086 through classic RISC (like early MIPS), especially for long-running loops (assuming they don't bottleneck on memory bandwidth).
Core2 and later should run this at one iteration per clock, twice as fast as the while(){}-structured loop, if memory isn't a bottleneck (i.e. assuming L1D hits, or at least L2 actually; this is only SSE2 16-bytes per clock).
This is only 3 fused-domain uops, so can issue at better than one per clock on anything since Core2, or just one per clock if issue groups always end with a taken branch.
But the important part is that port5 pressure is vastly reduced: only cmp/jb needs it. The other uops will probably be scheduled to port5 some of the time and steal cycles from loop-branch throughput, but this will be a few % instead of a factor of 2. See How are x86 uops scheduled, exactly?.
Most CPUs that normally have a taken-branch throughput of one per 2 cycles can still execute tiny loops at 1 per clock. There are some exceptions, though. (I forget which CPUs can't run tight loops at 1 per clock; maybe Bulldozer-family? Or maybe just some low-power CPUs like VIA Nano.) Sandybridge and Core2 can definitely run tight loops at one per clock. They even have loop buffers; Core2 has a loop buffer after instruction-length decode but before regular decode. Nehalem and later recycle uops in the queue that feeds the issue/rename stage. (Except on Skylake with microcode updates; Intel had to disable the loop buffer because of a partial-register merging bug.)
However, there is a loop-carried dependency chain on xmm0: Intel CPUs have 1-cycle latency paddd, so we're right up against that bottleneck, too. add esi, 16 is also 1 cycle latency. On Bulldozer-family, even integer vector ops have 2c latency, so that would bottleneck the loop at 2c per iteration. (AMD since K8 and Intel since SnB can run two loads per clock, so we need to unroll anyway for max throughput.) With floating point, you definitely want to unroll with multiple accumulators. Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators).
If I'd used an indexed addressing mode, like paddd xmm0, [edi + eax], I could have used sub eax, 16 / jnc at the loop condition. SUB/JNC can macro-fuse on Sandybridge-family, but the indexed load would un-laminate on SnB/IvB (but stay fused on Haswell and later, unless you use the AVX form).
; index relative to the end of the array, with an index counting up towards zero
add rdi, rsi ; edi = end_pointer
xor eax, eax
sub eax, esi ; eax = -length, so [rdi+rax] = first element
.looptop: ; do {
paddd xmm0, [rdi + rax]
add eax, 16
jl .looptop ; } while(idx+=16 < 0); // or JNC still works
(It's usually better to unroll some to hide the overhead of pointer increments instead of using indexed addressing modes, especially for stores, partly because indexed stores can't use the port7 store AGU on Haswell+.)
On Core2/Nehalem add/jl don't macro-fuse, so this is 3 fused-domain uops even in 64-bit mode, without depending on macro-fusion. Same for AMD K8/K10/Bulldozer-family/Ryzen: no fusion of the loop condition, but PADDD with a memory operand is 1 m-op / uop.
On SnB, paddd un-laminates from the load, but add/jl macro-fuse, so again 3 fused-domain uops. (But in the unfused domain, only 2 ALU uops + 1 load, so probably fewer resource conflicts reducing throughput of the loop.)
On HSW and later, this is 2 fused-domain uops because an indexed load can stay micro-fused with PADDD, and add/jl macro-fuses. (Predicted-taken branches run on port 6, so there are never resource conflicts.)
Of course, the loops can only run at best 1 iteration per clock because of taken branch throughput limits even for tiny loops. This indexing trick is potentially useful if you had something else to do inside the loop, too.
But all of these loops had no unrolling
Yes, that exaggerates the effect of loop overhead. But gcc doesn't unroll by default even at -O3 (unless it decides to fully unroll). It only unrolls with profile-guided optimization to let it know which loops are hot. (-fprofile-use). You can enable -funroll-all-loops, but I'd only recommend doing that on a per-file basis for a compilation unit you know has one of your hot loops that needs it. Or maybe even on a per-function basis with an __attribute__, if there is one for optimization options like that.
So this is highly relevant for compiler-generated code. (But clang does default to unrolling tiny loops by 4, or small loops by 2, and extremely importantly, using multiple accumulators to hide latency.)
Benefits with very low iteration count:
Consider what happens when the loop body should run once or twice: There's a lot more jumping with anything other than do{}while.
For do{}while, execution is a straight-line with no taken branches and one not-taken branch at the bottom. This is excellent.
For an if() { do{}while; } that might run the loop zero times, it's two not-taken branches. That's still very good. (Not-taken is slightly cheaper for the front-end than taken when both are correctly predicted).
For a jmp-to-the-bottom jmp; do{}while(), it's one taken unconditional branch, one taken loop condition, and then the loop branch is not-taken. This is kinda clunky but modern branch predictors are very good...
For a while(){} structure, this is one not-taken loop exit, one taken jmp at the bottom, then one taken loop-exit branch at the top.
With more iterations, each loop structure does one more taken branch. while(){} also does one more not-taken branch per iteration, so it quickly becomes obviously worse.
The latter two loop structures have more jumping around for small trip counts.
Jumping to the bottom also has a disadvantage for non-tiny loops that the bottom of the loop might be cold in L1I cache if it hasn't run for a while. Code fetch / prefetch is good at bringing code to the front-end in a straight line, but if prediction didn't predict the branch early enough, you might have a code miss for the jump to the bottom. Also, parallel decode will probably have (or could have) decoded some of the top of the loop while decoding the jmp to the bottom.
Conditionally jumping over a do{}while loop avoids all that: you only jump forwards into code that hasn't been run yet in cases where the code you're jumping over shouldn't run at all. It often predicts very well because a lot of code never actually takes 0 trips through the loop. (i.e. it could have been a do{}while, the compiler just didn't manage to prove it.)
Jumping to the bottom also means the core can't start working on the real loop body until after the front-end chases two taken branches.
There are cases with complicated loop conditions where it's easiest to write it this way, and the performance impact is small, but compilers often avoid it.
Loops with multiple exit conditions:
Consider a memchr loop, or a strchr loop: they have to stop at the end of the buffer (based on a count) or the end of an implicit-length string (0 byte). But they also have to break out of the loop if they find a match before the end.
So you'll often see a structure like
do {
if () break;
blah blah;
} while(condition);
Or just two conditions near the bottom. Ideally you can test multiple logical conditions with the same actual instruction (e.g. 5 < x && x < 25 using sub eax, 5 / cmp eax, 20 / ja .outside_range, unsigned compare trick for range checking, or combine that with an OR to check for alphabetic characters of either case in 4 instructions) but sometimes you can't and just need to use an if()break style loop-exit branch as well as a normal backwards taken branch.
Further reading:
Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” for good ways to look at compiler output (e.g. what kind of inputs give interesting output, and a primer on reading x86 asm for beginners). related: How to remove "noise" from GCC/clang assembly output?
Modern Microprocessors A 90-Minute Guide!. Details look at superscalar pipelined CPUs, mostly architecture neutral. Very good. Explains instruction-level parallelism and stuff like that.
Agner Fog's x86 optimization guide and microarch pdf. This will take you from being able to write (or understand) correct x86 asm to being able to write efficient asm (or see what the compiler should have done).
other links in the x86 tag wiki, including Intel's optimization manuals. Also several of my answers (linked in the tag wiki) have things that Agner missed in his testing on more recent microarchitectures (like un-lamination of micro-fused indexed addressing modes on SnB, and partial register stuff on Haswell+).
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators): how to use multiple accumulators to hide latency of a reduction loop (like an FP dot product).
Lecture 7: Loop Transformations (also on archive.org). Lots of cool stuff that compilers do to loops, using C syntax to describe the asm.
https://en.wikipedia.org/wiki/Loop_optimization
Sort of off topic:
Memory bandwidth is almost always important, but it's not widely known that a single core on most modern x86 CPUs can't saturate DRAM, and not even close on many-core Xeons where single-threaded bandwidth is worse than on a quad-core with dual channel memory controllers.
How much of ‘What Every Programmer Should Know About Memory’ is still valid? (my answer has commentary on what's changed and what's still relevant in Ulrich Drepper's well-known excellent article.)

Dependency chain analysis

From Agner Fog's "Optimizing Assembly" guide, Section 12.7: a loop example. One of the paragraphs discussing the example code:
[...] Analysis for Pentium M: ... 13 uops at 3 per clock = one iteration per 4.33c retirement time.
There is a dependency chain in the loop. The latencies are: 2 for
memory read, 5 for multiplication, 3 for subtraction, and 3 for memory
write, which totals 13 clock cycles. This is three times as much as
the retirement time but it is not a loop-carried dependence because
the results from each iteration are saved to memory and not reused in
the next iteration. The out-of-order execution mechanism and
pipelining makes it possible that each calculation can start before
the preceding calculation is finished. The only loop-carried
dependency chain is add eax,16 which has a latency of only 1.
## Example 12.6b. DAXPY algorithm, 32-bit mode
[...] ; not shown: initialize some regs before the loop
L1:
movapd xmm1, [esi+eax] ; X[i], X[i+1]
mulpd xmm1, xmm2 ; X[i] * DA, X[i+1] * DA
movapd xmm0, [edi+eax] ; Y[i], Y[i+1]
subpd xmm0, xmm1 ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DA
movapd [edi+eax], xmm0 ; Store result
add eax, 16 ; Add size of two elements to index
cmp eax, ecx ; Compare with n*8
jl L1 ; Loop back
I cannot understand why the dependency chain doesn't increase a whole throughput. I know that it is only important to find the worst bottleneck. The worst bottleneck identified before considering dependency chains was fused-domain uop throughput, at 4.33 cycles per iteration. I cannot understand why the dependency chain isn't a bigger bottleneck than that.
I see that the author explains that it is connected with out-of-order execution and pipelining but I cannot see it. I mean, though, only multiplication causes latency 5 cycles so only this value is greater than 4 cycle.
I also cannot understand why the author doesn't care about the dependency here:
add eax, 16 -> cmp eax, ecx -> jl L1
After all, addition must be executed before cmp and cmp must be executed before jl.
PS: later paragraphs identify the biggest bottleneck for Pentium M as decode, limiting it to one iteration per 6c, because 128b vector ops decode to two uops each. See Agner Fog's guide for the rest of the analysis, and analysis + tuning for Core2, FMA4 Bulldozer, and Sandybridge.
the mul isn't part of a loop-carried dependency chain, so there can be mulpd insns from multiple iterations in flight at once. The latency of a single instruction isn't the issue here at all, it's the dependency chain. Each iteration has a separate 13c dependency chain of load, mulpd, subpd, store. Out-of-order execution is what allows uops from multiple iterations to be in flight at once.
The cmp / jl in each iteration depend on the add from that iteration, but the add in the next iteration doesn't depend on the cmp. Speculative execution and branch prediction mean that control dependencies (conditional branches and indirect jumps/calls) are not part of data dependency chains. This is why instructions from one iteration can start running before the jl from the preceding iteration retires.
By comparison, cmov is a data dependency instead of a control dependency, so branchless loops tend to have loop-carried dependency chains. This tends to be slower than branching if the branch predicts well.
Each loop iteration has a separate cmp/jl dependency chain, just like the FP dependency chain.
I cannot understand why dependency chain doesn't increase a whole throughput.
I have no idea what this sentence means. I think I was able to figure out all your other mixed up words and phrasing. (e.g. "chain dependency" instead of "dependency chain".) Have a look at my edits to your question; some of them might help your understanding, too.

Resources