How to determine CPE: Cycles Per Element - performance

How do I determine the CPE of a program?
For example, I have this assembly code for a loop:
# inner4: data_t = float
# udata in %rbx, vdata in %rax, limit in %rcx,
# i in %rdx, sum in %xmm1
1 .L87: # loop:
2 movss (%rbx,%rdx,4), %xmm0 # Get udata[i]
3 mulss (%rax,%rdx,4), %xmm0 # Multiply by vdata[i]
4 addss %xmm0, %xmm1 # Add to sum
5 addq $1, %rdx # Increment i
6 cmpq %rcx, %rdx # Compare i:limit
7 jl .L87 # If <, goto loop
I have to find the lower bound of the CPE determined by the critical path using the data type float. I believe that the critical path would refer to the slowest possible path, and would thus be the one where the program has to execute the mulss instruction because that takes up the longest number of clock cycles.
However, there doesn't seem to be any clear way to determine the CPE. If one instruction takes two clock cycles, and another takes one, can the latter start after the first clock cycle of the former? Any help would be appreciated. Thanks

If you want to know how long it needs, you should measure it. Execute the loop some about 10^10 times, take the time it needs and multiply by the clock frequency. You get the total count of cycles, divide by 10^10 to get the number of clock cycles per loop iteration.
A theoretical prediction of the execution time will almost never be correct (and most of time to low) because the are numerous effects which determine the speed:
Pipelining (there can be easily about 20 stages in the pipeline)
Superscalar execution (up to 5 instructions in parallel, cmp and jl may be fused)
Decoding to µOps and reordering
The latencies of Caches or Memory
The throughput of the instructions (are there enough executions ports free)
The latencies of the instructions
Bank conflicts, aliasing issues and more esoteric stuff
Depending on the CPU and provided the memory accesses all hit the L1 cache, I believe the loop should need at least 3 clock cycles per iteration, because the longest dependency chain is 3 elements long. On an older CPU with slower mulss or addss instruction the time needed increases.
If you are actually interested in speeding up the code and not only some theoretical observations you should vectorize it. You can increase the performance by a factor of 4-8 with something like
.L87: # loop:
vmovdqa (%rbx,%rdx,4), %ymm0 # Get udata[i]..udata[i+7]
vmulps (%rax,%rdx,4), %ymm0, %ymm0 # Multiply by vdata[i]..vdata[i+7]
vaddps %ymm0, %ymm1, %ymm1 # Add to sum
addq $8, %rdx # Increment i
cmpq %rcx, %rdx # Compare i:limit
jl .L87 # If <, goto loop
You need to horizontal add all 8 elements after that and of course make sure alignment is 32 and loop counter divisible by 8.

If you're running an an Intel CPU, you can find some good documentation on instruction latency and throughput for various CPUs. Here's the link:
Intel® 64 and IA-32
Architectures
Optimization Reference Manual

Related

Interpreting Absurdly-Low Measured Latency in Careful Profile (Superscalarity Effects?)

I've written some code for profiling small functions. At the high level it:
Sets the thread affinity to only one core and the thread priority to maximum.
Computes statistics from doing the following 100 times:
Estimate the latency of a function that does nothing.
Estimate the latency of the test function.
Subtract the first from the second to remove the cost of doing function-call overhead, thereby roughly getting the cost of the test function's contents.
To estimate the latency of a function, it:
Invalidates caches (this is difficult to actually do in user-mode, but I allocate and write a buffer the size of the L3 to memory, which should maybe help).
Yields the thread, so that the profile loop has as-few-as-possible context switches.
Gets the current time from a std::chrono::high_resolution_clock (which seems to compile to system_clock, but).
Runs the profile loop 100,000,000 times, calling the tested function within.
Gets the current time from a std::chrono::high_resolution_clock and subtracts to get latency.
Because at this level, individual instructions matter, at all points we have to write very careful code to ensure that the compiler doesn't elide, inline, cache, or treat-differently the functions. I have manually validated the generated assembly in various test cases, including the one which I present below.
I am getting extremely low (sub-nanosecond) latencies reported in some cases. I have tried everything I can think of to account for this, but cannot find an error.
I am looking for an explanation accounting for this behavior. Why are my profiled functions taking so little time?
Let's take the example of computing a square root for float.
The function signature is float(*)(float), and the empty function is trivial:
empty_function(float):
ret
Let's compute the square root by using the sqrtss instruction, and by the multiplication-by-reciprocal-square-root hack. I.e., the tested functions are:
sqrt_sseinstr(float):
sqrtss xmm0, xmm0
ret
sqrt_rcpsseinstr(float):
movaps xmm1, xmm0
rsqrtss xmm1, xmm0
mulss xmm0, xmm1
ret
Here's the profile loop. Again, this same code is called with the empty function and with the test functions:
double profile(float):
...
mov rbp,rdi
push rbx
mov ebx, 0x5f5e100
call 1c20 <invalidate_caches()>
call 1110 <sched_yield()>
call 1050 <std::chrono::high_resolution_clock::now()>
mov r12, rax
xchg ax, ax
15b0:
movss xmm0,DWORD PTR [rip+0xba4]
call rbp
sub rbx, 0x1
jne 15b0 <double profile(float)+0x20>
call 1050 <std::chrono::high_resolution_clock::now()>
...
The timing result for sqrt_sseinstr(float) on my Intel 990X is 3.60±0.13 nanoseconds. At this processor's rated 3.46 GHz, that works out to be 12.45±0.44 cycles. This seems pretty spot-on, given that the docs say the latency of sqrtss is around 13 cycles (it's not listed for this processor's Nehalem architecture, but it seems likely to also be around 13 cycles).
the timing result for sqrt_rcpsseinstr(float) is stranger: 0.01±0.07 nanoseconds (or 0.02±0.24 cycles). This is flatly implausible unless another effect is going on.
I thought perhaps the processor is able to hide the latency of the tested function somewhat or perfectly because the tested function uses different instruction ports (i.e. superscalarity is hiding something)? I tried to analyze this by hand, but didn't get very far because I didn't really know what I was doing.
(Note: I cleaned up some of the assembly notation for your convenience. An unedited objdump of the whole program, which includes several other variants, is here, and I am temporarily hosting the binary here (x86-64 SSE2+, Linux).)
The question, again: Why are some profiled functions producing implausibly small values? If it is a higher-order effect, explain?
The problem is with the basic approach of subtracting out the "latency"1 of an empty function, as described:
Estimate the latency of a function that does nothing.
Estimate the latency of the test function.
Subtract the first from the second to remove the cost of doing function-call overhead, thereby roughly getting the cost of the test
function's contents.
The built-in assumption is that the cost of calling a function is X, and if the latency of the work done in the function is Y, then the total cost will be something like X + Y.
This isn't generally true for any two blocks of work and especially isn't true when when one of them is "calling a function". A more sophisticated view would be that the total time would be somewhere between min(X, Y) and X + Y - but even this is often wrong depending on the details. Still, it's enough of a refinement to explain what is going on here: the cost of the function is not additive with the work being doing in the function: they happen in parallel.
The cost of an empty function call is something like 4 to 5 cycles on modern Intel, probably bottlenecked on the front-end throughput for the two taken branches, and possibly by branch and return predictor latency.
However, when you add additional work to an empty function, it generally won't compete for the same resources, and its execution instructions won't depend on the "output" of the call (i.e., the work will form a separate dependency chain), except perhaps in rare cases where the stack pointer is manipulated and the stack engine doesn't remove the dependency.
So essentially the function will take the greater of the time needed for the function call mechanics, or the actual work done by the function. This approximation isn't exact, because some types of work may actually add to the overhead of the function call (e.g., if there are enough instructions for the front end to get through before getting to the ret, the total time may increase on top of the 4-5 cycle empty function time, even if the total work is less than that) - but it's a good first order approximation.
Your first function takes enough time that the actual work dominates the execution time. The second function is much faster, however, enabling it to "hide under" the existing time taken by the call/ret mechanics.
The solution is simple: duplicate the work within the function N times, so that the work always dominates. N=10 or N=50 or something like that is fine. You have to decide whether you want to test latency, in which case the output of one copy of the work should feed into the next, or throughput, in which case it shouldn't.
On other hand, if you actually want to test the cost of the function call + work, e.g., because that's how you'll be using it in real life, it is likely the results you have gotten is already close to correct: stuff really can be "incrementally free" when it hides behind a function call.
1 I'm putting "latency" in quotes here because it isn't clear whether we should be talking about the latency of call/ret or the throughput. call and ret don't have any explicit outputs (and ret has no input), so it doesn't participate in a classic register-based dependency chain - but it might make sense to think of latency if you consider other hidden architectural components like the instruction pointer. In either case latency of throughput mostly points down to the same thing because all call and ret on a thread operate on the same state, so it doesn't make sense to have say "independent" vs "dependent" call chains.
Your benchmarking approach is fundamentally wrong, and your "careful code" is bogus.
First, emptying the cache is bogus. Not only will it quickly be repopulated with the required data, but also the examples you have posted have very little memory interaction (only cache access by call/ret and a load we'll get to.
Second, yielding before the benchmarking loop is bogus. You iterate 100000000 times, which even on a reasonably fast modern processor will take longer than typical scheduling clock interrupts on a stock operating system. If, on the other hand, you disable scheduling clock interrupts, then yielding before the benchmark doesn't do anything.
Now that the useless incidental complexity is out of the way, about the fundamental misunderstanding of modern CPUs:
You expect loop_time_gross/loop_count to be the time spent in each loop iteration. This is wrong. Modern CPUs do not execute instructions one after the other, sequentially. Modern CPUs pipeline, predict branches, execute multiple instructions in parallel, and (reasonably fast CPUs) out of order.
So after the first handful of iterations of the benchmarking loop, all branches are perfectly predicted for the next almost 100000000 iterations. This enables the CPU to speculate. Effectively, the conditional branch in the benchmarking loop goes away, as does most of the cost of the indirect call. Effectively, the CPU can unroll the loop:
movss xmm0, number
movaps xmm1, xmm0
rsqrtss xmm1, xmm0
mulss xmm0, xmm1
movss xmm0, number
movaps xmm1, xmm0
rsqrtss xmm1, xmm0
mulss xmm0, xmm1
movss xmm0, number
movaps xmm1, xmm0
rsqrtss xmm1, xmm0
mulss xmm0, xmm1
...
or, for the other loop
movss xmm0, number
sqrtss xmm0, xmm0
movss xmm0, number
sqrtss xmm0, xmm0
movss xmm0, number
sqrtss xmm0, xmm0
...
Notable, the load of number is always the same (thus quickly cached), and it overwrites the just computed value, breaking the dependency chain.
To be fair, the
call rbp
sub rbx, 0x1
jne 15b0 <double profile(float)+0x20>
are still executed, but the only resources they take from the floating-point code are decode/micro-op cache and execution ports. Notably, while the integer loop code has a dependency chain (ensuring a minimum execution time), the floating-point code does not carry a dependency on it. Furthermore, the floating-point code consists of many mutually totally independent short dependency chains.
Where you expect the CPU to execute instructions sequentially, the CPU can instead execute them in parallel.
A small look at https://agner.org/optimize/instruction_tables.pdf reveals why this parallel execution doesn't work for sqrtss on Nehalem:
instruction: SQRTSS/PS
latency: 7-18
reciprocal throughput: 7-18
i.e., the instruction cannot be pipelined and only runs on one execution port.
In contrast, for movaps, rsqrtss, mulss:
instruction: MOVAPS/D
latency: 1
reciprocal throughput: 1
instruction: RSQRTSS
latency: 3
reciprocal throughput: 2
instruction: MULSS
latency: 4
reciprocal throughput: 1
the maximum reciprocal throughput of the dependency chain is 2, so you can expect the code to finish executing one dependency chain every 2 cycles in the steady state. At this point, the execution time of the floating-point part of the benchmarking loop is less than or equal to the loop overhead and overlapped with it, so your naive approach to subtract the loop overhead leads to nonsensical results.
If you wanted to do this properly, you would ensure that separate loop iterations are dependent on each other, for example by changing your benchmarking loop to
float x = INITIAL_VALUE;
for (i = 0; i < 100000000; i++)
x = benchmarked_function(x);
Obviously you will not benchmark the same input this way, unless INITIAL_VALUE is a fixed point of benchmarked_function(). However, you can arrange for it to be a fixed point of an expanded function by computing float diff = INITIAL_VALUE - benchmarked_function(INITIAL_VALUE); and then making the loop
float x = INITIAL_VALUE;
for (i = 0; i < 100000000; i++)
x = diff + benchmarked_function(x);
with relatively minor overhead, though you should then ensure that floating-point errors do not accumulate to significantly change the value passed to benchmarked_function().

Assembly - How to score a CPU instruction by latency and throughput

I'm looking for a type of a formula / way to measure how fast an instruction is, or more specific to give a "score" each of the instruction by CPU cycles.
Let's take the follow assembly program for an example,
nop
mov eax,dword ptr [rbp+34h]
inc eax
mov dword ptr [rbp+34h],eax
and the following Intel Skylake information:
mov r,m : Throughput=0.5 Latency=2
mov m,r
: Throughput=1 Latency=2
nop : Throughput=0.25 Latency=non
inc : Throughput=0.25 Latency=1
I know that the order of the instructions in the program are matter in here but
I'm looking to create something general that not need to be "accurate to the single cycle"
any one have any idea how can I do that?
There is no formula you can apply; you have to measure.
The same instruction on different versions of the same uarch family can have different performance. e.g. mulps:
Sandybridge 1c / 5c throughput/latency.
HSW 0.5 / 5. BDW 0.5 / 3 (faster multiply path in the FMA unit? FMA is still 5c).
SKL 0.5 / 4 (lower latency FMA, too). SKL runs addps on the FMA unit as well, dropping the dedicated FP multiply unit so add latency is higher, but throughput is higher.
There's no way you could predict any of this without measuring, or knowing some microarchitectural details. We expect FP math ops won't be single-cycle latency, because they're much more complicated than integer ops. (So if they were single cycle, the clock speed is set too low for integer ops.)
You measure by repeating the instruction many times in an unrolled loop. Or fully unrolled with no looping, but then you defeat the uop-cache and can get front-end bottlenecks. (e.g. for decoding 10-byte mov r64, imm64)
https://uops.info/ has already automated this testing for every form of every (unprivileged) instruction, and you can even click on any table entry to see what test loops they used. e.g. Skylake xchg r32, eax latency testing (https://uops.info/html-lat/SKL/XCHG_R32_EAX-Measurements.html) from each input operand to each output. (2 cycle latency from EAX -> R8D, but 1 cycle latency from R8D -> EAX.) So we can guess that the 3 uops include copying EAX to an internal temporary, but moving directly from the other operand to EAX.
https://uops.info/ is the current best source of test data; when it and Agner's tables disagree, my own measurements and/or other sources have always confirmed uops.info's testing was accurate. And they don't try to make up a latency number for 2 halves of a round-trip like movd xmm0,eax and back, they show you the range of possible latencies assuming the rest of the chain was the minimum plausible.
Agner Fog creates his instruction tables (which you appear to be reading) by timing large non-looping blocks of code that repeat an instruction. https://agner.org/optimize/. The intro section of his instruction-tables explains briefly how he measures, and his microarch guide explains more details of how different x86 microarchitectures work internally. Unfortunately there are occasional typos or copy/paste errors in his hand-edited tables.
http://instlatx64.atw.hu/ also has results of experimental measurements. I think they use a similar technique of a large block of the same instruction repeated, maybe small enough to fit in the uop cache. But they don't use perf counters to measure what execution port each instruction needs, so their throughput numbers don't help you figure out which instructions compete with which other instructions.
These latter two sources have been around for longer than uops.info, and cover some older CPUs, especially older AMD.
To measure latency yourself, you make the output of each instruction an input for the next.
mov ecx, 10000000
inc_latency:
inc eax
inc eax
inc eax
inc eax
inc eax
inc eax
sub ecx,1 ; avoid partial-flag false dep for P4
jnz inc_latency ; dec or sub/jnz macro-fuses into 1 uop on Intel SnB-family
This dependency chain of 7 inc instructions will bottleneck the loop at 1 iteration per 7 * inc_latency cycles. Using perf counters for core clock cycles (not RDTSC cycles), you can easily measure the time for all the iterations to 1 part in 10k, and with more care probably even more precisely than that. The repeat count of 10000000 hides start/stop overhead of whatever timing you use.
I normally put a loop like this in a Linux static executable that just makes a sys_exit(0) system call directly (with a syscall) instruction, and time the whole executable with perf stat ./testloop to get time and a cycle count. (See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example).
Another example is Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths, with the added complication of using lfence to drain the out-of-order execution window for two dep chains.
To measure throughput, you use separate registers, and/or include an xor-zeroing occasionally to break dep chains and let out-of-order exec overlap things. Don't forget to also use perf counters to see which ports it can run on, so you can tell which other instructions it will compete with. (e.g. FMA (p01) and shuffles (p5) don't compete at all for back-end resources on Haswell/Skylake, only for front-end throughput.) Don't forget to measure front-end uop counts, too: some instructions decode to multiply uops.
How many different dependency chains do we need to avoid a bottleneck? Well we know the latency (measure it first), and we know the max possible throughput (number of execution ports, or front-end throughput.)
For example, if FP multiply had 0.25c throughput (4 per clock), we could keep 20 in flight at once on Haswell (5c latency). That's more than we have registers, so we could just use all 16 and discover that in fact the throughput is only 0.5c. But if it had turned out that 16 registers was a bottleneck, we could add xorps xmm0,xmm0 occasionally and let out-of-order execution overlap some blocks.
More is normally better; having just barely enough to hide latency can slow down with imperfect scheduling. If we wanted to go nuts measuring inc, we'd do this:
mov ecx, 10000000
inc_latency:
%rep 10 ;; source-level repeat of a block, no runtime branching
inc eax
inc ebx
; not ecx, we're using it as a loop counter
inc edx
inc esi
inc edi
inc ebp
inc r8d
inc r9d
inc r10d
inc r11d
inc r12d
inc r13d
inc r14d
inc r15d
%endrep
sub ecx,1 ; break partial-flag false dep for P4
jnz inc_latency ; dec/jnz macro-fuses into 1 uop on Intel SnB-family
If we were worried about partial-flag false dependencies or flag-merging effects, we might experiment with mixing in an xor eax,eax somewhere to let OoO exec overlap more than just when sub wrote all flags. (See INC instruction vs ADD 1: Does it matter?)
There's a similar problem for measuring throughput and latency of shl r32, cl on Sandybridge-family: the flag dependency chain isn't normally relevant for a computation, but putting shl back-to-back creates a dependency through FLAGS as well as through the register. (Or for throughput, there isn't even a register dep).
I posted about this on Agner Fog's blog: https://www.agner.org/optimize/blog/read.php?i=415#860. I mixed shl edx,cl in with four add edx,1 instructions, to see what incremental slowdown adding one more instruction had, where the FLAGS dependency was a non-issue. On SKL, it only slows down by an extra 1.23 cycles on average, so the true latency cost of that shl was only ~1.23 cycles, not 2. (It's not a whole number or just 1 because of resource conflicts to run the flag-merging uops of the shl, I guess. BMI2 shlx edx, edx, ecx would be exactly 1c because it's only a single uop.)
Related: for static performance analysis of whole blocks of code (containing different instructions), see What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?. (It's using the word "latency" for the end-to-end latency of a whole computation, but actually asking about things small enough for OoO exec to overlap different parts, so instruction latency and throughput both matter.)
The Latency=2 numbers for load/store appear to be from Agner Fog's instruction tables (https://agner.org/optimize/). They unfortunately aren't accurate for a chain of mov rax, [rax]. You'll find that's 4c
latency if you measure it by putting that in a loop.
Agner splits up load/store latency into something that makes the total store/reload latency come out correct, but for some reason he doesn't make the load part equal to the L1d load-use latency when it comes from cache instead of the store buffer. (But also note that if the load feeds an ALU instruction instead of another load, the latency is 5c. So the simple addressing-mode fast-path only helps for pure pointer-chasing.)

x86 - instruction interleaving to avoid cpu stall

Gcc6 - intel core 2 duo.
Compilation flags: "-march=native -O3" (-S)
I was compiling a simple program and asked for the assembly output:
Code
movq 8(%rsi), %rdi
call _atoi
movq 16(%rbp), %rdi
movl %eax, %ebx
call _atof
pxor %xmm1, %xmm1
movl $1, %eax <- this instruction is my problem
cvtsi2sd %ebx, %xmm1
leaq LC0(%rip), %rdi
addsd %xmm1, %xmm0
call _printf
addq $8, %rsp
Execution
read/convert an integer variable, then read/convert a double value and add them.
The problem
I perfectly understand that one (the compiler more so) has to avoid cpu stalls as much as possible.
I've shown the offending instruction in the code section above.
To me, with cpu reordering, and different execution context, this interleaved instruction is useless.
My rationale is: chances that we stall are very high anyway and the cpu will wait for pxor xmm1 to return before being able to reuse it in the next instruction. Adding an instruction will just fill the cpu decoder for nothing. The cpu HAS to wait anyway. So why not leaving it alone for 1 instruction?
Moving the pxor before atof seems not possible as atof may use it.
Question
Is that a bug, a legacy junk (when cpu were not able to reorder) or.. else?
Thanks
EDIT:
I admit my question was not clear: can this instruction be safely removed without performance consequences?
The x86-64 ABI requires that calls to varargs functions (like printf) set %al = the count of floating-point args passed in xmm registers. In this case, you're passing one double, so the ABI requires %al = 1. (Fun fact: C's promotion rules make it impossible to pass a float to a vararg function. This is why there are no printf conversion specifiers for float, only double.)
mov $1, %eax avoids false dependencies on the rest of eax, (compared to mov $1, %al), so gcc prefers spending extra instruction bytes on that, even though it's tuning for Core2 (which renames partial registers).
Previous answer, before it was clarified that the question was why the mov is done all, not about its ordering.
IIRC, gcc doesn't do much instruction scheduling for x86, because it's assuming out-of-order execution. I tried to google that, but didn't find the quote from a gcc developer that I seem to remember reading (maybe in a gcc bug report comment).
Anyway, it looks ok to me, unless you're tuning for in-order Atom or P5. If you are, use gcc -O3 -march=atom (which implies -mtune=atom). But anyway, you're clearly not doing that, because you used -march=native on a C2Duo, which is a 4-wide out-of-order design with a fairly large scheduler.
To me, with cpu reordering, and different execution context, this interleaved instruction is useless.
I have no idea what you think the problem is, or what ordering you think would be better, so I'll just explain why it looks good.
I didn't take the time to edit this down to a short answer, so you might prefer to just read Agner Fog's microarch pdf for details of the Core2 pipeline, and skim this answer. See also other links from the x86 tag wiki.
...
call _atof
# xmm0 is probably still not ready when the following instructions issue
pxor %xmm1, %xmm1 # no inputs, so can run any time after being issued.
gcc uses pxor because cvtsi2sd is badly designed, giving it a false dependency on the previous value of the vector register. Note how the upper half of the vector register keeps its old value. Intel probably designed it this way because the original SSE cvtsi2ss was first implemented on Pentium III, where 128b vectors were handled as two halves. Zeroing the rest of the register (including the upper half) instead of merging probably would have taken an extra uop on PIII.
This short-sighted design choice saddled the architecture with the choice between an extra dependency-breaking instruction, or a false dependency. A false dep might not matter at all, or might be a big slowdown if the register used by one function happened to be used for a very long FP dependency chain in another function (maybe including a cache miss).
On Intel SnB-family CPUs, xor-zeroing is handled at register-rename time, so the uop never needs to execute on an execution port; it's already completed as soon as it issues into the ROB. This is true for integer and vector registers.
On other CPUs, the pxor will need an execution port, but has no input dependencies so it can execute any time there's a free ALU port, after it issues.
movl $1, %eax # no input dependencies, can execute any time.
This instruction could be placed anywhere after call atof and before call printf.
cvtsi2sd %ebx, %xmm1 # no false dependency thanks to pxor.
This is a 2 uop instruction on Core2 (Merom and Penryn), according to Agner Fog's tables. That's weird because cvtsi2ss is 1 uop. (They're both 2 uops in SnB; presumably one uop to move data between integer and vector, and another for the conversion).
Putting this insn earlier would be good, potentially issue it a cycle earlier, since it's part of the longest dependency chain here. (The integer stuff is all simple and trivial). However, printf has to parse the format string before it will decide to look at xmm0, so the FP instructions aren't actually on the critical path.
It can't go ahead of pxor, and call / pxor / cvtsi2sd would mean pxor would decode by itself that cycle. Decoding will start with the instruction after the call, after the ret in the called function has been decoded (and the return-address predictor predicts the jump back to the insn after the call). Multi-uop instructions have to be the first instruction in a block, so having pxor and mov imm32 decode that cycle means less of a decode bottleneck.
leaq LC0(%rip), %rdi # 1 uop
addsd %xmm1, %xmm0 # 1 uop
call _printf # 3 uop insn
cvtsi2sd/lea/addsd can all decode in the same cycle, which is optimal. If the mov imm32 was after the cvt, it could decode in the same cycle as well (since pre-SnB decoders can handle up to 4-1-1-1), but it couldn't have issued as soon.
If decoding was only barely keeping up with issue, that would mean pxor would issue by itself (because no other instructions were decoded yet). Then cvtsi2sd/mov imm/lea (4 uops), then addsd / call (4 uops). (addsd decoded with the previous issue group; core2 has a short queue between decode and issue to help absorb decode bubbles like this, and make it useful to be able to decode up to 7 uops in a cycle.)
That's not appreciably different from the current issue pattern in a decode-bottleneck situation: (pxor / mov imm) / (cvtsi2sd/lea/addsd) / (call printf)
If decode isn't the bottleneck, I'm not sure if Core2 can issue a ret or jmp in the same cycle as uops that follow the jump. In SnB-family CPUs, an unconditional jump always ends an issue group. e.g. a 3-uop loop issues ABC, ABC, ABC, not ABCA, BCAB, CABC.
Assuming the instructions after the ret issue with a group not including the ret, we'd have
(pxor/mov imm/cvtsi2sd), (lea / addsd / 2 of call's 3 uops) / (last call uop)
So the cvtsi2sd still issues in the first cycle after returning from atof, which means it can get started executing right away. Even on Core2, where pxor takes an execution unit, the first of the 2 uops from cvtsi2sd can probably execute in the same cycle as pxor. It's probably only the 2nd uop that has an input dependency on the dst register.
(mov imm / pxor / cvtsi2sd) would be equivalent, and so would the slower-to-decode (pxor / cvtsi2sd / mov imm), or getting the lea executed before mov imm.

Dependency chain analysis

From Agner Fog's "Optimizing Assembly" guide, Section 12.7: a loop example. One of the paragraphs discussing the example code:
[...] Analysis for Pentium M: ... 13 uops at 3 per clock = one iteration per 4.33c retirement time.
There is a dependency chain in the loop. The latencies are: 2 for
memory read, 5 for multiplication, 3 for subtraction, and 3 for memory
write, which totals 13 clock cycles. This is three times as much as
the retirement time but it is not a loop-carried dependence because
the results from each iteration are saved to memory and not reused in
the next iteration. The out-of-order execution mechanism and
pipelining makes it possible that each calculation can start before
the preceding calculation is finished. The only loop-carried
dependency chain is add eax,16 which has a latency of only 1.
## Example 12.6b. DAXPY algorithm, 32-bit mode
[...] ; not shown: initialize some regs before the loop
L1:
movapd xmm1, [esi+eax] ; X[i], X[i+1]
mulpd xmm1, xmm2 ; X[i] * DA, X[i+1] * DA
movapd xmm0, [edi+eax] ; Y[i], Y[i+1]
subpd xmm0, xmm1 ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DA
movapd [edi+eax], xmm0 ; Store result
add eax, 16 ; Add size of two elements to index
cmp eax, ecx ; Compare with n*8
jl L1 ; Loop back
I cannot understand why the dependency chain doesn't increase a whole throughput. I know that it is only important to find the worst bottleneck. The worst bottleneck identified before considering dependency chains was fused-domain uop throughput, at 4.33 cycles per iteration. I cannot understand why the dependency chain isn't a bigger bottleneck than that.
I see that the author explains that it is connected with out-of-order execution and pipelining but I cannot see it. I mean, though, only multiplication causes latency 5 cycles so only this value is greater than 4 cycle.
I also cannot understand why the author doesn't care about the dependency here:
add eax, 16 -> cmp eax, ecx -> jl L1
After all, addition must be executed before cmp and cmp must be executed before jl.
PS: later paragraphs identify the biggest bottleneck for Pentium M as decode, limiting it to one iteration per 6c, because 128b vector ops decode to two uops each. See Agner Fog's guide for the rest of the analysis, and analysis + tuning for Core2, FMA4 Bulldozer, and Sandybridge.
the mul isn't part of a loop-carried dependency chain, so there can be mulpd insns from multiple iterations in flight at once. The latency of a single instruction isn't the issue here at all, it's the dependency chain. Each iteration has a separate 13c dependency chain of load, mulpd, subpd, store. Out-of-order execution is what allows uops from multiple iterations to be in flight at once.
The cmp / jl in each iteration depend on the add from that iteration, but the add in the next iteration doesn't depend on the cmp. Speculative execution and branch prediction mean that control dependencies (conditional branches and indirect jumps/calls) are not part of data dependency chains. This is why instructions from one iteration can start running before the jl from the preceding iteration retires.
By comparison, cmov is a data dependency instead of a control dependency, so branchless loops tend to have loop-carried dependency chains. This tends to be slower than branching if the branch predicts well.
Each loop iteration has a separate cmp/jl dependency chain, just like the FP dependency chain.
I cannot understand why dependency chain doesn't increase a whole throughput.
I have no idea what this sentence means. I think I was able to figure out all your other mixed up words and phrasing. (e.g. "chain dependency" instead of "dependency chain".) Have a look at my edits to your question; some of them might help your understanding, too.

Performance of modern processor

Being executed on modern processor (AMD Phenom II 1090T), how many clock ticks does the following code consume more likely : 3 or 11?
label: mov (%rsi), %rax
adc %rax, (%rdx)
lea 8(%rdx), %rdx
lea 8(%rsi), %rsi
dec %ecx
jnz label
The problem is, when I execute many iterations of such code, results vary near 3 OR 11 ticks per iteration from time to time. And I can't decide "who is who".
UPD
According to Table of instruction latencies (PDF), my piece of code takes at least 10 clock cycles on AMD K10 microarchitecture. Therefore, impossible 3 ticks per iteration are caused by bugs in measurement.
SOLVED
#Atom noticed, that cycle frequency isn't constant in modern processors. When I disabled in BIOS three options - Core Performance Boost, AMD C1E Support and AMD K8 Cool&Quiet Control, consumption of my "six instructions" stabilized on 3 clock ticks :-)
I won't try to answer with certainty how many cycles (3 or 10) it will take to run each iteration, but I'll explain how it might be possible to get 3 cycles per iteration.
(Note that this is for processors in general and I make no references specific to AMD processors.)
Key Concepts:
Out of Order Execution
Register Renaming
Most modern (non-embedded) processors today are both super-scalar and out-of-order. Not only can execute multiple (independent) instructions in parallel, but they can re-order instructions to break dependencies and such.
Let's break down your example:
label:
mov (%rsi), %rax
adc %rax, (%rdx)
lea 8(%rdx), %rdx
lea 8(%rsi), %rsi
dec %ecx
jnz label
The first thing to notice is that the last 3 instructions before the branch are all independent:
lea 8(%rdx), %rdx
lea 8(%rsi), %rsi
dec %ecx
So it's possible for a processor to execute all 3 of these in parallel.
Another thing is this:
adc %rax, (%rdx)
lea 8(%rdx), %rdx
There seems to be a dependency on rdx that prevents the two from running in parallel. But in reality, this is false dependency because the second instruction doesn't actually
depend on the output of the first instruction. Modern processors are able to rename the rdx register to allow these two instructions to be re-ordered or done in parallel.
Same applies to the rsi register between:
mov (%rsi), %rax
lea 8(%rsi), %rsi
So in the end, 3 cycles is (potentially) achievable as follows (this is just one of several possible orderings):
1: mov (%rsi), %rax lea 8(%rdx), %rdx lea 8(%rsi), %rsi
2: adc %rax, (%rdx) dec %ecx
3: jnz label
*Of course, I'm over-simplifying things for simplicity. In reality the latencies are probably longer and there's overlap between different iterations of the loop.
In any case, this could explain how it's possible to get 3 cycles. As for why you sometimes get 10 cycles, there could be a ton of reasons for that: branch misprediction, some random pipeline bubble...
At Intel, Dr. David Levinthal's "Performance Analysis Guide" investigates the answers to such questions in great detail.

Resources