Why is LOOP faster than DEC,JNZ on 8086?

Why is LOOP faster than DEC,JNZ on 8086? - performance

My professor claimed that LOOP is faster on 8086 because only one instruction is fetched instead of two, like in dec cx, jnz. So I think we are saving time by avoiding the extra fetch and decode per iteration.
But earlier in the lecture, he also mentioned that LOOP does the same stuff as DEC, JNZ under the hood, and I presume that its decoding should also be more complex, so the speed difference should kind of balance out. Then, why is the LOOP instruction faster? I went through this post, and the answers there pertain to processors more modern than 8086, although one of the answers (and the page it links) does point out that on 8088 (closely related to 8086), LOOP is faster.
Later, the professor used the same reasoning to explain why rep string operations might be faster than LOOP + individual movement instructions, but since I was not entirely convinced with the previous approach, I asked this question here.

It's not decode that's the problem, it's usually fetch on 8086.
Starting two separate instruction-decode operations probably is more expensive than just fetching more microcode for one loop instruction. I'd guess that's what accounts for the numbers in the table below that don't include code-fetch bottlenecks.
Equally or more importantly, 8086 is often bottlenecked by memory access, including code-fetch. (8088 almost always is, breathing through a straw with it's 8-bit bus, unlike 8086's 16-bit bus).
dec cx is 1 byte, jnz rel8 is 2 bytes.
So 3 bytes total, vs. 2 for loop rel8.
8086 performance can be approximated by counting memory accesses and multiply by four, since its 6-byte instruction prefetch buffer allows it to overlap code-fetch with decode and execution of other instructions. (Except for very slow instructions like mul that would let the buffer fill up after at most three 2-byte fetches.)
See also Increasing Efficiency of binary -> gray code for 8086 for an example of optimizing something for 8086, with links to more resources like tables of instruction timings.
https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html has instruction timings for 8086 (taken from Intel manuals I think, as cited in njuffa's answer), but those are only execution, when fetch isn't a bottleneck. (i.e. just decoding from the prefetch buffer.)
Decode / execute timings, not including fetch:
DEC Decrement
operand bytes 8088 186 286 386 486 Pentium
r8 2 3 3 2 2 1 1 UV
r16 1 3 3 2 2 1 1 UV
r32 1 3 3 2 2 1 1 UV
mem 2+d(0,2) 23+EA 15 7 6 3 3 UV
Jcc Jump on condition code
operand bytes 8088 186 286 386 486 Pentium
near8 2 4/16 4/13 3/7+m 3/7+m 1/3 1 PV
near16 3 - - - 3/7+m 1/3 1 PV
LOOP Loop control with CX counter
operand bytes 8088 186 286 386 486 Pentium
short 2 5/17 5/15 4/8+m 11+m 6/7 5/6 NP
So even ignoring code-fetch differences:
dec + taken jnz takes 3 + 16 = 19 cycles to decode / exec on 8086 / 8088.
taken loop takes 17 cycles to decode / exec on 8086 / 8088.
(Taken branches are slow on 8086, and discard the prefetch buffer; there's no branch prediction. IDK if those timings include any of that penalty, since they apparently don't for other instructions and non-taken branches.)
8088/8086 are not pipelined except for the code-prefetch buffer. Finishing execution of one instruction and starting decode / exec of the next take it some time; even the cheapest instructions (like mov reg,reg / shift / rotate / stc/std / etc.) take 2 cycles. Bizarrely more than nop (3 cycles).

I presume that its decoding should also be more complex
There's no reason that the decoding is more complex for the loop instruction.  This instruction has to do multiple things, but decoding is not at issue — it should decode as easily as JMP, since there's just the opcode and the one operand, the branch target, like JMP.
Saving one instruction's fetch & decode probably accounts for the speed improvement, since in execution they are effectively equivalent.

Looking at the "8086/8088 User's Manual: Programmer's and Hardware Reference" (Intel 1989) confirms that LOOP is marginally faster than the combination DEC CX; JNZ. DEC takes 3 clock cycles, JNZ takes 4 (not taken) or 16 (taken) cycles. So the combination requires 7 or 19 cycles. LOOP on the other hand requires 5 cycles (not taken) or 17 cycles (taken), for a saving of 2 cycles.
I do not see anything in the manual that describes why LOOP is faster. The faster instruction fetch due to the reduced number of opcode bytes seems like a reasonable hypothesis.
According to the "80286 and 80287 Programmer's Reference Manual" (Intel 1987), LOOP still has a slight advantage over the discrete replacement, in that it requires 8 cycles when taken and 4 cycles when not taken, while the combo requires 1 cycle more in both cases (DEC 2 cycles; JNZ 7 or 3 cycles).
The 8086 microcode has been disassembled, so one could theoretically take a look at the internal sequence of operations for both of these cases to establish exactly why LOOP is faster, if one is so inclined.

Related

Is there a penalty when base+offset is in a different page than the base?

The execution times for these three snippets:
pageboundary: dq (pageboundary + 8)
...
mov rdx, [rel pageboundary]
.loop:
mov rdx, [rdx - 8]
sub ecx, 1
jnz .loop
And this:
pageboundary: dq (pageboundary - 8)
...
mov rdx, [rel pageboundary]
.loop:
mov rdx, [rdx + 8]
sub ecx, 1
jnz .loop
And this:
pageboundary: dq (pageboundary - 4096)
...
mov rdx, [rel pageboundary]
.loop:
mov rdx, [rdx + 4096]
sub ecx, 1
jnz .loop
Are, on a 4770K, roughly 5 cycles per iteration for the first snippet and roughly 9 cycles per iteration for the second snippet, then 5 cycles for the third snippet. They both access the exact same address, which is 4K-aligned. In the second snippet, only the address calculation crosses the page boundary: rdx and rdx + 8 don't belong to the same page, the load is still aligned. With a large offset it's back to 5 cycles again.
How does this effect work in general?
Routing the result from the load through an ALU instruction like this:
.loop:
mov rdx, [rdx + 8]
or rdx, 0
sub ecx, 1
jnz .loop
Makes it take 6 cycles per iteration, which makes sense as 5+1. Reg+8 should be a special fast load and AFAIK take 4 cycles, so even in this case there seems to be some penalty, but only 1 cycle.
A test like this was used in response to some of the comments:
.loop:
lfence
; or rdx, 0
mov rdx, [rdx + 8]
; or rdx, 0
; uncomment one of the ORs
lfence
sub ecx, 1
jnz .loop
Putting the or before the mov makes the loop faster than without any or, putting the or after the mov makes it a cycle slower.

Optimization rule: in pointer-connected data structures like linked-lists / trees, put the next or left/right pointers in the first 16 bytes of the object. malloc typically returns 16-byte aligned blocks (alignof(maxalign_t)), so this will ensure the linking pointers are in the same page as the start of the object.
Any other way of ensuring that important struct members are in the same page as the start of the object will also work.
Sandybridge-family normally has 5 cycle L1d load-use latency, but there's a special case for pointer-chasing with small positive displacements with base+disp addressing modes.
Sandybridge-family has 4 cycle load-use latency for [reg + 0..2047] addressing modes, when the base reg is the result of a mov load, not an ALU instruction. Or a penalty if reg+disp is in a different page than reg.
Based on these test results on Haswell and Skylake (and probably original SnB but we don't know), it appears that all of the following conditions must be true:
base reg comes from another load. (A rough heuristic for pointer-chasing, and usually means that load latency is probably part of a dep chain). If objects are usually allocated not crossing a page boundary, then this is a good heuristic. (The HW can apparently detect which execution unit the input is being forwarded from.)
Addressing mode is [reg] or [reg+disp8/disp32]. (Or an indexed load with an xor-zeroed index register! Usually not practically useful, but might provide some insight into the issue/rename stage transforming load uops.)
displacement < 2048. i.e. all bits above bit 11 are zero (a condition HW can check without a full integer adder/comparator.)
(Skylake but not Haswell/Broadwell): the last load wasn't a retried-fastpath. (So base = result of a 4 or 5 cycle load, it will attempt the fast path. But base = result of a 10 cycle retried load, it won't. The penalty on SKL seems to be 10, vs. 9 on HSW).
I don't know if it's the last load attempted on that load port that matters, or if it's actually what happened to the load that produced that input. Perhaps experiments chasing two dep chains in parallel could shed some light; I've only tried one pointer chasing dep chain with a mix of page-changing and non-page-changing displacements.
If all those things are true, the load port speculates that the final effective address will be in the same page as the base register. This is a useful optimization in real cases when load-use latency forms a loop-carried dep chain, like for a linked list or binary tree.
microarchitectural explanation (my best guess at explaining the result, not from anything Intel published):
It seems that indexing the L1dTLB is on the critical path for L1d load latency. Starting that 1 cycle early (without waiting for the output of an adder to calculate the final address) shaves a cycle off the full process of indexing L1d using the low 12 bits of the address, then comparing the 8 tags in that set against the high bits of the physical address produced by the TLB. (Intel's L1d is VIPT 8-way 32kiB, so it has no aliasing problems because the index bits all come from the low 12 bits of the address: the offset within a page which is the same in both the virtual and physical address. i.e. the low 12 bits translate for free from virt to phys.)
Since we don't find an effect for crossing 64-byte boundaries, we know the load port is adding the displacement before indexing the cache.
As Hadi suggests, it seems likely that if there's carry-out from bit 11, the load port lets the wrong-TLB load complete and then redoes it using the normal path. (On HSW, the total load latency = 9. On SKL the total load latency can be 7.5 or 10).
Aborting right away and retrying on the next cycle (to make it 5 or 6 cycles instead of 9) would in theory be possible, but remember that the load ports are pipelined with 1 per clock throughput. The scheduler is expecting to be able to send another uop to the load port in the next cycle, and Sandybridge-family standardizes latencies for everything of 5 cycles and shorter. (There are no 2-cycle instructions).
I didn't test if 2M hugepages help, but probably not. I think the TLB hardware is simple enough that it couldn't recognize that a 1-page-higher index would still pick the same entry. So it probably does the slow retry any time the displacement crosses a 4k boundary, even if that's in the same hugepage. (Page-split loads work this way: if the data actually crosses a 4k boundary (e.g. 8-byte load from page-4), you pay the page-split penalty not just the cache-line split penalty, regardless of hugepages)
Intel's optimization manual documents this special case in section 2.4.5.2 L1 DCache (in the Sandybridge section), but doesn't mention any different-page limitation, or the fact that it's only for pointer-chasing, and doesn't happen when there's an ALU instruction in the dep chain.
(Sandybridge)
Table 2-21. Effect of Addressing Modes on Load Latency
-----------------------------------------------------------------------
Data Type | Base + Offset > 2048 | Base + Offset < 2048
| Base + Index [+ Offset] |
----------------------+--------------------------+----------------------
Integer | 5 | 4
MMX, SSE, 128-bit AVX | 6 | 5
X87 | 7 | 6
256-bit AVX | 7 | 7
(remember, 256-bit loads on SnB take 2 cycles in the load port, unlike on HSW/SKL)
The text around this table also doesn't mention the limitations that exist on Haswell/Skylake, and may also exist on SnB (I don't know).
Maybe Sandybridge doesn't have those limitations and Intel didn't document the Haswell regression, or else Intel just didn't document the limitations in the first place. The table is pretty definite about that addressing mode always being 4c latency with offset = 0..2047.
#Harold's experiment of putting an ALU instruction as part of the load/use pointer-chasing dependency chain confirms that it's this effect that's causing the slowdown: an ALU insn decreased the total latency, effectively giving an instruction like and rdx, rdx negative incremental latency when added to the mov rdx, [rdx-8] dep chain in this specific page-crossing case.
Previous guesses in this answer included the suggestion that using the load result in an ALU vs. another load was what determined the latency. That would be super weird and require looking into the future. That was a wrong interpretation on my part of the effect of adding an ALU instruction into the loop. (I hadn't known about the 9-cycle effect on page crossing, and was thinking that the HW mechanism was a forwarding fast-path for the result inside the load port. That would make sense.)
We can prove that it's the source of the base reg input that matters, not the destination of the load result: Store the same address at 2 separate locations, before and after a page boundary. Create a dep chain of ALU => load => load, and check that it's the 2nd load that's vulnerable to this slowdown / able to benefit from the speedup with a simple addressing mode.
%define off 16
lea rdi, [buf+4096 - 16]
mov [rdi], rdi
mov [rdi+off], rdi
mov ebp, 100000000
.loop:
and rdi, rdi
mov rdi, [rdi] ; base comes from AND
mov rdi, [rdi+off] ; base comes from a load
dec ebp
jnz .loop
... sys_exit_group(0)
section .bss
align 4096
buf: resb 4096*2
Timed with Linux perf on SKL i7-6700k.
off = 8, the speculation is correct and we get total latency = 10 cycles = 1 + 5 + 4. (10 cycles per iteration).
off = 16, the [rdi+off] load is slow, and we get 16 cycles / iter = 1 + 5 + 10. (The penalty seems to be higher on SKL than HSW)
With the load order reversed (doing the [rdi+off] load first), it's always 10c regardless of off=8 or off=16, so we've proved that mov rdi, [rdi+off] doesn't attempt the speculative fast-path if its input is from an ALU instruction.
Without the and, and off=8, we get the expected 8c per iter: both use the fast path. (#harold confirms HSW also gets 8 here).
Without the and, and off=16, we get 15c per iter: 5+10. The mov rdi, [rdi+16] attempts the fast path and fails, taking 10c. Then mov rdi, [rdi] doesn't attempt the fast-path because its input failed. (#harold's HSW takes 13 here: 4 + 9. So that confirms HSW does attempt the fast-path even if the last fast-path failed, and that the fast-path fail penalty really is only 9 on HSW vs. 10 on SKL)
It's unfortunate that SKL doesn't realize that [base] with no displacement can always safely use the fast path.
On SKL, with just mov rdi, [rdi+16] in the loop, the average latency is 7.5 cycles. Based on tests with other mixes, I think it alternates between 5c and 10c: after a 5c load that didn't attempt the fast path, the next one does attempt it and fails, taking 10c. That makes the next load use the safe 5c path.
Adding a zeroed index register actually speeds it up in this case where we know the fast-path is always going to fail. Or using no base register, like [nosplit off + rdi*1], which NASM assembles to 48 8b 3c 3d 10 00 00 00 mov rdi,QWORD PTR [rdi*1+0x10]. Notice that this requires a disp32, so it's bad for code size.
Also beware that indexed addressing modes for micro-fused memory operands are un-laminated in some cases, while base+disp modes aren't. But if you're using pure loads (like mov or vbroadcastss), there's nothing inherently wrong with an indexed addressing mode. Using an extra zeroed register isn't great, though.
On Ice Lake, this special 4 cycle fast path for pointer chasing loads is gone: GP register loads that hit in L1 now generally take 5 cycles, with no difference based on the presence of indexing or the size of the offset.

I've conducted a sufficient number of experiments on Haswell to determine exactly when memory loads are issued speculatively before the effective address is fully calculated. These results also confirm Peter's guess.
I've varied the following parameters:
The offset from pageboundary. The offset used is the same in the definition of pageboundary and the load instruction.
The sign of the offset is either + or -. The sign used in the definition is always the opposite of the one used in the load instruction.
The alignment of pageboundary within the executable binary.
In all of the following graphs, the Y axis represents the load latency in core cycles. The X axis represents the configuration in the form NS1S2, where N is the offset, S1 is the sign of the offset used in the definition, and S2 is the sign used in the load instruction.
The following graph shows that loads are issued before calculating the effective address only when the offset is positive or zero. Note that for all of the offsets between 0-15, the base address and the effective address used in the load instruction are both within the same 4K page.
The next graph shows the point where this pattern changes. The change occurs at offset 213, which is the smallest offset where the base address and the effective address used in the load instruction are both within different 4K pages.
Another important observation that can be made from the previous two graphs is that even if the base address points to a different cache set than the effective address, no penalty is incurred. So it seems that the cache set is opened after calculating the effective address. This indicates that the L1 DTLB hit latency is 2 cycles (that is, it takes 2 cycles for the L1D to receive the tag), but it takes only 1 cycle to open the cache's data array set and the cache's tag array set (which occurs in parallel).
The next graph shows what happens when pageboundary is aligned on a 4K page boundary. In this case, any offset that is not zero will make the base and effective addresses reside within different pages. For example, if the base address of pageboundary is 4096, then the base address of pageboundary used in the load instruction is 4096 - offset, which is obviously in a different 4K page for any non-zero offset.
The next graph shows that the pattern changes again starting from offset 2048. At this point, loads are never issued before calculating the effective address.
This analysis can be confirmed by measuring the number of uops dispatched to the load ports 2 and 3. The total number of retired load uops is 1 billion (equal to the number of iterations). However, when the measured load latency is 9 cycles, the number of load uops dispatched to each of the two ports is 1 billion. Also when the load latency is 5 or 4 cycles, the number of load uops dispatched to each of the two ports is 0.5 billion. So something like this would be happening:
The load unit checks whether the offset is non-negative and smaller than 2048. In that case, it will issue a data load request using the base address. It will also begin calculating the effective address.
In the next cycle, the effective address calculation is completed. If it turns out that the load is to a different 4K page, the load unit waits until the issued load completes and then it discards the results and replays the load. Either way, it supplies the data cache with the set index and line offset.
In the next cycle, the tag comparison is performed and the data is forwarded to the load buffer. (I'm not sure whether the address-speculative load will be aborted in the case of a miss in the L1D or the DTLB.)
In the next cycle, the load buffer receives the data from the cache. If it's supposed to discard the data, it's discarded and it tells the dispatcher to replay the load with address speculation disabled for it. Otherwise, the data is written back. If a following instruction requires the data for its address calculation, it will receive the data in the next cycle (so it will be dispatched in the next cycle if all of its other operands are ready).
These steps explain the observed 4, 5, and 9 cycle latencies.
It might happen that the target page is a hugepage. The only way for the load unit to know whether the base address and the effective address point to the same page when using hugepages is to have the TLB supply the load unit with the size of the page being accessed. Then the load unit has to check whether the effective address is within that page. In modern processors, on a TLB miss, dedicated page-walk hardware is used. In this case, I think that the load unit will not supply the cache set index and cache line offset to the data cache and will use the actual effective address to access the TLB. This requires enabling the page-walk hardware to distinguish between loads with speculative addresses and other loads. Only if that other access missed the TLB will the page walk take place. Now if the target page turned out to be a hugepage and it's a hit in the TLB, it might be possible to inform the load unit that the size of the page is larger than 4K or maybe even of the exact size of the page. The load unit can then make a better decision regarding whether the load should be replayed. However, this logic should take no more than the time for the (potentially wrong) data to reach the load buffer allocated for the load. I think this time is only one cycle.

Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths

I was playing with the code in this answer, slightly modifying it:
BITS 64
GLOBAL _start
SECTION .text
_start:
mov ecx, 1000000
.loop:
;T is a symbol defined with the CLI (-DT=...)
TIMES T imul eax, eax
lfence
TIMES T imul edx, edx
dec ecx
jnz .loop
mov eax, 60 ;sys_exit
xor edi, edi
syscall
Without the lfence I the results I get are consistent with the static analysis in that answer.
When I introduce a single lfence I'd expect the CPU to execute the imul edx, edx sequence of the k-th iteration in parallel with the imul eax, eax sequence of the next (k+1-th) iteration.
Something like this (calling A the imul eax, eax sequence and D the imul edx, edx one):
|
| A
| D A
| D A
| D A
| ...
| D A
| D
|
V time
Taking more or less the same number of cycles but for one unpaired parallel execution.
When I measure the number of cycles, for the original and modified version, with taskset -c 2 ocperf.py stat -r 5 -e cycles:u '-x ' ./main-$T for T in the range below I get
T Cycles:u Cycles:u Delta
lfence no lfence
10 42047564 30039060 12008504
15 58561018 45058832 13502186
20 75096403 60078056 15018347
25 91397069 75116661 16280408
30 108032041 90103844 17928197
35 124663013 105155678 19507335
40 140145764 120146110 19999654
45 156721111 135158434 21562677
50 172001996 150181473 21820523
55 191229173 165196260 26032913
60 221881438 180170249 41711189
65 250983063 195306576 55676487
70 281102683 210255704 70846979
75 312319626 225314892 87004734
80 339836648 240320162 99516486
85 372344426 255358484 116985942
90 401630332 270320076 131310256
95 431465386 285955731 145509655
100 460786274 305050719 155735555
How can the values of Cycles:u lfence be explained?
I would have expected them to be similar to those of Cycles:u no lfence since a single lfence should prevent only the first iteration from being executed in parallel for the two blocks.
I don't think it's due to the lfence overhead as I believe that should be constant for all Ts.
I'd like to fix what's wrong with my forma mentis when dealing with the static analysis of code.
Supporting repository with source files.

I think you're measuring accurately, and the explanation is microarchitectural, not any kind of measurement error.
I think your results for mid to low T support the conclusion that lfence stops the front-end from even issuing past the lfence until all earlier instructions retire, rather than having all the uops from both chains already issued and just waiting for lfence to flip a switch and let multiplies from each chain start to dispatch on alternating cycles.
(port1 would get edx,eax,empty,edx,eax,empty,... for Skylake's 3c latency / 1c throughput multiplier right away, if lfence didn't block the front-end, and overhead wouldn't scale with T.)
You're losing imul throughput when only uops from the first chain are in the scheduler because the front-end hasn't chewed through the imul edx,edx and loop branch yet. And for the same number of cycles at the end of the window when the pipeline is mostly drained and only uops from the 2nd chain are left.
The overhead delta looks linear up to about T=60. I didn't run the numbers, but the slope up to there looks reasonable for T * 0.25 clocks to issue the first chain vs. 3c-latency execution bottleneck. i.e. delta growing maybe 1/12th as fast as total no-lfence cycles.
So (given the lfence overhead I measured below), with T<60:
no_lfence cycles/iter ~= 3T # OoO exec finds all the parallelism
lfence cycles/iter ~= 3T + T/4 + 9.3 # lfence constant + front-end delay
delta ~= T/4 + 9.3
#Margaret reports that T/4 is a better fit than 2*T / 4, but I would have expected T/4 at both the start and end, for a total of 2T/4 slope of the delta.
After about T=60, delta grows much more quickly (but still linearly), with a slope about equal to the total no-lfence cycles, thus about 3c per T. I think at that point, the scheduler (Reservation Station) size is limiting the out-of-order window. You probably tested on a Haswell or Sandybridge/IvyBridge, (which have a 60-entry or 54-entry scheduler respectively. Skylake's is 97 entry (but not fully unified; IIRC BeeOnRope's testing showed that not all the entries could be used for any type of uop. Some were specific to load and/or store, for example.)
The RS tracks un-executed uops. Each RS entry holds 1 unfused-domain uop that's waiting for its inputs to be ready, and its execution port, before it can dispatch and leave the RS1.
After an lfence, the front-end issues at 4 per clock while the back-end executes at 1 per 3 clocks, issuing 60 uops in ~15 cycles, during which time only 5 imul instructions from the edx chain have executed. (There's no load or store micro-fusion here, so every fused-domain uop from the front-end is still only 1 unfused-domain uop in the RS2.)
For large T the RS quickly fills up, at which point the front-end can only make progress at the speed of the back-end. (For small T, we hit the next iteration's lfence before that happens, and that's what stalls the front-end). When T > RS_size, the back-end can't see any of the uops from the eax imul chain until enough back-end progress through the edx chain has made room in the RS. At that point, one imul from each chain can dispatch every 3 cycles, instead of just the 1st or 2nd chain.
Remember from the first section that time spent just after lfence only executing the first chain = time just before lfence executing only the second chain. That applies here, too.
We get some of this effect even with no lfence, for T > RS_size, but there's opportunity for overlap on both sides of a long chain. The ROB is at least twice the size of the RS, so the out-of-order window when not stalled by lfence should be able to keep both chains in flight constantly even when T is somewhat larger than the scheduler capacity. (Remember that uops leave the RS as soon as they've executed. I'm not sure if that means they have to finish executing and forward their result, or merely start executing, but that's a minor difference here for short ALU instructions. Once they're done, only the ROB is holding onto them until they retire, in program order.)
The ROB and register-file shouldn't be limiting the out-of-order window size (http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/) in this hypothetical situation, or in your real situation. They should both be plenty big.
Blocking the front-end is an implementation detail of lfence on Intel's uarches. The manual only says that later instructions can't execute. That wording would allow the front-end to issue/rename them all into the scheduler (Reservation Station) and ROB while lfence is still waiting, as long as none are dispatched to an execution unit.
So a weaker lfence would maybe have flat overhead up to T=RS_size, then the same slope as you see now for T>60. (And the constant part of the overhead might be lower.)
Note that guarantees about speculative execution of conditional/indirect branches after lfence apply to execution, not (as far as I know) to code-fetch. Merely triggering code-fetch is not (AFAIK) useful to a Spectre or Meltdown attack. Possibly a timing side-channel to detect how it decodes could tell you something about the fetched code...
I think AMD's LFENCE is at least as strong on actual AMD CPUs, when the relevant MSR is enabled. (Is LFENCE serializing on AMD processors?).
Extra lfence overhead:
Your results are interesting, but it doesn't surprise me at all that there's significant constant overhead from lfence itself (for small T), as well as the component that scales with T.
Remember that lfence doesn't allow later instructions to start until earlier instructions have retired. This is probably at least a couple cycles / pipeline-stages later than when their results are ready for bypass-fowarding to other execution units (i.e. the normal latency).
So for small T, it's definitely significant that you add extra latency into the chain by requiring the result to not only be ready, but also written back to the register file.
It probably takes an extra cycle or so for lfence to allow the issue/rename stage to start operating again after detecting retirement of the last instruction before it. The issue/rename process takes multiple stages (cycles), and maybe lfence blocks at the start of this, instead of in the very last step before uops are added into the OoO part of the core.
Even back-to-back lfence itself has 4 cycle throughput on SnB-family, according to Agner Fog's testing. Agner Fog reports 2 fused-domain uops (no unfused), but on Skylake I measure it at 6 fused-domain (still no unfused) if I only have 1 lfence. But with more lfence back-to-back, it's fewer uops! Down to ~2 uops per lfence with many back-to-back, which is how Agner measures.
lfence/dec/jnz (a tight loop with no work) runs at 1 iteration per ~10 cycles on SKL, so that might give us an idea of the real extra latency that lfence adds to the dep chains even without the front-end and RS-full bottlenecks.
Measuring lfence overhead with only one dep chain, OoO exec being irrelevant:
.loop:
;mfence ; mfence here: ~62.3c (with no lfence)
lfence ; lfence here: ~39.3c
times 10 imul eax,eax ; with no lfence: 30.0c
; lfence ; lfence here: ~39.6c
dec ecx
jnz .loop
Without lfence, runs at the expected 30.0c per iter. With lfence, runs at ~39.3c per iter, so lfence effectively added ~9.3c of "extra latency" to the critical path dep chain. (And 6 extra fused-domain uops).
With lfence after the imul chain, right before the loop-branch, it's slightly slower. But not a whole cycle slower, so that would indicate that the front-end is issuing the loop-branch + and imul in a single issue-group after lfence allows execution to resume. That being the case, IDK why it's slower. It's not from branch misses.
Getting the behaviour you were expecting:
Interleave the chains in program order, like #BeeOnRope suggests in comments, doesn't require out-of-order execution to exploit the ILP, so it's pretty trivial:
.loop:
lfence ; at the top of the loop is the lowest-overhead place.
%rep T
imul eax,eax
imul edx,edx
%endrep
dec ecx
jnz .loop
You could put pairs of short times 8 imul chains inside a %rep to let OoO exec have an easy time.
Footnote 1: How the front-end / RS / ROB interact
My mental model is that the issue/rename/allocate stages in the front-end add new uops to both the RS and the ROB at the same time.
Uops leave the RS after executing, but stay in the ROB until in-order retirement. The ROB can be large because it's never scanned out-of-order to find the first-ready uop, only scanned in-order to check if the oldest uop(s) have finished executing and thus are ready to retire.
(I assume the ROB is physically a circular buffer with start/end indices, not a queue which actually copies uops to the right every cycle. But just think of it as a queue / list with a fixed max size, where the front-end adds uops at the front, and the retirement logic retires/commits uops from the end as long as they're fully executed, up to some per-cycle per-hyperthread retirement limit which is not usually a bottleneck. Skylake did increase it for better Hyperthreading, maybe to 8 per clock per logical thread. Perhaps retirement also means freeing physical registers which helps HT, because the ROB itself is statically partitioned when both threads are active. That's why retirement limits are per logical thread.)
Uops like nop, xor eax,eax, or lfence, which are handled in the front-end (don't need any execution units on any ports) are added only to the ROB, in an already-executed state. (A ROB entry presumably has a bit that marks it as ready to retire vs. still waiting for execution to complete. This is the state I'm talking about. For uops that did need an execution port, I assume the ROB bit is set via a completion port from the execution unit. And that the same completion-port signal frees its RS entry.)
Uops stay in the ROB from issue to retirement.
Uops stay in the RS from issue to execution. The RS can replay uops in a few cases, e.g. for the other half of a cache-line-split load, or if it was dispatched in anticipation of load data arriving, but in fact it didn't. (Cache miss or other conflicts like Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?) Or when a load port speculates that it can bypass the AGU before starting a TLB lookup to shorten pointer-chasing latency with small offsets - Is there a penalty when base+offset is in a different page than the base?
So we know that the RS can't remove a uop right as it dispatches, because it might need to be replayed. (Can happen even to non-load uops that consume load data.) But any speculation that needs replays is short-range, not through a chain of uops, so once a result comes out the other end of an execution unit, the uop can be removed from the RS. Probably this is part of what a completion port does, along with putting the result on the bypass forwarding network.
Footnote 2: How many RS entries does a micro-fused uop take?
TL:DR: P6-family: RS is fused, SnB-family: RS is unfused.
A micro-fused uop is issued to two separate RS entries in Sandybridge-family, but only 1 ROB entry. (Assuming it isn't un-laminated before issue, see section 2.3.5 for HSW or section 2.4.2.4 for SnB of Intel's optimization manual, and Micro fusion and addressing modes. Sandybridge-family's more compact uop format can't represent indexed addressing modes in the ROB in all cases.)
The load can dispatch independently, ahead of the other operand for the ALU uop being ready. (Or for micro-fused stores, either of the store-address or store-data uops can dispatch when its input is ready, without waiting for both.)
I used the two-dep-chain method from the question to experimentally test this on Skylake (RS size = 97), with micro-fused or edi, [rdi] vs. mov+or, and another dep chain in rsi. (Full test code, NASM syntax on Godbolt)
; loop body
%rep T
%if FUSE
or edi, [rdi] ; static buffers are in the low 32 bits of address space, in non-PIE
%else
mov eax, [rdi]
or edi, eax
%endif
%endrep
%rep T
%if FUSE
or esi, [rsi]
%else
mov eax, [rsi]
or esi, eax
%endif
%endrep
Looking at uops_executed.thread (unfused-domain) per cycle (or per second which perf calculates for us), we can see a throughput number that doesn't depend on separate vs. folded loads.
With small T (T=30), all the ILP can be exploited, and we get ~0.67 uops per clock with or without micro-fusion. (I'm ignoring the small bias of 1 extra uop per loop iteration from dec/jnz. It's negligible compared to the effect we'd see if micro-fused uops only used 1 RS entry)
Remember that load+or is 2 uops, and we have 2 dep chains in flight, so this is 4/6, because or edi, [rdi] has 6 cycle latency. (Not 5, which is surprising, see below.)
At T=60, we still have about 0.66 unfused uops executed per clock for FUSE=0, and 0.64 for FUSE=1. We can still find basically all the ILP, but it's just barely starting to dip, as the two dep chains are 120 uops long (vs. a RS size of 97).
At T=120, we have 0.45 unfused uops per clock for FUSE=0, and 0.44 for FUSE=1. We're definitely past the knee here, but still finding some of the ILP.
If a micro-fused uop took only 1 RS entry, FUSE=1 T=120 should be about the same speed as FUSE=0 T=60, but that's not the case. Instead, FUSE=0 or 1 makes nearly no difference at any T. (Including larger ones like T=200: FUSE=0: 0.395 uops/clock, FUSE=1: 0.391 uops/clock). We'd have to go to very large T before we start for the time with 1 dep-chain in flight to totally dominate the time with 2 in flight, and get down to 0.33 uops / clock (2/6).
Oddity: We have such a small but still measurable difference in throughput for fused vs. unfused, with separate mov loads being faster.
Other oddities: the total uops_executed.thread is slightly lower for FUSE=0 at any given T. Like 2,418,826,591 vs. 2,419,020,155 for T=60. This difference was repeatable down to +- 60k out of 2.4G, plenty precise enough. FUSE=1 is slower in total clock cycles, but most of the difference comes from lower uops per clock, not from more uops.
Simple addressing modes like [rdi] are supposed to only have 4 cycle latency, so load + ALU should be only 5 cycle. But I measure 6 cycle latency for the load-use latency of or rdi, [rdi], or with a separate MOV-load, or with any other ALU instruction I can never get the load part to be 4c.
A complex addressing mode like [rdi + rbx + 2064] has the same latency when there's an ALU instruction in the dep chain, so it appears that Intel's 4c latency for simple addressing modes only applies when a load is forwarding to the base register of another load (with up to a +0..2047 displacement and no index).
Pointer-chasing is common enough that this is a useful optimization, but we need to think of it as a special load-load forwarding fast-path, not as a general data ready sooner for use by ALU instructions.
P6-family is different: an RS entry holds a fused-domain uop.
#Hadi found an Intel patent from 2002, where Figure 12 shows the RS in the fused domain.
Experimental testing on a Conroe (first gen Core2Duo, E6600) shows that there's a large difference between FUSE=0 and FUSE=1 for T=50. (The RS size is 32 entries).
T=50 FUSE=1: total time of 2.346G cycles (0.44IPC)
T=50 FUSE=0: total time of 3.272G cycles (0.62IPC = 0.31 load+OR per clock). (perf / ocperf.py doesn't have events for uops_executed on uarches before Nehalem or so, and I don't have oprofile installed on that machine.)
T=24 there's a negligible difference between FUSE=0 and FUSE=1, around 0.47 IPC vs 0.9 IPC (~0.45 load+OR per clock).
T=24 is still over 96 bytes of code in the loop, too big for Core 2's 64-byte (pre-decode) loop buffer, so it's not faster because of fitting in a loop buffer. Without a uop-cache, we have to be worried about the front-end, but I think we're fine because I'm exclusively using 2-byte single-uop instructions that should easily decode at 4 fused-domain uops per clock.

I'll present an analysis for the case where T = 1 for both codes (with and without lfence). You can then extend this for other values of T. You can refer to Figure 2.4 of the Intel Optimization Manual for a visual.
Because there is only a single easily predicted branch, the frontend will only stall if the backend stalled. The frontend is 4-wide in Haswell, which means up to 4 fused uops can be issued from the IDQ (instruction decode queue, which is just a queue that holds in-order fused-domain uops, also called the uop queue) to the reservation station (RS) entires of the scheduler. Each imul is decoded into a single uop that cannot be fused. The instructions dec ecx and jnz .loop get macrofused in the frontend to a single uop. One of the differences between microfusion and macrofusion is that when the scheduler dispatches a macrofused uop (that are not microfused) to the execution unit it's assigned to, it gets dispatched as a single uop. In contrast, a microfused uop needs to be split into its constituent uops, each of which must be separately dispatched to an execution unit. (However, splitting microfused uops happens on entrance to the RS, not on dispatch, see Footnote 2 in #Peter's answer). lfence is decoded into 6 uops. Recognizing microfusion only matters in the backend, and in this case, there is no microfusion in the loop.
Since the loop branch is easily predictable and since the number of iterations is relatively large, we can just assume without compromising accuracy that the allocator will always be able to allocate 4 uops per cycle. In other words, the scheduler will receive 4 uops per cycle. Since there is no micorfusion, each uop will be dispatched as a single uop.
imul can only be executed by the Slow Int execution unit (see Figure 2.4). This means that the only choice for executing the imul uops is to dispatch them to port 1. In Haswell, the Slow Int is nicely pipelined so that a single imul can be dispatched per cycle. But it takes three cycles for the result of the multiplication be available for any instruction that requires (the writeback stage is the third cycle from the dispatch stage of the pipeline). So for each dependence chain, at most one imul can be dispatched per 3 cycles.
Becausedec/jnz is predicted taken, the only execution unit that can execute it is Primary Branch on port 6.
So at any given cycle, as long as the RS has space, it will receive 4 uops. But what kind of uops? Let's examine the loop without lfence:
imul eax, eax
imul edx, edx
dec ecx/jnz .loop (macrofused)
There are two possibilities:
Two imuls from the same iteration, one imul from a neighboring iteration, and one dec/jnz from one of those two iterations.
One dec/jnz from one iteration, two imuls from the next iteration, and one dec/jnz from the same iteration.
So at the beginning of any cycle, the RS will receive at least one dec/jnz and at least one imul from each chain. At the same time, in the same cycle and from those uops that are already there in the RS, the scheduler will do one of two actions:
Dispatch the oldest dec/jnz to port 6 and dispatch the oldest imul that is ready to port 1. That's a total of 2 uops.
Because the Slow Int has a latency of 3 cycles but there are only two chains, for each cycle of 3 cycles, no imul in the RS will be ready for execution. However, there is always at least one dec/jnz in the RS. So the scheduler can dispatch that. That's a total of 1 uop.
Now we can calculate the expected number of uops in the RS, XN, at the end of any given cycle N:
XN = XN-1 + (the number of uops to be allocated in the RS at the beginning of cycle N) - (the expected number of uops that will be dispatched at the beginning of cycle N)
= XN-1 + 4 - ((0+1)*1/3 + (1+1)*2/3)
= XN-1 + 12/3 - 5/3
= XN-1 + 7/3 for all N > 0
The initial condition for the recurrence is X0 = 4. This is a simple recurrence that can be solved by unfolding XN-1.
XN = 4 + 2.3 * N for all N >= 0
The RS in Haswell has 60 entries. We can determine the first cycle in which the RS is expected to become full:
60 = 4 + 7/3 * N
N = 56/2.3 = 24.3
So at the end of cycle 24.3, the RS is expected to be full. This means that at the beginning of cycle 25.3, the RS will not be able to receive any new uops. Now the number of iterations, I, under consideration determines how you should proceed with the analysis. Since a dependency chain will require at least 3*I cycles to execute, it takes about 8.1 iterations to reach cycle 24.3. So if the number of iterations is larger than 8.1, which is the case here, you need to analyze what happens after cycle 24.3.
The scheduler dispatches instructions at the following rates every cycle (as discussed above):
1
2
2
1
2
2
1
2
.
.
But the allocator will not allocate any uops in the RS unless there are at least 4 available entries. Otherwise, it will not waste power on issuing uops at a sub-optimal throughput. However, it is only at the beginning of every 4th cycle are there at least 4 free entries in the RS. So starting from cycle 24.3, the allocator is expected to get stalled 3 out of every 4 cycles.
Another important observation for the code being analyzed is that it never happens that there are more than 4 uops that can be dispatched, which means that the average number of uops that leave their execution units per cycle is not larger than 4. At most 4 uops can be retired from the ReOrder Buffer (ROB). This means that the ROB can never be on the critical path. In other words, performance is determined by the dispatch throughput.
We can calculate the IPC (instructions per cycles) fairly easily now. The ROB entries look something like this:
imul eax, eax - N
imul edx, edx - N + 1
dec ecx/jnz .loop - M
imul eax, eax - N + 3
imul edx, edx - N + 4
dec ecx/jnz .loop - M + 1
The column to the right shows the cycles in which the instruction can be retired. Retirement happens in order and is bounded by the latency of the critical path. Here each dependency chain have the same path length and so both constitute two equal critical paths of length 3 cycles. So every 3 cycles, 4 instructions can be retired. So the IPC is 4/3 = 1.3 and the CPI is 3/4 = 0.75. This is much smaller than the theoretical optimal IPC of 4 (even without considering micro- and macro-fusion). Because retirement happens in-order, the retirement behavior will be the same.
We can check our analysis using both perf and IACA. I'll discuss perf. I've a Haswell CPU.
perf stat -r 10 -e cycles:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-nolfence
Performance counter stats for './main-1-nolfence' (10 runs):
30,01,556 cycles:u ( +- 0.00% )
40,00,005 instructions:u # 1.33 insns per cycle ( +- 0.00% )
0 RESOURCE_STALLS.ROB
23,42,246 UOPS_ISSUED.ANY ( +- 0.26% )
22,49,892 RESOURCE_STALLS.RS ( +- 0.00% )
0.001061681 seconds time elapsed ( +- 0.48% )
There are 1 million iterations each takes about 3 cycles. Each iteration contains 4 instructions and the IPC is 1.33.RESOURCE_STALLS.ROB shows the number of cycles in which the allocator was stalled due to a full ROB. This of course never happens. UOPS_ISSUED.ANY can be used to count the number of uops issued to the RS and the number of cycles in which the allocator was stalled (no specific reason). The first is straightforward (not shown in the perf output); 1 million * 3 = 3 million + small noise. The latter is much more interesting. It shows that about 73% of all time the allocator stalled due to a full RS, which matches our analysis. RESOURCE_STALLS.RS counts the number of cycles in which the allocator was stalled due to a full RS. This is close to UOPS_ISSUED.ANY because the allocator does not stall for any other reason (although the difference could be proportional to the number of iterations for some reason, I'll have to see the results for T>1).
The analysis of the code without lfence can be extended to determine what happens if an lfence was added between the two imuls. Let's check out the perf results first (IACA unfortunately does not support lfence):
perf stat -r 10 -e cycles:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-lfence
Performance counter stats for './main-1-lfence' (10 runs):
1,32,55,451 cycles:u ( +- 0.01% )
50,00,007 instructions:u # 0.38 insns per cycle ( +- 0.00% )
0 RESOURCE_STALLS.ROB
1,03,84,640 UOPS_ISSUED.ANY ( +- 0.04% )
0 RESOURCE_STALLS.RS
0.004163500 seconds time elapsed ( +- 0.41% )
Observe that the number of cycles has increased by about 10 million, or 10 cycles per iteration. The number of cycles does not tell us much. The number of retired instruction has increased by a million, which is expected. We already know that the lfence will not make instruction complete any faster, so RESOURCE_STALLS.ROB should not change. UOPS_ISSUED.ANY and RESOURCE_STALLS.RS are particularly interesting. In this output, UOPS_ISSUED.ANY counts cycles, not uops. The number of uops can also be counted (using cpu/event=0x0E,umask=0x1,name=UOPS_ISSUED.ANY/u instead of cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u) and has increased by 6 uops per iteration (no fusion). This means that an lfence that was placed between two imuls was decoded into 6 uops. The one million dollar question is now what these uops do and how they move around in the pipe.
RESOURCE_STALLS.RS is zero. What does that mean? This indicates that the allocator, when it sees an lfence in the IDQ, it stops allocating until all current uops in the ROB retire. In other words, the allocator will not allocate entries in the RS past an lfence until the lfence retires. Since the loop body contains only 3 other uops, the 60-entry RS will never be full. In fact, it will be always almost empty.
The IDQ in reality is not a single simple queue. It consists of multiple hardware structures that can operate in parallel. The number of uops an lfence requires depends on the exact design of the IDQ. The allocator, which also consists of many different hardware structures, when it see there is an lfence uops at the front of any of the structures of the IDQ, it suspends allocation from that structure until the ROB is empty. So different uops are usd with different hardware structures.
UOPS_ISSUED.ANY shows that the allocator is not issuing any uops for about 9-10 cycles per iteration. What is happening here? Well, one of the uses of lfence is that it can tell us how much time it takes to retire an instruction and allocate the next instruction. The following assembly code can be used to do that:
TIMES T lfence
The performance event counters will not work well for small values of T. For sufficiently large T, and by measuring UOPS_ISSUED.ANY, we can determine that it takes about 4 cycles to retire each lfence. That's because UOPS_ISSUED.ANY will be incremented about 4 times every 5 cycles. So after every 4 cycles, the allocator issues another lfence (it doesn't stall), then it waits for another 4 cycles, and so on. That said, instructions that produce results may require 1 or few more cycle to retire depending on the instruction. IACA always assume that it takes 5 cycles to retire an instruction.
Our loop looks like this:
imul eax, eax
lfence
imul edx, edx
dec ecx
jnz .loop
At any cycle at the lfence boundary, the ROB will contain the following instructions starting from the top of the ROB (the oldest instruction):
imul edx, edx - N
dec ecx/jnz .loop - N
imul eax, eax - N+1
Where N denotes the cycle number at which the corresponding instruction was dispatched. The last instruction that is going to complete (reach the writeback stage) is imul eax, eax. and this happens at cycle N+4. The allocator stall cycle count will be incremented during cycles, N+1, N+2, N+3, and N+4. However it will about 5 more cycles until imul eax, eax retires. In addition, after it retires, the allocator needs to clean up the lfence uops from the IDQ and allocate the next group of instructions before they can be dispatched in the next cycle. The perf output tells us that it takes about 13 cycles per iteration and that the allocator stalls (because of the lfence) for 10 out of these 13 cycles.
The graph from the question shows only the number of cycles for up to T=100. However, there is another (final) knee at this point. So it would be better to plot the cycles for up to T=120 to see the full pattern.

Is performance reduced when executing loops whose uop count is not a multiple of processor width?

I'm wondering how loops of various sizes perform on recent x86 processors, as a function of number of uops.
Here's a quote from Peter Cordes who raised the issue of non-multiple-of-4 counts in another question:
I also found that the uop bandwidth out of the loop buffer isn't a
constant 4 per cycle, if the loop isn't a multiple of 4 uops. (i.e.
it's abc, abc, ...; not abca, bcab, ...). Agner Fog's microarch doc
unfortunately wasn't clear on this limitation of the loop buffer.
The issue is about whether loops need to be a multiple of N uops to execute at maximum uop throughput, where N is the width of the processor. (i.e., 4 for recent Intel processors). There are a lot of complicating factors when talking about "width" and count uops, but I mostly want to ignore those. In particular, assume no micro or macro-fusion.
Peter gives the following example of a loop with 7 uops in its body:
A 7-uop loop will issue groups of 4|3|4|3|... I haven't tested larger
loops (that don't fit in the loop buffer) to see if it's possible for
the first instruction from the next iteration to issue in the same
group as the taken-branch to it, but I assume not.
More generally, the claim is that each iteration of a loop with x uops in its body will take at least ceil(x / 4) iterations, rather than simply x / 4.
Is this true for some or all recent x86-compatible processors?

I did some investigation with Linux perf to help answer this on my Skylake i7-6700HQ box, and Haswell results have been kindly provided by another user. The analysis below applies to Skylake, but it is followed by a comparison versus Haswell.
Other architectures may vary0, and to help sort it all out I welcome additional results. The source is available).
This question mostly deals with the front end, since on recent architectures it is the front end which imposes the hard limit of four fused-domain uops per cycle.
Summary of Rules for Loop Performance
First, I'll summarize the results in terms of a few "performance rules" to keep in mind when dealing with small loops. There are plenty of other performance rules as well - these are complementary to them (i.e., you probably don't break another rule to just to satisfy these ones). These rules apply most directly to Haswell and later architectures - see the other answer for an overview of the differences on earlier architectures.
First, count the number of macro-fused uops in your loop. You can use Agner's instruction tables to look this up directly for every instruction, except that an ALU uop and immediately follow branch will usually fuse together into a single uop. Then based on this count:
If the count is a multiple of 4, you're good: these loops execute optimally.
If the count is even and less than 32, you're good, except if it's 10 in which case you should unroll to another even number if you can.
For odd numbers you should try to unroll to an even number less than 32 or a multiple of 4, if you can.
For loops larger than 32 uops but less than 64, you might want to unroll if it isn't already a multiple of 4: with more than 64 uops you'll get efficient performance at any value on Sklyake and almost all values on Haswell (with a few deviations, possibly alignment related). The inefficiencies for these loops are still relatively small: the values to avoid most are 4N + 1 counts, followed by 4N + 2 counts.
Summary of Findings
For code served out of the uop cache, there are no apparent multiple-of-4 effects. Loops of any number of uops can be executed at a throughput of 4 fused-domain uops per cycle.
For code processed by the legacy decoders, the opposite is true: loop execution time is limited to integral number of cycles, and hence loops that are not a multiple of 4 uops cannot achieve 4 uops/cycle, as they waste some issue/execution slots.
For code issued from the loop stream detector (LSD), the situation is a mix of the two and is explained in more detail below. In general, loops less than 32 uops and with an even number of uops execute optimally, while odd-sized loops do not, and larger loops require a multiple-of-4 uop count to execute optimally.
What Intel Says
Intel actually has a note on this in their optimization manual, details in the other answer.
Details
As anyone well-versed recent x86-64 architectures knows, at any point the fetch and decode portion of the front end may be working in one several different modes, depending on the code size and other factors. As it turns out, these different modes all have different behaviors with respect to loop sizing. I'll cover them separately follow.
Legacy Decoder
The legacy decoder1 is the full machine-code-to-uops decoder that is used2 when the code doesn't fit in the uop caching mechanisms (LSD or DSB). The primary reason this would occur is if the code working set is larger than the uop cache (approximately ~1500 uops in the ideal case, less in practice). For this test though, we'll take advantage of the fact that the legacy decoder will also be used if an aligned 32-byte chunk contains more than 18 instructions3.
To test the legacy decoder behavior, we use a loop that looks like this:
short_nop:
mov rax, 100_000_000
ALIGN 32
.top:
dec rax
nop
...
jnz .top
ret
Basically, a trivial loop that counts down until rax is zero. All instructions are a single uop4 and the number of nop instructions is varied (at the location shown as ...) to test different sizes of loops (so a 4-uop loop will have 2 nops, plus the two loop control instructions). There is no macro-fusion as we always separate the dec and jnz with at least one nop, and also no micro-fusion. Finally, there is no memory access at (outside of the implied icache access).
Note that this loop is very dense - about 1 byte per instruction (since the nop instructions are 1 byte each) - so we'll trigger the > 18 instructions in a 32B chunk condition as soon as hit 19 instructions in the loop. Based on examining the perf performance counters lsd.uops and idq.mite_uops that's exactly what we see: essentially 100% of the instructions come out of the LSD5 up until and including the 18 uop loop, but at 19 uops and up, 100% come from the legacy decoder.
In any case, here are the cycles/iteration for all loop sizes from 3 to 99 uops6:
The blue points are the loops that fit in the LSD, and show somewhat complex behavior. We'll look at these later.
The red points (starting at 19 uops/iteration), are handled by the legacy decoder, and show a very predictable pattern:
All loops with N uops take exactly ceiling(N/4) iterations
So, for the legacy decoder at least, Peter's observation holds exactly on Skylake: loops with a multiple of 4 uops may execute at an IPC of 4, but any other number of uops will waste 1, 2 or 3 execution slots (for loops with 4N+3, 4N+2, 4N+1 instructions, respectively).
It is not clear to me why this happens. Although it may seem obvious if you consider that decoding happens in contiguous 16B chunks, and so at a decoding rate of 4 uops/cycle loops not a multiple of 4 would always have some trailing (wasted) slots in the cycle the jnz instruction is encountered. However, the actual fetch & decode unit is composed of predecode and decode phases, with a queue in-between. The predecode phase actually has a throughput of 6 instructions, but only decodes to the end of the 16-byte boundary on each cycle. This seems to imply that the bubble that occurs at the end of the loop could be absorbed by the predecoder -> decode queue since the predecoder has an average throughput higher than 4.
So I can't fully explain this based on my understanding of how the predecoder works. It may be that there is some additional limitation in decoding or pre-decoding that prevents non-integral cycle counts. For example, perhaps the legacy decoders cannot decode instructions on both sides of a jump even if the instructions after the jump are available in the predecoded queue. Perhaps it is related to the need to handle macro-fusion.
The test above shows the behavior where the top of the loop is aligned on a 32-byte boundary. Below is the same graph, but with an added series that shows the effect when the top of loop is moved 2 bytes up (i.e, now misaligned at a 32N + 30 boundary):
Most loop sizes now suffer a 1 or 2 cycle penalty. The 1 penalty case makes sense when you consider decoding 16B boundaries and 4-instructions per cycle decoding, and the 2 cycle penalty cases occurs for loops where for some reason the DSB is used for 1 instruction in the loop (probably the dec instruction which appears in its own 32-byte chunk), and some DSB<->MITE switching penalties are incurred.
In some cases, the misalignment doesn't hurt when it ends up better aligning the end of the loop. I tested the misalignment and it persists in the same way up to 200 uop loops. If you take the description of the predecoders at face value, it would seem that, as above, they should be able to hide a fetch bubble for misalignment, but it doesn't happen (perhaps the queue is not big enough).
DSB (Uop Cache)
The uop cache (Intel likes to call it the DSB) is able to cache most loops of moderate amount of instructions. In a typical program, you'd hope that most of your instructions are served out of this cache7.
We can repeat the test above, but now serving uops out of the uop cache. This is a simple matter of increasing the size of our nops to 2 bytes, so we no longer hit the 18-instruction limit. We use the 2-byte nop xchg ax, ax in our loop:
long_nop_test:
mov rax, iters
ALIGN 32
.top:
dec eax
xchg ax, ax ; this is a 2-byte nop
...
xchg ax, ax
jnz .top
ret
Here, there results are very straightforward. For all tested loop sizes delivered out of the DSB, the number of cycles required was N/4 - i.e., the loops executed at the maximum theoretical throughput, even if they didn't have a multiple of 4 uops. So in general, on Skylake, moderately sized loops served out of the DSB shouldn't need to worry about ensuring the uop count meets some particular multiple.
Here's a graph out to 1,000 uop loops. If you squint, you can see the sub-optimal behavior before 64-uops (when the loop is in the LSD). After that, it's a straight shot, 4 IPC the whole way to 1,000 uops (with a blip around 900 that was probably due to load on my box):
Next we look at performance for loops that are small enough to fit in the uop cache.
LSD (Loop steam detector)
Important note: Intel has apparently disabled the LSD on Skylake (SKL150 erratum) and Kaby Lake (KBL095, KBW095 erratum) chips via a microcode update and on Skylake-X out of the box, due to a bug related to the interaction between hyperthreading and the LSD. For those chips, the graph below will likely not have the interesting region up to 64 uops; rather, it will just look the same as the region after 64 uops.
The loop stream detector can cache small loops of up to 64 uops (on Skylake). In Intel's recent documentation it is positioned more as a power-saving mechanism than a performance feature - although there are certainly no performance downsides mentioned to using the LSD.
Running this for the loop sizes that should fit in the LSD, we get the following cycles/iteration behavior:
The red line here is the % of uops which are delivered from the LSD. It flatlines at 100% for all loop sizes from 5 to 56 uops.
For the 3 and 4 uop loops, we have the unusual behavior that 16% and 25% of the uops, respectively, are delivered from the legacy decoder. Huh? Luckily, it doesn't seem to affect the loop throughput as both cases achieve the maximum throughput of 1 loop/cycle - despite the fact that one could expect some MITE<->LSD transition penalties.
Between loop sizes of 57 and 62 uops, the number of uops delivered from LSD exhibits some weird behavior - approximately 70% of the uops are delivered from the LSD, and the rest from the DSB. Skylake nominally has a 64-uop LSD, so this is some kind of transition right before the LSD size is exceeded - perhaps there is some kind of internal alignment within the IDQ (on which the LSD is implemented) that causes only partial hits to the LSD in this phase. This phase is short and, performance-wise, seems mostly to be a linear combination of the full-in-LSD performance which precedes it, and the fully-in-DSB performance which follows it.
Let's look at the main body of results between 5 and 56 uops. We see three distinct regions:
Loops from 3 to 10 uops: Here, the behavior is complex. It is the only region where we see cycle counts that can't be explained by static behavior over a single loop iteration8. The range is short enough that it's hard to say if there is a pattern. Loops of 4, 6 and 8 uops all execute optimally, in N/4 cycles (that's the same pattern as the next region).
A loop of 10 uops, on the other hand, executes in 2.66 cycles per iteration, making it the only even loop size that doesn't execute optimally until you get to loop sizes of 34 uops or above (other than the outlier at 26). That corresponds to something like a repeated uop/cycle execution rate of 4, 4, 4, 3. For a loop of 5 uops, you get 1.33 cycles per iteration, very close but not the same as the ideal of 1.25. That corresponds to an execution rate of 4, 4, 4, 4, 3.
These results are hard to explain. The results are repeatable from run to run, and robust to changes such as swapping out the nop for an instruction that actually does something like mov ecx, 123. It might be something to do with the limit of 1 taken branch every 2 cycles, which applies to all loops except those that are "very small". It might be that the uops occasionally line up such that this limitation kicks in, leading to an extra cycle. Once you get to 12 uops or above, this never occurs since you are always taking at least three cycles per iteration.
Loops from 11 to 32-uops: We see a stair-step pattern, but with a period of two. Basically all loops with an even number of uops perform optimally - i.e., taking exactly N/4 cycles. Loops with odd number of uops waste one "issue slot", and take the same number of cycles as a loop with one more uops (i.e., a 17 uop loop takes the same 4.5 cycles as an 18 uop loop). So here we have behavior better than ceiling(N/4) for many uop counts, and we have the first evidence that Skylake at least can execute loops in a non-integral number of cycles.
The only outliers are N=25 and N=26, which both take about 1.5% longer than expected. It's small but reproducible, and robust to moving the function around in the file. That's too small to be explained by a per-iteration effect, unless it has a giant period, so it's probably something else.
The overall behavior here is exactly consistent (outside of the 25/26 anomaly) with the hardware unrolling the loop by a factor of 2.
Loops from 33 to ~64 uops: We see a stair-step pattern again, but with a period of 4, and worse average performance than the up-to 32 uop case. The behavior is exactly ceiling(N/4) - that is, the same as the legacy decoder case. So for loops of 32 to 64 uops, the LSD provides no apparent benefit over the legacy decoders, in terms of front end throughput for this particular limitation. Of course, there are many other ways the LSD is better - it avoids many of the potential decoding bottlenecks that occur for more complex or longer instructions, and it saves power, etc.
All of this is quite surprising, because it means that loops delivered from the uop cache generally perform better in the front end than loops delivered from the LSD, despite the LSD usually being positioned as a strictly better source of uops than the DSB (e.g., as part of advice to try to keep loops small enough to fit in the LSD).
Here's another way to look at the same data - in terms of the efficiency loss for a given uop count, versus the theoretical maximum throughput of 4 uops per cycle. A 10% efficiency hit means you only have 90% of the throughput that you'd calculate from the simple N/4 formula.
The overall behavior here is consistent with the hardware not doing any unrolling, which makes sense since a loop of more than 32 uops cannot be unrolled at all in a buffer of 64 uops.
The three regions discussed above are colored differently, and at least competing effects are visible:
Everything else being equal, the larger the number of uops involved, the lower the efficiency hit. The hit is a fixed cost only once per iteration, so larger loops pay a smaller relative cost.
There is a large jump in inefficiency when you cross to into the 33+ uop region: both the size of the throughput loss increases, and the number of affected uop counts doubles.
The first region is somewhat chaotic, and 7 uops is the worst overall uop count.
Alignment
The DSB and LSD analysis above is for loop entries aligned to a 32-byte boundary, but the unaligned case doesn't seem to suffer in either case: there isn't a material difference from the aligned case (other than perhaps some small variation for less than 10 uops that I didn't investigate further).
Here's the unaligned results for 32N-2 and 32N+2 (i.e., the loop top 2 bytes before and after the 32B boundary):
The ideal N/4 line is also shown for reference.
Haswell
Next next take a look at the prior microarchitecture: Haswell. The numbers here have been graciously provided by user Iwillnotexist Idonotexist.
LSD + Legacy Decode Pipeline
First, the results from the "dense code" test which tests the LSD (for small uop counts) and the legacy pipeline (for larger uop counts, since the loop "busts out" of the DSB due to instruction density.
Immediately we see a difference already in terms of when each architecture delivers uops from the LSD for a dense loop. Below we compare Skylake and Haswell for short loops of dense code (1 byte per instruction).
As described above, the Skylake loop stops being delivered from the LSD at exactly 19 uops, as expected from the 18-uop per 32-byte region of code limit. Haswell, on the other hand, seems to stop delivering reliably from the LSD for the 16-uop and 17-uop loops as well. I don't have any explanation for this. There is also a difference in the 3-uop case: oddly both processors only deliver some of the their uops out of the LSD in the 3 and 4 uop cases, but the exact amount is the same for 4 uops, and different from 3.
We mostly care about the actual performance though, right? So let's look at the cycles/iteration for the 32-byte aligned dense code case:
This is the same data as show above for Skylake (the misaligned series has been removed), with Haswell plotted alongside. Immediately you notice that the pattern is similar for Haswell, but not the same. As above, there are two regions here:
Legacy Decode
The loops larger than ~16-18 uops (the uncertainty is described above) are delivered from the legacy decoders. The pattern for Haswell is somewhat different from Skylake.
For the range from 19-30 uops they are identical, but after that Haswell breaks the pattern. Skylake took ceil(N/4) cycles for loops delivered from the legacy decoders. Haswell, on the other hand, seems to take something like ceil((N+1)/4) + ceil((N+2)/12) - ceil((N+1)/12). OK, that's messy (shorter form, anyone?) - but basically it means that while Skylake executes loops with 4*N cycles optimally (i.e,. at 4-uops/cycle), such loops are (locally) usually the least optimal count (at least locally) - it takes one more cycle to execute such loops than Skylake. So you are actually best off with loops of 4N-1 uops on Haswell, except that the 25% of such loops that are also of the form 16-1N (31, 47, 63, etc) take one additional cycle. It's starting to sound like a leap year calculation - but the pattern is probably best understood visually above.
I don't think this pattern is intrinsic to uop dispatch on Haswell, so we shouldn't read to much into it. It seems to be explained by
0000000000455a80 <short_nop_aligned35.top>:
16B cycle
1 1 455a80: ff c8 dec eax
1 1 455a82: 90 nop
1 1 455a83: 90 nop
1 1 455a84: 90 nop
1 2 455a85: 90 nop
1 2 455a86: 90 nop
1 2 455a87: 90 nop
1 2 455a88: 90 nop
1 3 455a89: 90 nop
1 3 455a8a: 90 nop
1 3 455a8b: 90 nop
1 3 455a8c: 90 nop
1 4 455a8d: 90 nop
1 4 455a8e: 90 nop
1 4 455a8f: 90 nop
2 5 455a90: 90 nop
2 5 455a91: 90 nop
2 5 455a92: 90 nop
2 5 455a93: 90 nop
2 6 455a94: 90 nop
2 6 455a95: 90 nop
2 6 455a96: 90 nop
2 6 455a97: 90 nop
2 7 455a98: 90 nop
2 7 455a99: 90 nop
2 7 455a9a: 90 nop
2 7 455a9b: 90 nop
2 8 455a9c: 90 nop
2 8 455a9d: 90 nop
2 8 455a9e: 90 nop
2 8 455a9f: 90 nop
3 9 455aa0: 90 nop
3 9 455aa1: 90 nop
3 9 455aa2: 90 nop
3 9 455aa3: 75 db jne 455a80 <short_nop_aligned35.top>
Here I've noted the 16B decode chunk (1-3) each instruction appears in, and the cycle in which it will be decoded. The rule is basically that up to the next 4 instructions are decoded, as long as they fall in the current 16B chunk. Otherwise they have to wait until the next cycle. For N=35, we see that there is a loss of 1 decode slot in cycle 4 (only 3 instruction are left in the 16B chunk), but that otherwise the loop lines up very well with the 16B boundaries and even the last cycle (9) can decode 4 instructions.
Here's a truncated look at N=36, which is identical except for the end of the loop:
0000000000455b20 <short_nop_aligned36.top>:
16B cycle
1 1 455a80: ff c8 dec eax
1 1 455b20: ff c8 dec eax
1 1 455b22: 90 nop
... [29 lines omitted] ...
2 8 455b3f: 90 nop
3 9 455b40: 90 nop
3 9 455b41: 90 nop
3 9 455b42: 90 nop
3 9 455b43: 90 nop
3 10 455b44: 75 da jne 455b20 <short_nop_aligned36.top>
There are now 5 instructions to decode in the 3rd and final 16B chunk, so one additional cycle is needed. Basically 35 instructions, for this particular pattern of instructions happens to line up better with the 16B bit boundaries and saves one cycle when decoding. This doesn't mean that N=35 is better than N=36 in general! Different instructions will have different numbers of bytes and will line up differently. A similar alignment issue explains also the additional cycle that is required every 16 bytes:
16B cycle
...
2 7 45581b: 90 nop
2 8 45581c: 90 nop
2 8 45581d: 90 nop
2 8 45581e: 90 nop
3 8 45581f: 75 df jne 455800 <short_nop_aligned31.top>
Here the final jne has slipped into the next 16B chunk (if an instruction spans a 16B boundary it is effectively in the latter chunk), causing an extra cycle loss. This occurs only every 16 bytes.
So the Haswell legacy decoder results are explained perfectly by a legacy decoder that behaves as described, for example, in Agner Fog's microarchitecture doc. In fact, it also seems to explain Skylake results if you assume Skylake can decode 5 instructions per cycle (delivering up to 5 uops)9. Assuming it can, the asymptotic legacy decode throughput on this code for Skylake is still 4-uops, since a block of 16 nops decodes 5-5-5-1, versus 4-4-4-4 on Haswell, so you only get benefits at the edges: in the N=36 case above, for example, Skylake can decode all remaining 5 instructions, versus 4-1 for Haswell, saving a cycle.
The upshot is that it seems to be that the legacy decoder behavior can be understood in a fairly straightforward manner, and the main optimization advice is to continue to massage code so that it falls "smartly" into the 16B aligned chunks (perhaps that's NP-hard like bin packing?).
DSB (and LSD again)
Next let's take a look at the scenario where the code is served out of the LSD or DSB - by using the "long nop" test which avoids breaking the 18-uop per 32B chunk limit, and so stays in the DSB.
Haswell vs Skylake:
Note the LSD behavior - here Haswell stops serving out of the LSD at exactly 57 uops, which is completely consistent with the published size of the LSD of 57 uops. There is no weird "transition period" like we see on Skylake. Haswell also has the weird behavior for 3 and 4 uops where only ~0% and ~40% of the uops, respectively, come from the LSD.
Performance-wise, Haswell is normally in-line with Skylake with a few deviations, e.g., around 65, 77 and 97 uops where it rounds up to the next cycle, whereas Skylake is always able to sustain 4 uops/cycle even when that's results in a non-integer number of cycles. The slight deviation from expected at 25 and 26 uops has disappeared. Perhaps the 6-uop delivery rate of Skylake helps it avoid uop-cache alignment issues that Haswell suffers with its 4-uop delivery rate.
Other Architectures
Results for the following additional architectures were kindly provided by user Andreas Abel, but we'll have to use another answer for further analysis as we are at the character limit here.
Help Needed
Although results for many platforms have been kindly offered by the community, I'm still interested in results on chips older than Nehalem, and newer than Coffee Lake (in particular, Cannon Lake, which is a new uarch). The code to generate these results is public. Also, the results above are available in .ods format in GitHub as well.
0 In particular, the legacy decoder maximum throughput apparently increased from 4 to 5 uops in Skylake, and the maximum throughput for the uop cache increased from 4 to 6. Both of those could impact the results described here.
1 Intel actually like to call the legacy decoder the MITE (Micro-instruction Translation Engine), perhaps because it's a faux-pas to actually tag any part of your architecture with the legacy connotation.
2 Technically there is another, even slower, source of uops - the MS (microcode sequencing engine), which is used to implement any instruction with more than 4 uops, but we ignore this here since none of our loops contain microcoded instructions.
3 This works because any aligned 32-byte chunk can use at most 3-ways in its uop cache slot, and each slot holds up to 6 uops. So if you use more than 3 * 6 = 18 uops in a 32B chunk, the code can't be stored in the uop cache at all. It's probably rare to encounter this condition in practice, since the code needs to be very dense (less than 2 bytes per instruction) to trigger this.
4 The nop instructions decode to one uop, but don't are eliminated prior to execution (i.e., they don't use an execution port) - but still take up space in the front end and so count against the various limits that we are interested in.
5 The LSD is the loop stream detector, which caches small loops of up to 64 (Skylake) uops directly in the IDQ. On earlier architectures it can hold 28 uops (both logical cores active) or 56 uops (one logical core active).
6 We can't easily fit a 2 uop loop in this pattern, since that would mean zero nop instructions, meaning the dec and jnz instructions would macro-fuse, with a corresponding change in the uop count. Just take my word that all loops with 4 or less uops execute at best at 1 cycle/iteration.
7 For fun, I just ran perf stat against a short run of Firefox where I opened a tab and clicked around on a few Stack Overflow questions. For instructions delivered, I got 46% from DSB, 50% from legacy decoder and 4% for LSD. This shows that at least for big, branchy code like a browser the DSB still can't capture the large majority of the code (lucky the legacy decoders aren't too bad).
8 By this, I mean that all the other cycle counts can be explained by simply by taking an "effective" integral loop cost in uops (which might be higher than the actual size is uops) and dividing by 4. For these very short loops, this doesn't work - you can't get to 1.333 cycles per iteration by dividing any integer by 4. Said another way, in all other regions the costs have the form N/4 for some integer N.
9 In fact we know that Skylake can deliver 5 uops per cycle from the legacy decoder, but we don't know if those 5 uops can come from 5 different instructions, or only 4 or less. That is, we expect that Skylake can decode in the pattern 2-1-1-1, but I'm not sure if it can decode in the pattern 1-1-1-1-1. The above results give some evidence that it can indeed decode 1-1-1-1-1.

This is a follow-on to the original answer, to analyze the behavior for five additional architectures, based on test results provided by Andreas Abel:
Nehalem
Sandy Bridge
Ivy Bridge
Broadwell
Coffee Lake
We take a quick look at the results on these architectures in addition to Skylake and Haswell. It only needs to be a "quick" look since all the architectures except Nehalem follow one of the existing patterns discussed above.
First, the short nop case which exercises the legacy decoder (for loops that don't fit in the LSD) and the LSD. Here is the cycles/iteration for this scenario, for all 7 architectures.
Figure 2.1: All architectures dense nop performance:
This graph is really busy (click for a larger view) and a bit hard to read since the results for many architectures lie on top of each other, but I tried to ensure that a dedicated reader can track the line for any architecture.
First, let's discuss the big outlier: Nehalem. All of the other architectures have a slope that roughly follows the 4 uops/cycle line, but Nehalem is at almost exactly 3 uops per cycle, so quickly falls behind all of the other architectures. Outside of the initial LSD region, the line is also totally smooth, without the "stair step" appearance seen in the other architectures.
This is entirely consistent with Nehalem having a uop retirement limit of 3 uops/cycle. This is the bottleneck for uops outside of the LSD: they all execute at about exactly 3 uops per cycle, bottlenecked on retirement. The front-end isn't the bottleneck, so the exact uop count and decoding arrangement doens't matter and so the stair-step is absent.
Other than Nehalem, the other architectures, except Broadwell split fairly cleanly into groups: Haswell-like or Skylake-like. That is, all of Sandy Bridge, Ivy Bridge and Haswell behave like Haswell, for loops greater than about 15 uops (Haswell behavior is discussed in the other answer). Even though they are different micro-architectures, they behave largely the same since their legacy decoding capabilities are the same. Below about 15 uops we see Haswell as somewhat faster for any uop count not a multiple of 4. Perhaps it gets an additional unrolling in the LSD due to a larger LSD, or there are other "small loop" optimizations. For Sandy Bridge and Ivy Bridge, this means that small loops should definitely target a uop count which is a multiple of 4.
Coffee Lake behaves similarly to Skylake1. This makes sense, since the micro-architecture is the same. Coffee Lake appears better than Skylake below about 16 uops, but this is just an effect of Coffee Lake's disabled LSD by default. Skylake was tested with an enabled LSD, before Intel disabled it via microcode update due to a security issue. Coffee Lake was released after this issue was known, so had the LSD disabled out-of-the-box. So for this test, Coffee Lake is using either the DSB (for loops below about 18 uops, which can still fit in the DSB) or the legacy decoder (for the remainder of the loops), which leads to better results for small uop count loops where the LSD imposes an overhead (interesting, for larger loops, the LSD and the legacy decoder happen to impose exactly the same overhead, for very different reasons).
Finally, we take a look at 2-byte NOPs, which aren't dense enough to prevent the use of the DSB (so this case is more reflective of typical code).
Figure 2.1: 2-byte nop performance:
Again, the result is along the same lines as the earlier chart. Nehalem is still the outlier bottlenecked at 3 uops per cycle. For the range up to about 60ish uops, all architectures other than Coffee Lake are using the LSD, and we see that Sandy Bridge and Ivy Bridge perform a bit worse here, rounding up to the next cycle and so only achieving the maximum throughput of 4 uops/cycle if the number of uops in the loop is a multiple of 4. Above 32 uops the "unrolling" feature of Haswell and new uarchs dosn't have any effect, so everything is roughly tied.
Sandy Bridge actually has a few uop ranges (e.g., from 36 through 44 uops) where it performs better than the newer architectures. This seems to occur because not all loops are detected by the LSD and in these ranges the loops are served from the DSB instead. Since the DSB is generally faster, so is Sandy Bridge in these cases.
What Intel Says
You can actually find a section specifically dealing with this topic in the Intel Optimization Manual, section 3.4.2.5, as pointed out by Andreas Abel in the comments. There, Intel says:
The LSD holds micro-ops that construct small “infinite” loops.
Micro-ops from the LSD are allocated in the out-of-order engine. The
loop in the LSD ends with a taken branch to the beginning of the loop.
The taken branch at the end of the loop is always the last micro-op
allocated in the cycle. The instruction at the beginning of the loop
is always allocated at the next cycle. If the code performance is
bound by front end bandwidth, unused allocation slots result in a
bubble in allocation, and can cause performance degrada- tion.
Allocation bandwidth in Intel microarchitecture code name Sandy Bridge
is four micro-ops per cycle. Performance is best, when the number of
micro-ops in the LSD result in the least number of unused allo- cation
slots. You can use loop unrolling to control the number of micro-ops
that are in the LSD.
They go on to show an example where unrolling a loop by a factor of two doesn't help performance due to LSD "rounding", but unrolling by three works. The example is a big confusing since it actually mixes two effects since unrolling more also reduces the loop overhead and hence the number of uops per iteration. A more interesting example would have been where unrolling the loop fewer times led to an increase in performance due to LSD rounding effects.
This section seems to accurately describe the behavior in Sandy Bridge and Ivy Bridge. The results above show that both of these architectures do as described, and you lose 1, 2 or 3 uop execution slots for loops with 4N+3, 4N+2, or 4N+1 uops respectively.
It hasn't been updated with the new performance for Haswell and later however. As described in the other answer, performance has improved from the simple model described above and the behavior is more complex.
1 There is a weird outlier at 16 uops where Coffee Lake performs worse than all the other architectures, even Nehalem (a regression of about 50%), but maybe this measurement noise?

TL;DR: For tight loops consisting of exactly 7 uops it results in inefficient retirement bandwidth utilization. Consider manual loop unrolling so the loop will consist of 12 uops
I recently faced retirement bandwidth degradation with loops consisting of 7 uops. After doing some research by myself quick googling leads me to this topic. And here are my 2 cents applying to Kaby Lake i7-8550U CPU:
As #BeeOnRope noted, LSD is turned off on chips like KbL i7-8550U.
Consider the following NASM macro
;rdi = 1L << 31
%macro nops 1
align 32:
%%loop:
times %1 nop
dec rdi
ja %%loop
%endmacro
Here is how the "average retirement rate" uops_retired.retire_slots/uops_retired.total_cycle looks like:
The thing to notice here is the retirement degradation when the loop consists of 7 uops. It results in 3.5 uops being retired per cycle.
The average idq delivery rate idq.all_dsb_cycles_any_uops / idq.dsb_cycles looks as
For loops of 7 uops it results in 3.5 uops being delivered to the idq per cycle. Judging by only this counter it is impossible to conclude whether uops cache delivers 4|3 or 6|1 groups.
For loops consisting of 6 uops it results in an efficient utilization of uops cache bandwidth - 6 uops/c. When IDQ gets overflowed the uops cache stays idle until it can deliver 6 uops again.
To check how the uops cache stays idle let's compare idq.all_dsb_cycles_any_uops and cycles
The number of cycles uops are delivered to the idq is equal to the number of total cycles for loops of 7 uops. By contrast the counters are noticeably different for the loop of 6 uops.
The key counters to check is idq_uops_not_delivered.*
As can be seen for the loop of 7 uops we have that the Renamer takes 4|3 groups which results in inefficient retirement bandwidth utilization.

Why isn't MOVNTI slower, in a loop storing repeatedly to the same address?

section .text
%define n 100000
_start:
xor rcx, rcx
jmp .cond
.begin:
movnti [array], eax
.cond:
add rcx, 1
cmp rcx, n
jl .begin
section .data
array times 81920 db "A"
According to perf it runs at 1.82 instructions per cycle. I cannot understand why it's so fast. After all, it has to be stored in memory (RAM) so it should be slow.
P.S Is there any loop-carried-dependency?
EDIT
section .text
%define n 100000
_start:
xor rcx, rcx
jmp .cond
.begin:
movnti [array+rcx], eax
.cond:
add rcx, 1
cmp rcx, n
jl .begin
section .data
array times n dq 0
Now, the iteration take 5 cycle per iteration. Why? After all, there is still no loop-carried-dependency.

movnti can apparently sustain a throughput of one per clock when writing to the same address repeatedly.
I think movnti keeps writing into the same fill buffer, and it's not getting flushed very often because there are no other loads or stores happening. (That link is about copying from WC video memory with SSE4.1 NT loads, as well as storing to normal memory with NT stores.)
So the NT write-combining fill-buffer acts like a cache for multiple overlapping NT stores to the same address, and writes are actually hitting in the fill buffer instead of going to DRAM each time.
DDR DRAM only supports burst-transfer commands. If every movnti produced a 4B write that actually was visible to the memory chips, there'd be no way it could run that fast. The memory controller either has to read/modify/write, or do an interrupted burst transfer, since there is no non-burst write command. See also Ulrich Drepper's What Every Programmer Should Know About Memory.
We can further prove this is the case by running the test on multiple cores at once. Since they don't slow each other down at all, we can be sure that the writes are only infrequently making it out of the CPU cores and competing for memory cycles.
The reason your experiment doesn't show your loop running at 4 instruction per clock (one cycle per iteration) is that you used such a tiny repeat count. 100k cycles barely accounts for the startup overhead (which perf's timing includes).
For example, on a Core2 E6600 (Merom/Conroe) with dual channel DDR2 533MHz, the total time including all process startup / exit stuff is 0.113846 ms. That's only 266,007 cycles.
A more reasonable microbenchmark shows one iteration (one movnti) per cycle:
global _start
_start:
xor ecx,ecx
.begin:
movnti [array], eax
dec ecx
jnz .begin ; 2^32 iterations
mov eax, 60 ; __NR_exit
xor edi,edi
syscall ; exit(0)
section .bss
array resb 81920
(asm-link is a script I wrote)
$ asm-link movnti-same-address.asm
+ yasm -felf64 -Worphan-labels -gdwarf2 movnti-same-address.asm
+ ld -o movnti-same-address movnti-same-address.o
$ perf stat -e task-clock,cycles,instructions ./movnti-same-address
Performance counter stats for './movnti-same-address':
1835.056710 task-clock (msec) # 0.995 CPUs utilized
4,398,731,563 cycles # 2.397 GHz
12,891,491,495 instructions # 2.93 insns per cycle
1.843642514 seconds time elapsed
Running in parallel:
$ time ./movnti-same-address; time ./movnti-same-address & time ./movnti-same-address &
real 0m1.844s / user 0m1.828s # running alone
[1] 12523
[2] 12524
peter#tesla:~/src/SO$
real 0m1.855s / user 0m1.824s # running together
real 0m1.984s / user 0m1.808s
# output compacted by hand to save space
I expect perfect SMP scaling (except with hyperthreading), up to any number of cores. e.g. on a 10-core Xeon, 10 copies of this test could run at the same time (on separate physical cores), and each one would finish in the same time as if it was running alone. (Single-core turbo vs. multi-core turbo will also be a factor, though, if you measure wall-clock time instead of cycle counts.)
zx485's uop count nicely explains why the loop isn't bottlenecked by the frontend or unfused-domain execution resources.
However, this disproves his theory about the ratio of CPU to memory clocks having anything to do with it. Interesting coincidence, though, that the OP chose a count that happened to make the final total IPC work out that way.
P.S Is there any loop-carried-dependency?
Yes, the loop counter. (1 cycle). BTW, you could have saved an insn by counting down towards zero with dec / jg instead of counting up and having to use a cmp.
The write-after-write memory dependency isn't a "true" dependency in the normal sense, but it is something the CPU has to keep track of. The CPU doesn't "notice" that the same value is written repeatedly, so it has to make sure the last write is the one that "counts".
This is called an architectural hazard. I think the term still applies when talking about memory, rather than registers.

The result is plausible. Your loop code consists of the follwing instuctions. According to Agner Fog's instruction tables, these have the following timings:
Instruction regs fused unfused ports Latency Reciprocal Throughput
---------------------------------------------------------------------------------------------------------------------------
MOVNTI m,r 2 2 p23 p4 ~400 1
ADD r,r/i 1 1 p0156 1 0.25
CMP r,r/i 1 1 p0156 1 0.25
Jcc short 1 1 p6 1 1-2 if predicted that the jump is taken
Fused CMP+Jcc short 1 1 p6 1 1-2 if predicted that the jump is taken
So
MOVNTI consumes 2 uOps, 1 in port 2 or 3 and one in port 4
ADD consumes 1 uOps in port 0 or 1 or 5 or 6
CMP and Jcc macro-fuse to the last line in the table resulting in a consumption of 1 uOp
Because neither ADD nor CMP+Jcc depend on the result of MOVNTI they can be executed (nearly) in parallel on recent architectures, for example using the ports 1,2,4,6. The worst case would be a latency of 1 between ADD and CMP+Jcc.
This is most likely a design error in your code: you're essentially writing to the same address [array] a 100000 times, because you do not adjust the address.
The repeated writes can even go to the L1-cache under the condition that
The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region.
but it doesn't look like this and won't make for a great difference, anyway, because even if writing to memory, the memory speed will be the limiting factor.
For example, if you have a 3GHz CPU and 1600MHz DDR3-RAM this will result in 3/1.6 = 1.875 CPU cycles per memory cycle. This seems plausible.

GT200 Single Precision Peak Performance

I was trying to verify the single precision peak performance of a reference GT200 card.
From http://www.realworldtech.com/gt200/9/, we have two facts about GT200 –
The latency of the fastest operation for an SP core is 4 cycles.
The SFU takes 4 cycles too to finish an operation.
Now, each SM has a total of 8 SPs and 2 SFUs, with each SFU having 4 FP multiply units and these SPs and SFUs can work at the same time as they are on two different ports as explained in their SM level diagrams. Each SP can perform MAD operation.
So, we are looking at 8 MAD operations and 8 MUL operations per 4 SP cycles. This gives us 16 + 8 = 24 operations per 4 SP clock cycles as MAD counts as 2 operations. Since 2 SP clock cycle counts as one shader clock, we have 24/2 = 12 operations per shader clock.
For a reference GT200 card, shader clock = 1296 MHz/s.
Thus, the single precision peak performance must be = 1296 MHz/s * 30 SM * 12 operations per shader clock = 466.560 GFLOPS
This is exactly half of the GFLOPS as reported in the specs. So where am I going wrong?
Edit: After Robert’s pointer to the CUDA Programming Guide that says 8MADs/shader clock can be performed in a GT200 SM, I would have to question how latency and throughput relate to each other in this particular SM.
There is a latency of one OP / 4 SP cycles (as pointed out earlier), thus one MAD every 4 SP cycles, right? We have 8 SPs, so it becomes 8 MADs for every 4 SP cycles in an SM.
Since 2 SP cycles form one shader cycle, so we are left with => 8MADs per 2 shader clock cycles
=> 4 MADs per shader clock.
This doesn’t match with the 8MADs/shader clock from the Programming Guide.
So, what am I doing wrong again?

Latency and throughput are not the same thing.
A cc 1.x SM can retire 8 single precision floating point MAD operations on every clock cycle.
This is the correct formula:
1296 MHz(cycle/s) * 30 SM * (8 SP/SM * 2 flop/cycle per SP + 2 SFU/SM * 4 FPU/SFU * 1 flop/cycle per FPU)
= 622080 Mflop/s + 311040 Mflop/s = 933 GFlop/s single precision
From here
EDIT: The 4-cycle latency you're referring to is the latency of a warp (i.e. 32 threads) MAD instruction, as issued to the SM, not the latency of a single MAD operation on a single SP. The FPU in each SP can generate one MAD result per clock, and there are 8 SP's in one SM, so each SM can generate 8 MAD results per clock. Since a warp (32 threads) MAD instruction requires 32 MAD results, it requires 4 total clocks to complete the warp instruction, as issued to the SPs in the SM.
The FPU in the SP can generate one new MAD result per clock. From the standpoint of instruction issue, the fundamental unit is the warp. Therefore a warp MAD instruction requires 4 clocks to complete.
EDIT2: Responding to question below.
Preface: The FPUs in the SFU are not independently schedulable. They only come into play when an instruction is scheduled to the SFUs. There are 4 FPU per SFU, and an SFU instruction requires 16 cycles (since there are 2 SFU/SM) to complete for a warp. If all 4 FPU in both SFUs were fully utilized, that would be 128 (16x4x2) flops produced during the computation of the SFU instruction, in 16 cycles. This is added to the 256 (16x2x8) total flops that could be generated by the "regular" MAD FPUs in the SM during the same time (16 cycles).
Your question seems to be interpreting the observed benchmark result and this statement in the text:
Table III also shows that the throughput for single-precision
floating point multiplication is 11.2 ops/clock, which means
that multiplication can be issued to both the SP and SFU
units. This suggests that each SFU unit is capable of doing
2 multiplications per cycle, twice the throughput of other
(more complex) instructions that map to this unit.
as an indication of either the throughput of the FPUs in the SFU or else the number of FPUs in the SFU. However you are conflating benchmark data with a theoretical number. The SFU has 4 FPU, but this does not mean that all 4 are independently schedulable for arbitrary arithmetic or instruction streams. Seeing all 4 FPU take on a new floating point instruction in a given cycle may require a specific instruction sequence that the authors haven't used.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio