Why is my loop much faster when it is contained in one cache line?

Why is my loop much faster when it is contained in one cache line? - performance

When I run this little assembly program on my Ryzen 9 3900X:
_start: xor rax, rax
xor rcx, rcx
loop0: add rax, 1
mov rdx, rax
and rdx, 1
add rcx, rdx
cmp rcx, 1000000000
jne loop0
It completes in 450 ms if all the instructions between loop0 and up to and including the jne, are contained entirely in one cacheline. That is, if:
round((address of loop0)/64) == round((address of end of jne-instruction)/64)
However, if the above condition does not hold, the loop takes 900 ms instead.
I've made a repo with the code https://github.com/avl/strange_performance_repro .
Why is the inner loop much slower in some specific cases?
Edit: Removed a claim with a conclusion from a mistake in testing.

Your issue lies in the variable cost of the jne instruction.
First of all, to understand the impact of the effect, we need to analyze the full loop itself. The architecture of the Ryzen 9 3900X is Zen2. We can retrieve information about this on the AMD website or also WikiChip.
This architecture have 4 ALU and 3 AGU. This roughly means it can execute up to 4 instructions like add/and/cmp per cycle.
Here is the cost of each instruction of the loop (based on the Agner Fog instruction table for Zen1):
# Reciprocal throughput
loop0: add rax, 1 # 0.25
mov rdx, rax # 0.2
and rdx, 1 # 0.25
add rcx, rdx # 0.25
cmp rcx, 1000000000 # 0.25 | Fused 0.5-2 (2 if jumping)
jne loop0 # 0.5-2 |
As you can see, the 4 first computing instructions of the loop can be executed in ~1 cycle. The 2 last instructions can be merged by your processor into a faster one.
Your main issue is that this last jne instruction could be quite slow compared to the rest of the loop. So you are very likely to measure only the overhead of this instruction. From this point, things start to be complicated.
Engineer and researcher worked hard the last decades to reduce the cost of such instructions at (almost) any price. Nowadays, processors (like the Ryzen 9 3900X) use out-of-order instruction scheduling to execute the dependent instructions required by the jne instruction as soon as possible. Most processors can also predict the address of the next instruction to execute after the jne and fetch new instructions (eg. the one of the next loop iteration) while the other instruction of the current loop iteration are being executed.
Performing the fetch as soon as possible is important to prevent any stall in the execution pipeline of the processor (especially in your case).
From the AMD document "Software Optimization Guide for AMD Family 17h Models 30h and Greater Processors", we can read:
2.8.3 Loop Alignment:
For the processor loop alignment is not usually a significant issue. However, for hot loops, some further knowledge of trade-offs can be helpful. Since the processor can read an aligned 64-byte fetch block every cycle, aligning the end of the loop to the last byte of a 64-byte cache line is the best thing to do, if possible.
2.8.1.1 Next Address Logic
The next-address logic determines addresses for instruction fetch. [...]. When branches are identified, the next-address logic is redirected by the branch target and branch direction prediction hardware to generate a non-sequential fetch block address. The processor facilities that are designed to predict the next instruction to be executed following a branch are detailed in the following sections.
Thus, performing a conditional branching to instructions located in another cache line introduces an additional latency overhead due to a fetch of the Op Cache (an instruction cache faster than the L1) not required if the whole loop fit in 1 cache line. Indeed, if a loop is crossing a cache line, 2 line-cache fetches are required, which takes no less than 2 cycles. If the whole loop is fitting in a cache-line, only one 1 line-cache fetch is needed, which take only 1 cycle. As a result, since your loop iterations are very fast, paying 1 additional cycle introduces a significant slowdown. But how much?
You say that the program takes about 450 ms.
Since the Ryzen 9 3900X turbo frequency is 4.6 GHz and your loop is executed 2e9 times, the number cycles per loop iteration is 1.035. Note that this is better than what we can expected before (I guess this processor is able to rename registers, ignore the mov instruction, execute the jne instruction in parallel in only 1 cycle while other instructions of the loop are perfectly pipelined; which is astounding). This also shows that paying an additional fetch overhead of 1 cycle will double the number of cycles needed to execute each loop iteration and so the overall execution time.
If you do not want to pay this overhead, consider unrolling your loop to significantly reduce the number of conditional branches and non-sequential fetches.
This problem can occur on other architectures such as on Intel Skylake. Indeed, the same loop on a i5-9600KF takes 0.70s with loop alignment and 0.90s without (also due to the additional 1 cycle fetch latency). With a 8x unrolling, the result is 0.53s (whatever the alignment).

Related

Assembly - How to score a CPU instruction by latency and throughput

I'm looking for a type of a formula / way to measure how fast an instruction is, or more specific to give a "score" each of the instruction by CPU cycles.
Let's take the follow assembly program for an example,
nop
mov eax,dword ptr [rbp+34h]
inc eax
mov dword ptr [rbp+34h],eax
and the following Intel Skylake information:
mov r,m : Throughput=0.5 Latency=2
mov m,r
: Throughput=1 Latency=2
nop : Throughput=0.25 Latency=non
inc : Throughput=0.25 Latency=1
I know that the order of the instructions in the program are matter in here but
I'm looking to create something general that not need to be "accurate to the single cycle"
any one have any idea how can I do that?

There is no formula you can apply; you have to measure.
The same instruction on different versions of the same uarch family can have different performance. e.g. mulps:
Sandybridge 1c / 5c throughput/latency.
HSW 0.5 / 5. BDW 0.5 / 3 (faster multiply path in the FMA unit? FMA is still 5c).
SKL 0.5 / 4 (lower latency FMA, too). SKL runs addps on the FMA unit as well, dropping the dedicated FP multiply unit so add latency is higher, but throughput is higher.
There's no way you could predict any of this without measuring, or knowing some microarchitectural details. We expect FP math ops won't be single-cycle latency, because they're much more complicated than integer ops. (So if they were single cycle, the clock speed is set too low for integer ops.)
You measure by repeating the instruction many times in an unrolled loop. Or fully unrolled with no looping, but then you defeat the uop-cache and can get front-end bottlenecks. (e.g. for decoding 10-byte mov r64, imm64)
https://uops.info/ has already automated this testing for every form of every (unprivileged) instruction, and you can even click on any table entry to see what test loops they used. e.g. Skylake xchg r32, eax latency testing (https://uops.info/html-lat/SKL/XCHG_R32_EAX-Measurements.html) from each input operand to each output. (2 cycle latency from EAX -> R8D, but 1 cycle latency from R8D -> EAX.) So we can guess that the 3 uops include copying EAX to an internal temporary, but moving directly from the other operand to EAX.
https://uops.info/ is the current best source of test data; when it and Agner's tables disagree, my own measurements and/or other sources have always confirmed uops.info's testing was accurate. And they don't try to make up a latency number for 2 halves of a round-trip like movd xmm0,eax and back, they show you the range of possible latencies assuming the rest of the chain was the minimum plausible.
Agner Fog creates his instruction tables (which you appear to be reading) by timing large non-looping blocks of code that repeat an instruction. https://agner.org/optimize/. The intro section of his instruction-tables explains briefly how he measures, and his microarch guide explains more details of how different x86 microarchitectures work internally. Unfortunately there are occasional typos or copy/paste errors in his hand-edited tables.
http://instlatx64.atw.hu/ also has results of experimental measurements. I think they use a similar technique of a large block of the same instruction repeated, maybe small enough to fit in the uop cache. But they don't use perf counters to measure what execution port each instruction needs, so their throughput numbers don't help you figure out which instructions compete with which other instructions.
These latter two sources have been around for longer than uops.info, and cover some older CPUs, especially older AMD.
To measure latency yourself, you make the output of each instruction an input for the next.
mov ecx, 10000000
inc_latency:
inc eax
inc eax
inc eax
inc eax
inc eax
inc eax
sub ecx,1 ; avoid partial-flag false dep for P4
jnz inc_latency ; dec or sub/jnz macro-fuses into 1 uop on Intel SnB-family
This dependency chain of 7 inc instructions will bottleneck the loop at 1 iteration per 7 * inc_latency cycles. Using perf counters for core clock cycles (not RDTSC cycles), you can easily measure the time for all the iterations to 1 part in 10k, and with more care probably even more precisely than that. The repeat count of 10000000 hides start/stop overhead of whatever timing you use.
I normally put a loop like this in a Linux static executable that just makes a sys_exit(0) system call directly (with a syscall) instruction, and time the whole executable with perf stat ./testloop to get time and a cycle count. (See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example).
Another example is Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths, with the added complication of using lfence to drain the out-of-order execution window for two dep chains.
To measure throughput, you use separate registers, and/or include an xor-zeroing occasionally to break dep chains and let out-of-order exec overlap things. Don't forget to also use perf counters to see which ports it can run on, so you can tell which other instructions it will compete with. (e.g. FMA (p01) and shuffles (p5) don't compete at all for back-end resources on Haswell/Skylake, only for front-end throughput.) Don't forget to measure front-end uop counts, too: some instructions decode to multiply uops.
How many different dependency chains do we need to avoid a bottleneck? Well we know the latency (measure it first), and we know the max possible throughput (number of execution ports, or front-end throughput.)
For example, if FP multiply had 0.25c throughput (4 per clock), we could keep 20 in flight at once on Haswell (5c latency). That's more than we have registers, so we could just use all 16 and discover that in fact the throughput is only 0.5c. But if it had turned out that 16 registers was a bottleneck, we could add xorps xmm0,xmm0 occasionally and let out-of-order execution overlap some blocks.
More is normally better; having just barely enough to hide latency can slow down with imperfect scheduling. If we wanted to go nuts measuring inc, we'd do this:
mov ecx, 10000000
inc_latency:
%rep 10 ;; source-level repeat of a block, no runtime branching
inc eax
inc ebx
; not ecx, we're using it as a loop counter
inc edx
inc esi
inc edi
inc ebp
inc r8d
inc r9d
inc r10d
inc r11d
inc r12d
inc r13d
inc r14d
inc r15d
%endrep
sub ecx,1 ; break partial-flag false dep for P4
jnz inc_latency ; dec/jnz macro-fuses into 1 uop on Intel SnB-family
If we were worried about partial-flag false dependencies or flag-merging effects, we might experiment with mixing in an xor eax,eax somewhere to let OoO exec overlap more than just when sub wrote all flags. (See INC instruction vs ADD 1: Does it matter?)
There's a similar problem for measuring throughput and latency of shl r32, cl on Sandybridge-family: the flag dependency chain isn't normally relevant for a computation, but putting shl back-to-back creates a dependency through FLAGS as well as through the register. (Or for throughput, there isn't even a register dep).
I posted about this on Agner Fog's blog: https://www.agner.org/optimize/blog/read.php?i=415#860. I mixed shl edx,cl in with four add edx,1 instructions, to see what incremental slowdown adding one more instruction had, where the FLAGS dependency was a non-issue. On SKL, it only slows down by an extra 1.23 cycles on average, so the true latency cost of that shl was only ~1.23 cycles, not 2. (It's not a whole number or just 1 because of resource conflicts to run the flag-merging uops of the shl, I guess. BMI2 shlx edx, edx, ecx would be exactly 1c because it's only a single uop.)
Related: for static performance analysis of whole blocks of code (containing different instructions), see What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?. (It's using the word "latency" for the end-to-end latency of a whole computation, but actually asking about things small enough for OoO exec to overlap different parts, so instruction latency and throughput both matter.)
The Latency=2 numbers for load/store appear to be from Agner Fog's instruction tables (https://agner.org/optimize/). They unfortunately aren't accurate for a chain of mov rax, [rax]. You'll find that's 4c
latency if you measure it by putting that in a loop.
Agner splits up load/store latency into something that makes the total store/reload latency come out correct, but for some reason he doesn't make the load part equal to the L1d load-use latency when it comes from cache instead of the store buffer. (But also note that if the load feeds an ALU instruction instead of another load, the latency is 5c. So the simple addressing-mode fast-path only helps for pure pointer-chasing.)

Is there a penalty when base+offset is in a different page than the base?

The execution times for these three snippets:
pageboundary: dq (pageboundary + 8)
...
mov rdx, [rel pageboundary]
.loop:
mov rdx, [rdx - 8]
sub ecx, 1
jnz .loop
And this:
pageboundary: dq (pageboundary - 8)
...
mov rdx, [rel pageboundary]
.loop:
mov rdx, [rdx + 8]
sub ecx, 1
jnz .loop
And this:
pageboundary: dq (pageboundary - 4096)
...
mov rdx, [rel pageboundary]
.loop:
mov rdx, [rdx + 4096]
sub ecx, 1
jnz .loop
Are, on a 4770K, roughly 5 cycles per iteration for the first snippet and roughly 9 cycles per iteration for the second snippet, then 5 cycles for the third snippet. They both access the exact same address, which is 4K-aligned. In the second snippet, only the address calculation crosses the page boundary: rdx and rdx + 8 don't belong to the same page, the load is still aligned. With a large offset it's back to 5 cycles again.
How does this effect work in general?
Routing the result from the load through an ALU instruction like this:
.loop:
mov rdx, [rdx + 8]
or rdx, 0
sub ecx, 1
jnz .loop
Makes it take 6 cycles per iteration, which makes sense as 5+1. Reg+8 should be a special fast load and AFAIK take 4 cycles, so even in this case there seems to be some penalty, but only 1 cycle.
A test like this was used in response to some of the comments:
.loop:
lfence
; or rdx, 0
mov rdx, [rdx + 8]
; or rdx, 0
; uncomment one of the ORs
lfence
sub ecx, 1
jnz .loop
Putting the or before the mov makes the loop faster than without any or, putting the or after the mov makes it a cycle slower.

Optimization rule: in pointer-connected data structures like linked-lists / trees, put the next or left/right pointers in the first 16 bytes of the object. malloc typically returns 16-byte aligned blocks (alignof(maxalign_t)), so this will ensure the linking pointers are in the same page as the start of the object.
Any other way of ensuring that important struct members are in the same page as the start of the object will also work.
Sandybridge-family normally has 5 cycle L1d load-use latency, but there's a special case for pointer-chasing with small positive displacements with base+disp addressing modes.
Sandybridge-family has 4 cycle load-use latency for [reg + 0..2047] addressing modes, when the base reg is the result of a mov load, not an ALU instruction. Or a penalty if reg+disp is in a different page than reg.
Based on these test results on Haswell and Skylake (and probably original SnB but we don't know), it appears that all of the following conditions must be true:
base reg comes from another load. (A rough heuristic for pointer-chasing, and usually means that load latency is probably part of a dep chain). If objects are usually allocated not crossing a page boundary, then this is a good heuristic. (The HW can apparently detect which execution unit the input is being forwarded from.)
Addressing mode is [reg] or [reg+disp8/disp32]. (Or an indexed load with an xor-zeroed index register! Usually not practically useful, but might provide some insight into the issue/rename stage transforming load uops.)
displacement < 2048. i.e. all bits above bit 11 are zero (a condition HW can check without a full integer adder/comparator.)
(Skylake but not Haswell/Broadwell): the last load wasn't a retried-fastpath. (So base = result of a 4 or 5 cycle load, it will attempt the fast path. But base = result of a 10 cycle retried load, it won't. The penalty on SKL seems to be 10, vs. 9 on HSW).
I don't know if it's the last load attempted on that load port that matters, or if it's actually what happened to the load that produced that input. Perhaps experiments chasing two dep chains in parallel could shed some light; I've only tried one pointer chasing dep chain with a mix of page-changing and non-page-changing displacements.
If all those things are true, the load port speculates that the final effective address will be in the same page as the base register. This is a useful optimization in real cases when load-use latency forms a loop-carried dep chain, like for a linked list or binary tree.
microarchitectural explanation (my best guess at explaining the result, not from anything Intel published):
It seems that indexing the L1dTLB is on the critical path for L1d load latency. Starting that 1 cycle early (without waiting for the output of an adder to calculate the final address) shaves a cycle off the full process of indexing L1d using the low 12 bits of the address, then comparing the 8 tags in that set against the high bits of the physical address produced by the TLB. (Intel's L1d is VIPT 8-way 32kiB, so it has no aliasing problems because the index bits all come from the low 12 bits of the address: the offset within a page which is the same in both the virtual and physical address. i.e. the low 12 bits translate for free from virt to phys.)
Since we don't find an effect for crossing 64-byte boundaries, we know the load port is adding the displacement before indexing the cache.
As Hadi suggests, it seems likely that if there's carry-out from bit 11, the load port lets the wrong-TLB load complete and then redoes it using the normal path. (On HSW, the total load latency = 9. On SKL the total load latency can be 7.5 or 10).
Aborting right away and retrying on the next cycle (to make it 5 or 6 cycles instead of 9) would in theory be possible, but remember that the load ports are pipelined with 1 per clock throughput. The scheduler is expecting to be able to send another uop to the load port in the next cycle, and Sandybridge-family standardizes latencies for everything of 5 cycles and shorter. (There are no 2-cycle instructions).
I didn't test if 2M hugepages help, but probably not. I think the TLB hardware is simple enough that it couldn't recognize that a 1-page-higher index would still pick the same entry. So it probably does the slow retry any time the displacement crosses a 4k boundary, even if that's in the same hugepage. (Page-split loads work this way: if the data actually crosses a 4k boundary (e.g. 8-byte load from page-4), you pay the page-split penalty not just the cache-line split penalty, regardless of hugepages)
Intel's optimization manual documents this special case in section 2.4.5.2 L1 DCache (in the Sandybridge section), but doesn't mention any different-page limitation, or the fact that it's only for pointer-chasing, and doesn't happen when there's an ALU instruction in the dep chain.
(Sandybridge)
Table 2-21. Effect of Addressing Modes on Load Latency
-----------------------------------------------------------------------
Data Type | Base + Offset > 2048 | Base + Offset < 2048
| Base + Index [+ Offset] |
----------------------+--------------------------+----------------------
Integer | 5 | 4
MMX, SSE, 128-bit AVX | 6 | 5
X87 | 7 | 6
256-bit AVX | 7 | 7
(remember, 256-bit loads on SnB take 2 cycles in the load port, unlike on HSW/SKL)
The text around this table also doesn't mention the limitations that exist on Haswell/Skylake, and may also exist on SnB (I don't know).
Maybe Sandybridge doesn't have those limitations and Intel didn't document the Haswell regression, or else Intel just didn't document the limitations in the first place. The table is pretty definite about that addressing mode always being 4c latency with offset = 0..2047.
#Harold's experiment of putting an ALU instruction as part of the load/use pointer-chasing dependency chain confirms that it's this effect that's causing the slowdown: an ALU insn decreased the total latency, effectively giving an instruction like and rdx, rdx negative incremental latency when added to the mov rdx, [rdx-8] dep chain in this specific page-crossing case.
Previous guesses in this answer included the suggestion that using the load result in an ALU vs. another load was what determined the latency. That would be super weird and require looking into the future. That was a wrong interpretation on my part of the effect of adding an ALU instruction into the loop. (I hadn't known about the 9-cycle effect on page crossing, and was thinking that the HW mechanism was a forwarding fast-path for the result inside the load port. That would make sense.)
We can prove that it's the source of the base reg input that matters, not the destination of the load result: Store the same address at 2 separate locations, before and after a page boundary. Create a dep chain of ALU => load => load, and check that it's the 2nd load that's vulnerable to this slowdown / able to benefit from the speedup with a simple addressing mode.
%define off 16
lea rdi, [buf+4096 - 16]
mov [rdi], rdi
mov [rdi+off], rdi
mov ebp, 100000000
.loop:
and rdi, rdi
mov rdi, [rdi] ; base comes from AND
mov rdi, [rdi+off] ; base comes from a load
dec ebp
jnz .loop
... sys_exit_group(0)
section .bss
align 4096
buf: resb 4096*2
Timed with Linux perf on SKL i7-6700k.
off = 8, the speculation is correct and we get total latency = 10 cycles = 1 + 5 + 4. (10 cycles per iteration).
off = 16, the [rdi+off] load is slow, and we get 16 cycles / iter = 1 + 5 + 10. (The penalty seems to be higher on SKL than HSW)
With the load order reversed (doing the [rdi+off] load first), it's always 10c regardless of off=8 or off=16, so we've proved that mov rdi, [rdi+off] doesn't attempt the speculative fast-path if its input is from an ALU instruction.
Without the and, and off=8, we get the expected 8c per iter: both use the fast path. (#harold confirms HSW also gets 8 here).
Without the and, and off=16, we get 15c per iter: 5+10. The mov rdi, [rdi+16] attempts the fast path and fails, taking 10c. Then mov rdi, [rdi] doesn't attempt the fast-path because its input failed. (#harold's HSW takes 13 here: 4 + 9. So that confirms HSW does attempt the fast-path even if the last fast-path failed, and that the fast-path fail penalty really is only 9 on HSW vs. 10 on SKL)
It's unfortunate that SKL doesn't realize that [base] with no displacement can always safely use the fast path.
On SKL, with just mov rdi, [rdi+16] in the loop, the average latency is 7.5 cycles. Based on tests with other mixes, I think it alternates between 5c and 10c: after a 5c load that didn't attempt the fast path, the next one does attempt it and fails, taking 10c. That makes the next load use the safe 5c path.
Adding a zeroed index register actually speeds it up in this case where we know the fast-path is always going to fail. Or using no base register, like [nosplit off + rdi*1], which NASM assembles to 48 8b 3c 3d 10 00 00 00 mov rdi,QWORD PTR [rdi*1+0x10]. Notice that this requires a disp32, so it's bad for code size.
Also beware that indexed addressing modes for micro-fused memory operands are un-laminated in some cases, while base+disp modes aren't. But if you're using pure loads (like mov or vbroadcastss), there's nothing inherently wrong with an indexed addressing mode. Using an extra zeroed register isn't great, though.
On Ice Lake, this special 4 cycle fast path for pointer chasing loads is gone: GP register loads that hit in L1 now generally take 5 cycles, with no difference based on the presence of indexing or the size of the offset.

I've conducted a sufficient number of experiments on Haswell to determine exactly when memory loads are issued speculatively before the effective address is fully calculated. These results also confirm Peter's guess.
I've varied the following parameters:
The offset from pageboundary. The offset used is the same in the definition of pageboundary and the load instruction.
The sign of the offset is either + or -. The sign used in the definition is always the opposite of the one used in the load instruction.
The alignment of pageboundary within the executable binary.
In all of the following graphs, the Y axis represents the load latency in core cycles. The X axis represents the configuration in the form NS1S2, where N is the offset, S1 is the sign of the offset used in the definition, and S2 is the sign used in the load instruction.
The following graph shows that loads are issued before calculating the effective address only when the offset is positive or zero. Note that for all of the offsets between 0-15, the base address and the effective address used in the load instruction are both within the same 4K page.
The next graph shows the point where this pattern changes. The change occurs at offset 213, which is the smallest offset where the base address and the effective address used in the load instruction are both within different 4K pages.
Another important observation that can be made from the previous two graphs is that even if the base address points to a different cache set than the effective address, no penalty is incurred. So it seems that the cache set is opened after calculating the effective address. This indicates that the L1 DTLB hit latency is 2 cycles (that is, it takes 2 cycles for the L1D to receive the tag), but it takes only 1 cycle to open the cache's data array set and the cache's tag array set (which occurs in parallel).
The next graph shows what happens when pageboundary is aligned on a 4K page boundary. In this case, any offset that is not zero will make the base and effective addresses reside within different pages. For example, if the base address of pageboundary is 4096, then the base address of pageboundary used in the load instruction is 4096 - offset, which is obviously in a different 4K page for any non-zero offset.
The next graph shows that the pattern changes again starting from offset 2048. At this point, loads are never issued before calculating the effective address.
This analysis can be confirmed by measuring the number of uops dispatched to the load ports 2 and 3. The total number of retired load uops is 1 billion (equal to the number of iterations). However, when the measured load latency is 9 cycles, the number of load uops dispatched to each of the two ports is 1 billion. Also when the load latency is 5 or 4 cycles, the number of load uops dispatched to each of the two ports is 0.5 billion. So something like this would be happening:
The load unit checks whether the offset is non-negative and smaller than 2048. In that case, it will issue a data load request using the base address. It will also begin calculating the effective address.
In the next cycle, the effective address calculation is completed. If it turns out that the load is to a different 4K page, the load unit waits until the issued load completes and then it discards the results and replays the load. Either way, it supplies the data cache with the set index and line offset.
In the next cycle, the tag comparison is performed and the data is forwarded to the load buffer. (I'm not sure whether the address-speculative load will be aborted in the case of a miss in the L1D or the DTLB.)
In the next cycle, the load buffer receives the data from the cache. If it's supposed to discard the data, it's discarded and it tells the dispatcher to replay the load with address speculation disabled for it. Otherwise, the data is written back. If a following instruction requires the data for its address calculation, it will receive the data in the next cycle (so it will be dispatched in the next cycle if all of its other operands are ready).
These steps explain the observed 4, 5, and 9 cycle latencies.
It might happen that the target page is a hugepage. The only way for the load unit to know whether the base address and the effective address point to the same page when using hugepages is to have the TLB supply the load unit with the size of the page being accessed. Then the load unit has to check whether the effective address is within that page. In modern processors, on a TLB miss, dedicated page-walk hardware is used. In this case, I think that the load unit will not supply the cache set index and cache line offset to the data cache and will use the actual effective address to access the TLB. This requires enabling the page-walk hardware to distinguish between loads with speculative addresses and other loads. Only if that other access missed the TLB will the page walk take place. Now if the target page turned out to be a hugepage and it's a hit in the TLB, it might be possible to inform the load unit that the size of the page is larger than 4K or maybe even of the exact size of the page. The load unit can then make a better decision regarding whether the load should be replayed. However, this logic should take no more than the time for the (potentially wrong) data to reach the load buffer allocated for the load. I think this time is only one cycle.

Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths

I was playing with the code in this answer, slightly modifying it:
BITS 64
GLOBAL _start
SECTION .text
_start:
mov ecx, 1000000
.loop:
;T is a symbol defined with the CLI (-DT=...)
TIMES T imul eax, eax
lfence
TIMES T imul edx, edx
dec ecx
jnz .loop
mov eax, 60 ;sys_exit
xor edi, edi
syscall
Without the lfence I the results I get are consistent with the static analysis in that answer.
When I introduce a single lfence I'd expect the CPU to execute the imul edx, edx sequence of the k-th iteration in parallel with the imul eax, eax sequence of the next (k+1-th) iteration.
Something like this (calling A the imul eax, eax sequence and D the imul edx, edx one):
|
| A
| D A
| D A
| D A
| ...
| D A
| D
|
V time
Taking more or less the same number of cycles but for one unpaired parallel execution.
When I measure the number of cycles, for the original and modified version, with taskset -c 2 ocperf.py stat -r 5 -e cycles:u '-x ' ./main-$T for T in the range below I get
T Cycles:u Cycles:u Delta
lfence no lfence
10 42047564 30039060 12008504
15 58561018 45058832 13502186
20 75096403 60078056 15018347
25 91397069 75116661 16280408
30 108032041 90103844 17928197
35 124663013 105155678 19507335
40 140145764 120146110 19999654
45 156721111 135158434 21562677
50 172001996 150181473 21820523
55 191229173 165196260 26032913
60 221881438 180170249 41711189
65 250983063 195306576 55676487
70 281102683 210255704 70846979
75 312319626 225314892 87004734
80 339836648 240320162 99516486
85 372344426 255358484 116985942
90 401630332 270320076 131310256
95 431465386 285955731 145509655
100 460786274 305050719 155735555
How can the values of Cycles:u lfence be explained?
I would have expected them to be similar to those of Cycles:u no lfence since a single lfence should prevent only the first iteration from being executed in parallel for the two blocks.
I don't think it's due to the lfence overhead as I believe that should be constant for all Ts.
I'd like to fix what's wrong with my forma mentis when dealing with the static analysis of code.
Supporting repository with source files.

I think you're measuring accurately, and the explanation is microarchitectural, not any kind of measurement error.
I think your results for mid to low T support the conclusion that lfence stops the front-end from even issuing past the lfence until all earlier instructions retire, rather than having all the uops from both chains already issued and just waiting for lfence to flip a switch and let multiplies from each chain start to dispatch on alternating cycles.
(port1 would get edx,eax,empty,edx,eax,empty,... for Skylake's 3c latency / 1c throughput multiplier right away, if lfence didn't block the front-end, and overhead wouldn't scale with T.)
You're losing imul throughput when only uops from the first chain are in the scheduler because the front-end hasn't chewed through the imul edx,edx and loop branch yet. And for the same number of cycles at the end of the window when the pipeline is mostly drained and only uops from the 2nd chain are left.
The overhead delta looks linear up to about T=60. I didn't run the numbers, but the slope up to there looks reasonable for T * 0.25 clocks to issue the first chain vs. 3c-latency execution bottleneck. i.e. delta growing maybe 1/12th as fast as total no-lfence cycles.
So (given the lfence overhead I measured below), with T<60:
no_lfence cycles/iter ~= 3T # OoO exec finds all the parallelism
lfence cycles/iter ~= 3T + T/4 + 9.3 # lfence constant + front-end delay
delta ~= T/4 + 9.3
#Margaret reports that T/4 is a better fit than 2*T / 4, but I would have expected T/4 at both the start and end, for a total of 2T/4 slope of the delta.
After about T=60, delta grows much more quickly (but still linearly), with a slope about equal to the total no-lfence cycles, thus about 3c per T. I think at that point, the scheduler (Reservation Station) size is limiting the out-of-order window. You probably tested on a Haswell or Sandybridge/IvyBridge, (which have a 60-entry or 54-entry scheduler respectively. Skylake's is 97 entry (but not fully unified; IIRC BeeOnRope's testing showed that not all the entries could be used for any type of uop. Some were specific to load and/or store, for example.)
The RS tracks un-executed uops. Each RS entry holds 1 unfused-domain uop that's waiting for its inputs to be ready, and its execution port, before it can dispatch and leave the RS1.
After an lfence, the front-end issues at 4 per clock while the back-end executes at 1 per 3 clocks, issuing 60 uops in ~15 cycles, during which time only 5 imul instructions from the edx chain have executed. (There's no load or store micro-fusion here, so every fused-domain uop from the front-end is still only 1 unfused-domain uop in the RS2.)
For large T the RS quickly fills up, at which point the front-end can only make progress at the speed of the back-end. (For small T, we hit the next iteration's lfence before that happens, and that's what stalls the front-end). When T > RS_size, the back-end can't see any of the uops from the eax imul chain until enough back-end progress through the edx chain has made room in the RS. At that point, one imul from each chain can dispatch every 3 cycles, instead of just the 1st or 2nd chain.
Remember from the first section that time spent just after lfence only executing the first chain = time just before lfence executing only the second chain. That applies here, too.
We get some of this effect even with no lfence, for T > RS_size, but there's opportunity for overlap on both sides of a long chain. The ROB is at least twice the size of the RS, so the out-of-order window when not stalled by lfence should be able to keep both chains in flight constantly even when T is somewhat larger than the scheduler capacity. (Remember that uops leave the RS as soon as they've executed. I'm not sure if that means they have to finish executing and forward their result, or merely start executing, but that's a minor difference here for short ALU instructions. Once they're done, only the ROB is holding onto them until they retire, in program order.)
The ROB and register-file shouldn't be limiting the out-of-order window size (http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/) in this hypothetical situation, or in your real situation. They should both be plenty big.
Blocking the front-end is an implementation detail of lfence on Intel's uarches. The manual only says that later instructions can't execute. That wording would allow the front-end to issue/rename them all into the scheduler (Reservation Station) and ROB while lfence is still waiting, as long as none are dispatched to an execution unit.
So a weaker lfence would maybe have flat overhead up to T=RS_size, then the same slope as you see now for T>60. (And the constant part of the overhead might be lower.)
Note that guarantees about speculative execution of conditional/indirect branches after lfence apply to execution, not (as far as I know) to code-fetch. Merely triggering code-fetch is not (AFAIK) useful to a Spectre or Meltdown attack. Possibly a timing side-channel to detect how it decodes could tell you something about the fetched code...
I think AMD's LFENCE is at least as strong on actual AMD CPUs, when the relevant MSR is enabled. (Is LFENCE serializing on AMD processors?).
Extra lfence overhead:
Your results are interesting, but it doesn't surprise me at all that there's significant constant overhead from lfence itself (for small T), as well as the component that scales with T.
Remember that lfence doesn't allow later instructions to start until earlier instructions have retired. This is probably at least a couple cycles / pipeline-stages later than when their results are ready for bypass-fowarding to other execution units (i.e. the normal latency).
So for small T, it's definitely significant that you add extra latency into the chain by requiring the result to not only be ready, but also written back to the register file.
It probably takes an extra cycle or so for lfence to allow the issue/rename stage to start operating again after detecting retirement of the last instruction before it. The issue/rename process takes multiple stages (cycles), and maybe lfence blocks at the start of this, instead of in the very last step before uops are added into the OoO part of the core.
Even back-to-back lfence itself has 4 cycle throughput on SnB-family, according to Agner Fog's testing. Agner Fog reports 2 fused-domain uops (no unfused), but on Skylake I measure it at 6 fused-domain (still no unfused) if I only have 1 lfence. But with more lfence back-to-back, it's fewer uops! Down to ~2 uops per lfence with many back-to-back, which is how Agner measures.
lfence/dec/jnz (a tight loop with no work) runs at 1 iteration per ~10 cycles on SKL, so that might give us an idea of the real extra latency that lfence adds to the dep chains even without the front-end and RS-full bottlenecks.
Measuring lfence overhead with only one dep chain, OoO exec being irrelevant:
.loop:
;mfence ; mfence here: ~62.3c (with no lfence)
lfence ; lfence here: ~39.3c
times 10 imul eax,eax ; with no lfence: 30.0c
; lfence ; lfence here: ~39.6c
dec ecx
jnz .loop
Without lfence, runs at the expected 30.0c per iter. With lfence, runs at ~39.3c per iter, so lfence effectively added ~9.3c of "extra latency" to the critical path dep chain. (And 6 extra fused-domain uops).
With lfence after the imul chain, right before the loop-branch, it's slightly slower. But not a whole cycle slower, so that would indicate that the front-end is issuing the loop-branch + and imul in a single issue-group after lfence allows execution to resume. That being the case, IDK why it's slower. It's not from branch misses.
Getting the behaviour you were expecting:
Interleave the chains in program order, like #BeeOnRope suggests in comments, doesn't require out-of-order execution to exploit the ILP, so it's pretty trivial:
.loop:
lfence ; at the top of the loop is the lowest-overhead place.
%rep T
imul eax,eax
imul edx,edx
%endrep
dec ecx
jnz .loop
You could put pairs of short times 8 imul chains inside a %rep to let OoO exec have an easy time.
Footnote 1: How the front-end / RS / ROB interact
My mental model is that the issue/rename/allocate stages in the front-end add new uops to both the RS and the ROB at the same time.
Uops leave the RS after executing, but stay in the ROB until in-order retirement. The ROB can be large because it's never scanned out-of-order to find the first-ready uop, only scanned in-order to check if the oldest uop(s) have finished executing and thus are ready to retire.
(I assume the ROB is physically a circular buffer with start/end indices, not a queue which actually copies uops to the right every cycle. But just think of it as a queue / list with a fixed max size, where the front-end adds uops at the front, and the retirement logic retires/commits uops from the end as long as they're fully executed, up to some per-cycle per-hyperthread retirement limit which is not usually a bottleneck. Skylake did increase it for better Hyperthreading, maybe to 8 per clock per logical thread. Perhaps retirement also means freeing physical registers which helps HT, because the ROB itself is statically partitioned when both threads are active. That's why retirement limits are per logical thread.)
Uops like nop, xor eax,eax, or lfence, which are handled in the front-end (don't need any execution units on any ports) are added only to the ROB, in an already-executed state. (A ROB entry presumably has a bit that marks it as ready to retire vs. still waiting for execution to complete. This is the state I'm talking about. For uops that did need an execution port, I assume the ROB bit is set via a completion port from the execution unit. And that the same completion-port signal frees its RS entry.)
Uops stay in the ROB from issue to retirement.
Uops stay in the RS from issue to execution. The RS can replay uops in a few cases, e.g. for the other half of a cache-line-split load, or if it was dispatched in anticipation of load data arriving, but in fact it didn't. (Cache miss or other conflicts like Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?) Or when a load port speculates that it can bypass the AGU before starting a TLB lookup to shorten pointer-chasing latency with small offsets - Is there a penalty when base+offset is in a different page than the base?
So we know that the RS can't remove a uop right as it dispatches, because it might need to be replayed. (Can happen even to non-load uops that consume load data.) But any speculation that needs replays is short-range, not through a chain of uops, so once a result comes out the other end of an execution unit, the uop can be removed from the RS. Probably this is part of what a completion port does, along with putting the result on the bypass forwarding network.
Footnote 2: How many RS entries does a micro-fused uop take?
TL:DR: P6-family: RS is fused, SnB-family: RS is unfused.
A micro-fused uop is issued to two separate RS entries in Sandybridge-family, but only 1 ROB entry. (Assuming it isn't un-laminated before issue, see section 2.3.5 for HSW or section 2.4.2.4 for SnB of Intel's optimization manual, and Micro fusion and addressing modes. Sandybridge-family's more compact uop format can't represent indexed addressing modes in the ROB in all cases.)
The load can dispatch independently, ahead of the other operand for the ALU uop being ready. (Or for micro-fused stores, either of the store-address or store-data uops can dispatch when its input is ready, without waiting for both.)
I used the two-dep-chain method from the question to experimentally test this on Skylake (RS size = 97), with micro-fused or edi, [rdi] vs. mov+or, and another dep chain in rsi. (Full test code, NASM syntax on Godbolt)
; loop body
%rep T
%if FUSE
or edi, [rdi] ; static buffers are in the low 32 bits of address space, in non-PIE
%else
mov eax, [rdi]
or edi, eax
%endif
%endrep
%rep T
%if FUSE
or esi, [rsi]
%else
mov eax, [rsi]
or esi, eax
%endif
%endrep
Looking at uops_executed.thread (unfused-domain) per cycle (or per second which perf calculates for us), we can see a throughput number that doesn't depend on separate vs. folded loads.
With small T (T=30), all the ILP can be exploited, and we get ~0.67 uops per clock with or without micro-fusion. (I'm ignoring the small bias of 1 extra uop per loop iteration from dec/jnz. It's negligible compared to the effect we'd see if micro-fused uops only used 1 RS entry)
Remember that load+or is 2 uops, and we have 2 dep chains in flight, so this is 4/6, because or edi, [rdi] has 6 cycle latency. (Not 5, which is surprising, see below.)
At T=60, we still have about 0.66 unfused uops executed per clock for FUSE=0, and 0.64 for FUSE=1. We can still find basically all the ILP, but it's just barely starting to dip, as the two dep chains are 120 uops long (vs. a RS size of 97).
At T=120, we have 0.45 unfused uops per clock for FUSE=0, and 0.44 for FUSE=1. We're definitely past the knee here, but still finding some of the ILP.
If a micro-fused uop took only 1 RS entry, FUSE=1 T=120 should be about the same speed as FUSE=0 T=60, but that's not the case. Instead, FUSE=0 or 1 makes nearly no difference at any T. (Including larger ones like T=200: FUSE=0: 0.395 uops/clock, FUSE=1: 0.391 uops/clock). We'd have to go to very large T before we start for the time with 1 dep-chain in flight to totally dominate the time with 2 in flight, and get down to 0.33 uops / clock (2/6).
Oddity: We have such a small but still measurable difference in throughput for fused vs. unfused, with separate mov loads being faster.
Other oddities: the total uops_executed.thread is slightly lower for FUSE=0 at any given T. Like 2,418,826,591 vs. 2,419,020,155 for T=60. This difference was repeatable down to +- 60k out of 2.4G, plenty precise enough. FUSE=1 is slower in total clock cycles, but most of the difference comes from lower uops per clock, not from more uops.
Simple addressing modes like [rdi] are supposed to only have 4 cycle latency, so load + ALU should be only 5 cycle. But I measure 6 cycle latency for the load-use latency of or rdi, [rdi], or with a separate MOV-load, or with any other ALU instruction I can never get the load part to be 4c.
A complex addressing mode like [rdi + rbx + 2064] has the same latency when there's an ALU instruction in the dep chain, so it appears that Intel's 4c latency for simple addressing modes only applies when a load is forwarding to the base register of another load (with up to a +0..2047 displacement and no index).
Pointer-chasing is common enough that this is a useful optimization, but we need to think of it as a special load-load forwarding fast-path, not as a general data ready sooner for use by ALU instructions.
P6-family is different: an RS entry holds a fused-domain uop.
#Hadi found an Intel patent from 2002, where Figure 12 shows the RS in the fused domain.
Experimental testing on a Conroe (first gen Core2Duo, E6600) shows that there's a large difference between FUSE=0 and FUSE=1 for T=50. (The RS size is 32 entries).
T=50 FUSE=1: total time of 2.346G cycles (0.44IPC)
T=50 FUSE=0: total time of 3.272G cycles (0.62IPC = 0.31 load+OR per clock). (perf / ocperf.py doesn't have events for uops_executed on uarches before Nehalem or so, and I don't have oprofile installed on that machine.)
T=24 there's a negligible difference between FUSE=0 and FUSE=1, around 0.47 IPC vs 0.9 IPC (~0.45 load+OR per clock).
T=24 is still over 96 bytes of code in the loop, too big for Core 2's 64-byte (pre-decode) loop buffer, so it's not faster because of fitting in a loop buffer. Without a uop-cache, we have to be worried about the front-end, but I think we're fine because I'm exclusively using 2-byte single-uop instructions that should easily decode at 4 fused-domain uops per clock.

I'll present an analysis for the case where T = 1 for both codes (with and without lfence). You can then extend this for other values of T. You can refer to Figure 2.4 of the Intel Optimization Manual for a visual.
Because there is only a single easily predicted branch, the frontend will only stall if the backend stalled. The frontend is 4-wide in Haswell, which means up to 4 fused uops can be issued from the IDQ (instruction decode queue, which is just a queue that holds in-order fused-domain uops, also called the uop queue) to the reservation station (RS) entires of the scheduler. Each imul is decoded into a single uop that cannot be fused. The instructions dec ecx and jnz .loop get macrofused in the frontend to a single uop. One of the differences between microfusion and macrofusion is that when the scheduler dispatches a macrofused uop (that are not microfused) to the execution unit it's assigned to, it gets dispatched as a single uop. In contrast, a microfused uop needs to be split into its constituent uops, each of which must be separately dispatched to an execution unit. (However, splitting microfused uops happens on entrance to the RS, not on dispatch, see Footnote 2 in #Peter's answer). lfence is decoded into 6 uops. Recognizing microfusion only matters in the backend, and in this case, there is no microfusion in the loop.
Since the loop branch is easily predictable and since the number of iterations is relatively large, we can just assume without compromising accuracy that the allocator will always be able to allocate 4 uops per cycle. In other words, the scheduler will receive 4 uops per cycle. Since there is no micorfusion, each uop will be dispatched as a single uop.
imul can only be executed by the Slow Int execution unit (see Figure 2.4). This means that the only choice for executing the imul uops is to dispatch them to port 1. In Haswell, the Slow Int is nicely pipelined so that a single imul can be dispatched per cycle. But it takes three cycles for the result of the multiplication be available for any instruction that requires (the writeback stage is the third cycle from the dispatch stage of the pipeline). So for each dependence chain, at most one imul can be dispatched per 3 cycles.
Becausedec/jnz is predicted taken, the only execution unit that can execute it is Primary Branch on port 6.
So at any given cycle, as long as the RS has space, it will receive 4 uops. But what kind of uops? Let's examine the loop without lfence:
imul eax, eax
imul edx, edx
dec ecx/jnz .loop (macrofused)
There are two possibilities:
Two imuls from the same iteration, one imul from a neighboring iteration, and one dec/jnz from one of those two iterations.
One dec/jnz from one iteration, two imuls from the next iteration, and one dec/jnz from the same iteration.
So at the beginning of any cycle, the RS will receive at least one dec/jnz and at least one imul from each chain. At the same time, in the same cycle and from those uops that are already there in the RS, the scheduler will do one of two actions:
Dispatch the oldest dec/jnz to port 6 and dispatch the oldest imul that is ready to port 1. That's a total of 2 uops.
Because the Slow Int has a latency of 3 cycles but there are only two chains, for each cycle of 3 cycles, no imul in the RS will be ready for execution. However, there is always at least one dec/jnz in the RS. So the scheduler can dispatch that. That's a total of 1 uop.
Now we can calculate the expected number of uops in the RS, XN, at the end of any given cycle N:
XN = XN-1 + (the number of uops to be allocated in the RS at the beginning of cycle N) - (the expected number of uops that will be dispatched at the beginning of cycle N)
= XN-1 + 4 - ((0+1)*1/3 + (1+1)*2/3)
= XN-1 + 12/3 - 5/3
= XN-1 + 7/3 for all N > 0
The initial condition for the recurrence is X0 = 4. This is a simple recurrence that can be solved by unfolding XN-1.
XN = 4 + 2.3 * N for all N >= 0
The RS in Haswell has 60 entries. We can determine the first cycle in which the RS is expected to become full:
60 = 4 + 7/3 * N
N = 56/2.3 = 24.3
So at the end of cycle 24.3, the RS is expected to be full. This means that at the beginning of cycle 25.3, the RS will not be able to receive any new uops. Now the number of iterations, I, under consideration determines how you should proceed with the analysis. Since a dependency chain will require at least 3*I cycles to execute, it takes about 8.1 iterations to reach cycle 24.3. So if the number of iterations is larger than 8.1, which is the case here, you need to analyze what happens after cycle 24.3.
The scheduler dispatches instructions at the following rates every cycle (as discussed above):
1
2
2
1
2
2
1
2
.
.
But the allocator will not allocate any uops in the RS unless there are at least 4 available entries. Otherwise, it will not waste power on issuing uops at a sub-optimal throughput. However, it is only at the beginning of every 4th cycle are there at least 4 free entries in the RS. So starting from cycle 24.3, the allocator is expected to get stalled 3 out of every 4 cycles.
Another important observation for the code being analyzed is that it never happens that there are more than 4 uops that can be dispatched, which means that the average number of uops that leave their execution units per cycle is not larger than 4. At most 4 uops can be retired from the ReOrder Buffer (ROB). This means that the ROB can never be on the critical path. In other words, performance is determined by the dispatch throughput.
We can calculate the IPC (instructions per cycles) fairly easily now. The ROB entries look something like this:
imul eax, eax - N
imul edx, edx - N + 1
dec ecx/jnz .loop - M
imul eax, eax - N + 3
imul edx, edx - N + 4
dec ecx/jnz .loop - M + 1
The column to the right shows the cycles in which the instruction can be retired. Retirement happens in order and is bounded by the latency of the critical path. Here each dependency chain have the same path length and so both constitute two equal critical paths of length 3 cycles. So every 3 cycles, 4 instructions can be retired. So the IPC is 4/3 = 1.3 and the CPI is 3/4 = 0.75. This is much smaller than the theoretical optimal IPC of 4 (even without considering micro- and macro-fusion). Because retirement happens in-order, the retirement behavior will be the same.
We can check our analysis using both perf and IACA. I'll discuss perf. I've a Haswell CPU.
perf stat -r 10 -e cycles:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-nolfence
Performance counter stats for './main-1-nolfence' (10 runs):
30,01,556 cycles:u ( +- 0.00% )
40,00,005 instructions:u # 1.33 insns per cycle ( +- 0.00% )
0 RESOURCE_STALLS.ROB
23,42,246 UOPS_ISSUED.ANY ( +- 0.26% )
22,49,892 RESOURCE_STALLS.RS ( +- 0.00% )
0.001061681 seconds time elapsed ( +- 0.48% )
There are 1 million iterations each takes about 3 cycles. Each iteration contains 4 instructions and the IPC is 1.33.RESOURCE_STALLS.ROB shows the number of cycles in which the allocator was stalled due to a full ROB. This of course never happens. UOPS_ISSUED.ANY can be used to count the number of uops issued to the RS and the number of cycles in which the allocator was stalled (no specific reason). The first is straightforward (not shown in the perf output); 1 million * 3 = 3 million + small noise. The latter is much more interesting. It shows that about 73% of all time the allocator stalled due to a full RS, which matches our analysis. RESOURCE_STALLS.RS counts the number of cycles in which the allocator was stalled due to a full RS. This is close to UOPS_ISSUED.ANY because the allocator does not stall for any other reason (although the difference could be proportional to the number of iterations for some reason, I'll have to see the results for T>1).
The analysis of the code without lfence can be extended to determine what happens if an lfence was added between the two imuls. Let's check out the perf results first (IACA unfortunately does not support lfence):
perf stat -r 10 -e cycles:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-lfence
Performance counter stats for './main-1-lfence' (10 runs):
1,32,55,451 cycles:u ( +- 0.01% )
50,00,007 instructions:u # 0.38 insns per cycle ( +- 0.00% )
0 RESOURCE_STALLS.ROB
1,03,84,640 UOPS_ISSUED.ANY ( +- 0.04% )
0 RESOURCE_STALLS.RS
0.004163500 seconds time elapsed ( +- 0.41% )
Observe that the number of cycles has increased by about 10 million, or 10 cycles per iteration. The number of cycles does not tell us much. The number of retired instruction has increased by a million, which is expected. We already know that the lfence will not make instruction complete any faster, so RESOURCE_STALLS.ROB should not change. UOPS_ISSUED.ANY and RESOURCE_STALLS.RS are particularly interesting. In this output, UOPS_ISSUED.ANY counts cycles, not uops. The number of uops can also be counted (using cpu/event=0x0E,umask=0x1,name=UOPS_ISSUED.ANY/u instead of cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u) and has increased by 6 uops per iteration (no fusion). This means that an lfence that was placed between two imuls was decoded into 6 uops. The one million dollar question is now what these uops do and how they move around in the pipe.
RESOURCE_STALLS.RS is zero. What does that mean? This indicates that the allocator, when it sees an lfence in the IDQ, it stops allocating until all current uops in the ROB retire. In other words, the allocator will not allocate entries in the RS past an lfence until the lfence retires. Since the loop body contains only 3 other uops, the 60-entry RS will never be full. In fact, it will be always almost empty.
The IDQ in reality is not a single simple queue. It consists of multiple hardware structures that can operate in parallel. The number of uops an lfence requires depends on the exact design of the IDQ. The allocator, which also consists of many different hardware structures, when it see there is an lfence uops at the front of any of the structures of the IDQ, it suspends allocation from that structure until the ROB is empty. So different uops are usd with different hardware structures.
UOPS_ISSUED.ANY shows that the allocator is not issuing any uops for about 9-10 cycles per iteration. What is happening here? Well, one of the uses of lfence is that it can tell us how much time it takes to retire an instruction and allocate the next instruction. The following assembly code can be used to do that:
TIMES T lfence
The performance event counters will not work well for small values of T. For sufficiently large T, and by measuring UOPS_ISSUED.ANY, we can determine that it takes about 4 cycles to retire each lfence. That's because UOPS_ISSUED.ANY will be incremented about 4 times every 5 cycles. So after every 4 cycles, the allocator issues another lfence (it doesn't stall), then it waits for another 4 cycles, and so on. That said, instructions that produce results may require 1 or few more cycle to retire depending on the instruction. IACA always assume that it takes 5 cycles to retire an instruction.
Our loop looks like this:
imul eax, eax
lfence
imul edx, edx
dec ecx
jnz .loop
At any cycle at the lfence boundary, the ROB will contain the following instructions starting from the top of the ROB (the oldest instruction):
imul edx, edx - N
dec ecx/jnz .loop - N
imul eax, eax - N+1
Where N denotes the cycle number at which the corresponding instruction was dispatched. The last instruction that is going to complete (reach the writeback stage) is imul eax, eax. and this happens at cycle N+4. The allocator stall cycle count will be incremented during cycles, N+1, N+2, N+3, and N+4. However it will about 5 more cycles until imul eax, eax retires. In addition, after it retires, the allocator needs to clean up the lfence uops from the IDQ and allocate the next group of instructions before they can be dispatched in the next cycle. The perf output tells us that it takes about 13 cycles per iteration and that the allocator stalls (because of the lfence) for 10 out of these 13 cycles.
The graph from the question shows only the number of cycles for up to T=100. However, there is another (final) knee at this point. So it would be better to plot the cycles for up to T=120 to see the full pattern.

Deterministic execution time on x86-64

Is there an x64 instruction(s) that takes a fixed amount of time, regardless of the micro-architectural state such as caches, branch predictors, etc.?
For instance, if a hypothetical add or increment instruction always takes n cycles, then I can implement a timer in my program by performing that add instruction multiple times. Perhaps an increment instruction with register operands may work, but it's not clear to me whether Intel's spec guarantees that it would take deterministic number of cycles. Note that I am not interested in current time, but only a primitive / instruction sequence that takes a fixed number of cycles.
Assume that I have a way to force atomic execution i.e. no context switches during timer's execution i.e. only my program gets to run.
On a related note, I also cannot use system services to keep track of time, because I am working in a setting where my program is a user-level program running on an untrusted OS.

The x86 ISA documents don't guarantee anything about what takes a certain amount of cycles. The ISA allows things like Transmeta's Crusoe that JIT-compiled x86 instructions to an internal VLIW instruction set. It could conceivably do optimizations between adjacent instructions.
The best you can do is write something that will work on as many known microarchitectures as possible. I'm not aware of any x86-64 microarchitectures that are "weird" like Transmeta, only the usual superscalar decode-to-uops designs like Intel and AMD use.
Simple integer ALU instructions like ADD are almost all 1c latency, and tiny loops that don't touch memory are almost totally unaffected anything, and are very predictable. If they run a lot of iterations, they're also almost totally unaffected by anything to do with the impact of surrounding code on the out-of-order core, and recover very quickly from disruptions like timer interrupts.
On nearly every Intel microarchitecture, this loop will run at one iteration per clock:
mov ecx, 1234567 ; or use a 64-bit register for higher counts.
ALIGN 16
.loop:
sub ecx, 1 ; not dec because of Pentium 4.
jnz .loop
Agner Fog's microarch guide and instruction tables say that VIA Nano3000 has a taken-branch throughput of one per 3 cycles, so this loop would only run at one iteration per 3 clocks there. AMD Bulldozer-family and Jaguar similarly have a max throughput of one taken JCC per 2 clocks.
See also other performance links in the x86 tag wiki.
If you want a more power-efficient loop, you could use PAUSE in the loop, but it waits ~100 cycles on Skylake, up from ~5 cycles on previous microarchitectures. (You can make cycle-accurate predictions for more complicated loops that don't touch memory, but that depends on microarchitectural details.)
You could make a more reliable loop that's less likely to have different bottlenecks on different CPUs by making a longer dependency chain within each iteration. Since each instruction depends on the previous, it can still only run at one instruction per cycle (not counting the branch), drastically the branches per cycle.
# one add/sub per clock, limited by latency
# should run one iteration per 6 cycles on every CPU listed in Agner Fog's tables
# And should be the same on all future CPUs unless they do magic inter-instruction optimizations.
# Or it could be slower on CPUs that always have a bubble on taken branches, but it seems unlikely anyone would design one.
ALIGN 16
.loop:
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 2 ; net result ecx-1
jnz .loop
Unrolling like this ensures that front-end effects are not a bottleneck. It gives the frontend decoders plenty of time to queue up the 6 add/sub insns and the jcc before the next branch.
Using add/sub instead of dec/inc avoids a partial-flag false dependency on Pentium 4. (Although I don't think that would be an issue anyway.)
Pentium4's double-clocked ALUs can each run two ADDs per clock, but the latency is still one cycle. i.e. apparently it can't forward a result internally to chew through this dependency chain twice as fast as any other CPU.
And yes, Prescott P4 is an x86-64 CPU, so we can't quite ignore P4 if we need a general purpose answer.

Why isn't MOVNTI slower, in a loop storing repeatedly to the same address?

section .text
%define n 100000
_start:
xor rcx, rcx
jmp .cond
.begin:
movnti [array], eax
.cond:
add rcx, 1
cmp rcx, n
jl .begin
section .data
array times 81920 db "A"
According to perf it runs at 1.82 instructions per cycle. I cannot understand why it's so fast. After all, it has to be stored in memory (RAM) so it should be slow.
P.S Is there any loop-carried-dependency?
EDIT
section .text
%define n 100000
_start:
xor rcx, rcx
jmp .cond
.begin:
movnti [array+rcx], eax
.cond:
add rcx, 1
cmp rcx, n
jl .begin
section .data
array times n dq 0
Now, the iteration take 5 cycle per iteration. Why? After all, there is still no loop-carried-dependency.

movnti can apparently sustain a throughput of one per clock when writing to the same address repeatedly.
I think movnti keeps writing into the same fill buffer, and it's not getting flushed very often because there are no other loads or stores happening. (That link is about copying from WC video memory with SSE4.1 NT loads, as well as storing to normal memory with NT stores.)
So the NT write-combining fill-buffer acts like a cache for multiple overlapping NT stores to the same address, and writes are actually hitting in the fill buffer instead of going to DRAM each time.
DDR DRAM only supports burst-transfer commands. If every movnti produced a 4B write that actually was visible to the memory chips, there'd be no way it could run that fast. The memory controller either has to read/modify/write, or do an interrupted burst transfer, since there is no non-burst write command. See also Ulrich Drepper's What Every Programmer Should Know About Memory.
We can further prove this is the case by running the test on multiple cores at once. Since they don't slow each other down at all, we can be sure that the writes are only infrequently making it out of the CPU cores and competing for memory cycles.
The reason your experiment doesn't show your loop running at 4 instruction per clock (one cycle per iteration) is that you used such a tiny repeat count. 100k cycles barely accounts for the startup overhead (which perf's timing includes).
For example, on a Core2 E6600 (Merom/Conroe) with dual channel DDR2 533MHz, the total time including all process startup / exit stuff is 0.113846 ms. That's only 266,007 cycles.
A more reasonable microbenchmark shows one iteration (one movnti) per cycle:
global _start
_start:
xor ecx,ecx
.begin:
movnti [array], eax
dec ecx
jnz .begin ; 2^32 iterations
mov eax, 60 ; __NR_exit
xor edi,edi
syscall ; exit(0)
section .bss
array resb 81920
(asm-link is a script I wrote)
$ asm-link movnti-same-address.asm
+ yasm -felf64 -Worphan-labels -gdwarf2 movnti-same-address.asm
+ ld -o movnti-same-address movnti-same-address.o
$ perf stat -e task-clock,cycles,instructions ./movnti-same-address
Performance counter stats for './movnti-same-address':
1835.056710 task-clock (msec) # 0.995 CPUs utilized
4,398,731,563 cycles # 2.397 GHz
12,891,491,495 instructions # 2.93 insns per cycle
1.843642514 seconds time elapsed
Running in parallel:
$ time ./movnti-same-address; time ./movnti-same-address & time ./movnti-same-address &
real 0m1.844s / user 0m1.828s # running alone
[1] 12523
[2] 12524
peter#tesla:~/src/SO$
real 0m1.855s / user 0m1.824s # running together
real 0m1.984s / user 0m1.808s
# output compacted by hand to save space
I expect perfect SMP scaling (except with hyperthreading), up to any number of cores. e.g. on a 10-core Xeon, 10 copies of this test could run at the same time (on separate physical cores), and each one would finish in the same time as if it was running alone. (Single-core turbo vs. multi-core turbo will also be a factor, though, if you measure wall-clock time instead of cycle counts.)
zx485's uop count nicely explains why the loop isn't bottlenecked by the frontend or unfused-domain execution resources.
However, this disproves his theory about the ratio of CPU to memory clocks having anything to do with it. Interesting coincidence, though, that the OP chose a count that happened to make the final total IPC work out that way.
P.S Is there any loop-carried-dependency?
Yes, the loop counter. (1 cycle). BTW, you could have saved an insn by counting down towards zero with dec / jg instead of counting up and having to use a cmp.
The write-after-write memory dependency isn't a "true" dependency in the normal sense, but it is something the CPU has to keep track of. The CPU doesn't "notice" that the same value is written repeatedly, so it has to make sure the last write is the one that "counts".
This is called an architectural hazard. I think the term still applies when talking about memory, rather than registers.

The result is plausible. Your loop code consists of the follwing instuctions. According to Agner Fog's instruction tables, these have the following timings:
Instruction regs fused unfused ports Latency Reciprocal Throughput
---------------------------------------------------------------------------------------------------------------------------
MOVNTI m,r 2 2 p23 p4 ~400 1
ADD r,r/i 1 1 p0156 1 0.25
CMP r,r/i 1 1 p0156 1 0.25
Jcc short 1 1 p6 1 1-2 if predicted that the jump is taken
Fused CMP+Jcc short 1 1 p6 1 1-2 if predicted that the jump is taken
So
MOVNTI consumes 2 uOps, 1 in port 2 or 3 and one in port 4
ADD consumes 1 uOps in port 0 or 1 or 5 or 6
CMP and Jcc macro-fuse to the last line in the table resulting in a consumption of 1 uOp
Because neither ADD nor CMP+Jcc depend on the result of MOVNTI they can be executed (nearly) in parallel on recent architectures, for example using the ports 1,2,4,6. The worst case would be a latency of 1 between ADD and CMP+Jcc.
This is most likely a design error in your code: you're essentially writing to the same address [array] a 100000 times, because you do not adjust the address.
The repeated writes can even go to the L1-cache under the condition that
The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region.
but it doesn't look like this and won't make for a great difference, anyway, because even if writing to memory, the memory speed will be the limiting factor.
For example, if you have a 3GHz CPU and 1600MHz DDR3-RAM this will result in 3/1.6 = 1.875 CPU cycles per memory cycle. This seems plausible.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio