How are x86 uops scheduled, exactly?

How are x86 uops scheduled, exactly? - performance

Modern x86 CPUs break down the incoming instruction stream into micro-operations (uops1) and then schedule these uops out-of-order as their inputs become ready. While the basic idea is clear, I'd like to know the specific details of how ready instructions are scheduled, since it impacts micro-optimization decisions.
For example, take the following toy loop2:
top:
lea eax, [ecx + 5]
popcnt eax, eax
add edi, eax
dec ecx
jnz top
this basically implements the loop (with the following correspondence: eax -> total, c -> ecx):
do {
total += popcnt(c + 5);
} while (--c > 0);
I'm familiar with the process of optimizing any small loop by looking at the uop breakdown, dependency chain latencies and so on. In the loop above we have only one carried dependency chain: dec ecx. The first three instructions of the loop (lea, popcnt, add) are part of a dependency chain that starts fresh each loop.
The final dec and jne are fused. So we have a total of 4 fused-domain uops, and one only loop-carried dependency chain with a latency of 1 cycle. So based on that criteria, it seems that the loop can execute at 1 cycle/iteration.
However, we should look at the port pressure too:
The lea can execute on ports 1 and 5
The popcnt can execute on port 1
The add can execute on port 0, 1, 5 and 6
The predicted-taken jnz executes on port 6
So to get to 1 cycle / iteration, you pretty much need the following to happen:
The popcnt must execute on port 1 (the only port it can execute on)
The lea must execute on port 5 (and never on port 1)
The add must execute on port 0, and never on any of other three ports it can execute on
The jnz can only execute on port 6 anyway
That's a lot of conditions! If instructions just got scheduled randomly, you could get a much worse throughput. For example, 75% the add would go to port 1, 5 or 6, which would delay the popcnt, lea or jnz by one cycle. Similarly for the lea which can go to 2 ports, one shared with popcnt.
IACA on the other hand reports a result very close to optimal, 1.05 cycles per iteration:
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - l.o
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 1.05 Cycles Throughput Bottleneck: FrontEnd, Port0, Port1, Port5
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 1.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.0 | 0.9 | 0.0 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
# - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | | | | 1.0 | | | CP | lea eax, ptr [ecx+0x5]
| 1 | | 1.0 | | | | | | | CP | popcnt eax, eax
| 1 | 0.1 | | | | | 0.1 | 0.9 | | CP | add edi, eax
| 1 | 0.9 | | | | | | 0.1 | | CP | dec ecx
| 0F | | | | | | | | | | jnz 0xfffffffffffffff4
It pretty much reflects the necessary "ideal" scheduling I mentioned above, with a small deviation: it shows the add stealing port 5 from the lea on 1 out of 10 cycles. It also doesn't know that the fused branch is going to go to port 6 since it is predicted taken, so it puts most of the uops for the branch on port 0, and most of the uops for the add on port 6, rather than the other way around.
It's not clear if the extra 0.05 cycles that IACA reports over the optimal is the result of some deep, accurate analysis, or a less insightful consequence of the algorithm it uses, e.g., analyzing the loop over a fixed number of cycles, or just a bug or whatever. The same goes for the 0.1 fraction of a uop that it thinks will go to the non-ideal port. It is also not clear if one explains the other - I would think that mis-assigning a port 1 out of 10 times would cause a cycle count of 11/10 = 1.1 cycles per iteration, but I haven't worked out the actual downstream results - maybe the impact is less on average. Or it could just be rounding (0.05 == 0.1 to 1 decimal place).
So how do modern x86 CPUs actually schedule? In particular:
When multiple uops are ready in the reservation station, in what order are they scheduled to ports?
When a uop can go to multiple ports (like the add and lea in the example above), how is it decided which port is chosen?
If any of the answers involve a concept like oldest to choose among uops, how is it defined? Age since it was delivered to the RS? Age since it became ready? How are ties broken? Does program order ever come into it?
Results on Skylake
Let's measure some actual results on Skylake to check which answers explain the experimental evidence, so here are some real-world measured results (from perf) on my Skylake box. Confusingly, I'm going switch to using imul for my "only executes on one port" instruction, since it has many variants, including 3-argument versions that allow you to use different registers for the source(s) and destination. This is very handy when trying to construct dependency chains. It also avoids the whole "incorrect dependency on destination" that popcnt has.
Independent Instructions
Let's start by looking at the simple (?) case that the instructions are relatively independent - without any dependency chains other than trivial ones like the loop counter.
Here's a 4 uop loop (only 3 executed uops) with mild pressure. All instructions are independent (don't share any sources or destinations). The add could in principle steal the p1 needed by the imul or p6 needed by the dec:
Example 1
instr p0 p1 p5 p6
xor (elim)
imul X
add X X X X
dec X
top:
xor r9, r9
add r8, rdx
imul rax, rbx, 5
dec esi
jnz top
The results is that this executes with perfect scheduling at 1.00 cycles / iteration:
560,709,974 uops_dispatched_port_port_0 ( +- 0.38% )
1,000,026,608 uops_dispatched_port_port_1 ( +- 0.00% )
439,324,609 uops_dispatched_port_port_5 ( +- 0.49% )
1,000,041,224 uops_dispatched_port_port_6 ( +- 0.00% )
5,000,000,110 instructions:u # 5.00 insns per cycle ( +- 0.00% )
1,000,281,902 cycles:u
( +- 0.00% )
As expected, p1 and p6 are fully utilized by the imul and dec/jnz respectively, and then the add issues roughly half and half between the remaining available ports. Note roughly - the actual ratio is 56% and 44%, and this ratio is pretty stable across runs (note the +- 0.49% variation). If I adjust the loop alignment, the split changes (53/46 for 32B alignment, more like 57/42 for 32B+4 alignment). Now, we if change nothing except the position of imul in the loop:
Example 2
top:
imul rax, rbx, 5
xor r9, r9
add r8, rdx
dec esi
jnz top
Then suddenly the p0/p5 split is exactly 50%/50%, with 0.00% variation:
500,025,758 uops_dispatched_port_port_0 ( +- 0.00% )
1,000,044,901 uops_dispatched_port_port_1 ( +- 0.00% )
500,038,070 uops_dispatched_port_port_5 ( +- 0.00% )
1,000,066,733 uops_dispatched_port_port_6 ( +- 0.00% )
5,000,000,439 instructions:u # 5.00 insns per cycle ( +- 0.00% )
1,000,439,396 cycles:u ( +- 0.01% )
So that's already interesting, but it's hard to tell what's going on. Perhaps the exact behavior depends on the initial conditions at loop entry and is sensitive to ordering within the loop (e.g., because counters are used). This example shows that something more than "random" or "stupid" scheduling is going on. In particular, if you just eliminate the imul instruction from the loop, you get the following:
Example 3
330,214,329 uops_dispatched_port_port_0 ( +- 0.40% )
314,012,342 uops_dispatched_port_port_1 ( +- 1.77% )
355,817,739 uops_dispatched_port_port_5 ( +- 1.21% )
1,000,034,653 uops_dispatched_port_port_6 ( +- 0.00% )
4,000,000,160 instructions:u # 4.00 insns per cycle ( +- 0.00% )
1,000,235,522 cycles:u ( +- 0.00% )
Here, the add is now roughly evenly distributed among p0, p1 and p5 - so the presence of the imul did affect the add scheduling: it wasn't just a consequence of some "avoid port 1" rule.
Note here that total port pressure is only 3 uops/cycle, since the xor is a zeroing idiom and is eliminated in the renamer. Let's try with the max pressure of 4 uops. I expect whatever mechanism kicked in above to able to perfectly schedule this also. We only change xor r9, r9 to xor r9, r10, so it is no longer a zeroing idiom. We get the following results:
Example 4
top:
xor r9, r10
add r8, rdx
imul rax, rbx, 5
dec esi
jnz top
488,245,238 uops_dispatched_port_port_0 ( +- 0.50% )
1,241,118,197 uops_dispatched_port_port_1 ( +- 0.03% )
1,027,345,180 uops_dispatched_port_port_5 ( +- 0.28% )
1,243,743,312 uops_dispatched_port_port_6 ( +- 0.04% )
5,000,000,711 instructions:u # 2.66 insns per cycle ( +- 0.00% )
1,880,606,080 cycles:u ( +- 0.08% )
Oops! Rather than evenly scheduling everything across p0156, the scheduler has underused p0 (it's only executing something ~49% of cycles), and hence p1 and p6 are oversubcribed because they are executing both their required ops of imul and dec/jnz. This behavior, I think is consistent with a counter-based pressure indicator as hayesti indicated in their answer, and with uops being assigned to a port at issue-time, not at execution time as both
hayesti and Peter Cordes mentioned. That behavior3 makes the execute the oldest ready uops rule not nearly as effective. If uops weren't bound to execution ports at issue, but rather at execution, then this "oldest" rule would fix the problem above after one iteration - once one imul and one dec/jnz got held back for a single iteration, they will always be older than the competing xor and add instructions, so should always get scheduled first. One thing I am learning though, is that if ports are assigned at issue time, this rule doesn't help because the ports are pre-determined at issue time. I guess it still helps a bit in favoring instructions which are part of long dependecy chains (since these will tend to fall behind), but it's not the cure-all I thought it was.
That also seems to be a explain the results above: p0 gets assigned more pressure than it really has because the dec/jnz combo can in theory execute on p06. In fact because the branch is predicted taken it only ever goes to p6, but perhaps that info can't feed into the pressure balancing algorithm, so the counters tend to see equal pressure on p016, meaning that the add and the xor get spread around differently than optimal.
Probably we can test this, by unrolling the loop a bit so the jnz is less of a factor...
1 OK, it is properly written μops, but that kills search-ability and to actually type the "μ" character I'm usually resorting to copy-pasting the character from a webpage.
2 I had originally used imul instead of popcnt in the loop, but, unbelievably, _IACA doesn't support it_!
3 Please note that I'm not suggesting this is a poor design or anything - there are probably very good hardware reasons why the scheduler cannot easily make all its decisions at execution time.

Your questions are tough for a couple of reasons:
The answer depends a lot on the microarchitecture of the processor
which can vary significantly from generation to generation.
These are fine-grained details which Intel doesn't generally release to the public.
Nevertheless, I'll try to answer...
When multiple uops are ready in the reservation station, in what order are they scheduled to ports?
It should be the oldest [see below], but your mileage may vary. The P6 microarchitecture (used in the Pentium Pro, 2 & 3) used a reservation station with five schedulers (one per execution port); the schedulers used a priority pointer as a place to start scanning for ready uops to dispatch. It was only pseudo FIFO so it's entirely possible that the oldest ready instruction was not always scheduled. In the NetBurst microarchitecture (used in Pentium 4), they ditched the unified reservation station and used two uop queues instead. These were proper collapsing priority queues so the schedulers were guaranteed to get the oldest ready instruction. The Core architecture returned to a reservation station and I would hazard an educated guess that they used the collapsing priority queue, but I can't find a source to confirm this. If somebody has a definitive answer, I'm all ears.
When a uop can go to multiple ports (like the add and lea in the example above), how is it decided which port is chosen?
That's tricky to know. The best I could find is a patent from Intel describing such a mechanism. Essentially, they keep a counter for each port that has redundant functional units. When the uops leave the front end to the reservation station, they are assigned a dispatch port. If it has to decide between multiple redundant execution units, the counters are used to distribute the work evenly. Counters are incremented and decremented as uops enter and leave the reservation station respectively.
Naturally this is just a heuristic and does not guarantee a perfect conflict-free schedule, however, I could still see it working with your toy example. The instructions which can only go to one port would ultimately influence the scheduler to dispatch the "less restricted" uops to other ports.
In any case, the presence of a patent doesn't necessarily imply that the idea was adopted (although that said, one of the authors was also a tech lead of the Pentium 4, so who knows?)
If any of the answers involve a concept like oldest to choose among uops, how is it defined? Age since it was delivered to the RS? Age since it became ready? How are ties broken? Does program order ever come into it?
Since uops are inserted into the reservation station in order, oldest here does indeed refer to time it entered the reservation station, i.e. oldest in program order.
By the way, I would take those IACA results with a grain of salt as they may not reflect the nuances of the real hardware. On Haswell, there is a hardware counter called uops_executed_port that can tell you how many cycles in your thread were uops issues to ports 0-7. Maybe you could leverage these to get a better understanding of your program?

Here's what I found on Skylake, coming at it from the angle that uops are assigned to ports at issue time (i.e., when they are issued to the RS), not at dispatch time (i.e., at the moment they are sent to execute). Before I had understood that the port decision was made at dispatch time.
I did a variety of tests which tried to isolate sequences of add operations that can go to p0156 and imul operations which go only to port 0. A typical test goes something like this:
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
... many more mov instructions
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1
add r9, 1
add r8, 1
add ecx, 1
add edx, 1
add r9, 1
add r8, 1
add ecx, 1
add edx, 1
add r9, 1
add r8, 1
add ecx, 1
add edx, 1
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
... many more mov instructions
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
Basically there is a long lead-in of mov eax, [edi] instructions, which only issue on p23 and hence don't clog up the ports used by the instructions (I could have also used nop instructions, but the test would be a bit different since nop don't issue to the RS). This is followed by the "payload" section, here composed of 4 imul and 12 add, and then a lead-out section of more dummy mov instructions.
First, let's take a look at the patent that hayesti linked above, and which he describes the basic idea about: counters for each port that track the total number of uops assigned to the port, which are used to load balance the port assignments. Take a look at this table included in the patent description:
This table is used to pick between p0 or p1 for the 3-uops in an issue group for the 3-wide architecture discussed in the patent. Note that the behavior depends on the position of the uop in the group, and that there are 4 rules1 based on the count, which spread the uops around in a logical way. In particular, the count needs to be at +/- 2 or greater before the whole group gets assigned the under-used port.
Let's see if we can observe the "position in issue group" matters behavior on Sklake. We use a payload of a single add like:
add edx, 1 ; position 0
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
... and we slide it around inside the 4 instruction chuck like:
mov eax, [edi]
add edx, 1 ; position 1
mov eax, [edi]
mov eax, [edi]
... and so on, testing all four positions within the issue group2. This shows the following, when the RS is full (of mov instructions) but with no port pressure of any of the relevant ports:
The first add instructions go to p5 or p6, with the port selected usually alternating as the instruction is slow down (i.e., add instructions in even positions go to p5 and in odd positions go to p6).
The second add instruction also goes to p56 - whichever of the two the first one didn't go to.
After that further add instructions start to be balanced around p0156, with p5 and p6 usually ahead but with things fairly even overall (i.e., the gap between p56 and the other two ports doesn't grow).
Next, I took a look at what happens if load up p1 with imul operations, then first in a bunch of add operations:
imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1
add r9, 1
add r8, 1
add ecx, 1
add edx, 1
add r9, 1
add r8, 1
add ecx, 1
add edx, 1
add r9, 1
add r8, 1
add ecx, 1
add edx, 1
The results show that the scheduler handles this well - all of the imul got to scheduled to p1 (as expected), and then none of the subsequent add instructions went to p1, being spread around p056 instead. So here the scheduling is working well.
Of course, when the situation is reversed, and the series of imul comes after the adds, p1 is loaded up with its share of adds before the imuls hit. That's a result of port assignment happening in-order at issue time, since is no mechanism to "look ahead" and see the imul when scheduling the adds.
Overall the scheduler looks to do a good job in these test cases.
It doesn't explain what happens in smaller, tighter loops like the following:
sub r9, 1
sub r10, 1
imul ebx, edx, 1
dec ecx
jnz top
Just like Example 4 in my question, this loop only fills p0 on ~30% of cycles, despite there being two sub instructions that should be able to go to p0 on every cycle. p1 and p6 are oversubscribed, each executing 1.24 uops for every iteration (1 is ideal). I wasn't able to triangulate the difference between the examples that work well at the top of this answer with the bad loops - but there are still many ideas to try.
I did note that examples without instruction latency differences don't seem to suffer from this issue. For example, here's another 4-uop loop with "complex" port pressure:
top:
sub r8, 1
ror r11, 2
bswap eax
dec ecx
jnz top
The uop map is as follows:
instr p0 p1 p5 p6
sub X X X X
ror X X
bswap X X
dec/jnz X
So the sub must always go to p15, shared with bswap if things are to work out. They do:
Performance counter stats for './sched-test2' (2 runs):
999,709,142 uops_dispatched_port_port_0 ( +- 0.00% )
999,675,324 uops_dispatched_port_port_1 ( +- 0.00% )
999,772,564 uops_dispatched_port_port_5 ( +- 0.00% )
1,000,991,020 uops_dispatched_port_port_6 ( +- 0.00% )
4,000,238,468 uops_issued_any ( +- 0.00% )
5,000,000,117 instructions:u # 4.99 insns per cycle ( +- 0.00% )
1,001,268,722 cycles:u ( +- 0.00% )
So it seems that the issue may be related to instruction latencies (certainly, there are other differences between the examples). That's something that came up in this similar question.
1The table has 5 rules, but the rule for 0 and -1 counts are identical.
2Of course, I can't be sure where the issue groups start and end, but regardless we test four different positions as we slide down four instructions (but the labels could be wrong). I'm also not sure the issue group max size is 4 - earlier parts of the pipeline are wider - but I believe it is and some testing seemed to show it was (loops with a multiple of 4 uops showed consistent scheduling behavior). In any case, the conclusions hold with different scheduling group sizes.

Section 2.12 of Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures[^1] explains how port are assigned, though it fails to explain example 4 in the question description. I also failed to figure out what role Latency plays in the port assignment.
Previous work [19, 25, 26] has identified the ports that the µops of individual instructions can use. For µops that can use more than one port, it was, however, previously unknown how the actual port is chosen by the processor. We reverse-engineered the port assignment algorithm using microbenchmarks. In the following, we describe our findings for CPUs with eight ports; such CPUs are currently most widely used.
The ports are assigned when the µops are issued by the renamer to the scheduler. In a single cycle, up to four µops can be issued. In the following, we will call the position of a µop within a cycle an issue slot; e.g., the oldest instruction issued in a cycle would occupy issue slot 0.
The port that a µop is assigned depends on its issue slot and on the ports assigned to µops that have not been executed and were issued in a previous cycle.
In the following, we will only consider µops that can use more than one port. For a given µop m, let $P_{min}$ be the port to which the fewest non-executed µops have been assigned to from among the ports that m can use. Let $P_{min'}$ be the port with the second smallest usage so far. If there is a tie among the ports with the smallest (or second smallest, respectively) usage, let $P_{min}$ (or $P_{min'}$) be the port with the highest port number from among these ports (the reason for this choice is probably that ports with higher numbers are connected to fewer functional units). If the difference between $P_{min}$ and $P_{min'}$ is greater or equal to 3, we set $P_{min'}$ to $P_{min}$.
The µops in issue slots 0 and 2 are assigned to port $P_{min}$ The µops in issue slots 1 and 3 are assigned to port $P_{min'}$.
A special case is µops that can use port 2 and port 3. These ports are used by µops that handle memory accesses, and both ports are connected to the same types of functional units. For such µops, the port assignment algorithm alternates between port 2 and port 3.
I tried to find out whether $P_{min}$ and $P_{min'}$ are shared between threads (Hyper-Threading), namely whether one thread can affect the port assignment of another one in the same core.
Just split the code used in BeeOnRope's answer into two threads.
thread1:
.loop:
imul rax, rbx, 5
jmp .loop
thread2:
mov esi,1000000000
.top:
bswap eax
dec esi
jnz .top
jmp thread2
Where instructions bswap can be executed on ports 1 and 5, and imul r64, R64, i on port 1. If counters were shared between threads, you would see bswap executed on port 5 and imul executed on port 1.
The experiment was recorded as follows, where ports P0 and P5 on thread 1 and p0 on thread 2 should have recorded a small amount of non-user data, but without hindering the conclusion. It can be seen from the data that the bswap instruction of thread 2 is executed alternately between ports P1 and P5 without giving up P1.
port
thread 1 active cycles
thread 2 active cycles
P0
63,088,967
68,022,708
P1
180,219,013,832
95,742,764,738
P5
63,994,200
96,291,124,547
P6
180,330,835,515
192,048,880,421
total
180,998,504,099
192,774,759,297
Therefore, the counters are not shared between threads.
This conclusion does not conflict with SMotherSpectre[^2], which uses time as the side channel. (For example, thread 2 waits longer on port 1 to use port 1.)
Executing instructions that occupy a specific port and measuring their timing enables inference about other instructions executing on the same port. We first choose two instructions, each scheduled on a single, distinct, execution port. One thread runs and times a long sequence of single µop instructions scheduled on port a, while simultaneously the other thread runs a long sequence of instructions scheduled on port b. We expect that, if a = b, contention occurs and the measured execution time is longer compared to the a ≠ b case.
[^1]: Abel, Andreas, and Jan Reineke. "Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures." arXiv preprint arXiv:2107.14210 (2021).
[^2]: Bhattacharyya, Atri, Alexandra Sandulescu, Matthias Neugschwandtner, Alessandro Sorniotti, Babak Falsafi, Mathias Payer, and Anil Kurmus. “SMoTherSpectre: Exploiting Speculative Execution through Port Contention.” Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, November 6, 2019, 785–800. https://doi.org/10.1145/3319535.3363194.

Related

Why is my loop much faster when it is contained in one cache line?

When I run this little assembly program on my Ryzen 9 3900X:
_start: xor rax, rax
xor rcx, rcx
loop0: add rax, 1
mov rdx, rax
and rdx, 1
add rcx, rdx
cmp rcx, 1000000000
jne loop0
It completes in 450 ms if all the instructions between loop0 and up to and including the jne, are contained entirely in one cacheline. That is, if:
round((address of loop0)/64) == round((address of end of jne-instruction)/64)
However, if the above condition does not hold, the loop takes 900 ms instead.
I've made a repo with the code https://github.com/avl/strange_performance_repro .
Why is the inner loop much slower in some specific cases?
Edit: Removed a claim with a conclusion from a mistake in testing.

Your issue lies in the variable cost of the jne instruction.
First of all, to understand the impact of the effect, we need to analyze the full loop itself. The architecture of the Ryzen 9 3900X is Zen2. We can retrieve information about this on the AMD website or also WikiChip.
This architecture have 4 ALU and 3 AGU. This roughly means it can execute up to 4 instructions like add/and/cmp per cycle.
Here is the cost of each instruction of the loop (based on the Agner Fog instruction table for Zen1):
# Reciprocal throughput
loop0: add rax, 1 # 0.25
mov rdx, rax # 0.2
and rdx, 1 # 0.25
add rcx, rdx # 0.25
cmp rcx, 1000000000 # 0.25 | Fused 0.5-2 (2 if jumping)
jne loop0 # 0.5-2 |
As you can see, the 4 first computing instructions of the loop can be executed in ~1 cycle. The 2 last instructions can be merged by your processor into a faster one.
Your main issue is that this last jne instruction could be quite slow compared to the rest of the loop. So you are very likely to measure only the overhead of this instruction. From this point, things start to be complicated.
Engineer and researcher worked hard the last decades to reduce the cost of such instructions at (almost) any price. Nowadays, processors (like the Ryzen 9 3900X) use out-of-order instruction scheduling to execute the dependent instructions required by the jne instruction as soon as possible. Most processors can also predict the address of the next instruction to execute after the jne and fetch new instructions (eg. the one of the next loop iteration) while the other instruction of the current loop iteration are being executed.
Performing the fetch as soon as possible is important to prevent any stall in the execution pipeline of the processor (especially in your case).
From the AMD document "Software Optimization Guide for AMD Family 17h Models 30h and Greater Processors", we can read:
2.8.3 Loop Alignment:
For the processor loop alignment is not usually a significant issue. However, for hot loops, some further knowledge of trade-offs can be helpful. Since the processor can read an aligned 64-byte fetch block every cycle, aligning the end of the loop to the last byte of a 64-byte cache line is the best thing to do, if possible.
2.8.1.1 Next Address Logic
The next-address logic determines addresses for instruction fetch. [...]. When branches are identified, the next-address logic is redirected by the branch target and branch direction prediction hardware to generate a non-sequential fetch block address. The processor facilities that are designed to predict the next instruction to be executed following a branch are detailed in the following sections.
Thus, performing a conditional branching to instructions located in another cache line introduces an additional latency overhead due to a fetch of the Op Cache (an instruction cache faster than the L1) not required if the whole loop fit in 1 cache line. Indeed, if a loop is crossing a cache line, 2 line-cache fetches are required, which takes no less than 2 cycles. If the whole loop is fitting in a cache-line, only one 1 line-cache fetch is needed, which take only 1 cycle. As a result, since your loop iterations are very fast, paying 1 additional cycle introduces a significant slowdown. But how much?
You say that the program takes about 450 ms.
Since the Ryzen 9 3900X turbo frequency is 4.6 GHz and your loop is executed 2e9 times, the number cycles per loop iteration is 1.035. Note that this is better than what we can expected before (I guess this processor is able to rename registers, ignore the mov instruction, execute the jne instruction in parallel in only 1 cycle while other instructions of the loop are perfectly pipelined; which is astounding). This also shows that paying an additional fetch overhead of 1 cycle will double the number of cycles needed to execute each loop iteration and so the overall execution time.
If you do not want to pay this overhead, consider unrolling your loop to significantly reduce the number of conditional branches and non-sequential fetches.
This problem can occur on other architectures such as on Intel Skylake. Indeed, the same loop on a i5-9600KF takes 0.70s with loop alignment and 0.90s without (also due to the additional 1 cycle fetch latency). With a 8x unrolling, the result is 0.53s (whatever the alignment).

Why jnz counts no cycle?

I found in online resource that IvyBridge has 3 ALU. So I write a small program to test:
global _start
_start:
mov rcx, 10000000
.for_loop: ; do {
inc rax
inc rbx
dec rcx
jnz .for_loop ; } while (--rcx)
xor rdi, rdi
mov rax, 60 ; _exit(0)
syscall
I compile and run it with perf:
$ nasm -felf64 cycle.asm && ld cycle.o && sudo perf stat ./a.out
The output shows:
10,491,664 cycles
which seems to make sense at the first glance, because there are 3 independent instructions (2 inc and 1 dec) that uses ALU in the loop, so they count 1 cycle together.
But what I don't understand is why the whole loop only has 1 cycle? jnz depends on the result of dec rcx, it should counts 1 cycle, so that the whole loop is 2 cycle. I would expect the output to be close to 20,000,000 cycles.
I also tried to change the second inc from inc rbx to inc rax, which makes it dependent on the first inc. The result does becomes close to 20,000,000 cycles, which shows that dependency will delay an instruction so that they can't run at the same time. So why jnz is special?
What I'm missing here?

First of all, dec/jnz will macro-fuse into a single uop on Intel Sandybridge-family. You could defeat that by putting a non-flag-setting instruction between the dec and jnz.
.for_loop: ; do {
inc rax
dec rcx
lea rbx, [rbx+1] ; doesn't touch flags, defeats macro-fusion
jnz .for_loop ; } while (--rcx)
This will still run at 1 iter per cycle on Haswell and later and Ryzen because they have 4 integer execution ports to keep up with 4 uops per iteration. (Your loop with macro-fusion is only 3 fused-domain uops on Intel CPUs, so SnB/IvB can run it at 1 per clock, too.)
See Agner Fog's optimization guide and especially his microarch guide. Also other links in https://stackoverflow.com/tags/x86/info.
Control dependencies are hidden by branch prediction + speculative execution, unlike data dependencies.
Out-of-order execution and branch prediction + speculative execution hide the "latency" of the control dependency. i.e. the next iteration can start running before the CPU verifies that jnz should really be taken.
So each jnz has an input dependency on the previous dec rcx before it can verify the prediction, but later instructions don't have to wait for it to be checked before they can execute. In-order retirement makes sure that mis-speculation is caught before anything can "see" it happen (except for microarchitectural effects leading to the Spectre attack...)
10M iterations is not a lot. I'd normally use at least 100M for something that runs at only 1c per iter. Having a simple microbenchmark run for 0.1 to 1 second is normally good to get very high precision and hide startup overhead.
And BTW, you don't need sudo perf if you set kernel.perf_event_paranoid = 0 with sysctl. It's almost certainly better to do that than to use sudo all the time.

Why dependency in a loop iteration can't be executed together with the previous one

I use this code to test the impact of dependency in a loop iteration on IvyBridge:
global _start
_start:
mov rcx, 1000000000
.for_loop:
inc rax ; uop A
inc rax ; uop B
dec rcx ; uop C
jnz .for_loop
xor rdi, rdi
mov rax, 60 ; _exit(0)
syscall
Since dec and jnz will be macro-fused to a single uop, there are 3 uops in my loop, they are labeled in the comments.
uop B depends on uop A, so I think the execution would be like this:
A C
B A C ; the previous B and current A can be in the same cycle
B A C
...
B A C
B
Therefore the loop can be executed 1 cycle per iter.
However, the perf tool shows:
2,009,704,779 cycles
1,008,054,984 stalled-cycles-frontend # 50.16% frontend cycles idl
So it's 2 cycle per iter, and there are 50% frontend cycle idle.
What caused the frontend 50% idle? Why the hypothetical execution diagram can't be realized?

B and A form a loop-carried dependency chain. A in the next iteration can't run until it has the result of B in the previous.
Any given B can never run in the same cycle as an A: what input would the later one use, if the earlier one hasn't produced a result yet?
This chain is 2 cycles long (per iteration), because the latency of inc is 1 cycle. This creates a latency bottleneck in the back-end that out-of-order execution can't hide. (Except for very low iteration counts where it can overlap it with code after the loop).
Just like if you fully unrolled a huge chain of times 102400 inc eax, there's no instruction-level parallelism for the CPU to find between a chain of instructions that each depend on the previous.
The macro-fused dec rcx/jnz uop is independent of the RAX chain, and is a shorter chain (only 1 cycle per iteration, being only 1 dec&branch uop with 1c latency). So it can run in parallel with B or A uops.
See my answer on another question for more about the concept of instruction-level parallelism and dependency chains, and how CPUs exploit that parallelism to run instruction in parallel when they're independent.
Agner Fog's microarch PDF shows this with examples in an early chapter: Chapter 2: Out-of-order execution (All processors except P1,
PMMX).
If you started a new 2-cycle dep chain every iteration, it would run as you expect. A new chain forking off every iteration would expose instruction-level parallelism for the CPU to keep A and B from different iterations in flight at the same time.
.for_loop:
xor eax,eax ; dependency-breaking for RAX
inc rax ; uop A
inc rax ; uop B
dec rcx ; uop C
jnz .for_loop
Sandybridge-family handles xor-zeroing without an execution unit, so this is still only 3 unfused-domain uops in the loop, so IvyBridge has enough ALU execution ports to run all 3 in a single cycle. This also maxes out the front-end at 4 fused-domain uops per clock.
Or if you changed A to start a new dep chain in RAX with any instruction that unconditionally overwrites RAX without depending on the result of the inc, you'd be fine.
lea rax, [rdx + rdx] ; no dependency on B from last iter
inc rax ; uop B
Except for a couple instructions with an unfortunate output dependency: Why does breaking the "output dependency" of LZCNT matter?
popcnt rax, rdx ; false dependency on RAX, 3 cycle latency
inc rax ; uop B
On Intel CPUs, only popcnt, and lzcnt/tzcnt have an output dependency for no reason. It's because they use the same execution unit as bsf/bsr, which leave the destination unmodified if the input is zero, on Intel and AMD CPUs. Intel still only documents it on paper as undefined if the input is zero for BSF/BSR, but they build hardware that implements stronger guarantees. (AMD does even document this BSF/BSR behaviour.) Anyway, so Intel's BSF/BSR are like CMOV, and need the destination as an input in case the source reg is 0. popcnt, (and lzcnt/tzcnt on pre-Skylake) suffer from this, too.
If you made the loop more than 5 fused-domain uops, SnB/IvB could issue it at best 1 per 2 cycles from the front-end. Haswell and later "unroll" in the loop buffer or something so a 5 uop loop can run at ~1.25 c per iteration, but SnB/IvB don't. Is performance reduced when executing loops whose uop count is not a multiple of processor width?
The front-end issue/rename stage is 4 fused-domain uops wide in Intel CPUs since Core 2.

Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths

I was playing with the code in this answer, slightly modifying it:
BITS 64
GLOBAL _start
SECTION .text
_start:
mov ecx, 1000000
.loop:
;T is a symbol defined with the CLI (-DT=...)
TIMES T imul eax, eax
lfence
TIMES T imul edx, edx
dec ecx
jnz .loop
mov eax, 60 ;sys_exit
xor edi, edi
syscall
Without the lfence I the results I get are consistent with the static analysis in that answer.
When I introduce a single lfence I'd expect the CPU to execute the imul edx, edx sequence of the k-th iteration in parallel with the imul eax, eax sequence of the next (k+1-th) iteration.
Something like this (calling A the imul eax, eax sequence and D the imul edx, edx one):
|
| A
| D A
| D A
| D A
| ...
| D A
| D
|
V time
Taking more or less the same number of cycles but for one unpaired parallel execution.
When I measure the number of cycles, for the original and modified version, with taskset -c 2 ocperf.py stat -r 5 -e cycles:u '-x ' ./main-$T for T in the range below I get
T Cycles:u Cycles:u Delta
lfence no lfence
10 42047564 30039060 12008504
15 58561018 45058832 13502186
20 75096403 60078056 15018347
25 91397069 75116661 16280408
30 108032041 90103844 17928197
35 124663013 105155678 19507335
40 140145764 120146110 19999654
45 156721111 135158434 21562677
50 172001996 150181473 21820523
55 191229173 165196260 26032913
60 221881438 180170249 41711189
65 250983063 195306576 55676487
70 281102683 210255704 70846979
75 312319626 225314892 87004734
80 339836648 240320162 99516486
85 372344426 255358484 116985942
90 401630332 270320076 131310256
95 431465386 285955731 145509655
100 460786274 305050719 155735555
How can the values of Cycles:u lfence be explained?
I would have expected them to be similar to those of Cycles:u no lfence since a single lfence should prevent only the first iteration from being executed in parallel for the two blocks.
I don't think it's due to the lfence overhead as I believe that should be constant for all Ts.
I'd like to fix what's wrong with my forma mentis when dealing with the static analysis of code.
Supporting repository with source files.

I think you're measuring accurately, and the explanation is microarchitectural, not any kind of measurement error.
I think your results for mid to low T support the conclusion that lfence stops the front-end from even issuing past the lfence until all earlier instructions retire, rather than having all the uops from both chains already issued and just waiting for lfence to flip a switch and let multiplies from each chain start to dispatch on alternating cycles.
(port1 would get edx,eax,empty,edx,eax,empty,... for Skylake's 3c latency / 1c throughput multiplier right away, if lfence didn't block the front-end, and overhead wouldn't scale with T.)
You're losing imul throughput when only uops from the first chain are in the scheduler because the front-end hasn't chewed through the imul edx,edx and loop branch yet. And for the same number of cycles at the end of the window when the pipeline is mostly drained and only uops from the 2nd chain are left.
The overhead delta looks linear up to about T=60. I didn't run the numbers, but the slope up to there looks reasonable for T * 0.25 clocks to issue the first chain vs. 3c-latency execution bottleneck. i.e. delta growing maybe 1/12th as fast as total no-lfence cycles.
So (given the lfence overhead I measured below), with T<60:
no_lfence cycles/iter ~= 3T # OoO exec finds all the parallelism
lfence cycles/iter ~= 3T + T/4 + 9.3 # lfence constant + front-end delay
delta ~= T/4 + 9.3
#Margaret reports that T/4 is a better fit than 2*T / 4, but I would have expected T/4 at both the start and end, for a total of 2T/4 slope of the delta.
After about T=60, delta grows much more quickly (but still linearly), with a slope about equal to the total no-lfence cycles, thus about 3c per T. I think at that point, the scheduler (Reservation Station) size is limiting the out-of-order window. You probably tested on a Haswell or Sandybridge/IvyBridge, (which have a 60-entry or 54-entry scheduler respectively. Skylake's is 97 entry (but not fully unified; IIRC BeeOnRope's testing showed that not all the entries could be used for any type of uop. Some were specific to load and/or store, for example.)
The RS tracks un-executed uops. Each RS entry holds 1 unfused-domain uop that's waiting for its inputs to be ready, and its execution port, before it can dispatch and leave the RS1.
After an lfence, the front-end issues at 4 per clock while the back-end executes at 1 per 3 clocks, issuing 60 uops in ~15 cycles, during which time only 5 imul instructions from the edx chain have executed. (There's no load or store micro-fusion here, so every fused-domain uop from the front-end is still only 1 unfused-domain uop in the RS2.)
For large T the RS quickly fills up, at which point the front-end can only make progress at the speed of the back-end. (For small T, we hit the next iteration's lfence before that happens, and that's what stalls the front-end). When T > RS_size, the back-end can't see any of the uops from the eax imul chain until enough back-end progress through the edx chain has made room in the RS. At that point, one imul from each chain can dispatch every 3 cycles, instead of just the 1st or 2nd chain.
Remember from the first section that time spent just after lfence only executing the first chain = time just before lfence executing only the second chain. That applies here, too.
We get some of this effect even with no lfence, for T > RS_size, but there's opportunity for overlap on both sides of a long chain. The ROB is at least twice the size of the RS, so the out-of-order window when not stalled by lfence should be able to keep both chains in flight constantly even when T is somewhat larger than the scheduler capacity. (Remember that uops leave the RS as soon as they've executed. I'm not sure if that means they have to finish executing and forward their result, or merely start executing, but that's a minor difference here for short ALU instructions. Once they're done, only the ROB is holding onto them until they retire, in program order.)
The ROB and register-file shouldn't be limiting the out-of-order window size (http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/) in this hypothetical situation, or in your real situation. They should both be plenty big.
Blocking the front-end is an implementation detail of lfence on Intel's uarches. The manual only says that later instructions can't execute. That wording would allow the front-end to issue/rename them all into the scheduler (Reservation Station) and ROB while lfence is still waiting, as long as none are dispatched to an execution unit.
So a weaker lfence would maybe have flat overhead up to T=RS_size, then the same slope as you see now for T>60. (And the constant part of the overhead might be lower.)
Note that guarantees about speculative execution of conditional/indirect branches after lfence apply to execution, not (as far as I know) to code-fetch. Merely triggering code-fetch is not (AFAIK) useful to a Spectre or Meltdown attack. Possibly a timing side-channel to detect how it decodes could tell you something about the fetched code...
I think AMD's LFENCE is at least as strong on actual AMD CPUs, when the relevant MSR is enabled. (Is LFENCE serializing on AMD processors?).
Extra lfence overhead:
Your results are interesting, but it doesn't surprise me at all that there's significant constant overhead from lfence itself (for small T), as well as the component that scales with T.
Remember that lfence doesn't allow later instructions to start until earlier instructions have retired. This is probably at least a couple cycles / pipeline-stages later than when their results are ready for bypass-fowarding to other execution units (i.e. the normal latency).
So for small T, it's definitely significant that you add extra latency into the chain by requiring the result to not only be ready, but also written back to the register file.
It probably takes an extra cycle or so for lfence to allow the issue/rename stage to start operating again after detecting retirement of the last instruction before it. The issue/rename process takes multiple stages (cycles), and maybe lfence blocks at the start of this, instead of in the very last step before uops are added into the OoO part of the core.
Even back-to-back lfence itself has 4 cycle throughput on SnB-family, according to Agner Fog's testing. Agner Fog reports 2 fused-domain uops (no unfused), but on Skylake I measure it at 6 fused-domain (still no unfused) if I only have 1 lfence. But with more lfence back-to-back, it's fewer uops! Down to ~2 uops per lfence with many back-to-back, which is how Agner measures.
lfence/dec/jnz (a tight loop with no work) runs at 1 iteration per ~10 cycles on SKL, so that might give us an idea of the real extra latency that lfence adds to the dep chains even without the front-end and RS-full bottlenecks.
Measuring lfence overhead with only one dep chain, OoO exec being irrelevant:
.loop:
;mfence ; mfence here: ~62.3c (with no lfence)
lfence ; lfence here: ~39.3c
times 10 imul eax,eax ; with no lfence: 30.0c
; lfence ; lfence here: ~39.6c
dec ecx
jnz .loop
Without lfence, runs at the expected 30.0c per iter. With lfence, runs at ~39.3c per iter, so lfence effectively added ~9.3c of "extra latency" to the critical path dep chain. (And 6 extra fused-domain uops).
With lfence after the imul chain, right before the loop-branch, it's slightly slower. But not a whole cycle slower, so that would indicate that the front-end is issuing the loop-branch + and imul in a single issue-group after lfence allows execution to resume. That being the case, IDK why it's slower. It's not from branch misses.
Getting the behaviour you were expecting:
Interleave the chains in program order, like #BeeOnRope suggests in comments, doesn't require out-of-order execution to exploit the ILP, so it's pretty trivial:
.loop:
lfence ; at the top of the loop is the lowest-overhead place.
%rep T
imul eax,eax
imul edx,edx
%endrep
dec ecx
jnz .loop
You could put pairs of short times 8 imul chains inside a %rep to let OoO exec have an easy time.
Footnote 1: How the front-end / RS / ROB interact
My mental model is that the issue/rename/allocate stages in the front-end add new uops to both the RS and the ROB at the same time.
Uops leave the RS after executing, but stay in the ROB until in-order retirement. The ROB can be large because it's never scanned out-of-order to find the first-ready uop, only scanned in-order to check if the oldest uop(s) have finished executing and thus are ready to retire.
(I assume the ROB is physically a circular buffer with start/end indices, not a queue which actually copies uops to the right every cycle. But just think of it as a queue / list with a fixed max size, where the front-end adds uops at the front, and the retirement logic retires/commits uops from the end as long as they're fully executed, up to some per-cycle per-hyperthread retirement limit which is not usually a bottleneck. Skylake did increase it for better Hyperthreading, maybe to 8 per clock per logical thread. Perhaps retirement also means freeing physical registers which helps HT, because the ROB itself is statically partitioned when both threads are active. That's why retirement limits are per logical thread.)
Uops like nop, xor eax,eax, or lfence, which are handled in the front-end (don't need any execution units on any ports) are added only to the ROB, in an already-executed state. (A ROB entry presumably has a bit that marks it as ready to retire vs. still waiting for execution to complete. This is the state I'm talking about. For uops that did need an execution port, I assume the ROB bit is set via a completion port from the execution unit. And that the same completion-port signal frees its RS entry.)
Uops stay in the ROB from issue to retirement.
Uops stay in the RS from issue to execution. The RS can replay uops in a few cases, e.g. for the other half of a cache-line-split load, or if it was dispatched in anticipation of load data arriving, but in fact it didn't. (Cache miss or other conflicts like Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?) Or when a load port speculates that it can bypass the AGU before starting a TLB lookup to shorten pointer-chasing latency with small offsets - Is there a penalty when base+offset is in a different page than the base?
So we know that the RS can't remove a uop right as it dispatches, because it might need to be replayed. (Can happen even to non-load uops that consume load data.) But any speculation that needs replays is short-range, not through a chain of uops, so once a result comes out the other end of an execution unit, the uop can be removed from the RS. Probably this is part of what a completion port does, along with putting the result on the bypass forwarding network.
Footnote 2: How many RS entries does a micro-fused uop take?
TL:DR: P6-family: RS is fused, SnB-family: RS is unfused.
A micro-fused uop is issued to two separate RS entries in Sandybridge-family, but only 1 ROB entry. (Assuming it isn't un-laminated before issue, see section 2.3.5 for HSW or section 2.4.2.4 for SnB of Intel's optimization manual, and Micro fusion and addressing modes. Sandybridge-family's more compact uop format can't represent indexed addressing modes in the ROB in all cases.)
The load can dispatch independently, ahead of the other operand for the ALU uop being ready. (Or for micro-fused stores, either of the store-address or store-data uops can dispatch when its input is ready, without waiting for both.)
I used the two-dep-chain method from the question to experimentally test this on Skylake (RS size = 97), with micro-fused or edi, [rdi] vs. mov+or, and another dep chain in rsi. (Full test code, NASM syntax on Godbolt)
; loop body
%rep T
%if FUSE
or edi, [rdi] ; static buffers are in the low 32 bits of address space, in non-PIE
%else
mov eax, [rdi]
or edi, eax
%endif
%endrep
%rep T
%if FUSE
or esi, [rsi]
%else
mov eax, [rsi]
or esi, eax
%endif
%endrep
Looking at uops_executed.thread (unfused-domain) per cycle (or per second which perf calculates for us), we can see a throughput number that doesn't depend on separate vs. folded loads.
With small T (T=30), all the ILP can be exploited, and we get ~0.67 uops per clock with or without micro-fusion. (I'm ignoring the small bias of 1 extra uop per loop iteration from dec/jnz. It's negligible compared to the effect we'd see if micro-fused uops only used 1 RS entry)
Remember that load+or is 2 uops, and we have 2 dep chains in flight, so this is 4/6, because or edi, [rdi] has 6 cycle latency. (Not 5, which is surprising, see below.)
At T=60, we still have about 0.66 unfused uops executed per clock for FUSE=0, and 0.64 for FUSE=1. We can still find basically all the ILP, but it's just barely starting to dip, as the two dep chains are 120 uops long (vs. a RS size of 97).
At T=120, we have 0.45 unfused uops per clock for FUSE=0, and 0.44 for FUSE=1. We're definitely past the knee here, but still finding some of the ILP.
If a micro-fused uop took only 1 RS entry, FUSE=1 T=120 should be about the same speed as FUSE=0 T=60, but that's not the case. Instead, FUSE=0 or 1 makes nearly no difference at any T. (Including larger ones like T=200: FUSE=0: 0.395 uops/clock, FUSE=1: 0.391 uops/clock). We'd have to go to very large T before we start for the time with 1 dep-chain in flight to totally dominate the time with 2 in flight, and get down to 0.33 uops / clock (2/6).
Oddity: We have such a small but still measurable difference in throughput for fused vs. unfused, with separate mov loads being faster.
Other oddities: the total uops_executed.thread is slightly lower for FUSE=0 at any given T. Like 2,418,826,591 vs. 2,419,020,155 for T=60. This difference was repeatable down to +- 60k out of 2.4G, plenty precise enough. FUSE=1 is slower in total clock cycles, but most of the difference comes from lower uops per clock, not from more uops.
Simple addressing modes like [rdi] are supposed to only have 4 cycle latency, so load + ALU should be only 5 cycle. But I measure 6 cycle latency for the load-use latency of or rdi, [rdi], or with a separate MOV-load, or with any other ALU instruction I can never get the load part to be 4c.
A complex addressing mode like [rdi + rbx + 2064] has the same latency when there's an ALU instruction in the dep chain, so it appears that Intel's 4c latency for simple addressing modes only applies when a load is forwarding to the base register of another load (with up to a +0..2047 displacement and no index).
Pointer-chasing is common enough that this is a useful optimization, but we need to think of it as a special load-load forwarding fast-path, not as a general data ready sooner for use by ALU instructions.
P6-family is different: an RS entry holds a fused-domain uop.
#Hadi found an Intel patent from 2002, where Figure 12 shows the RS in the fused domain.
Experimental testing on a Conroe (first gen Core2Duo, E6600) shows that there's a large difference between FUSE=0 and FUSE=1 for T=50. (The RS size is 32 entries).
T=50 FUSE=1: total time of 2.346G cycles (0.44IPC)
T=50 FUSE=0: total time of 3.272G cycles (0.62IPC = 0.31 load+OR per clock). (perf / ocperf.py doesn't have events for uops_executed on uarches before Nehalem or so, and I don't have oprofile installed on that machine.)
T=24 there's a negligible difference between FUSE=0 and FUSE=1, around 0.47 IPC vs 0.9 IPC (~0.45 load+OR per clock).
T=24 is still over 96 bytes of code in the loop, too big for Core 2's 64-byte (pre-decode) loop buffer, so it's not faster because of fitting in a loop buffer. Without a uop-cache, we have to be worried about the front-end, but I think we're fine because I'm exclusively using 2-byte single-uop instructions that should easily decode at 4 fused-domain uops per clock.

I'll present an analysis for the case where T = 1 for both codes (with and without lfence). You can then extend this for other values of T. You can refer to Figure 2.4 of the Intel Optimization Manual for a visual.
Because there is only a single easily predicted branch, the frontend will only stall if the backend stalled. The frontend is 4-wide in Haswell, which means up to 4 fused uops can be issued from the IDQ (instruction decode queue, which is just a queue that holds in-order fused-domain uops, also called the uop queue) to the reservation station (RS) entires of the scheduler. Each imul is decoded into a single uop that cannot be fused. The instructions dec ecx and jnz .loop get macrofused in the frontend to a single uop. One of the differences between microfusion and macrofusion is that when the scheduler dispatches a macrofused uop (that are not microfused) to the execution unit it's assigned to, it gets dispatched as a single uop. In contrast, a microfused uop needs to be split into its constituent uops, each of which must be separately dispatched to an execution unit. (However, splitting microfused uops happens on entrance to the RS, not on dispatch, see Footnote 2 in #Peter's answer). lfence is decoded into 6 uops. Recognizing microfusion only matters in the backend, and in this case, there is no microfusion in the loop.
Since the loop branch is easily predictable and since the number of iterations is relatively large, we can just assume without compromising accuracy that the allocator will always be able to allocate 4 uops per cycle. In other words, the scheduler will receive 4 uops per cycle. Since there is no micorfusion, each uop will be dispatched as a single uop.
imul can only be executed by the Slow Int execution unit (see Figure 2.4). This means that the only choice for executing the imul uops is to dispatch them to port 1. In Haswell, the Slow Int is nicely pipelined so that a single imul can be dispatched per cycle. But it takes three cycles for the result of the multiplication be available for any instruction that requires (the writeback stage is the third cycle from the dispatch stage of the pipeline). So for each dependence chain, at most one imul can be dispatched per 3 cycles.
Becausedec/jnz is predicted taken, the only execution unit that can execute it is Primary Branch on port 6.
So at any given cycle, as long as the RS has space, it will receive 4 uops. But what kind of uops? Let's examine the loop without lfence:
imul eax, eax
imul edx, edx
dec ecx/jnz .loop (macrofused)
There are two possibilities:
Two imuls from the same iteration, one imul from a neighboring iteration, and one dec/jnz from one of those two iterations.
One dec/jnz from one iteration, two imuls from the next iteration, and one dec/jnz from the same iteration.
So at the beginning of any cycle, the RS will receive at least one dec/jnz and at least one imul from each chain. At the same time, in the same cycle and from those uops that are already there in the RS, the scheduler will do one of two actions:
Dispatch the oldest dec/jnz to port 6 and dispatch the oldest imul that is ready to port 1. That's a total of 2 uops.
Because the Slow Int has a latency of 3 cycles but there are only two chains, for each cycle of 3 cycles, no imul in the RS will be ready for execution. However, there is always at least one dec/jnz in the RS. So the scheduler can dispatch that. That's a total of 1 uop.
Now we can calculate the expected number of uops in the RS, XN, at the end of any given cycle N:
XN = XN-1 + (the number of uops to be allocated in the RS at the beginning of cycle N) - (the expected number of uops that will be dispatched at the beginning of cycle N)
= XN-1 + 4 - ((0+1)*1/3 + (1+1)*2/3)
= XN-1 + 12/3 - 5/3
= XN-1 + 7/3 for all N > 0
The initial condition for the recurrence is X0 = 4. This is a simple recurrence that can be solved by unfolding XN-1.
XN = 4 + 2.3 * N for all N >= 0
The RS in Haswell has 60 entries. We can determine the first cycle in which the RS is expected to become full:
60 = 4 + 7/3 * N
N = 56/2.3 = 24.3
So at the end of cycle 24.3, the RS is expected to be full. This means that at the beginning of cycle 25.3, the RS will not be able to receive any new uops. Now the number of iterations, I, under consideration determines how you should proceed with the analysis. Since a dependency chain will require at least 3*I cycles to execute, it takes about 8.1 iterations to reach cycle 24.3. So if the number of iterations is larger than 8.1, which is the case here, you need to analyze what happens after cycle 24.3.
The scheduler dispatches instructions at the following rates every cycle (as discussed above):
1
2
2
1
2
2
1
2
.
.
But the allocator will not allocate any uops in the RS unless there are at least 4 available entries. Otherwise, it will not waste power on issuing uops at a sub-optimal throughput. However, it is only at the beginning of every 4th cycle are there at least 4 free entries in the RS. So starting from cycle 24.3, the allocator is expected to get stalled 3 out of every 4 cycles.
Another important observation for the code being analyzed is that it never happens that there are more than 4 uops that can be dispatched, which means that the average number of uops that leave their execution units per cycle is not larger than 4. At most 4 uops can be retired from the ReOrder Buffer (ROB). This means that the ROB can never be on the critical path. In other words, performance is determined by the dispatch throughput.
We can calculate the IPC (instructions per cycles) fairly easily now. The ROB entries look something like this:
imul eax, eax - N
imul edx, edx - N + 1
dec ecx/jnz .loop - M
imul eax, eax - N + 3
imul edx, edx - N + 4
dec ecx/jnz .loop - M + 1
The column to the right shows the cycles in which the instruction can be retired. Retirement happens in order and is bounded by the latency of the critical path. Here each dependency chain have the same path length and so both constitute two equal critical paths of length 3 cycles. So every 3 cycles, 4 instructions can be retired. So the IPC is 4/3 = 1.3 and the CPI is 3/4 = 0.75. This is much smaller than the theoretical optimal IPC of 4 (even without considering micro- and macro-fusion). Because retirement happens in-order, the retirement behavior will be the same.
We can check our analysis using both perf and IACA. I'll discuss perf. I've a Haswell CPU.
perf stat -r 10 -e cycles:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-nolfence
Performance counter stats for './main-1-nolfence' (10 runs):
30,01,556 cycles:u ( +- 0.00% )
40,00,005 instructions:u # 1.33 insns per cycle ( +- 0.00% )
0 RESOURCE_STALLS.ROB
23,42,246 UOPS_ISSUED.ANY ( +- 0.26% )
22,49,892 RESOURCE_STALLS.RS ( +- 0.00% )
0.001061681 seconds time elapsed ( +- 0.48% )
There are 1 million iterations each takes about 3 cycles. Each iteration contains 4 instructions and the IPC is 1.33.RESOURCE_STALLS.ROB shows the number of cycles in which the allocator was stalled due to a full ROB. This of course never happens. UOPS_ISSUED.ANY can be used to count the number of uops issued to the RS and the number of cycles in which the allocator was stalled (no specific reason). The first is straightforward (not shown in the perf output); 1 million * 3 = 3 million + small noise. The latter is much more interesting. It shows that about 73% of all time the allocator stalled due to a full RS, which matches our analysis. RESOURCE_STALLS.RS counts the number of cycles in which the allocator was stalled due to a full RS. This is close to UOPS_ISSUED.ANY because the allocator does not stall for any other reason (although the difference could be proportional to the number of iterations for some reason, I'll have to see the results for T>1).
The analysis of the code without lfence can be extended to determine what happens if an lfence was added between the two imuls. Let's check out the perf results first (IACA unfortunately does not support lfence):
perf stat -r 10 -e cycles:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-lfence
Performance counter stats for './main-1-lfence' (10 runs):
1,32,55,451 cycles:u ( +- 0.01% )
50,00,007 instructions:u # 0.38 insns per cycle ( +- 0.00% )
0 RESOURCE_STALLS.ROB
1,03,84,640 UOPS_ISSUED.ANY ( +- 0.04% )
0 RESOURCE_STALLS.RS
0.004163500 seconds time elapsed ( +- 0.41% )
Observe that the number of cycles has increased by about 10 million, or 10 cycles per iteration. The number of cycles does not tell us much. The number of retired instruction has increased by a million, which is expected. We already know that the lfence will not make instruction complete any faster, so RESOURCE_STALLS.ROB should not change. UOPS_ISSUED.ANY and RESOURCE_STALLS.RS are particularly interesting. In this output, UOPS_ISSUED.ANY counts cycles, not uops. The number of uops can also be counted (using cpu/event=0x0E,umask=0x1,name=UOPS_ISSUED.ANY/u instead of cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u) and has increased by 6 uops per iteration (no fusion). This means that an lfence that was placed between two imuls was decoded into 6 uops. The one million dollar question is now what these uops do and how they move around in the pipe.
RESOURCE_STALLS.RS is zero. What does that mean? This indicates that the allocator, when it sees an lfence in the IDQ, it stops allocating until all current uops in the ROB retire. In other words, the allocator will not allocate entries in the RS past an lfence until the lfence retires. Since the loop body contains only 3 other uops, the 60-entry RS will never be full. In fact, it will be always almost empty.
The IDQ in reality is not a single simple queue. It consists of multiple hardware structures that can operate in parallel. The number of uops an lfence requires depends on the exact design of the IDQ. The allocator, which also consists of many different hardware structures, when it see there is an lfence uops at the front of any of the structures of the IDQ, it suspends allocation from that structure until the ROB is empty. So different uops are usd with different hardware structures.
UOPS_ISSUED.ANY shows that the allocator is not issuing any uops for about 9-10 cycles per iteration. What is happening here? Well, one of the uses of lfence is that it can tell us how much time it takes to retire an instruction and allocate the next instruction. The following assembly code can be used to do that:
TIMES T lfence
The performance event counters will not work well for small values of T. For sufficiently large T, and by measuring UOPS_ISSUED.ANY, we can determine that it takes about 4 cycles to retire each lfence. That's because UOPS_ISSUED.ANY will be incremented about 4 times every 5 cycles. So after every 4 cycles, the allocator issues another lfence (it doesn't stall), then it waits for another 4 cycles, and so on. That said, instructions that produce results may require 1 or few more cycle to retire depending on the instruction. IACA always assume that it takes 5 cycles to retire an instruction.
Our loop looks like this:
imul eax, eax
lfence
imul edx, edx
dec ecx
jnz .loop
At any cycle at the lfence boundary, the ROB will contain the following instructions starting from the top of the ROB (the oldest instruction):
imul edx, edx - N
dec ecx/jnz .loop - N
imul eax, eax - N+1
Where N denotes the cycle number at which the corresponding instruction was dispatched. The last instruction that is going to complete (reach the writeback stage) is imul eax, eax. and this happens at cycle N+4. The allocator stall cycle count will be incremented during cycles, N+1, N+2, N+3, and N+4. However it will about 5 more cycles until imul eax, eax retires. In addition, after it retires, the allocator needs to clean up the lfence uops from the IDQ and allocate the next group of instructions before they can be dispatched in the next cycle. The perf output tells us that it takes about 13 cycles per iteration and that the allocator stalls (because of the lfence) for 10 out of these 13 cycles.
The graph from the question shows only the number of cycles for up to T=100. However, there is another (final) knee at this point. So it would be better to plot the cycles for up to T=120 to see the full pattern.

Why isn't MOVNTI slower, in a loop storing repeatedly to the same address?

section .text
%define n 100000
_start:
xor rcx, rcx
jmp .cond
.begin:
movnti [array], eax
.cond:
add rcx, 1
cmp rcx, n
jl .begin
section .data
array times 81920 db "A"
According to perf it runs at 1.82 instructions per cycle. I cannot understand why it's so fast. After all, it has to be stored in memory (RAM) so it should be slow.
P.S Is there any loop-carried-dependency?
EDIT
section .text
%define n 100000
_start:
xor rcx, rcx
jmp .cond
.begin:
movnti [array+rcx], eax
.cond:
add rcx, 1
cmp rcx, n
jl .begin
section .data
array times n dq 0
Now, the iteration take 5 cycle per iteration. Why? After all, there is still no loop-carried-dependency.

movnti can apparently sustain a throughput of one per clock when writing to the same address repeatedly.
I think movnti keeps writing into the same fill buffer, and it's not getting flushed very often because there are no other loads or stores happening. (That link is about copying from WC video memory with SSE4.1 NT loads, as well as storing to normal memory with NT stores.)
So the NT write-combining fill-buffer acts like a cache for multiple overlapping NT stores to the same address, and writes are actually hitting in the fill buffer instead of going to DRAM each time.
DDR DRAM only supports burst-transfer commands. If every movnti produced a 4B write that actually was visible to the memory chips, there'd be no way it could run that fast. The memory controller either has to read/modify/write, or do an interrupted burst transfer, since there is no non-burst write command. See also Ulrich Drepper's What Every Programmer Should Know About Memory.
We can further prove this is the case by running the test on multiple cores at once. Since they don't slow each other down at all, we can be sure that the writes are only infrequently making it out of the CPU cores and competing for memory cycles.
The reason your experiment doesn't show your loop running at 4 instruction per clock (one cycle per iteration) is that you used such a tiny repeat count. 100k cycles barely accounts for the startup overhead (which perf's timing includes).
For example, on a Core2 E6600 (Merom/Conroe) with dual channel DDR2 533MHz, the total time including all process startup / exit stuff is 0.113846 ms. That's only 266,007 cycles.
A more reasonable microbenchmark shows one iteration (one movnti) per cycle:
global _start
_start:
xor ecx,ecx
.begin:
movnti [array], eax
dec ecx
jnz .begin ; 2^32 iterations
mov eax, 60 ; __NR_exit
xor edi,edi
syscall ; exit(0)
section .bss
array resb 81920
(asm-link is a script I wrote)
$ asm-link movnti-same-address.asm
+ yasm -felf64 -Worphan-labels -gdwarf2 movnti-same-address.asm
+ ld -o movnti-same-address movnti-same-address.o
$ perf stat -e task-clock,cycles,instructions ./movnti-same-address
Performance counter stats for './movnti-same-address':
1835.056710 task-clock (msec) # 0.995 CPUs utilized
4,398,731,563 cycles # 2.397 GHz
12,891,491,495 instructions # 2.93 insns per cycle
1.843642514 seconds time elapsed
Running in parallel:
$ time ./movnti-same-address; time ./movnti-same-address & time ./movnti-same-address &
real 0m1.844s / user 0m1.828s # running alone
[1] 12523
[2] 12524
peter#tesla:~/src/SO$
real 0m1.855s / user 0m1.824s # running together
real 0m1.984s / user 0m1.808s
# output compacted by hand to save space
I expect perfect SMP scaling (except with hyperthreading), up to any number of cores. e.g. on a 10-core Xeon, 10 copies of this test could run at the same time (on separate physical cores), and each one would finish in the same time as if it was running alone. (Single-core turbo vs. multi-core turbo will also be a factor, though, if you measure wall-clock time instead of cycle counts.)
zx485's uop count nicely explains why the loop isn't bottlenecked by the frontend or unfused-domain execution resources.
However, this disproves his theory about the ratio of CPU to memory clocks having anything to do with it. Interesting coincidence, though, that the OP chose a count that happened to make the final total IPC work out that way.
P.S Is there any loop-carried-dependency?
Yes, the loop counter. (1 cycle). BTW, you could have saved an insn by counting down towards zero with dec / jg instead of counting up and having to use a cmp.
The write-after-write memory dependency isn't a "true" dependency in the normal sense, but it is something the CPU has to keep track of. The CPU doesn't "notice" that the same value is written repeatedly, so it has to make sure the last write is the one that "counts".
This is called an architectural hazard. I think the term still applies when talking about memory, rather than registers.

The result is plausible. Your loop code consists of the follwing instuctions. According to Agner Fog's instruction tables, these have the following timings:
Instruction regs fused unfused ports Latency Reciprocal Throughput
---------------------------------------------------------------------------------------------------------------------------
MOVNTI m,r 2 2 p23 p4 ~400 1
ADD r,r/i 1 1 p0156 1 0.25
CMP r,r/i 1 1 p0156 1 0.25
Jcc short 1 1 p6 1 1-2 if predicted that the jump is taken
Fused CMP+Jcc short 1 1 p6 1 1-2 if predicted that the jump is taken
So
MOVNTI consumes 2 uOps, 1 in port 2 or 3 and one in port 4
ADD consumes 1 uOps in port 0 or 1 or 5 or 6
CMP and Jcc macro-fuse to the last line in the table resulting in a consumption of 1 uOp
Because neither ADD nor CMP+Jcc depend on the result of MOVNTI they can be executed (nearly) in parallel on recent architectures, for example using the ports 1,2,4,6. The worst case would be a latency of 1 between ADD and CMP+Jcc.
This is most likely a design error in your code: you're essentially writing to the same address [array] a 100000 times, because you do not adjust the address.
The repeated writes can even go to the L1-cache under the condition that
The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region.
but it doesn't look like this and won't make for a great difference, anyway, because even if writing to memory, the memory speed will be the limiting factor.
For example, if you have a 3GHz CPU and 1600MHz DDR3-RAM this will result in 3/1.6 = 1.875 CPU cycles per memory cycle. This seems plausible.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio