What's the meaning of "under" in phrase "hit-under-miss"?

What's the meaning of "under" in phrase "hit-under-miss"? - caching

ARM website contains an explanation of a feature called "HUM" (hit-under-missing).
It seems that "under" can be interpreted as "after" or "follow", meaning that the previous access is a miss, and the subsequence access is a hit. Is this understanding correct? And if, I am wondering is there special context for using the word "under" (i.e., use spatial-relationship instead of time-relationship).

The word "under", in this case, probably means "during" or "while", as in "hit during (under the circumstances of recovering from a previous) miss".
The intent is to let the load/store unit continue accessing memory freely while it waits for a line fill (transfer from cache-capable memory to the cache) to recover from a previous miss. There is no need to stall while there are no subsequent misses:
Advancing time =======================>
hit hit miss/line-fill ------> carry-on
hit hit hit hit >
However, a stall will happen if a miss occurs while a line-fill is currently being performed. At that point, no further action will take place until the first line-fill completes, at which point the stall will "un-stall" and the next line-fill will start (alongside any non-misses):
Advancing time ================================================>
miss/line-fill --------> carry-on
hit miss/stall ...> line-fill ----> carry on
miss/stall ...> line-fill ---> carry on
hit hit ----->
Perhaps a more graphical way of viewing this is in the following simplified block diagram:
+--------+
+-> | dcache | <---+ linefile
| +--------+ | (via amba)
V +----------+
+------------+ | cachable |
| load/store | | memory |
| unit | +----------+
+------------+
^ +--------------+
| | non-cachable |
+-------> | memory |
+--------------+
Once a line-fill starts, there is no stall provided the load/store unit only reads:
cachable memory that's already in the dcache; or
non-cachable memory.

Prior to ARMv6, the whole pipeline stalled if a cache-miss occured.
HUM means that independent instructions keep being executed even if a cache-miss occurs.
1:
ldr r0, [r10], #4
lsl r1, r1, #2
subs r9, r9, #1
add r1, r0, r1
bgt 1b
no HUM: the next instruction lsl gets executed AFTER the cache-miss penalty (dozens of cycles)
HUM: the first dependent instructionadd has to wait. Prior instructions lsl and subs get executed regardlessly.

Related

Why does using MFENCE with store instruction block prefetching in L1 cache?

I have an object of 64 byte in size:
typedef struct _object{
int value;
char pad[60];
} object;
in main I am initializing array of object:
volatile object * array;
int arr_size = 1000000;
array = (object *) malloc(arr_size * sizeof(object));
for(int i=0; i < arr_size; i++){
array[i].value = 1;
_mm_clflush(&array[i]);
}
_mm_mfence();
Then loop again through each element. This is the loop I am counting events for:
int tmp;
for(int i=0; i < arr_size-105; i++){
array[i].value = 2;
//tmp = array[i].value;
_mm_mfence();
}
having mfence does not make any sense here but I was tying something else and accidentally found that if I have store operation, without mfence I get half million of RFO requests (measured by papi L2_RQSTS.ALL_RFO event), which means that another half million was L1 hit, prefetched before demand. However including mfence results in 1 million RFO requests, giving RFO_HITs, that means that cache line is only prefetched in L2, not in L1 cache anymore.
Besides the fact that Intel documentation somehow indicates otherwise: "data can be brought into the caches speculatively just before, during, or after the execution of an MFENCE instruction." I checked with load operations. without mfence I get up to 2000 L1 hit, whereas with mfence, I have up to 1 million L1 hit (measured with papi MEM_LOAD_RETIRED.L1_HIT event). The cache lines are prefetched in L1 for load instruction.
So it should not be the case that including mfence blocks prefetching. Both the store and load operations take almost the same time - without mfence 5-6 msec, with mfence 20 msec. I went through other questions regarding mfence but it's not mentioned what is expected behavior for it with prefetching and I don't see good enough reason or explanation why it would block prefetching in L1 cache with only store operations. Or I might be missing something for mfence description?
I am testing on Skylake miroarchitecture, however checked with Broadwell and got the same result.

It's not L1 prefetching that causes the counter values you see: the effect remains even if you disable the L1 prefetchers. In fact, the effect remains if you disable all prefetchers except the L2 streamer:
wrmsr -a 0x1a4 "$((2#1110))"
If you do disable the L2 streamer, however, the counts are as you'd expect: you see roughly 1,000,000 L2.RFO_MISS and L2.RFO_ALL even without the mfence.
First, it is important to note that the L2_RQSTS.RFO_* events count do not count RFO events originating from the L2 streamer. You can see the details here, but basically the umask for each of the 0x24 RFO events are:
name umask
RFO_MISS 0x22
RFO_HIT 0x42
ALL_RFO 0xE2
Note that none of the umask values have the 0x10 bit which indicates that events which originate from the L2 streamer should be tracked.
It seems like what happens is that when the L2 streamer is active, many of the events that you might expect to be assigned to one of those events are instead "eaten" by the L2 prefetcher events instead. What likely happens is that the L2 prefetcher is running ahead of the request stream, and when the demand RFO comes in from L1, it finds a request already in progress from the L2 prefetcher. This only increments again the umask |= 0x10 version of the event (indeed I get 2,000,000 total references when including that bit), which means that RFO_MISS and RFO_HIT and RFO_ALL will miss it.
It's somewhat analogous to the "fb_hit" scenario, where L1 loads neither miss nor hit exactly, but hit an in-progress load - but the complication here is the load was initiated by the L2 prefetcher.
The mfence just slows everything down enough that the L2 prefetcher almost always has time to bring the line all the way to L2, giving an RFO_HIT count.
I don't think the L1 prefetchers are involved here at all (shown by the fact that this works the same if you turn them off): as far as I know L1 prefetchers don't interact with stores, only loads.
Here are some useful perf commands you can use to see the difference in including the "L2 streamer origin" bit. Here's w/o the L2 streamer events:
perf stat --delay=1000 -e cpu/event=0x24,umask=0xef,name=l2_rqsts_references/,cpu/event=0x24,umask=0xe2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xc2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x22,name=l2_rqsts_rfo_miss/
and with them included:
perf stat --delay=1000 -e cpu/event=0x24,umask=0xff,name=l2_rqsts_references/,cpu/event=0x24,umask=0xf2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xd2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x32,name=l2_rqsts_rfo_miss/
I ran these against this code (with the sleep(1) lining up with the --delay=1000 command passed to perf to exclude the init code):
#include <time.h>
#include <immintrin.h>
#include <stdio.h>
#include <unistd.h>
typedef struct _object{
int value;
char pad[60];
} object;
int main() {
volatile object * array;
int arr_size = 1000000;
array = (object *) malloc(arr_size * sizeof(object));
for(int i=0; i < arr_size; i++){
array[i].value = 1;
_mm_clflush((const void*)&array[i]);
}
_mm_mfence();
sleep(1);
// printf("Starting main loop after %zu ms\n", (size_t)clock() * 1000u / CLOCKS_PER_SEC);
int tmp;
for(int i=0; i < arr_size-105; i++){
array[i].value = 2;
//tmp = array[i].value;
// _mm_mfence();
}
}

Regarding the case with store operations, I have run the same loop on a Haswell processor in four different configurations:
MFENCE + E: There is an MFENCE instruction after the store. All hardware prefetchers are enabled.
E : There is no MFENCE. All hardware prefetchers are enabled.
MFENCE + D: There is an MFENCE instruction after the store. All hardware prefetchers are disabled.
D : There is no MFENCE. All hardware prefetchers are disabled.
The results are shown below, which are normalized by the number of stores (each store is to a different cache line). They are very deterministic across multiple runs.
| MFENCE + E | E | MFENCE + D | D |
L2_RQSTS.ALL_RFO | 0.90 | 0.62 | 1.00 | 1.00 |
L2_RQSTS.RFO_HIT | 0.80 | 0.12 | 0.00 | 0.00 |
L2_RQSTS.RFO_MISS | 0.10 | 0.50 | 1.00 | 1.00 |
OFFCORE_REQUESTS.DEMAND_RFO | 0.20 | 0.88 | 1.00 | 1.00 |
PF_L3_RFO | 0.00 | 0.00 | 0.00 | 0.00 |
PF_RFO | 0.80 | 0.16 | 0.00 | 0.00 |
DMND_RFO | 0.19 | 0.84 | 1.00 | 1.00 |
The first four events are core events and the last three events are off-core response events:
L2_RQSTS.ALL_RFO: Occurs for each RFO request to the L2. This includes RFO requests from stores that have retired or otherwise, and RFO requests from PREFETCHW. For the cases where the hardware prefetchers are enabled, the event count is less than what is expected, which is a normalized one. One can think of two possible reasons for this: (1) somehow some of the RFOs hit in the L1, and (2) the event is undercounted. We'll try to figure out which is it by examining the counts of the other events and recalling what we know about the L1D prefetchers.
L2_RQSTS.RFO_HIT and L2_RQSTS.RFO_MISS: Occur for an RFO that hits or misses in the L2, respectively. In all configurations, the sum of the counts of these events is exactly equal to L2_RQSTS.ALL_RFO.
OFFCORE_REQUESTS.DEMAND_RFO: The documentation of this event suggests that it should be the same as L2_RQSTS.RFO_MISS. However, observe that the sum of OFFCORE_REQUESTS.DEMAND_RFO and L2_RQSTS.RFO_HIT is actually equal to one. Thus, it's possible that L2_RQSTS.RFO_MISS undercounts (and so L2_RQSTS.ALL_RFO does too). In fact, this is the most likely explanation because the Intel optimization manual (and other Intel documents) say that only the L2 streamer prefetcher can track stores. The Intel performance counter manual mentions "L1D RFO prefetches" in the description of L2_RQSTS.ALL_RFO. These prefetches probably refer to RFOs from stores that have not retired yet (see the last section of the answer to Why are the user-mode L1 store miss events only counted when there is a store initialization loop?).
PF_L3_RFO: Occurs when an RFO from the L2 streamer prefetcher is triggered and the target cache structure is the L3 only. All counts of this event are zero.
PF_RFO: Occurs when an RFO from the L2 streamer prefetcher is triggered and the target cache structure is the L2 and possibly the L3 (if the L3 is inclusive, then the line will also be filled into the L3 as well). The count of this event is close to L2_RQSTS.RFO_HIT. In the MFENCE + E case, it seems that 100% of the RFOs have completed on time (before the demand RFO has reached the L2). In the E case, 25% of prefetches did not complete on time or the wrong lines were prefetched. The reason why the number of RFO hits in the L2 is larger in the MFENCE + E case compared to the E case is that the MFENCE instruction delays later RFOs, thereby keeping most of the L2's super queue entries available for the L2 streamer prefetcher. So MFENCE really enables the L2 streamer prefetcher to perform better. Without it, there would be many in-flight demand RFOs at the L2, leaving a small number of super queue entries for prefetching.
DMND_RFO: The same as OFFCORE_REQUESTS.DEMAND_RFO, but it looks like it may undercount a little.
I checked with load operations. without mfence I get up to 2000 L1
hit, whereas with mfence, I have up to 1 million L1 hit (measured with
papi MEM_LOAD_RETIRED.L1_HIT event). The cache lines are prefetched in
L1 for load instruction.
Regarding the case with load operations, in my experience, MFENCE (or any other fence instruction) has no impact on the behavior of the hardware prefetchers. The true count of the MEM_LOAD_RETIRED.L1_HIT event here is actually very small (< 2000). Most of the events being counted are from MFENCE itself, not the loads. MFENCE (and SFENCE) require sending a fence request all the way to the memory controller to ensure that all pending stores have reached the global observation point. A fence request is not counted as an RFO event, but it may get counted as multiple events, including L1_HIT. For more information on this and similar observations, see my blog post: An Introduction to the Cache Hit and Miss Performance Monitoring Events.

vgetmantps vs andpd instructions for getting the mantissa of float

For skylakex (agner fog's instruction tables):
+-----------------------+-------------+-------------------+---------------------+----------------+---------+-----------------------+
| Instruction | Operands | µops fused domain | µops unfused domain | µops each port | Latency | Reciprocal throughput |
+-----------------------+-------------+-------------------+---------------------+----------------+---------+-----------------------+
| VGETMANTPS/PD | v,v,v | 1 | 1 | p01/05 | 4 | 0.5-1 |
| AND/ANDN/OR/ XORPS/PD | x,x / y,y,y | 1 | 1 | p015 | 1 | 0.33 |
+-----------------------+-------------+-------------------+---------------------+----------------+---------+-----------------------+
Does that mean using a bitmask and logical and to get the mantissa of a float is faster than using the vgetmantps instruction?
How much is the latency for transferring the number from float to int and back to float?

For implementing log(x), you want the mantissa and exponent as float, and vgetmantps / vgetexpps are perfect for it. Efficient implementation of log2(__m256d) in AVX2. This is what those instructions are for, and do speed up a fast approximation to log(2). (Plus it can normalize the significand to -0.5 .. +0.5 instead of 1 .. 2.0 or other neat ranges to create input for a polynomial approximation to log(x+1) or whatever. See its docs.)
If you only want the mantissa as an integer, then sure AND away the other bits and you're done in one instruction.
(But remember that for a NaN, the mantissa is the NaN payload, so if you need to do something different for NaN then you need to check the exponent.)
How much is the latency for transferring the number from float to int and back to float?
You already have Agner Fog's instruction tables (https://agner.org/optimize/). On Skylake (SKL and SKX) VCVT(T) PS2DQ is 4c latency for the FMA ports, and so is the other direction.
Or are you asking about bypass latency for using the output of an FP instruction like andps as the input to an integer instruction?
Agner Fog's microarch PDF has some info about bypass latency for sending data between vec-int and fp domains, but not many specifics.
Skylake's bypass latency is weird: unlike on previous uarches, it depends what port the instruction actually picked. andps has no bypass latency between FP instructions if it runs on port 5, but if it runs on p0 or p1, it has an extra 1c of latency.
See Intel's optimization manual for a table of domain-crossing latencies broken down by domain+execution-port.
(And just to be extra weird, this bypass-delay latency affects that register forever, even after it has definitely written back to a physical register and isn't being forwarded over the bypass network. vpaddd xmm0, xmm1, xmm2 has 2c latency for both inputs if either input came from vmulps. But some shuffles and other instructions work in either domain. It was a while since I experimented with this, and I didn't check my notes, so this example might not be exactly right, but it's something like this.)
(Intel's optimization manual doesn't mention this permanent effect which lasts until you overwrite the architectural register with a new value. So be careful about creating FP constants ahead of a loop with integer instructions.)

How are x86 uops scheduled, exactly?

Modern x86 CPUs break down the incoming instruction stream into micro-operations (uops1) and then schedule these uops out-of-order as their inputs become ready. While the basic idea is clear, I'd like to know the specific details of how ready instructions are scheduled, since it impacts micro-optimization decisions.
For example, take the following toy loop2:
top:
lea eax, [ecx + 5]
popcnt eax, eax
add edi, eax
dec ecx
jnz top
this basically implements the loop (with the following correspondence: eax -> total, c -> ecx):
do {
total += popcnt(c + 5);
} while (--c > 0);
I'm familiar with the process of optimizing any small loop by looking at the uop breakdown, dependency chain latencies and so on. In the loop above we have only one carried dependency chain: dec ecx. The first three instructions of the loop (lea, popcnt, add) are part of a dependency chain that starts fresh each loop.
The final dec and jne are fused. So we have a total of 4 fused-domain uops, and one only loop-carried dependency chain with a latency of 1 cycle. So based on that criteria, it seems that the loop can execute at 1 cycle/iteration.
However, we should look at the port pressure too:
The lea can execute on ports 1 and 5
The popcnt can execute on port 1
The add can execute on port 0, 1, 5 and 6
The predicted-taken jnz executes on port 6
So to get to 1 cycle / iteration, you pretty much need the following to happen:
The popcnt must execute on port 1 (the only port it can execute on)
The lea must execute on port 5 (and never on port 1)
The add must execute on port 0, and never on any of other three ports it can execute on
The jnz can only execute on port 6 anyway
That's a lot of conditions! If instructions just got scheduled randomly, you could get a much worse throughput. For example, 75% the add would go to port 1, 5 or 6, which would delay the popcnt, lea or jnz by one cycle. Similarly for the lea which can go to 2 ports, one shared with popcnt.
IACA on the other hand reports a result very close to optimal, 1.05 cycles per iteration:
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - l.o
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 1.05 Cycles Throughput Bottleneck: FrontEnd, Port0, Port1, Port5
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 1.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.0 | 0.9 | 0.0 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
# - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | | | | 1.0 | | | CP | lea eax, ptr [ecx+0x5]
| 1 | | 1.0 | | | | | | | CP | popcnt eax, eax
| 1 | 0.1 | | | | | 0.1 | 0.9 | | CP | add edi, eax
| 1 | 0.9 | | | | | | 0.1 | | CP | dec ecx
| 0F | | | | | | | | | | jnz 0xfffffffffffffff4
It pretty much reflects the necessary "ideal" scheduling I mentioned above, with a small deviation: it shows the add stealing port 5 from the lea on 1 out of 10 cycles. It also doesn't know that the fused branch is going to go to port 6 since it is predicted taken, so it puts most of the uops for the branch on port 0, and most of the uops for the add on port 6, rather than the other way around.
It's not clear if the extra 0.05 cycles that IACA reports over the optimal is the result of some deep, accurate analysis, or a less insightful consequence of the algorithm it uses, e.g., analyzing the loop over a fixed number of cycles, or just a bug or whatever. The same goes for the 0.1 fraction of a uop that it thinks will go to the non-ideal port. It is also not clear if one explains the other - I would think that mis-assigning a port 1 out of 10 times would cause a cycle count of 11/10 = 1.1 cycles per iteration, but I haven't worked out the actual downstream results - maybe the impact is less on average. Or it could just be rounding (0.05 == 0.1 to 1 decimal place).
So how do modern x86 CPUs actually schedule? In particular:
When multiple uops are ready in the reservation station, in what order are they scheduled to ports?
When a uop can go to multiple ports (like the add and lea in the example above), how is it decided which port is chosen?
If any of the answers involve a concept like oldest to choose among uops, how is it defined? Age since it was delivered to the RS? Age since it became ready? How are ties broken? Does program order ever come into it?
Results on Skylake
Let's measure some actual results on Skylake to check which answers explain the experimental evidence, so here are some real-world measured results (from perf) on my Skylake box. Confusingly, I'm going switch to using imul for my "only executes on one port" instruction, since it has many variants, including 3-argument versions that allow you to use different registers for the source(s) and destination. This is very handy when trying to construct dependency chains. It also avoids the whole "incorrect dependency on destination" that popcnt has.
Independent Instructions
Let's start by looking at the simple (?) case that the instructions are relatively independent - without any dependency chains other than trivial ones like the loop counter.
Here's a 4 uop loop (only 3 executed uops) with mild pressure. All instructions are independent (don't share any sources or destinations). The add could in principle steal the p1 needed by the imul or p6 needed by the dec:
Example 1
instr p0 p1 p5 p6
xor (elim)
imul X
add X X X X
dec X
top:
xor r9, r9
add r8, rdx
imul rax, rbx, 5
dec esi
jnz top
The results is that this executes with perfect scheduling at 1.00 cycles / iteration:
560,709,974 uops_dispatched_port_port_0 ( +- 0.38% )
1,000,026,608 uops_dispatched_port_port_1 ( +- 0.00% )
439,324,609 uops_dispatched_port_port_5 ( +- 0.49% )
1,000,041,224 uops_dispatched_port_port_6 ( +- 0.00% )
5,000,000,110 instructions:u # 5.00 insns per cycle ( +- 0.00% )
1,000,281,902 cycles:u
( +- 0.00% )
As expected, p1 and p6 are fully utilized by the imul and dec/jnz respectively, and then the add issues roughly half and half between the remaining available ports. Note roughly - the actual ratio is 56% and 44%, and this ratio is pretty stable across runs (note the +- 0.49% variation). If I adjust the loop alignment, the split changes (53/46 for 32B alignment, more like 57/42 for 32B+4 alignment). Now, we if change nothing except the position of imul in the loop:
Example 2
top:
imul rax, rbx, 5
xor r9, r9
add r8, rdx
dec esi
jnz top
Then suddenly the p0/p5 split is exactly 50%/50%, with 0.00% variation:
500,025,758 uops_dispatched_port_port_0 ( +- 0.00% )
1,000,044,901 uops_dispatched_port_port_1 ( +- 0.00% )
500,038,070 uops_dispatched_port_port_5 ( +- 0.00% )
1,000,066,733 uops_dispatched_port_port_6 ( +- 0.00% )
5,000,000,439 instructions:u # 5.00 insns per cycle ( +- 0.00% )
1,000,439,396 cycles:u ( +- 0.01% )
So that's already interesting, but it's hard to tell what's going on. Perhaps the exact behavior depends on the initial conditions at loop entry and is sensitive to ordering within the loop (e.g., because counters are used). This example shows that something more than "random" or "stupid" scheduling is going on. In particular, if you just eliminate the imul instruction from the loop, you get the following:
Example 3
330,214,329 uops_dispatched_port_port_0 ( +- 0.40% )
314,012,342 uops_dispatched_port_port_1 ( +- 1.77% )
355,817,739 uops_dispatched_port_port_5 ( +- 1.21% )
1,000,034,653 uops_dispatched_port_port_6 ( +- 0.00% )
4,000,000,160 instructions:u # 4.00 insns per cycle ( +- 0.00% )
1,000,235,522 cycles:u ( +- 0.00% )
Here, the add is now roughly evenly distributed among p0, p1 and p5 - so the presence of the imul did affect the add scheduling: it wasn't just a consequence of some "avoid port 1" rule.
Note here that total port pressure is only 3 uops/cycle, since the xor is a zeroing idiom and is eliminated in the renamer. Let's try with the max pressure of 4 uops. I expect whatever mechanism kicked in above to able to perfectly schedule this also. We only change xor r9, r9 to xor r9, r10, so it is no longer a zeroing idiom. We get the following results:
Example 4
top:
xor r9, r10
add r8, rdx
imul rax, rbx, 5
dec esi
jnz top
488,245,238 uops_dispatched_port_port_0 ( +- 0.50% )
1,241,118,197 uops_dispatched_port_port_1 ( +- 0.03% )
1,027,345,180 uops_dispatched_port_port_5 ( +- 0.28% )
1,243,743,312 uops_dispatched_port_port_6 ( +- 0.04% )
5,000,000,711 instructions:u # 2.66 insns per cycle ( +- 0.00% )
1,880,606,080 cycles:u ( +- 0.08% )
Oops! Rather than evenly scheduling everything across p0156, the scheduler has underused p0 (it's only executing something ~49% of cycles), and hence p1 and p6 are oversubcribed because they are executing both their required ops of imul and dec/jnz. This behavior, I think is consistent with a counter-based pressure indicator as hayesti indicated in their answer, and with uops being assigned to a port at issue-time, not at execution time as both
hayesti and Peter Cordes mentioned. That behavior3 makes the execute the oldest ready uops rule not nearly as effective. If uops weren't bound to execution ports at issue, but rather at execution, then this "oldest" rule would fix the problem above after one iteration - once one imul and one dec/jnz got held back for a single iteration, they will always be older than the competing xor and add instructions, so should always get scheduled first. One thing I am learning though, is that if ports are assigned at issue time, this rule doesn't help because the ports are pre-determined at issue time. I guess it still helps a bit in favoring instructions which are part of long dependecy chains (since these will tend to fall behind), but it's not the cure-all I thought it was.
That also seems to be a explain the results above: p0 gets assigned more pressure than it really has because the dec/jnz combo can in theory execute on p06. In fact because the branch is predicted taken it only ever goes to p6, but perhaps that info can't feed into the pressure balancing algorithm, so the counters tend to see equal pressure on p016, meaning that the add and the xor get spread around differently than optimal.
Probably we can test this, by unrolling the loop a bit so the jnz is less of a factor...
1 OK, it is properly written μops, but that kills search-ability and to actually type the "μ" character I'm usually resorting to copy-pasting the character from a webpage.
2 I had originally used imul instead of popcnt in the loop, but, unbelievably, _IACA doesn't support it_!
3 Please note that I'm not suggesting this is a poor design or anything - there are probably very good hardware reasons why the scheduler cannot easily make all its decisions at execution time.

Your questions are tough for a couple of reasons:
The answer depends a lot on the microarchitecture of the processor
which can vary significantly from generation to generation.
These are fine-grained details which Intel doesn't generally release to the public.
Nevertheless, I'll try to answer...
When multiple uops are ready in the reservation station, in what order are they scheduled to ports?
It should be the oldest [see below], but your mileage may vary. The P6 microarchitecture (used in the Pentium Pro, 2 & 3) used a reservation station with five schedulers (one per execution port); the schedulers used a priority pointer as a place to start scanning for ready uops to dispatch. It was only pseudo FIFO so it's entirely possible that the oldest ready instruction was not always scheduled. In the NetBurst microarchitecture (used in Pentium 4), they ditched the unified reservation station and used two uop queues instead. These were proper collapsing priority queues so the schedulers were guaranteed to get the oldest ready instruction. The Core architecture returned to a reservation station and I would hazard an educated guess that they used the collapsing priority queue, but I can't find a source to confirm this. If somebody has a definitive answer, I'm all ears.
When a uop can go to multiple ports (like the add and lea in the example above), how is it decided which port is chosen?
That's tricky to know. The best I could find is a patent from Intel describing such a mechanism. Essentially, they keep a counter for each port that has redundant functional units. When the uops leave the front end to the reservation station, they are assigned a dispatch port. If it has to decide between multiple redundant execution units, the counters are used to distribute the work evenly. Counters are incremented and decremented as uops enter and leave the reservation station respectively.
Naturally this is just a heuristic and does not guarantee a perfect conflict-free schedule, however, I could still see it working with your toy example. The instructions which can only go to one port would ultimately influence the scheduler to dispatch the "less restricted" uops to other ports.
In any case, the presence of a patent doesn't necessarily imply that the idea was adopted (although that said, one of the authors was also a tech lead of the Pentium 4, so who knows?)
If any of the answers involve a concept like oldest to choose among uops, how is it defined? Age since it was delivered to the RS? Age since it became ready? How are ties broken? Does program order ever come into it?
Since uops are inserted into the reservation station in order, oldest here does indeed refer to time it entered the reservation station, i.e. oldest in program order.
By the way, I would take those IACA results with a grain of salt as they may not reflect the nuances of the real hardware. On Haswell, there is a hardware counter called uops_executed_port that can tell you how many cycles in your thread were uops issues to ports 0-7. Maybe you could leverage these to get a better understanding of your program?

Here's what I found on Skylake, coming at it from the angle that uops are assigned to ports at issue time (i.e., when they are issued to the RS), not at dispatch time (i.e., at the moment they are sent to execute). Before I had understood that the port decision was made at dispatch time.
I did a variety of tests which tried to isolate sequences of add operations that can go to p0156 and imul operations which go only to port 0. A typical test goes something like this:
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
... many more mov instructions
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1
add r9, 1
add r8, 1
add ecx, 1
add edx, 1
add r9, 1
add r8, 1
add ecx, 1
add edx, 1
add r9, 1
add r8, 1
add ecx, 1
add edx, 1
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
... many more mov instructions
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
Basically there is a long lead-in of mov eax, [edi] instructions, which only issue on p23 and hence don't clog up the ports used by the instructions (I could have also used nop instructions, but the test would be a bit different since nop don't issue to the RS). This is followed by the "payload" section, here composed of 4 imul and 12 add, and then a lead-out section of more dummy mov instructions.
First, let's take a look at the patent that hayesti linked above, and which he describes the basic idea about: counters for each port that track the total number of uops assigned to the port, which are used to load balance the port assignments. Take a look at this table included in the patent description:
This table is used to pick between p0 or p1 for the 3-uops in an issue group for the 3-wide architecture discussed in the patent. Note that the behavior depends on the position of the uop in the group, and that there are 4 rules1 based on the count, which spread the uops around in a logical way. In particular, the count needs to be at +/- 2 or greater before the whole group gets assigned the under-used port.
Let's see if we can observe the "position in issue group" matters behavior on Sklake. We use a payload of a single add like:
add edx, 1 ; position 0
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
... and we slide it around inside the 4 instruction chuck like:
mov eax, [edi]
add edx, 1 ; position 1
mov eax, [edi]
mov eax, [edi]
... and so on, testing all four positions within the issue group2. This shows the following, when the RS is full (of mov instructions) but with no port pressure of any of the relevant ports:
The first add instructions go to p5 or p6, with the port selected usually alternating as the instruction is slow down (i.e., add instructions in even positions go to p5 and in odd positions go to p6).
The second add instruction also goes to p56 - whichever of the two the first one didn't go to.
After that further add instructions start to be balanced around p0156, with p5 and p6 usually ahead but with things fairly even overall (i.e., the gap between p56 and the other two ports doesn't grow).
Next, I took a look at what happens if load up p1 with imul operations, then first in a bunch of add operations:
imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1
add r9, 1
add r8, 1
add ecx, 1
add edx, 1
add r9, 1
add r8, 1
add ecx, 1
add edx, 1
add r9, 1
add r8, 1
add ecx, 1
add edx, 1
The results show that the scheduler handles this well - all of the imul got to scheduled to p1 (as expected), and then none of the subsequent add instructions went to p1, being spread around p056 instead. So here the scheduling is working well.
Of course, when the situation is reversed, and the series of imul comes after the adds, p1 is loaded up with its share of adds before the imuls hit. That's a result of port assignment happening in-order at issue time, since is no mechanism to "look ahead" and see the imul when scheduling the adds.
Overall the scheduler looks to do a good job in these test cases.
It doesn't explain what happens in smaller, tighter loops like the following:
sub r9, 1
sub r10, 1
imul ebx, edx, 1
dec ecx
jnz top
Just like Example 4 in my question, this loop only fills p0 on ~30% of cycles, despite there being two sub instructions that should be able to go to p0 on every cycle. p1 and p6 are oversubscribed, each executing 1.24 uops for every iteration (1 is ideal). I wasn't able to triangulate the difference between the examples that work well at the top of this answer with the bad loops - but there are still many ideas to try.
I did note that examples without instruction latency differences don't seem to suffer from this issue. For example, here's another 4-uop loop with "complex" port pressure:
top:
sub r8, 1
ror r11, 2
bswap eax
dec ecx
jnz top
The uop map is as follows:
instr p0 p1 p5 p6
sub X X X X
ror X X
bswap X X
dec/jnz X
So the sub must always go to p15, shared with bswap if things are to work out. They do:
Performance counter stats for './sched-test2' (2 runs):
999,709,142 uops_dispatched_port_port_0 ( +- 0.00% )
999,675,324 uops_dispatched_port_port_1 ( +- 0.00% )
999,772,564 uops_dispatched_port_port_5 ( +- 0.00% )
1,000,991,020 uops_dispatched_port_port_6 ( +- 0.00% )
4,000,238,468 uops_issued_any ( +- 0.00% )
5,000,000,117 instructions:u # 4.99 insns per cycle ( +- 0.00% )
1,001,268,722 cycles:u ( +- 0.00% )
So it seems that the issue may be related to instruction latencies (certainly, there are other differences between the examples). That's something that came up in this similar question.
1The table has 5 rules, but the rule for 0 and -1 counts are identical.
2Of course, I can't be sure where the issue groups start and end, but regardless we test four different positions as we slide down four instructions (but the labels could be wrong). I'm also not sure the issue group max size is 4 - earlier parts of the pipeline are wider - but I believe it is and some testing seemed to show it was (loops with a multiple of 4 uops showed consistent scheduling behavior). In any case, the conclusions hold with different scheduling group sizes.

Section 2.12 of Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures[^1] explains how port are assigned, though it fails to explain example 4 in the question description. I also failed to figure out what role Latency plays in the port assignment.
Previous work [19, 25, 26] has identified the ports that the µops of individual instructions can use. For µops that can use more than one port, it was, however, previously unknown how the actual port is chosen by the processor. We reverse-engineered the port assignment algorithm using microbenchmarks. In the following, we describe our findings for CPUs with eight ports; such CPUs are currently most widely used.
The ports are assigned when the µops are issued by the renamer to the scheduler. In a single cycle, up to four µops can be issued. In the following, we will call the position of a µop within a cycle an issue slot; e.g., the oldest instruction issued in a cycle would occupy issue slot 0.
The port that a µop is assigned depends on its issue slot and on the ports assigned to µops that have not been executed and were issued in a previous cycle.
In the following, we will only consider µops that can use more than one port. For a given µop m, let $P_{min}$ be the port to which the fewest non-executed µops have been assigned to from among the ports that m can use. Let $P_{min'}$ be the port with the second smallest usage so far. If there is a tie among the ports with the smallest (or second smallest, respectively) usage, let $P_{min}$ (or $P_{min'}$) be the port with the highest port number from among these ports (the reason for this choice is probably that ports with higher numbers are connected to fewer functional units). If the difference between $P_{min}$ and $P_{min'}$ is greater or equal to 3, we set $P_{min'}$ to $P_{min}$.
The µops in issue slots 0 and 2 are assigned to port $P_{min}$ The µops in issue slots 1 and 3 are assigned to port $P_{min'}$.
A special case is µops that can use port 2 and port 3. These ports are used by µops that handle memory accesses, and both ports are connected to the same types of functional units. For such µops, the port assignment algorithm alternates between port 2 and port 3.
I tried to find out whether $P_{min}$ and $P_{min'}$ are shared between threads (Hyper-Threading), namely whether one thread can affect the port assignment of another one in the same core.
Just split the code used in BeeOnRope's answer into two threads.
thread1:
.loop:
imul rax, rbx, 5
jmp .loop
thread2:
mov esi,1000000000
.top:
bswap eax
dec esi
jnz .top
jmp thread2
Where instructions bswap can be executed on ports 1 and 5, and imul r64, R64, i on port 1. If counters were shared between threads, you would see bswap executed on port 5 and imul executed on port 1.
The experiment was recorded as follows, where ports P0 and P5 on thread 1 and p0 on thread 2 should have recorded a small amount of non-user data, but without hindering the conclusion. It can be seen from the data that the bswap instruction of thread 2 is executed alternately between ports P1 and P5 without giving up P1.
port
thread 1 active cycles
thread 2 active cycles
P0
63,088,967
68,022,708
P1
180,219,013,832
95,742,764,738
P5
63,994,200
96,291,124,547
P6
180,330,835,515
192,048,880,421
total
180,998,504,099
192,774,759,297
Therefore, the counters are not shared between threads.
This conclusion does not conflict with SMotherSpectre[^2], which uses time as the side channel. (For example, thread 2 waits longer on port 1 to use port 1.)
Executing instructions that occupy a specific port and measuring their timing enables inference about other instructions executing on the same port. We first choose two instructions, each scheduled on a single, distinct, execution port. One thread runs and times a long sequence of single µop instructions scheduled on port a, while simultaneously the other thread runs a long sequence of instructions scheduled on port b. We expect that, if a = b, contention occurs and the measured execution time is longer compared to the a ≠ b case.
[^1]: Abel, Andreas, and Jan Reineke. "Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures." arXiv preprint arXiv:2107.14210 (2021).
[^2]: Bhattacharyya, Atri, Alexandra Sandulescu, Matthias Neugschwandtner, Alessandro Sorniotti, Babak Falsafi, Mathias Payer, and Anil Kurmus. “SMoTherSpectre: Exploiting Speculative Execution through Port Contention.” Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, November 6, 2019, 785–800. https://doi.org/10.1145/3319535.3363194.

Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs

This is related, but not the same, as this question: Performance optimisations of x86-64 assembly - Alignment and branch prediction and is slightly related to my previous question: Unsigned 64-bit to double conversion: why this algorithm from g++
The following is a not real-world test case. This primality testing algorithm is not sensible. I suspect any real-world algorithm would never execute such a small inner-loop quite so many times (num is a prime of size about 2**50). In C++11:
using nt = unsigned long long;
bool is_prime_float(nt num)
{
for (nt n=2; n<=sqrt(num); ++n) {
if ( (num%n)==0 ) { return false; }
}
return true;
}
Then g++ -std=c++11 -O3 -S produces the following, with RCX containing n and XMM6 containing sqrt(num). See my previous post for the remaining code (which is never executed in this example, as RCX never becomes large enough to be treated as a signed negative).
jmp .L20
.p2align 4,,10
.L37:
pxor %xmm0, %xmm0
cvtsi2sdq %rcx, %xmm0
ucomisd %xmm0, %xmm6
jb .L36 // Exit the loop
.L20:
xorl %edx, %edx
movq %rbx, %rax
divq %rcx
testq %rdx, %rdx
je .L30 // Failed divisibility test
addq $1, %rcx
jns .L37
// Further code to deal with case when ucomisd can't be used
I time this using std::chrono::steady_clock. I kept getting weird performance changes: from just adding or deleting other code. I eventually tracked this down to an alignment issue. The command .p2align 4,,10 tried to align to a 2**4=16 byte boundary, but only uses at most 10 bytes of padding to do so, I guess to balance between alignment and code size.
I wrote a Python script to replace .p2align 4,,10 by a manually controlled number of nop instructions. The following scatter plot shows the fastest 15 of 20 runs, time in seconds, number of bytes padding at the x-axis:
From objdump with no padding, the pxor instruction will occur at offset 0x402f5f. Running on a laptop, Sandybridge i5-3210m, turboboost disabled, I found that
For 0 byte padding, slow performance (0.42 secs)
For 1-4 bytes padding (offset 0x402f60 to 0x402f63) get slightly better (0.41s, visible on the plot).
For 5-20 bytes padding (offset 0x402f64 to 0x402f73) get fast performance (0.37s)
For 21-32 bytes padding (offset 0x402f74 to 0x402f7f) slow performance (0.42 secs)
Then cycles on a 32 byte sample
So a 16-byte alignment doesn't give the best performance-- it puts us in the slightly better (or just less variation, from the scatter plot) region. Alignment of 32 plus 4 to 19 gives the best performance.
Why am I seeing this performance difference? Why does this seem to violate the rule of aligning branch targets to a 16-byte boundary (see e.g. the Intel optimisation manual)
I don't see any branch-prediction problems. Could this be a uop cache quirk??
By changing the C++ algorithm to cache sqrt(num) in an 64-bit integer and then make the loop purely integer based, I remove the problem-- alignment now makes no difference at all.

Here's what I found on Skylake for the same loop. All the code to reproduce my tests on your hardware is on github.
I observe three different performance levels based on alignment, whereas the OP only really saw 2 primary ones. The levels are very distinct and repeatable2:
We see three distinct performance levels here (the pattern repeats starting from offset 32), which we'll call regions 1, 2 and 3, from left to right (region 2 is split into two parts straddling region 3). The fastest region (1) is from offset 0 to 8, the middle (2) region is from 9-18 and 28-31, and the slowest (3) is from 19-27. The difference between each region is close to or exactly 1 cycle/iteration.
Based on the performance counters, the fastest region is very different from the other two:
All the instructions are delivered from the legacy decoder, not from the DSB1.
There are exactly 2 decoder <-> microcode switches (idq_ms_switches) for every iteration of the loop.
On the hand, the two slower regions are fairly similar:
All the instructions are delivered from the DSB (uop cache), and not from the legacy decoder.
There are exactly 3 decoder <-> microcode switches per iteration of the loop.
The transition from the fastest to the middle region, as the offset changes from 8 to 9, corresponds exactly to when the loop starts fitting in the uop buffer, because of alignment issues. You count this out in exactly the same way as Peter did in his answer:
Offset 8:
LSD? <_start.L37>:
ab 1 4000a8: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000ac: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000b1: 66 0f 2e f0 ucomisd xmm6,xmm0
ab 1 4000b5: 72 21 jb 4000d8 <_start.L36>
ab 2 4000b7: 31 d2 xor edx,edx
ab 2 4000b9: 48 89 d8 mov rax,rbx
ab 3 4000bc: 48 f7 f1 div rcx
!!!! 4000bf: 48 85 d2 test rdx,rdx
4000c2: 74 0d je 4000d1 <_start.L30>
4000c4: 48 83 c1 01 add rcx,0x1
4000c8: 79 de jns 4000a8 <_start.L37>
In the first column I've annotated how the uops for each instruction end up in the uop cache. "ab 1" means they go in the set associated with address like ...???a? or ...???b? (each set covers 32 bytes, aka 0x20), while 1 means way 1 (out of a max of 3).
At the point !!! this busts out of the uop cache because the test instruction has no where to go, all the 3 ways are used up.
Let's look at offset 9 on the other hand:
00000000004000a9 <_start.L37>:
ab 1 4000a9: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000ad: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000b2: 66 0f 2e f0 ucomisd xmm6,xmm0
ab 1 4000b6: 72 21 jb 4000d9 <_start.L36>
ab 2 4000b8: 31 d2 xor edx,edx
ab 2 4000ba: 48 89 d8 mov rax,rbx
ab 3 4000bd: 48 f7 f1 div rcx
cd 1 4000c0: 48 85 d2 test rdx,rdx
cd 1 4000c3: 74 0d je 4000d2 <_start.L30>
cd 1 4000c5: 48 83 c1 01 add rcx,0x1
cd 1 4000c9: 79 de jns 4000a9 <_start.L37>
Now there is no problem! The test instruction has slipped into the next 32B line (the cd line), so everything fits in the uop cache.
So that explains why stuff changes between the MITE and DSB at that point. It doesn't, however, explain why the MITE path is faster. I tried some simpler tests with div in a loop, and you can reproduce this with simpler loops without any of the floating point stuff. It's weird and sensitive to random other stuff you put in the loop.
For example this loop also executes faster out of the legacy decoder than the DSB:
ALIGN 32
<add some nops here to swtich between DSB and MITE>
.top:
add r8, r9
xor eax, eax
div rbx
xor edx, edx
times 5 add eax, eax
dec rcx
jnz .top
In that loop, adding the pointless add r8, r9 instruction, which doesn't really interact with the rest of the loop, sped things up for the MITE version (but not the DSB version).
So I think the difference between region 1 an region 2 and 3 is due to the former executing out of the legacy decoder (which, oddly, makes it faster).
Let's also take a look at the offset 18 to offset 19 transition (where region2 ends and 3 starts):
Offset 18:
00000000004000b2 <_start.L37>:
ab 1 4000b2: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000b6: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000bb: 66 0f 2e f0 ucomisd xmm6,xmm0
ab 1 4000bf: 72 21 jb 4000e2 <_start.L36>
cd 1 4000c1: 31 d2 xor edx,edx
cd 1 4000c3: 48 89 d8 mov rax,rbx
cd 2 4000c6: 48 f7 f1 div rcx
cd 3 4000c9: 48 85 d2 test rdx,rdx
cd 3 4000cc: 74 0d je 4000db <_start.L30>
cd 3 4000ce: 48 83 c1 01 add rcx,0x1
cd 3 4000d2: 79 de jns 4000b2 <_start.L37>
Offset 19:
00000000004000b3 <_start.L37>:
ab 1 4000b3: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000b7: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000bc: 66 0f 2e f0 ucomisd xmm6,xmm0
cd 1 4000c0: 72 21 jb 4000e3 <_start.L36>
cd 1 4000c2: 31 d2 xor edx,edx
cd 1 4000c4: 48 89 d8 mov rax,rbx
cd 2 4000c7: 48 f7 f1 div rcx
cd 3 4000ca: 48 85 d2 test rdx,rdx
cd 3 4000cd: 74 0d je 4000dc <_start.L30>
cd 3 4000cf: 48 83 c1 01 add rcx,0x1
cd 3 4000d3: 79 de jns 4000b3 <_start.L37>
The only difference I see here is that the first 4 instructions in the offset 18 case fit into the ab cache line, but only 3 in the offset 19 case. If we hypothesize that the DSB can only deliver uops to the IDQ from one cache set, this means that at some point one uop may be issued and executed a cycle earlier in the offset 18 scenario than in the 19 scenario (imagine, for example, that the IDQ is empty). Depending on exactly what port that uop goes to in the context of the surrounding uop flow, that may delay the loop by one cycle. Indeed, the difference between region 2 and 3 is ~1 cycle (within the margin of error).
So I think we can say that the difference between 2 and 3 is likely due to uop cache alignment - region 2 has a slightly better alignment than 3, in terms of issuing one additional uop one cycle earlier.
Some addition notes on things I checked that didn't pan out as being a possible cause of the slowdowns:
Despite the DSB modes (regions 2 and 3) having 3 microcode switches versus the 2 of the MITE path (region 1), that doesn't seem to directly cause the slowdown. In particular, simpler loops with div execute in identical cycle counts, but still show 3 and 2 switches for DSB and MITE paths respectively. So that's normal and doesn't directly imply the slowdown.
Both paths execute essentially identical number of uops and, in particular, have identical number of uops generated by the microcode sequencer. So it's not like there is more overall work being done in the different regions.
There wasn't really an difference in cache misses (very low, as expected) at various levels, branch mispredictions (essentially zero3), or any other types of penalties or unusual conditions I checked.
What did bear fruit is looking at the pattern of execution unit usage across the various regions. Here's a look at the distribution of uops executed per cycle and some stall metrics:
+----------------------------+----------+----------+----------+
| | Region 1 | Region 2 | Region 3 |
+----------------------------+----------+----------+----------+
| cycles: | 7.7e8 | 8.0e8 | 8.3e8 |
| uops_executed_stall_cycles | 18% | 24% | 23% |
| exe_activity_1_ports_util | 31% | 22% | 27% |
| exe_activity_2_ports_util | 29% | 31% | 28% |
| exe_activity_3_ports_util | 12% | 19% | 19% |
| exe_activity_4_ports_util | 10% | 4% | 3% |
+----------------------------+----------+----------+----------+
I sampled a few different offset values and the results were consistent within each region, yet between the regions you have quite different results. In particular, in region 1, you have fewer stall cycles (cycles where no uop is executed). You also have significant variation in the non-stall cycles, although no clear "better" or "worse" trend is evident. For example, region 1 has many more cycles (10% vs 3% or 4%) with 4 uops executed, but the other regions largely make up for it with more cycles with 3 uops executed, and few cycles with 1 uop executed.
The difference in UPC4 that the execution distribution above implies fully explains the difference in performance (this is probably a tautology since we already confirmed the uop count is the same between them).
Let's see what toplev.py has to say about it ... (results omitted).
Well, toplev suggests that the primary bottleneck is the front-end (50+%). I don't think you can trust this because the way it calculates FE-bound seems broken in the case of long strings of micro-coded instructions. FE-bound is based on frontend_retired.latency_ge_8, which is defined as:
Retired instructions that are fetched after an interval where the
front-end delivered no uops for a period of 8 cycles which was not
interrupted by a back-end stall. (Supports PEBS)
Normally that makes sense. You are counting instructions which were delayed because the frontend wasn't delivering cycles. The "not interrupted by a back-end stall" condition ensures that this doesn't trigger when the front-end isn't delivering uops simply because is the backend is not able to accept them (e.g,. when the RS is full because the backend is performing some low-throuput instructions).
It kind of seems for div instructions - even a simple loop with pretty much just one div shows:
FE Frontend_Bound: 57.59 % [100.00%]
BAD Bad_Speculation: 0.01 %below [100.00%]
BE Backend_Bound: 0.11 %below [100.00%]
RET Retiring: 42.28 %below [100.00%]
That is, the only bottleneck is the front-end ("retiring" is not a bottleneck, it represents the useful work). Clearly, such a loop is trivially handled by the front-end and is instead limited by the backend's ability to chew threw all the uops generated by the div operation. Toplev might get this really wrong because (1) it may be that the uops delivered by the microcode sequencer aren't counted in the frontend_retired.latency... counters, so that every div operation causes that event to count all the subsequent instructions (even though the CPU was busy during that period - there was no real stall), or (2) the microcode sequencer might deliver all its ups essentially "up front", slamming ~36 uops into the IDQ, at which point it doesn't deliver any more until the div is finished, or something like that.
Still, we can look at the lower levels of toplev for hints:
The main difference toplev calls out between the regions 1 and region 2 and 3 is the increased penalty of ms_switches for the latter two regions (since they incur 3 every iteration vs 2 for the legacy path. Internally, toplev estimates a 2-cycle penalty in the frontend for such switches. Of course, whether these penalties actually slow anything down depends in a complex way on the instruction queue and other factors. As mentioned above, a simple loop with div doesn't show any difference between the DSB and MITE paths, a loop with additional instructions does. So it could be that the extra switch bubble is absorbed in simpler loops (where the backend processing of all the uops generated by the div is the main factor), but once you add some other work in the loop, the switches become a factor at least for the transition period between the div and non-div` work.
So I guess my conclusion is that the way the div instruction interacts with the rest of the frontend uop flow, and backend execution, isn't completely well understood. We know it involves a flood of uops, delivered both from the MITE/DSB (seems like 4 uops per div) and from the microcode sequencer (seems like ~32 uops per div, although it changes with different input values to the div op) - but we don't know what those uops are (we can see their port distribution though). All that makes the behavior fairly opaque, but I think it is probably down to either the MS switches bottlnecking the front-end, or slight differences in the uop delivery flow resulting in different scheduling decisions which end up making the MITE order master.
1 Of course, most of the uops are not delivered from the legacy decoder or DSB at all, but by the microcode sequencer (ms). So we loosely talk about instructions delivered, not uops.
2 Note that the x axis here is "offset bytes from 32B alignment". That is, 0 means the top of the loop (label .L37) is aligned to a 32B boundary, and 5 means the loop starts five bytes below a 32B boundary (using nop for padding) and so on. So my padding bytes and offset are the same. The OP used a different meaning for offset, if I understand it correctly: his 1 byte of padding resulted in a 0 offset. So you would subtract 1 from the OPs padding values to get my offset values.
3 In fact, the branch prediction rate for a typical test with prime=1000000000000037 was ~99.999997%, reflecting only 3 mispredicted branches in the entire run (likely on the first pass through the loop, and the last iteration).
4 UPC, i.e., uops per cycle - a measure closely related to IPC for similar programs, and one that is a bit more precise when we are looking in detail at uop flows. In this case, we already know the uop counts are the same for all variations of alignment, so UPC and IPC will be directly proportional.

I don't have a specific answer, just a few different hypotheses that I'm unable to test (lack of hardware). I thought I'd found something conclusive, but I had the alignment off by one (because the question counts padding from 0x5F, not from an aligned boundary). Anyway, hopefully it's useful to post this anyway to describe the factors that are probably at play here.
The question also doesn't specify the encoding of the branches (short (2B) or near (6B)). This leaves too many possibilities to look at and theorize about exactly which instruction crossing a 32B boundary or not is causing the issue.
I think it's either a matter of the loop fitting in the uop cache or not, or else it's a matter of alignment mattering for whether it decodes fast with the legacy decoders.
Obviously that asm loop could be improved a lot (e.g. by hoisting the floating-point out of it, not to mention using a different algorithm entirely), but that's not the question. We just want to know why alignment matters for this exact loop.
You might expect that a loop that bottlenecks on division wouldn't bottleneck on the front-end or be affected by alignment, because division is slow and the loop runs very few instructions per clock. That's true, but 64-bit DIV is micro-coded as 35-57 micro-ops (uops) on IvyBridge, so it turns out there can be front-end issues.
The two main ways alignment can matter are:
Front-end bottlenecks (in the fetch/decode stages), leading to bubbles in keeping the out-of-order core supplied with work to do.
Branch prediction: if two branches have the same address modulo some large power of 2, they can alias each other in the branch prediction hardware. Code alignment in one object file is affecting the performance of a function in another object file
scratches the surface of this issue, but much has been written about it.
I suspect this is a purely front-end issue, not branch prediction, since the code spends all its time in this loop, and isn't running other branches that might alias with the ones here.
Your Intel IvyBridge CPU is a die-shrink of SandyBridge. It has a few changes (like mov-elimination, and ERMSB), but the front-end is similar between SnB/IvB/Haswell. Agner Fog's microarch pdf has enough details to analyze what should happen when the CPU runs this code. See also David Kanter's SandyBridge writeup for a block diagram of the fetch/decode stages, but he splits the fetch/decode from the uop cache, microcode, and decoded-uop queue. At the end, there's a full block diagram of a whole core. His Haswell article has a block diagram including the whole front-end, up to the decoded-uop queue that feeds the issue stage. (IvyBridge, like Haswell, has a 56 uop queue / loopback buffer when not using Hyperthreading. Sandybridge statically partitions them into 2x28 uop queues even when HT is disabled.)
Image copied from David Kanter's also-excellent Haswell write-up, where he includes the decoders and uop-cache in one diagram.
Let's look at how the uop cache will probably cache this loop, once things settle down. (i.e. assuming that the loop entry with a jmp to the middle of the loop doesn't have any serious long-term effect on how the loop sits in the uop cache).
According to Intel's optimization manual (2.3.2.2 Decoded ICache):
All micro-ops in a Way (uop cache line) represent instructions which are statically contiguous in the code and have
their EIPs within the same aligned 32-byte region. (I think this means an instruction that extends past the boundary goes in the uop cache for the block containing its start, rather than end. Spanning instructions have to go somewhere, and the branch target address that would run the instruction is the start of the insn, so it's most useful to put it in a line for that block).
A multi micro-op instruction cannot be split across Ways.
An instruction which turns on the MSROM consumes an entire Way. (i.e. any instruction that takes more than 4 uops (for the reg,reg form) is microcoded. For example, DPPD is not micro-coded (4 uops), but DPPS is (6 uops). DPPD with a memory operand that can't micro-fuse would be 5 total uops, but still wouldn't need to turn on the microcode sequencer (not tested).
Up to two branches are allowed per Way.
A pair of macro-fused instructions is kept as one micro-op.
David Kanter's SnB writeup has some more great details about the uop cache.
Let's see how the actual code will go into the uop cache
# let's consider the case where this is 32B-aligned, so it runs in 0.41s
# i.e. this is at 0x402f60, instead of 0 like this objdump -Mintel -d output on a .o
# branch displacements are all 00, and I forgot to put in dummy labels, so they're using the rel32 encoding not rel8.
0000000000000000 <.text>:
0: 66 0f ef c0 pxor xmm0,xmm0 # 1 uop
4: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx # 2 uops
9: 66 0f 2e f0 ucomisd xmm6,xmm0 # 2 uops
d: 0f 82 00 00 00 00 jb 0x13 # 1 uop (end of one uop cache line of 6 uops)
13: 31 d2 xor edx,edx # 1 uop
15: 48 89 d8 mov rax,rbx # 1 uop (end of a uop cache line: next insn doesn't fit)
18: 48 f7 f1 div rcx # microcoded: fills a whole uop cache line. (And generates 35-57 uops)
1b: 48 85 d2 test rdx,rdx ### PROBLEM!! only 3 uop cache lines can map to the same 32-byte block of x86 instructions.
# So the whole block has to be re-decoded by the legacy decoders every time, because it doesn't fit in the uop-cache
1e: 0f 84 00 00 00 00 je 0x24 ## spans a 32B boundary, so I think it goes with TEST in the line that includes the first byte. Should actually macro-fuse.
24: 48 83 c1 01 add rcx,0x1 # 1 uop
28: 79 d6 jns 0x0 # 1 uop
So with 32B alignment for the start of the loop, it has to run from the legacy decoders, which is potentially slower than running from the uop cache. There could even be some overhead in switching from uop cache to legacy decoders.
#Iwill's testing (see comments on the question) reveals that any microcoded instruction prevents a loop from running from the loopback buffer. See comments on the question. (LSD = Loop Stream Detector = loop buffer; physically the same structure as the IDQ (instruction decode queue). DSB = Decode Stream Buffer = the uop cache. MITE = legacy decoders.)
Busting the uop cache will hurt performance even if the loop is small enough to run from the LSD (28 uops minimum, or 56 without hyperthreading on IvB and Haswell).
Intel's optimization manual (section 2.3.2.4) says the LSD requirements include
All micro-ops are also resident in the Decoded ICache.
So this explains why microcode doesn't qualify: in that case the uop-cache only holds a pointer into to microcode, not the uops themselves. Also note that this means that busting the uop cache for any other reason (e.g. lots of single-byte NOP instructions) means a loop can't run from the LSD.
With the minimum padding to go fast, according to the OP's testing.
# branch displacements are still 32-bit, except the loop branch.
# This may not be accurate, since the question didn't give raw instruction dumps.
# the version with short jumps looks even more unlikely
0000000000000000 <loop_start-0x64>:
...
5c: 00 00 add BYTE PTR [rax],al
5e: 90 nop
5f: 90 nop
60: 90 nop # 4NOPs of padding is just enough to bust the uop cache before (instead of after) div, if they have to go in the uop cache.
# But that makes little sense, because looking backward should be impossible (insn start ambiguity), and we jump into the loop so the NOPs don't even run once.
61: 90 nop
62: 90 nop
63: 90 nop
0000000000000064 <loop_start>: #uops #decode in cycle A..E
64: 66 0f ef c0 pxor xmm0,xmm0 #1 A
68: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx #2 B
6d: 66 0f 2e f0 ucomisd xmm6,xmm0 #2 C (crosses 16B boundary)
71: 0f 82 db 00 00 00 jb 152 #1 C
77: 31 d2 xor edx,edx #1 C
79: 48 89 d8 mov rax,rbx #1 C
7c: 48 f7 f1 div rcx #line D
# 64B boundary after the REX in next insn
7f: 48 85 d2 test rdx,rdx #1 E
82: 74 06 je 8a <loop_start+0x26>#1 E
84: 48 83 c1 01 add rcx,0x1 #1 E
88: 79 da jns 64 <loop_start>#1 E
The REX prefix of test rdx,rdx is in the same block as the DIV, so this should bust the uop cache. One more byte of padding would put it into the next 32B block, which would make perfect sense. Perhaps the OP's results are wrong, or perhaps prefixes don't count, and it's the position of the opcode byte that matters. Perhaps that matters, or perhaps a macro-fused test+branch is pulled to the next block?
Macro-fusion does happen across the 64B L1I-cache line boundary, since it doesn't fall on the boundary between instructions.
Macro fusion does not happen if the first instruction ends on byte 63 of a cache line, and the second instruction is a conditional branch that starts at byte 0 of the next cache line. -- Intel's optimization manual, 2.3.2.1
Or maybe with a short encoding for one jump or the other, things are different?
Or maybe busting the uop cache has nothing to do with it, and that's fine as long as it decodes fast, which this alignment makes happen. This amount of padding just barely puts the end of UCOMISD into a new 16B block, so maybe that actually improves efficiency by letting it decode with the other instructions in the next aligned 16B block. However, I'm not sure that a 16B pre-decode (instruction-length finding) or 32B decode block have to be aligned.
I also wondered if the CPU ends up switching from uop cache to legacy decode frequently. That can be worse than running from legacy decode all the time.
Switching from the decoders to the uop cache or vice versa takes a cycle, according to Agner Fog's microarch guide. Intel says:
When micro-ops cannot be stored in the Decoded ICache due to these restrictions, they are delivered from the legacy decode pipeline. Once micro-ops are delivered from the legacy pipeline, fetching micro-
ops from the Decoded ICache can resume only after the next branch micro-op. Frequent switches can incur a penalty.
The source that I assembled + disassembled:
.skip 0x5e
nop
# this is 0x5F
#nop # OP needed 1B of padding to reach a 32B boundary
.skip 5, 0x90
.globl loop_start
loop_start:
.L37:
pxor %xmm0, %xmm0
cvtsi2sdq %rcx, %xmm0
ucomisd %xmm0, %xmm6
jb .Loop_exit // Exit the loop
.L20:
xorl %edx, %edx
movq %rbx, %rax
divq %rcx
testq %rdx, %rdx
je .Lnot_prime // Failed divisibility test
addq $1, %rcx
jns .L37
.skip 200 # comment this to make the jumps rel8 instead of rel32
.Lnot_prime:
.Loop_exit:

From what I can see in your algorithm, there is certainly not much you can do to improve it.
The problem you are hitting is probably not so much the branch to an aligned position, although that can still help, you're current problem is much more likely the pipeline mechanism.
When you write two instructions one after another such as:
mov %eax, %ebx
add 1, %ebx
In order to execute the second instruction, the first one has to be complete. For that reason compilers tend to mix instructions. Say you need to set %ecx to zero, you could do this:
mov %eax, %ebx
xor %ecx, %ecx
add 1, %ebx
In this case, the mov and the xor can both be executed in parallel. This makes things go faster... The number of instructions that can be handled in parallel vary very much between processors (Xeons are generally better at that).
The branch adds another parameter where the best processors may start executing both sides of the branch (the true and the false...) simultaneously. But really most processors will make a guess and hope they are right.
Finally, it is obvious that converting the sqrt() result to an integer will make things a lot faster since you will avoid all that non-sense with SSE2 code which is definitively slower if used only for a conversion + compare when those two instructions could be done with integers.
Now... you are probably still wondering why the alignment does not matter with the integers. The fact is that if your code fits in the L1 instruction cache, then the alignment is not important. If you lose the L1 cache, then it has to reload the code and that's where the alignment becomes quite important since on each loop it could otherwise be loading useless code (possibly 15 bytes of useless code...) and memory access is still dead slow.

The performance difference can be explained by the different ways the instruction encoding mechanism "sees" the instructions. A CPU reads the instructions in chunks (was on core2 16 byte I believe) and it tries to give the different superscalar units microops. If the instructions are on boundaries or ordered unlikely the units in one core can starve quite easily.

Why Load/Store from buffer takes one whole clock cycle?

In this video: http://www.infoq.com/presentations/click-crash-course-modern-hardware
At 28:00, he starts to explain the following example:
ld rax <- [rbx+16] //Clock 0 Starts
add rbx,16
cmp rax,0
jeq null_chk
st [rbx-16] <- rcx //Clock 1 Starts
ld rcx <- [rdx+0] //Clock 2 Starts. Why clock 1 only does one op?
ld rax <- [rax+8] //Clock 3 Starts. Why?
And at 29:48, he says 2 cache misses happen in the example. Why 2?

No, he says that there are 2 cache misses running in parallel.
Out of the 4 memory accesses here (data memory, but let's ignore code/page walks, etc):
First ld of address [rbx+16] can be dispatched and sent to memory.
The store can not be performed until retirement when it's known to be non-bogus for sure (he mentioned the branch possibly being mis-predicted - that can be known at execution time, which is before retirement, but there could be other cases he didn't mention, like faults or exceptions). In addition, the store address needs to wait for the calculation of add rbx, 16, but that shouldn't take too long. However, the execution and eventually retirement of the branch that's holding the store depends on the result of the rax compare & jmp, which in turn depends on the first load - assumed here to go to memory (in other words - don't hold your breath).
The 2nd load (from [rdx+0]) is independent and can be dispatched - that's the 2nd load done in parallel.
The 3rd load can't be dispatched - its address depends on rax, which just like the cmp+jmp has to wait for the first load to finish. This means we can't do anything with this load, we don't even know the actual address. Think for e.g. what happens in HW if the load return, writes to rax, and it turns out that [rax+8] equals [rbx-16]? You'll have to forward the data from the store.
So, when executing the code, we can right away (withing 1-2 cycles, depends on the decode and address calculation time) send 2 loads - sections #1 and #3. The other memory operations will have to wait. The crucial element for performance here is to launch your loads as early as possible, and in parallel whenever possible.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio