Bring code into the L1 instruction cache without executing it

Bring code into the L1 instruction cache without executing it - performance

Let's say I have a function that I plan to execute as part of a benchmark. I want to bring this code into the L1 instruction cache prior to executing since I don't want to measure the cost of I$ misses as part of the benchmark.
The obvious way to do this is to simply execute the code at least once before the benchmark, hence "warming it up" and bringing it into the L1 instruction cache and possibly the uop cache, etc.
What are my alternatives in the case I don't want to execute the code (e.g., because I want the various predictors which key off of instruction addresses to be cold)?

In Granite Rapids and later, PREFETCHIT0 [rip+rel32] to prefetch code into "all levels" of cache, or prefetchit1 to prefetch into all levels except L1i. These instructions are a NOP with an addressing-mode other than RIP-relative, or on CPUs that don't support them. (Perhaps they also prime iTLB or even uop cache, or at least could on paper, in which case this isn't what you want.) The docs in Intel's "future extensions" manual as of 2022 Dec recommends that the target address be the start of some instruction.
Note that this Q&A is about priming things for a microbenchmark. Not things that would be worth doing to improve overall performance. For that, probably just best-effort prefetch into L2 cache (the inner-most unified cache) with prefetcht1 SW prefetch, also priming the dTLB in a way that possibly helps the iTLB (although it will evict a possibly-useful dTLB entry).
Or not, if the L2TLB is a victim cache. see X86 prefetching optimizations: "computed goto" threaded code for more discussion.
Caveat: some of this answer is Intel-centric. If I just say "the uop cache", I'm talking about Sandybridge-family. I know Ryzen has a uop-cache too, but I haven't read much of anything about its internals, and only know some basics about Ryzen from reading Agner Fog's microarch guide.
You can at least prefetch into L2 with software prefetch, but that doesn't even necessarily help with iTLB hits. (The 2nd-level TLB is a victim cache, I think, so a dTLB miss might not populate anything that the iTLB checks.)
But this doesn't help with L1I$ misses, or getting the target code decoded into the uop cache.
If there is a way to do what you want, it's going to be with some kind of trick. x86 has no "standard" way to do this; no code-prefetch instruction. Darek Mihoka wrote about code-prefetch as part of a CPU-emulator interpreter loop: The Common CPU Interpreter Loop Revisited: the Nostradamus Distributor back in 2008 when P4 and Core2 were the hot CPU microarchitectures to tune for.
But of course that's a different case: the goal is sustained performance of indirect branches, not priming things for a benchmark. It doesn't matter if you spend a lot of time achieving the microarchitectural state you want outside the timed portion of a micro-benchmark.
Speaking of which, modern branch predictors aren't just "cold", they always contain some prediction based on aliasing1. This may or may not be important.
Prefetch the first / last lines (and maybe others) with call to a ret
I think instruction fetch / prefetch normally continues past an ordinary ret or jmp, because it can't be detected until decode. So you could just call a function that ends at the end of the previous cache line. (Make sure they're in the same page, so an iTLB miss doesn't block prefetch.)
ret after a call will predict reliably if no other call did anything to the return-address predictor stack, except in rare cases if an interrupt happened between the call and ret, and the kernel code had a deep enough call tree to push the prediction for this ret out of the RSB (return-stack-buffer). Or if Spectre mitigation flushed it intentionally on a context switch.
; make sure this is in the same *page* as function_under_test, to prime the iTLB
ALIGN 64
; could put the label here, but probably better not
60 bytes of (long) NOPs or whatever padding
prime_Icache_first_line:
ret
; jmp back_to_benchmark_startup ; alternative if JMP is handled differently than RET.
lfence ; prevent any speculative execution after RET, in case it or JMP aren't detected as soon as they decode
;;; cache-line boundary here
function_under_test:
...
prime_Icache_last_line: ; label the last RET in function_under_test
ret ; this will prime the "there's a ret here" predictor (correctly)
benchmark_startup:
call prime_Icache_first_line
call prime_Icache_first_line ; IDK if calling twice could possibly help in case prefetch didn't get far the first time? But now the CPU has "seen" the RET and may not fetch past it.
call prime_Icache_last_line ; definitely only need to call once; it's in the right line
lfence
rdtsc
.timed_loop:
call function_under_test
...
jnz .time_loop
We can even extend this technique to more than 2 cache lines by calling to any 0xC3 (ret) byte inside the body of function_under_test. But as #BeeOnRope points out, that's dangerous because it may prime branch prediction with "there's a ret here" causing a mispredict you otherwise wouldn't have had when calling function_under_test for real.
Early in the front-end, branch prediction is needed based on fetch-block address (which block to fetch after this one), not on individual branches inside each block, so this could be a problem even if the ret byte was part of another instruction.
But if this idea is viable, then you can look for a 0xc3 byte as part of an instruction in the cache line, or at worst add a 3-byte NOP r/m32 (0f 1f c3 nop ebx,eax). c3 as a ModR/M encodes a reg,reg instruction (with ebx and eax as operands), so it doesn't have to be hidden in a disp8 to avoid making the NOP even longer, and it's easy to find in short instructions: e.g. 89 c3 mov ebx,eax, or use the other opcode so the same modrm byte gives you mov eax,ebx. Or 83 c3 01 add ebx,0x1, or many other instructions with e/rbx, bl (and r/eax or al).
With a REX prefix, those can be you have a choice of rbx / r11 (and rax/r8 for the /r field if applicable). It's likely you can choose (or modify for this microbenchmark) your register allocation to produce an instruction using the relevant registers to produce a c3 byte without any overhead at all, especially if you can use a custom calling convention (at least for testing purposes) so you can clobber rbx if you weren't already saving/restoring it.
I found these by searching for (space)c3(space) in the output of objdump -d /bin/bash, just to pick a random not-small executable full of compiler-generated code.
Evil hack: end the cache line before with the start of a multi-byte instruction.
; at the end of a cache line
prefetch_Icache_first_line:
db 0xe9 ; the opcode for 5-byte jmp rel32
function_under_test:
... normal code ; first 4 bytes will be treated as a rel32 when decoding starts at prefetch_I...
ret
; then at function_under_test+4 + rel32:
;org whatever (that's not how ORG works in NASM, so actually you'll need a linker script or something to put code here)
prefetch_Icache_branch_target:
jmp back_to_test_startup
So it jumps to a virtual address which depends on the instruction bytes of function_under_test. Map that page and put code in it that jumps back to your benchmark-prep code. The destination has to be within 2GiB, so (in 64-bit code) it's always possible to choose a virtual address for function_under_test that makes the destination a valid user-space virtual address. Actually, for many rel32 values, it's possible to choose the address of function_under_test to keep both it and the target within the low 2GiB of virtual address space, (and definitely 3GiB) and thus valid 32-bit user-space addresses even under a 32-bit kernel.
Or less crazy, using the end of a ret imm16 to consume a byte or two, just requiring a fixup of RSP after return (and treating whatever is temporarily below RSP as a "red zone" if you don't reserve extra space):
; at the end of a cache line
prefetch_Icache_first_line:
db 0xc2 ; the opcode for 3-byte ret imm16
; db 0x00 ; optional: one byte of the immediate at least keeps RSP aligned
; But x86 is little-endian, so the last byte is most significant
;; Cache-line boundary here
function_under_test:
... normal code ; first 2 bytes will be added to RSP when decoding starts at prefetch_Icache_first_line
ret
prefetch_caller:
push rbp
mov rbp, rsp ; save old RSP
;sub rsp, 65536 ; reserve space in case of the max RSP+imm16.
call prefetch_Icache_first_line
;;; UNSAFE HERE in case of signal handler if you didn't SUB.
mov rsp, rbp ; restore RSP; if no signal handlers installed, probably nothing could step on stack memory
...
pop rbp
ret
Using sub rsp, 65536 before calling to the ret imm16 makes it safe even if there might be a signal handler (or interrupt handler in kernel code, if your kernel stack is big enough, otherwise look at the actual byte and see how much will really be added to RSP). It means that call's push/store will probably miss in data cache, and maybe even cause a pagefault to grow the stack. That does happen before fetching the ret imm16, so that won't evict the L1I$ line we wanted to prime.
This whole idea is probably unnecessary; I think the above method can reliably prefetch the first line of a function anyway, and this only works for the first line. (Unless you put a 0xe9 or 0xc2 in the last byte of each relevant cache line, e.g. as part of a NOP if necessary.)
But this does give you a way to non-speculatively do code-fetch from from the cache line you want without architecturally executing any instructions in it. Hopefully a direct jmp is detected before any later instructions execute, and probably without any others even decoding, except ones that decoded in the same block. (And an unconditional jmp always ends a uop-cache line on Intel). i.e. the mispredict penalty is all in the front-end from re-steering the fetch pipeline as soon as decode detects the jmp. I hope ret is like this too, in cases where the return-predictor stack is not empty.
A jmp r/m64 would let you control the destination just by putting the address in the right register or memory. (Figure out what register or memory addressing mode the first byte(s) of function_under_test encode, and put an address there). The opcode is FF /4, so you can only use a register addressing mode if the first byte works as a ModRM that has /r = 4 and mode=11b. But you could put the first 2 bytes of the jmp r/m64 in the previous line, so the extra bytes form the SIB (and disp8 or disp32). Whatever those are, you can set up register contents such that the jump-target address will be loaded from somewhere convenient.
But the key problem with a jmp r/m64 is that default-prediction for an indirect branch can fall through and speculatively execute function_under_test, affecting the branch-prediction entries for those branches. You could have bogus values in registers so you prime branch prediction incorrectly, but that's still different from not touching them at all.
How does this overlapping-instructions hack to consume bytes from the target cache line affect the uop cache?
I think (based on previous experimental evidence) Intel's uop cache puts instructions in the uop-cache line that corresponds to their start address, in cases where they cross a 32 or 64-byte boundary. So when the real execution of function_under_test begins, it will simply miss in the uop-cache because no uop-cache line is caching the instruction-start-address range that includes the first byte of function_under_test. i.e. the overlapping decode is probably not even noticed when it's split across an L1I$ boundary this way.
It is normally a problem for the uop cache to have the same bytes decode as parts of different instructions, but I'm optimistic that we wouldn't have a penalty in this case. (I haven't double-checked that for this case. I'm mostly assuming that lines record which range of start-addresses they cache, and not the whole range of x86 instruction bytes they're caching.)
Create mis-speculation to fetch arbitrary lines, but block exec with lfence
Spectre / Meltdown exploits and mitigation strategies provide some interesting ideas: you could maybe trigger a mispredict that fetches at least the start of the code you want, but maybe doesn't speculate into it.
lfence blocks speculative execution, but (AFAIK) not instruction prefetch / fetch / decode.
I think (and hope) the front-end will follow direct relative jumps on its own, even after lfence, so we can use jmp target_cache_line in the shadow of a mispredict + lfence to fetch and decode but not execute the target function.
If lfence works by blocking the issue stage until the reservation station (OoO scheduler) is empty, then an Intel CPU should probably decode past lfence until the IDQ is full (64 uops on Skylake). There are further buffers before other stages (fetch -> instruction-length-decode, and between that and decode), so fetch can run ahead of that. Presumably there's a HW prefetcher that runs ahead of where actual fetch is reading from, so it's pretty plausible to get several cache lines into the target function in the shadow of a single mispredict, especially if you introduce delays before the mispredict can be detected.
We can use the same return-address frobbing as a retpoline to reliably trigger a mispredict to jmp rel32 which sends fetch into the target function. (I'm pretty sure a re-steer of the front-end can happen in the shadow of speculative execution without waiting to confirm correct speculation, because that would make every re-steer serializing.)
function_under_test:
...
some_line: ; not necessarily the first cache line
...
ret
;;; This goes in the same page as the test function,
;;; so we don't iTLB-miss when trying to send the front-end there
ret_frob:
xorps xmm0,xmm0
movq xmm1, rax
;; The point of this LFENCE is to make sure the RS / ROB are empty so the front-end can run ahead in a burst.
;; so the sqrtpd delay chain isn't gradually issued.
lfence
;; alternatively, load the return address from the stack and create a data dependency on it, e.g. and eax,0
;; create a many-cycle dependency-chain before the RET misprediction can be detected
times 10 sqrtpd xmm0,xmm0 ; high latency, single uop
orps xmm0, xmm1 ; target address with data-dep on the sqrtpd chain
movq [rsp], xmm0 ; overwrite return address
; movd [rsp], xmm0 ; store-forwarding stall: do this *as well* as the movq
ret ; mis-speculate to the lfence/jmp some_line
; but architecturally jump back to the address we got in RAX
prefetch_some_line:
lea rax, [rel back_to_bench_startup]
; or pop rax or load it into xmm1 directly,
; so this block can be CALLed as a function instead of jumped to
call ret_frob
; speculative execution goes here, but architecturally never reached
lfence ; speculative *execution* stops here, fetch continues
jmp some_line
I'm not sure the lfence in ret_frob is needed. But it does make it easier to reason about what the front-end is doing relative to the back-end. After the lfence, the return address has a data dependency on the chain of 10x sqrtpd. (10x 15 to 16 cycle latency on Skylake, for example). But the 10x sqrtpd + orps + movq only take 3 cycles to issue (on 4-wide CPUs), leaving at least 148 cycles + store-forwarding latency before ret can read the return address back from the stack and discover that the return-stack prediction was wrong.
This should be plenty of time for the front-end to follow the jmp some_line and load that line into L1I$, and probably load several lines after that. It should also get some of them decoded into the uop cache.
You need a separate call / lfence / jmp block for each target line (because the target address has to be hard-coded into a direct jump for the front-end to follow it without the back-end executing anything), but they can all share the same ret_frob block.
If you left out the lfence, you could use the above retpoline-like technique to trigger speculative execution into the function. This would let you jump to any target branch in the target function with whatever args you like in registers, so you can mis-prime branch prediction however you like.
Footnote 1:
Modern branch predictors aren't just "cold", they contain predictions
from whatever aliased the target virtual addresses in the various branch-prediction data structures. (At least on Intel where SnB-family pretty definitely uses TAGE prediction.)
So you should decide whether you want to specifically anti-prime the branch predictors by (speculatively) executing the branches in your function with bogus data in registers / flags, or whether your micro-benchmarking environment resembles the surrounding conditions of the real program closely enough.
If your target function has enough branching in a very specific complex pattern (like a branchy sort function over 10 integers), then presumably only that exact input can train the branch predictor well, so any initial state other than a specially-warmed-up state is probably fine.
You may not want the uop-cache primed at all, to look for cold-execution effects in general (like decode), so that might rule out any speculative fetch / decode, not just speculative execution. Or maybe speculative decode is ok if you then run some uop-cache-polluting long-NOPs or times 800 xor eax,eax (2-byte instructions -> 16 per 32-byte block uses up all 3 entries that SnB-family uop caches allow without running out of room and not being able to fit in the uop cache at all). But not so many that you evict L1I$ as well.
Even speculative decode without execute will prime the front-end branch prediction that knows where branches are ahead of decode, though. I think that a ret (or jmp rel32) at the end of the previous cache line

Map the same physical page to two different virtual addresses.
L1I$ is physically addressed. (VIPT but with all the index bits from below the page offset, so effectively PIPT).
Branch-prediction and uop caches are virtually addressed, so with the right choice of virtual addresses, a warm-up run of the function at the alternate virtual address will prime L1I, but not branch prediction or uop caches. (This only works if branch aliasing happens modulo something larger than 4096 bytes, because the position within the page is the same for both mappings.)
Prime the iTLB by calling to a ret in the same page as the test function, but outside it.
After setting this up, no modification of the page tables are required between the warm-up run and the timing run. This is why you use two mappings of the same page instead of remapping a single mapping.
Margaret Bloom suggests that CPUs vulnerable to Meltdown might speculatively fetch instructions from a no-exec page if you jump there (in the shadow of a mispredict so it doesn't actually fault), but that would then require changing the page table, and thus a system call which is expensive and might evict that line of L1I. But if it doesn't pollute the iTLB, you could then re-populate the iTLB entry with a mispredicted branch anywhere into the same page as the function. Or just a call to a dummy ret outside the function in the same page.
None of this will let you get the uop cache warmed up, though, because it's virtually addressed. OTOH, in real life, if branch predictors are cold then probably the uop cache will also be cold.

One approach that could work for small functions would be to execute some code which appears on the same cache line(s) as your target function, which will bring in the entire cache line.
For example, you could organize your code as follows:
ALIGN 64
function_under_test:
; some code, less than 64 bytes
dummy:
ret
and then call the dummy function prior to calling function_under_test - if dummy starts on the same cache line as the target function, it would bring the entire cache line into L1I. This works for functions of 63 bytes or less1.
This can probably be extended to functions up to ~126 bytes or so by using this trick both at before2 and after the target function. You could extend it to arbitrarily sized functions by inserting dummy functions on every cache line and having the target code jump over them, but this comes at a cost of inserting the otherwise-unnecessary jumps your code under test, and requires careful control over the code size so that the dummy functions are placed correctly.
You need fine control over function alignment and placement to achieve this: assembler is probably the easiest, but you can also probably do it with C or C++ in combination with compiler-specific attributes.
1 You could even reuse the ret in the function_under_test itself to support slightly longer functions (e.g., those whose ret starts within 64 bytes of the start).
2 You'd have to be more careful about the dummy function appearing before the code under test: the processor might fetch instructions past the ret and it might (?) even execute them. A ud2 after the dummy ret is likely to block further fetch (but you might want fetch if populating the uop cache is important).

Related

Cost of instruction jumps in assembly

I've had always been curious about the cost of jumps in assembly.
cmp ecx, edx
je SOME_LOCATION # What's the cost of this jump?
Does it need to do a search in a lookup table for each jumps or how does it work?

No, a jump doesn’t do a search. The assembler resolves the label to an address, which in most cases is then converted to an offset from the current instruction. The address or offset is encoded in the instruction. At run time, the processor loads the address into the IP register or adds the offset to the current value of the IP register (along with all the other effects discussed by #Brendan).
There is a type of jump instruction that can be used to get the destination from a table. The jump instruction reads the address from a memory location. (The instruction specifies a single location, so there still is no “search”.) This instruction could look something like this:
jmp table[eax*4]
where eax is the index of the entry in the table containing the address to jump to.

Originally (e.g. 8086) the cost of a jump wasn't much different to the cost of a mov.
Later CPUs added caches, which meant some jumps were faster (because the code they jump to is in the cache) and some jumps were slower (because the code they jump to isn't in the cache).
Even later CPUs added "out of order" execution, where conditional branches (e.g. je SOME_LOCATION) would have to wait until the flags from "previous instructions that happen to be executed in parallel" became known.
This means that a sequence like
mov esi, edi
cmp ecx, edx
je SOME_LOCATION
can be slower than rearranging it to
cmp ecx, edx
mov esi, edi
je SOME_LOCATION
to increase the chance that the flags would be known.
Even later CPUs added speculative execution. In this case, for conditional branches the CPU just takes a guess at where it will branch to before it actually knows (e.g. before the flags are known), and if it guesses wrong it'll just pretend that it didn't execute the wrong instructions. More specifically, the speculatively executed instructions are tagged at the start of the pipeline and held at the end of the pipeline (at retirement) until the CPU knows if they can be committed to visible state or if they have to be discarded.
After that things just got more complicated, with fancier methods of doing branch prediction, additional "branch target" buffers, etc.
Far jumps that change the code segment are more expensive. In real mode it's not so bad because the CPU mostly only does "CS.base = value * 16" when CS is changed. For protected mode it's a table lookup (to find GDT or LDT entry), decoding the entry, deciding what to do based on what kind of entry it is, then a pile of protection checks. For long mode it's vaguely similar. All of this adds more uncertainty (e.g. with the table entry be in cache?).
On top of all of this there's things like TLB misses. For example, jmp [indirectAddress] can cause a TLB miss at indirectAddress then a TLB miss at the stack top then a TLB miss at the new instruction pointer; where each TLB miss can cost a few hundred cycles.
Mostly; the cost of a jump can be anything from 0 cycles (for a correctly predicted jump) to maybe 1000 cycles; depending on which CPU it is, what kind of jump, what is in caches, what branch prediction predicts, cache/TLB misses, how fast/slow RAM is, and anything I may have forgotten.

Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?

I'm doing micro-optimization on a performance critical part of my code and came across the sequence of instructions (in AT&T syntax):
add %rax, %rbx
mov %rdx, %rax
mov %rbx, %rdx
I thought I finally had a use case for xchg which would allow me to shave an instruction and write:
add %rbx, %rax
xchg %rax, %rdx
However, to my dimay I found from Agner Fog's instruction tables, that xchg is a 3 micro-op instruction with a 2 cycle latency on Sandy Bridge, Ivy Bridge, Broadwell, Haswell and even Skylake. 3 whole micro-ops and 2 cycles of latency! The 3 micro-ops throws off my 4-1-1-1 cadence and the 2 cycle latency makes it worse than the original in the best case since the last 2 instructions in the original might execute in parallel.
Now... I get that the CPU might be breaking the instruction into micro-ops that are equivalent to:
mov %rax, %tmp
mov %rdx, %rax
mov %tmp, %rdx
where tmp is an anonymous internal register and I suppose the last two micro-ops could be run in parallel so the latency is 2 cycles.
Given that register renaming occurs on these micro-architectures, though, it doesn't make sense to me that this is done this way. Why wouldn't the register renamer just swap the labels? In theory, this would have a latency of only 1 cycle (possibly 0?) and could be represented as a single micro-op so it would be much cheaper.

Supporting efficient xchg is non-trivial, and presumably not worth the extra complexity it would require in various parts of the CPU. A real CPU's microarchitecture is much more complicated than the mental model that you can use while optimizing software for it. For example, speculative execution makes everything more complicated, because it has to be able to roll back to the point where an exception occurred.
Making fxch efficient was important for x87 performance because the stack nature of x87 makes it (or alternatives like fld st(2)) hard to avoid. Compiler-generated FP code (for targets without SSE support) really does use fxch a significant amount. It seems that fast fxch was done because it was important, not because it's easy. Intel Haswell even dropped support for single-uop fxch. It's still zero-latency, but decodes to 2 uops on HSW and later (up from 1 in P5, and PPro through IvyBridge).
xchg is usually easy to avoid. In most cases, you can just unroll a loop so it's ok that the same value is now in a different register. e.g. Fibonacci with add rax, rdx / add rdx, rax instead of add rax, rdx / xchg rax, rdx. Compilers generally don't use xchg reg,reg, and usually hand-written asm doesn't either. (This chicken/egg problem is pretty similar to loop being slow (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?). loop would have been very useful for for adc loops on Core2/Nehalem where an adc + dec/jnz loop causes partial-flag stalls.)
Since xchg is still slow-ish on previous CPUs, compilers wouldn't start using it with -mtune=generic for several years. Unlike fxch or mov-elimination, a design-change to support fast xchg wouldn't help the CPU run most existing code faster, and would only enable performance gains over the current design in rare cases where it's actually a useful peephole optimization.
Integer registers are complicated by partial-register stuff, unlike x87
There are 4 operand sizes of xchg, 3 of which use the same opcode with REX or operand-size prefixes. (xchg r8,r8 is a separate opcode, so it's probably easier to make the decoders decode it differently from the others). The decoders already have to recognize xchg with a memory operand as special, because of the implicit lock prefix, but it's probably less decoder complexity (transistor-count + power) if the reg-reg forms all decode to the same number of uops for different operand sizes.
Making some r,r forms decode to a single uop would be even more complexity, because single-uop instructions have to be handled by the "simple" decoders as well as the complex decoder. So they would all need to be able to parse xchg and decide whether it was a single uop or multi-uop form.
AMD and Intel CPUs behave somewhat similarly from a programmer's perspective, but there are many signs that the internal implementation is vastly different. For example, Intel mov-elimination only works some of the time, limited by some kind of microarchitectural resources, but AMD CPUs that do mov-elimination do it 100% of the time (e.g. Bulldozer for the low lane of vector regs).
See Intel's optimization manual, Example 3-23. Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions, where they discuss overwriting the zero-latency-movzx result right away to free up the internal resource sooner. (I tried the examples on Haswell and Skylake, and found that mov-elimination did in fact work significantly more of the time when doing that, but that it was actually slightly slower in total cycles, instead of faster. The example was intended to show the benefit on IvyBridge, which probably bottlenecks on its 3 ALU ports, but HSW/SKL only bottleneck on resource conflicts in the dep chains and don't seem to be bothered by needing an ALU port for more of the movzx instructions.)
I don't know exactly what needs tracking in a limited-size table(?) for mov-elimination. Probably it's related to needing to free register-file entries as soon as possible when they're no longer needed, because Physical Register File size limits rather than ROB size can be the bottleneck for the out-of-order window size. Swapping around indices might make this harder.
xor-zeroing is eliminated 100% of the time on Intel Sandybridge-family; it's assumed that this works by renaming to a physical zero register, and this register never needs to be freed.
If xchg used the same mechanism that mov-elimination does, it also could probably only work some of the time. It would need to decode to enough uops to work in cases where it isn't handled at rename. (Or else the issue/rename stage would have to insert extra uops when an xchg will take more than 1 uop, like it does when un-laminating micro-fused uops with indexed addressing modes that can't stay micro-fused in the ROB, or when inserting merging uops for flags or high-8 partial registers. But that's a significant complication that would only be worth doing if xchg was a common and important instruction.)
Note that xchg r32,r32 has to zero-extend both results to 64 bits, so it can't be a simple swap of RAT (Register Alias Table) entries. It would be more like truncating both registers in-place. And note that Intel CPUs never eliminate mov same,same. It does already need to support mov r32,r32 and movzx r32, r8 with no execution port, so presumably it has some bits that indicate that rax = al or something. (And yes, Intel HSW/SKL do that, not just Ivybridge, despite what Agner's microarch guide says.)
We know P6 and SnB had upper-zeroed bits like this, because xor eax,eax before setz al avoids a partial-register stall when reading eax. HSW/SKL never rename al separately in the first place, only ah. It may not be a coincidence that partial-register renaming (other than AH) seems to have been dropped in the same uarch that introduced mov-elimination (Ivybridge). Still, setting that bit for 2 registers at once would be a special case that required special support.
xchg r64,r64 could maybe just swap the RAT entries, but decoding that differently from the r32 case is yet another complication. It might still need to trigger partial-register merging for both inputs, but add r64,r64 needs to do that, too.
Also note that an Intel uop (other than fxch) only ever produces one register result (plus flags). Not touching flags doesn't "free up" an output slot; For example mulx r64,r64,r64 still takes 2 uops to produce 2 integer outputs on HSW/SKL, even though all the "work" is done in the multiply unit on port 1, same as with mul r64 which does produce a flag result.)
Even if it is as simple as "swap the RAT entries", building a RAT that supports writing more than one entry per uop is a complication. What to do when renaming 4 xchg uops in a single issue group? It seems to me like it would make the logic significantly more complicated. Remember that this has to be built out of logic gates / transistors. Even if you say "handle that special case with a trap to microcode", you have to build the whole pipeline to support the possibility that that pipeline stage could take that kind of exception.
Single-uop fxch requires support for swapping RAT entries (or some other mechanism) in the FP RAT (fRAT), but it's a separate block of hardware from the integer RAT (iRAT). Leaving out that complication in the iRAT seems reasonable even if you have it in the fRAT (pre-Haswell).
Issue/rename complexity is definitely an issue for power consumption, though. Note that Skylake widened a lot of the front-end (legacy decode and uop cache fetch), and retirement, but kept the 4-wide issue/rename limit. SKL also added replicated execution units on more port in the back-end, so issue bandwidth is a bottleneck even more of the time, especially in code with a mix of loads, stores, and ALU.
The RAT (or the integer register file, IDK) may even have limited read ports, since there seem to be some front-end bottlenecks in issuing/renaming many 3-input uops like add rax, [rcx+rdx]. I posted some microbenchmarks (this and the follow-up post) showing Skylake being faster than Haswell when reading lots of registers, e.g. with micro-fusion of indexed addressing modes. Or maybe the bottleneck there was really some other microarchitectural limit.
But how does 1-uop fxch work? IDK how it's done in Sandybridge / Ivybridge. In P6-family CPUs, an extra remapping table exists basically to support FXCH. That might only be needed because P6 uses a Retirement Register File with 1 entry per "logical" register, instead of a physical register file (PRF). As you say, you'd expect it to be simpler when even "cold" register values are just a pointer to a PRF entry. (Source: US patent 5,499,352: Floating point register alias table FXCH and retirement floating point register array (describes Intel's P6 uarch).
One main reason the rfRAT array 802 is included within the present invention fRAT logic is a direct result of the manner in which the present invention implements the FXCH instruction.
(Thanks Andy Glew (#krazyglew), I hadn't thought of looking up patents to find out about CPU internals.) It's pretty heavy going, but may provide some insight into the bookkeeping needed for speculative execution.
Interesting tidbit: the patent describes integer as well, and mentions that there are some "hidden" logical registers which are reserved for use by microcode. (Intel's 3-uop xchg almost certain uses one of these as a temporary.)
We might be able to get some insight from looking at what AMD does.
Interestingly, AMD has 2-uop xchg r,r in K10, Bulldozer-family, Bobcat/Jaguar, and Ryzen. (But Jaguar xchg r8,r8 is 3 uops. Maybe to support the xchg ah,al corner case without a special uop for swapping the low 16 of a single reg).
Presumably both uops read the old values of the input architectural registers before the first one updates the RAT. IDK exactly how this works, since they aren't necessarily issued/renamed in the same cycle (but they are at least contiguous in the uop flow, so at worst the 2nd uop is the first uop in the next cycle). I have no idea if Haswell's 2-uop fxch works similarly, or if they're doing something else.
Ryzen is a new architecture designed after mov-elimination was "invented", so presumably they take advantage of it wherever possible. (Bulldozer-family renames vector moves (but only for the low 128b lane of YMM vectors); Ryzen is the first AMD architecture to do it for GP regs too.) xchg r32,r32 and r64,r64 are zero-latency (renamed), but still 2 uops each. (r8 and r16 need an execution unit, because they merge with the old value instead of zero-extending or copying the entire reg, but are still only 2 uops).
Ryzen's fxch is 1 uop. AMD (like Intel) probably isn't spending a lot of transistors on making x87 fast (e.g. fmul is only 1 per clock and on the same port as fadd), so presumably they were able to do this without a lot of extra support. Their micro-coded x87 instructions (like fyl2x) are faster than on recent Intel CPUs, so maybe Intel cares even less (at least about the microcoded x87 instruction).
Maybe AMD could have made xchg r64,r64 a single uop too, more easily than Intel. Maybe even xchg r32,r32 could be single uop, since like Intel it needs to support mov r32,r32 zero-extension with no execution port, so maybe it could just set whatever "upper 32 zeroed" bit exists to support that. Ryzen doesn't eliminate movzx r32, r8 at rename, so presumably there's only an upper32-zero bit, not bits for other widths.
What Intel might be able to do cheaply if they wanted to:
It's possible that Intel could support 2-uop xchg r,r the way Ryzen does (zero latency for the r32,r32 and r64,r64 forms, or 1c for the r8,r8 and r16,r16 forms) without too much extra complexity in critical parts of the core, like the issue/rename and retirement stages that manage the Register Alias Table (RAT). But maybe not, if they can't have 2 uops read the "old" value of a register when the first uop writes it.
Stuff like xchg ah,al is definitely a extra complication, since Intel CPUs don't rename partial registers separately anymore, except AH/BH/CH/DH.
xchg latency in practice on current hardware
Your guess about how it might work internally is good. It almost certainly uses one of the internal temporary registers (accessible only to microcode). Your guess about how they can reorder is too limited, though.
In fact, one direction has 2c latency and the other direction has ~1c latency.
00000000004000e0 <_start.loop>:
4000e0: 48 87 d1 xchg rcx,rdx # slow version
4000e3: 48 83 c1 01 add rcx,0x1
4000e7: 48 83 c1 01 add rcx,0x1
4000eb: 48 87 ca xchg rdx,rcx
4000ee: 48 83 c2 01 add rdx,0x1
4000f2: 48 83 c2 01 add rdx,0x1
4000f6: ff cd dec ebp
4000f8: 7f e6 jg 4000e0 <_start.loop>
This loop runs in ~8.06 cycles per iteration on Skylake. Reversing the xchg operands makes it run in ~6.23c cycles per iteration (measured with perf stat on Linux). uops issued/executed counters are equal, so no elimination happened. It looks like the dst <- src direction is the slow one, since putting the add uops on that dependency chain makes things slower than when they're on the dst -> src dependency chain.
If you ever want to use xchg reg,reg on the critical path (code-size reasons?), do it with the dst -> src direction on the critical path, because that's only about 1c latency.
Other side-topics from comments and the question
The 3 micro-ops throws off my 4-1-1-1 cadence
Sandybridge-family decoders are different from Core2/Nehalem. They can produce up to 4 uops total, not 7, so the patterns are 1-1-1-1, 2-1-1, 3-1, or 4.
Also beware that if the last uop is one that can macro-fuse, they will hang onto it until the next decode cycle in case the first instruction in the next block is a jcc. (This is a win when code runs multiple times from the uop cache for each time it's decoded. And that's still usually 3 uops per clock decode throughput.)
Skylake has an extra "simple" decoder so it can do 1-1-1-1-1 up to 4-1 I guess, but > 4 uops for one instruction still requires the microcode ROM. Skylake beefed up the uop cache, too, and can often bottleneck on the 4 fused-domain uops per clock issue/rename throughput limit if the back-end (or branch misses) aren't a bottleneck first.
I'm literally searching for ~1% speed bumps so hand optimization has been working out on the main loop code. Unfortunately that's ~18kB of code so I'm not even trying to consider the uop cache anymore.
That seems kinda crazy, unless you're mostly limiting yourself to asm-level optimization in shorter loops inside your main loop. Any inner loops within the main loop will still run from the uop cache, and that should probably be where you're spending most of your time optimizing. Compilers usually do a good-enough job that it's not practical for a human to do much over a large scale. Try to write your C or C++ in such a way that the compiler can do a good job with it, of course, but looking for tiny peephole optimizations like this over 18kB of code seems like going down the rabbit hole.
Use perf counters like idq.dsb_uops vs. uops_issued.any to see how many of your total uops came from the uop cache (DSB = Decoded Stream Buffer or something). Intel's optimization manual has some suggestions for other perf counters to look at for code that doesn't fit in the uop cache, such as DSB2MITE_SWITCHES.PENALTY_CYCLES. (MITE is the legacy-decode path). Search the pdf for DSB to find a few places it's mentioned.
Perf counters will help you find spots with potential problems, e.g. regions with higher than average uops_issued.stall_cycles could benefit from finding ways to expose more ILP if there are any, or from solving a front-end problem, or from reducing branch-mispredicts.
As discussed in comments, a single uop produces at most 1 register result
As an aside, with a mul %rbx, do you really get %rdx and %rax all at once or does the ROB technically have access to the lower part of the result one cycle earlier than the higher part? Or is it like the "mul" uop goes into the multiplication unit and then the multiplication unit issues two uops straight into the ROB to write the result at the end?
Terminology: the multiply result doesn't go into the ROB. It goes over the forwarding network to whatever other uops read it, and goes into the PRF.
The mul %rbx instruction decodes to 2 uops in the decoders. They don't even have to issue in the same cycle, let alone execute in the same cycle.
However, Agner Fog's instruction tables only list a single latency number. It turns out that 3 cycles is the latency from both inputs to RAX. The minimum latency for RDX is 4c, according to InstlatX64 testing on both Haswell and Skylake-X.
From this, I conclude that the 2nd uop is dependent on the first, and exists to write the high half of the result to an architectural register. The port1 uop produces a full 128b multiply result.
I don't know where the high-half result lives until the p6 uop reads it. Perhaps there's some sort of internal queue between the multiply execution unit and hardware connected to port 6. By scheduling the p6 uop with a dependency on the low-half result, that might arrange for the p6 uops from multiple in-flight mul instructions to run in the correct order. But then instead of actually using that dummy low-half input, the uop would take the high half result from the queue output in an execution unit that's connected to port 6 and return that as the result. (This is pure guess work, but I think it's plausible as one possible internal implementation. See comments for some earlier ideas).
Interestingly, according to Agner Fog's instruction tables, on Haswell the two uops for mul r64 go to ports 1 and 6. mul r32 is 3 uops, and runs on p1 + p0156. Agner doesn't say whether that's really 2p1 + p0156 or p1 + 2p0156 like he does for some other insns. (However, he says that mulx r32,r32,r32 runs on p1 + 2p056 (note that p056 doesn't include p1).)
Even more strangely, he says that Skylake runs mulx r64,r64,r64 on p1 p5 but mul r64 on p1 p6. If that's accurate and not a typo (which is a possibility), it pretty much rules out the possibility that the extra uop is an upper-half multiplier.

Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly?

Background:
While optimizing some Pascal code with embedded assembly language, I noticed an unnecessary MOV instruction, and removed it.
To my surprise, removing the un-necessary instruction caused my program to slow down.
I found that adding arbitrary, useless MOV instructions increased performance even further.
The effect is erratic, and changes based on execution order: the same junk instructions transposed up or down by a single line produce a slowdown.
I understand that the CPU does all kinds of optimizations and streamlining, but, this seems more like black magic.
The data:
A version of my code conditionally compiles three junk operations in the middle of a loop that runs 2**20==1048576 times. (The surrounding program just calculates SHA-256 hashes).
The results on my rather old machine (Intel(R) Core(TM)2 CPU 6400 # 2.13 GHz):
avg time (ms) with -dJUNKOPS: 1822.84 ms
avg time (ms) without: 1836.44 ms
The programs were run 25 times in a loop, with the run order changing randomly each time.
Excerpt:
{$asmmode intel}
procedure example_junkop_in_sha256;
var s1, t2 : uint32;
begin
// Here are parts of the SHA-256 algorithm, in Pascal:
// s0 {r10d} := ror(a, 2) xor ror(a, 13) xor ror(a, 22)
// s1 {r11d} := ror(e, 6) xor ror(e, 11) xor ror(e, 25)
// Here is how I translated them (side by side to show symmetry):
asm
MOV r8d, a ; MOV r9d, e
ROR r8d, 2 ; ROR r9d, 6
MOV r10d, r8d ; MOV r11d, r9d
ROR r8d, 11 {13 total} ; ROR r9d, 5 {11 total}
XOR r10d, r8d ; XOR r11d, r9d
ROR r8d, 9 {22 total} ; ROR r9d, 14 {25 total}
XOR r10d, r8d ; XOR r11d, r9d
// Here is the extraneous operation that I removed, causing a speedup
// s1 is the uint32 variable declared at the start of the Pascal code.
//
// I had cleaned up the code, so I no longer needed this variable, and
// could just leave the value sitting in the r11d register until I needed
// it again later.
//
// Since copying to RAM seemed like a waste, I removed the instruction,
// only to discover that the code ran slower without it.
{$IFDEF JUNKOPS}
MOV s1, r11d
{$ENDIF}
// The next part of the code just moves on to another part of SHA-256,
// maj { r12d } := (a and b) xor (a and c) xor (b and c)
mov r8d, a
mov r9d, b
mov r13d, r9d // Set aside a copy of b
and r9d, r8d
mov r12d, c
and r8d, r12d { a and c }
xor r9d, r8d
and r12d, r13d { c and b }
xor r12d, r9d
// Copying the calculated value to the same s1 variable is another speedup.
// As far as I can tell, it doesn't actually matter what register is copied,
// but moving this line up or down makes a huge difference.
{$IFDEF JUNKOPS}
MOV s1, r9d // after mov r12d, c
{$ENDIF}
// And here is where the two calculated values above are actually used:
// T2 {r12d} := S0 {r10d} + Maj {r12d};
ADD r12d, r10d
MOV T2, r12d
end
end;
Try it yourself:
The code is online at GitHub if you want to try it out yourself.
My questions:
Why would uselessly copying a register's contents to RAM ever increase performance?
Why would the same useless instruction provide a speedup on some lines, and a slowdown on others?
Is this behavior something that could be exploited predictably by a compiler?

The most likely cause of the speed improvement is that:
inserting a MOV shifts the subsequent instructions to different memory addresses
one of those moved instructions was an important conditional branch
that branch was being incorrectly predicted due to aliasing in the branch prediction table
moving the branch eliminated the alias and allowed the branch to be predicted correctly
Your Core2 doesn't keep a separate history record for each conditional jump. Instead it keeps a shared history of all conditional jumps. One disadvantage of global branch prediction is that the history is diluted by irrelevant information if the different conditional jumps are uncorrelated.
This little branch prediction tutorial shows how branch prediction buffers work. The cache buffer is indexed by the lower portion of the address of the branch instruction. This works well unless two important uncorrelated branches share the same lower bits. In that case, you end-up with aliasing which causes many mispredicted branches (which stalls the instruction pipeline and slowing your program).
If you want to understand how branch mispredictions affect performance, take a look at this excellent answer: https://stackoverflow.com/a/11227902/1001643
Compilers typically don't have enough information to know which branches will alias and whether those aliases will be significant. However, that information can be determined at runtime with tools such as Cachegrind and VTune.

You may want to read http://research.google.com/pubs/pub37077.html
TL;DR: randomly inserting nop instructions in programs can easily increase performance by 5% or more, and no, compilers cannot easily exploit this. It's usually a combination of branch predictor and cache behaviour, but it can just as well be e.g. a reservation station stall (even in case there are no dependency chains that are broken or obvious resource over-subscriptions whatsoever).

I believe in modern CPUs the assembly instructions, while being the last visible layer to a programmer for providing execution instructions to a CPU, actually are several layers from actual execution by the CPU.
Modern CPUs are RISC/CISC hybrids that translate CISC x86 instructions into internal instructions that are more RISC in behavior. Additionally there are out-of-order execution analyzers, branch predictors, Intel's "micro-ops fusion" that try to group instructions into larger batches of simultaneous work (kind of like the VLIW/Itanium titanic). There are even cache boundaries that could make the code run faster for god-knows-why if it's bigger (maybe the cache controller slots it more intelligently, or keeps it around longer).
CISC has always had an assembly-to-microcode translation layer, but the point is that with modern CPUs things are much much much more complicated. With all the extra transistor real estate in modern semiconductor fabrication plants, CPUs can probably apply several optimization approaches in parallel and then select the one at the end that provides the best speedup. The extra instructions may be biasing the CPU to use one optimization path that is better than others.
The effect of the extra instructions probably depends on the CPU model / generation / manufacturer, and isn't likely to be predictable. Optimizing assembly language this way would require execution against many CPU architecture generations, perhaps using CPU-specific execution paths, and would only be desirable for really really important code sections, although if you're doing assembly, you probably already know that.

Preparing the cache
Move operations to memory can prepare the cache and make subsequent move operations faster. A CPU usually have two load units and one store units. A load unit can read from memory into a register (one read per cycle), a store unit stores from register to memory. There are also other units that do operations between registers. All the units work in parallel. So, on each cycle, we may do several operations at once, but no more than two loads, one store, and several register operations. Usually it is up to 4 simple operations with plain registers, up to 3 simple operations with XMM/YMM registers and a 1-2 complex operations with any kind of registers. Your code has lots of operations with registers, so one dummy memory store operation is free (since there are more than 4 register operations anyway), but it prepares memory cache for the subsequent store operation. To find out how memory stores work, please refer to the Intel 64 and IA-32 Architectures Optimization Reference Manual.
Breaking the false dependencies
Although this does not exactly refer to your case, but sometimes using 32-bit mov operations under the 64-bit processor (as in your case) are used to clear the higher bits (32-63) and break the dependency chains.
It is well known that under x86-64, using 32-bit operands clears the higher bits of the 64-bit register. Pleas read the relevant section - 3.4.1.1 - of The Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1:
32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register
So, the mov instructions, that may seem useless at the first sight, clear the higher bits of the appropriate registers. What it gives to us? It breaks dependency chains and allows the instructions to execute in parallel, in random order, by the Out-of-Order algorithm implemented internally by CPUs since Pentium Pro in 1995.
A Quote from the Intel® 64 and IA-32 Architectures Optimization Reference Manual, Section 3.5.1.8:
Code sequences that modifies partial register can experience some delay in its dependency chain, but can be avoided by using dependency breaking idioms. In processors based on Intel Core micro-architecture, a number of instructions can help clear execution dependency when software uses these instruction to clear register content to zero. Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For
moves, this can be accomplished with 32-bit moves or by using MOVZX.
Assembly/Compiler Coding Rule 37. (M impact, MH generality): Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.
The MOVZX and MOV with 32-bit operands for x64 are equivalent - they all break dependency chains.
That's why your code executes faster. If there are no dependencies, the CPU can internally rename the registers, even though at the first sight it may seem that the second instruction modifies a register used by the first instruction, and the two cannot execute in parallel. But due to register renaming they can.
Register renaming is a technique used internally by a CPU that eliminates the false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them.
I think you now see that it is too obvious.

Instruction Level Profiling: The Meaning of the Instruction Pointer?

When profiling code at the the assembly instruction level, what does the position of the instruction pointer really mean given that modern CPUs don't execute instructions serially or in-order? For example, assume the following x64 assembly code:
mov RAX, [RBX]; // Assume a cache miss here.
mov RSI, [RBX + RCX]; // Another cache miss.
xor R8, R8;
add RDX, RAX; // Dependent on the load into RAX.
add RDI, RSI; // Dependent on the load into RSI.
Which instruction will the instruction pointer spend most of its time on? I can think of good arguments for all of them:
mov RAX, [RBX] is taking probably 100s of cycles because it's a cache miss.
mov RSI, [RBX + RCX] also takes 100s of cycles, but probably executes in parallel with the previous instruction. What does it even mean for the instruction pointer to be on one or the other of these?
xor R8, R8 probably executes out-of-order and finishes before the memory loads finish, but the instruction pointer might stay here until all previous instructions are also finished.
add RDX, RAX generates a pipeline stall because it's the instruction where the value of RAX is actually used after a slow cache-miss load into it.
add RDI, RSI also stalls because it's dependent on the load into RSI.

CPUs maintains a fiction that there are only the architectural registers (RAX, RBX, etc) and there is a specific instruction pointer (IP). Programmers and compilers target this fiction.
Yet as you noted, modern CPUs don't execute serially or in-order. Until you the programmer / user request the IP, it is like Quantum Physics, the IP is a wave of instructions being executed; all so that the processor can run the program as fast as possible. When you request the current IP (for example, via a debugger breakpoint or profiler interrupt), then the processor must recreate the fiction that you expect so it collapses this wave form (all "in flight" instructions), gathers the register values back into architectural names, and builds a context for executing the debugger routine, etc.
In this context, there is an IP that indicates the instruction where the processor should resume execution. During the out-of-order execution, this instruction was the oldest instruction yet to complete, even though at the time of the interrupt the processor was perhaps fetching instructions well past that point.
For example, perhaps the interrupt indicates mov RSI, [RBX + RCX]; as the IP, but the xor had already executed and completed; however, when the processor would resume execution after the interrupt, it will re-execute the xor.

It's a good question, but in the kind of performance tuning I do, it doesn't matter.
It doesn't really matter because what you're looking for is speed-bugs.
These are things that the code is doing that take clock time and that could be done better or not at all. Examples:
- Spending I/O time looking in DLLs for resources that don't, actually, need to be looked for.
- Spending time in memory-allocation routines making and freeing objects that could simply be re-used.
- Re-calculating things in functions that could be memo-ized.
... this is just a few off the top of my head
Your biggest enemy is a self-congratulatory tendency to say "I wouldn't consciously write any bugs. Why would I?" Of course, you know that's why you test software. But the same goes for speed-bugs, and if you don't know how to find those you assume there are none, which is a way of saying "My code has no possible speedups, except maybe a profiler can show me how to shave a few cycles."
In my half-century experience, there is no code that, as first written, contains no speed-bugs. What's more, there's an enormous multiplier effect, where every speed-bug you remove makes the remaining ones more obvious. As a contrived example, suppose bug A accounts for 90% of clock time, and bug B accounts for 9%. If you only fix B, big deal - the code is 11% faster. If you only fix A, that's good - it's 10x faster. But if you fix both, that's really good - it's 100x faster. Fixing A made B big.
So the thing you need most in performance tuning is to find the speed-bugs, and not miss any. When you've done all that, then you can get down to cycle-shaving.

Why does this code enable me to detect a debugger?

Why the following assembly code is an anti-debugging tool?
l1:
call l3
l2:
;some code
l3:
mov al, 0c3h
mov edi, offset l3
or ecx, -1
rep stosb
I know that C3h is RETN and I know that stobs writes the value in al as opcode according to the offset in edi and it is done for ecx times because of rep.
I am also aware the fact that stobs and stosw will run if they were pre-fetched on intel architecture as their original format.
If we run the program in debugged mode the pre-fetch is irrelevant and the l2 label will run (because it is single-step) otherwise if there is no debugger it will be ping-pong between l1 and l3 am I right?

When program is debugged (i.e. single step) prefetch queue is flushed at each step (when interrupt occurs). However, when executed normally that will not happen to rep stosb. Older processors didn't flushed it even when there was memory write to the cached area, in order to support self-modifying code that was changed except rep movs and rep stosb. (IIRC it was eventually fixed in i7 processors.)
That's why if there is a debugger (single step) code will execute correctly and when rep stosb is replaced by ret l2 will be executed. When there is no debugger rep stosb will continue, since ecx is the biggest possible it will eventually write somewhere it is not supposed to write and an exception will occur.
This anti-debugging technique is described in this paper.

The only thing that a debugger does here is add time delay. That may be the key to how this works. The Intel (and I assume the AMD) manual explicitly say that self-modifying code is not gauranteed "to work" unless the program signals the CPU that the cache line containg the modified instruction has changed. This is to make the prefetch logic cheap; the chip designers don't want to have hardware that continually tests that every byte of an instruction cacheline is still valid.
So I assume what happens with the debuggers is l1 calls l3, which stores a return after the rep stosb, and the return gets executed becaued of long delays induced by the debugger in single stepping, forcing the cachecline containng l3 to be refetched after changed.
Without the debugger, I'd guess the instruction (not shown) after the stosb gets executed. If it were a jump to "no debugger" then the success of the jump would demonstrate that no single-stepping debugger was being used.
If I found this code in an application, I would refuse to run it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio