rdpmc: surprising behavior

rdpmc: surprising behavior - performance

I'm trying to understand the rdpmc instruction. As such I have the following asm code:
segment .text
global _start
_start:
xor eax, eax
mov ebx, 10
.loop:
dec ebx
jnz .loop
mov ecx, 1<<30
; calling rdpmc with ecx = (1<<30) gives number of retired instructions
rdpmc
; but only if you do a bizarre incantation: (Why u do dis Intel?)
shl rdx, 32
or rax, rdx
mov rdi, rax ; return number of instructions retired.
mov eax, 60
syscall
(The implementation is a translation of rdpmc_instructions().)
I count that this code should execute 2*ebx+3 instructions before hitting the rdpmc instruction, so I expect (in this case) that I should get a return status of 23.
If I run perf stat -e instruction:u ./a.out on this binary, perf tells me that I've executed 30 instructions, which looks about right. But if I execute the binary, I get a return status of 58, or 0, not deterministic.
What have I done wrong here?

The fixed counters don't count all the time, only when software has enabled them. Normally (the kernel side of) perf does this, along with resetting them to zero before starting a program.
The fixed counters (like the programmable counters) have bits that control whether
they count in user, kernel, or user+kernel (i.e. always). I assume Linux's perf kernel code leaves them set to count neither when nothing is using them.
If you want to use raw RDPMC yourself, you need to either program / enable the counters (by setting the corresponding bits in the IA32_PERF_GLOBAL_CTRL and IA32_FIXED_CTR_CTRL MSRs), or get perf to do it for you by still running your program under perf. e.g. perf stat ./a.out
If you use perf stat -e instructions:u ./perf ; echo $?, the fixed counter will actually be zeroed before entering your code so you get consistent results from using rdpmc once. Otherwise, e.g. with the default -e instructions (not :u) you don't know the initial value of the counter. You can fix that by taking a delta, reading the counter once at start, then once after your loop.
The exit status is only 8 bits wide, so this little hack to avoid printf or write() only works for very small counts.
It also means its pointless to construct the full 64-bit rdpmc result: the high 32 bits of the inputs don't affect the low 8 bits of a sub result because carry propagates only from low to high. In general, unless you expect counts > 2^32, just use the EAX result. Even if the raw 64-bit counter wrapped around during the interval you measured, your subtraction result will still be a correct small integer in a 32-bit register.
Simplified even more than in your question. Also note indenting the operands so they can stay at a consistent column even for mnemonics longer than 3 letters.
segment .text
global _start
_start:
mov ecx, 1<<30 ; fixed counter: instructions
rdpmc
mov edi, eax ; start
mov edx, 10
.loop:
dec edx
jnz .loop
rdpmc ; ecx = same counter as before
sub eax, edi ; end - start
mov edi, eax
mov eax, 231
syscall ; sys_exit_group(rdpmc). sys_exit isn't wrong, but glibc uses exit_group.
Running this under perf stat ./a.out or perf stat -e instructions:u ./a.out, we always get 23 from echo $? (instructions:u shows 30, which is 1 more than the actual number of instructions this program runs, including syscall)
23 instructions is exactly the number of instructions strictly after the first rdpmc, but including the 2nd rdpmc.
If we comment out the first rdpmc and run it under perf stat -e instructions:u, we consistently get 26 as the exit status, and 29 from perf. rdpmc is the 24th instruction to be executed. (And RAX starts out initialized to zero because this is a Linux static executable, so the dynamic linker didn't run before _start). I wonder if the sysret in the kernel gets counted as a "user" instruction.
But with the first rdpmc commented out, running under perf stat -e instructions (not :u) gives arbitrary values as the starting value of the counter isn't fixed. So we're just taking (some arbitrary starting point + 26) mod 256 as the exit status.
But note that RDPMC is not a serializing instruction, and can execute out of order. In general you maybe need lfence, or (as John McCalpin suggests in the thread you linked) giving ECX a false dependency on the results of instructions you care about. e.g. and ecx, 0 / or ecx, 1<<30 works, because unlike xor-zeroing, and ecx,0 is not dependency-breaking.
Nothing weird happens in this program because the front-end is the only bottleneck, so all the instructions execute basically as soon as they're issued. Also, the rdpmc is right after the loop, so probably a branch mispredict of the loop-exit branch prevents it from being issued into the OoO back-end before the loop finishes.
PS for future readers: one way to enable user-space RDPMC on Linux without any custom modules beyond what perf requires is documented in perf_event_open(2):
echo 2 | sudo tee /sys/devices/cpu/rdpmc # enable RDPMC always, not just when a perf event is open

The first step is to ensure that the performance counters you want to use are enabled in the IA32_PERF_GLOBAL_CTRL MSR register, whose layout is shown in Figure 18-8 of the Intel Manual Volume 3 (January 2019). You can easily do this by loading the MSR kernel module (sudo modprobe msr) and executing the following command:
sudo rdmsr -a 0x38F
The value 0x38F is the address of the IA32_PERF_GLOBAL_CTRL MSR register and the -a option specifies that the rdmsr instruction should be executed on all logical cores. By default, this should print 7000000ff (when HT is disabled) or 70000000f (when HT is enabled) for all logical cores. For the INST_RETIRED.ANY fixed-function performance counter, the bit at index 32 is the one that enables it, so it should be 1. The value 7000000ff that all of the three fixed-function counters and all of the eight programmable counters are enabled.
The IA32_PERF_GLOBAL_CTRL register has one enable bit for each performance counter per logical core. Each programmable performance counter has also its dedicated control register and there is a control register for all of the fixed-function counters. In particular, the control register for the INST_RETIRED.ANY fixed-function performance counter is IA32_FIXED_CTR_CTRL, whose layout is shown in Figure 18-7 of the Intel Manual Volume 3. There are 12 defined bits in the register, the first 4 bits can be used to control the behavior of the the first fixed-function counter, i.e., INST_RETIRED.ANY (the order is shown in Table 19-2). Before modifying the register, you should first check how it got initialized by the OS by executing:
sudo rdmsr -a 0x38D
It should print 0xb0, by default. This indicates that the second fixed-function counter (unhalted core cycles) is enabled and configured to count in both supervisor mode and user mode. To enable INST_RETIRED.ANY and configure it to count only user mode events while keeping the unhalted core cycles counter as is, execute the following command:
sudo wrmsr -a 0x38D 0xb2
Once this command is executed, the events are counted immediately. You can check this by reading the first fixed-function counter IA32_PERF_FIXED_CTR0 (see Table 19-2):
sudo rdmsr -a 0x309
You can execute that command multiple times and see how the counts on each core are changing. Unfortunately, this means that by the time your program is run, the current value in IA32_PERF_FIXED_CTR0 will be basically some random value. You can try to reset the counter by executing:
sudo wrmsr -a 0x309 0
But the fundamental problem remains; you cannot instantaneously reset the counter and run your program. As suggested in #Peter's answer, the right way to use any performance counter is to wrap the region of interest between rdpmc instructions and take the difference.
The MSR kernel module is very convenient because the only way to access MSR registers is in kernel mode. However, there is an alternative to wrapping the code between rdpmc instructions. You can write your own kernel module and place your code in the kernel module immediately after the instruction that enables the counter. You can even disable interrupts. Typically, this level of accuracy is not worth the effort.
You can use the -p option instead of -a to specify a particular logical core. However, you'll have to make sure that the program is run on the same core with taskset -c 3 ./a.out to run on core #3, for example.

Related

Setting and clearing the zero flag in x86

What's the most efficient way to set and also to clear the zero flag (ZF) in x86-64?
Methods that work without the need for a register with a known value, or without any free registers at all are preferred, but if a better method is available when those or other assumptions are true it is also worth mentioning.

ZF=0
This is harder. cmp between any two regs known to be not equal. Or cmp reg,imm with any value some reg couldn't possibly have. e.g. cmp reg,1 with any known-zero register.
In general test reg,reg is good with any known-non-0 register value, e.g. a pointer.
test rsp, rsp is probably a good choice, or even test esp, esp to save a byte will work except if your stack is in the unusual location of spanning a 4G boundary.
I don't see a way to create ZF=0 in one instruction without a false dependency on some input reg. xor eax,eax / inc eax or dec will do the trick in 2 uops if you don't mind destroying a register, breaking false dependencies. (not doesn't set FLAGS, and neg will just do 0-0 = 0.)
or eax, -1 doesn't need any pre-condition for the register value. (False dependency, but not a true dependency so you can pick any register even if it might be zero.) It doesn't have to be -1, it's not gaining you anything so if you can make it something useful so much the better.
or eax,-1 FLAG results: ZF=0 PF=1 SF=1 CF=0 OF=0 (AF=undefined).
If you need to do this in a loop, you can obviously set up for it outside the loop, if you can dedicate a register to being non-zero for use with test.
ZF=1
Least destructive: cmp eax,eax - but has a false dependency (I assume) and needs a back-end uop: not a zeroing idiom. RSP doesn't usually change much so cmp esp, esp could be a good choice. (Unless that forces a stack-sync uop).
Most efficient: xor-zeroing (like xor eax,eax using any free register) is definitely the most efficient way on SnB-family (same cost as a 2-byte nop, or 3-byte if it needs a REX because you want to zero one of r8d..r15d): 1 front-end uop, zero back-end uops on SnB-family, and the FLAGS result is ready in the same cycle it issues. (Relevant only in case the front-end was stalled, or some other case where a uop depending on it issues in the same cycle and there aren't any older uops in the RS with ready inputs, otherwise such uops would have priority for whichever execution port.)
Flag results: ZF=1 PF=1 SF=0 CF=0 OF=0 (AF=undefined). (Or use sub eax,eax to get well-defined AF=0. In practice modern CPUs pick AF=0 for xor-zeroing, too, so they can decode both zeroing idioms the same way. Silvermont only recognizes 32-bit operand-size xor as a zeroing idiom, not sub.)
xor-zero is very cheap on all other uarches as well, of course: no input dependencies, and doesn't need any pre-existing register value. (And thus doesn't contribute to P6-family register-read stalls). So it will be at worst tied with anything else you could do on any other uarch (where it does require an execution unit.)
(On early P6-family, before Pentium M, xor-zeroing does not break dependencies; it only triggers the special al=eax state that avoids partial-register stuff. But none of those CPUs are x86-64, all 32-bit only.)
It's pretty common to want a zeroed register for something anyway, e.g. as a sub destination for 0 - x to copy-and-negate, so take advantage of it by putting the xor-zeroing where you need it to also create a useful FLAG condition.
Interesting but probably not useful: test al, 0 is 2 bytes long. But so is cmp esp,esp.
As #prl suggested, cmp same,same with any register will work without disturbing a value. I suspect this is not special-cased as dependency breaking the way sub same,same is on some CPUs, so pick a "cold" register. Again 2 or 3 bytes, 1 uop. It can micro-fuse with a JCC, but that would be dumb (unless the JCC is also a branch target from some other condition?)
Flag results: same as xor-zeroing.
Downsides:
(probably) false dependency
on P6-family can contribute to a register-read stall, so pick a cold register you're already reading in nearby instructions.
needs a back-end execution unit on SnB-family
Just for fun, other as-cheap alternatives include test al, 0. 2 bytes for AL, 3 or 4 bytes for any other 8-bit register. (REX) + opcode + modrm + imm8. The original register value doesn't matter because an imm8 of zero guarantees that reg & 0 = 0.
If you happen to have a 1 or -1 in a register you can destroy, 32-bit mode inc or dec would set ZF in only 1 byte. But in x86-64 that's at least 2 bytes. Nothing comes to mind for a 1-byte instruction in 64-bit mode that's actually efficient and sets FLAGS.
ZF=!CF
sbb same,same can set ZF=!CF (leaving CF unmodified), and setting the reg to 0 (CF=0) or -1 (CF=1). On AMD since Bulldozer (BD-family and Zen-family), this has no dependency on the GP register, only CF. But on other uarches it's not special cased and there is a false dep on the reg. And it's 2 uops on Intel before Broadwell.
ZF=!bool(integer register)
To set ZF=!integer_reg, obviously the normal test reg,reg is your best bet. (Better than and reg,reg or or reg,reg, unless you're intentionally rewriting the register to avoid P6 register-read stalls.)
ZF=1 if the register value is zero, so it's like C's logical inverse operator.
ZF=!ZF
Perhaps setz al / test al, al. No single instruction: I don't think any read ZF and write FLAGS. setz materializes ZF in a register, then test is just ZF = !reg.
Other FLAGS conditions:
How to read and write x86 flags registers directly?
One instruction to clear PF (Parity Flag) -- get odd number of bits in result register (not possible without pre-existing register values for test or cmp).
How can I set or clear overflow flag in x86 assembly? (e.g. for the start of an ADOX chain.)
pushf/pop rax is not terrible, but writing flags with popf is very slow (e.g. 1/20c throughput on SKL). It's microcoded because flags like IF also live in EFLAGS, and there isn't a condition-codes-only version or a special fast-path for user-space. (Or maybe 20c is the fast path.)
lahf (FLAGS->AH) / sahf (AH->FLAGS) can be useful but miss OF.
CF has clc/stc/cmc instructions. (clc is as efficient as xor-zeroing on SnB-family.)

The least intrusive way to manipulate any(i) of the lower 8 bits of Flags is to use the classic LAHF/SAHF instructions which bring them to/from AH, on which any bit operation can be applied.
(i) Just bits 7 (SF), 6 (ZF), 4 (AF), 2 (PF), and 0 (CF)
Turning off ZF
LAHF ; Load lower 8 bit from Flags into AH
AND AH,010111111b ; Clear bit for ZF
SAHF ; Store AH back to Flags
Turning on ZF
LAHF ; Load AH from FLAGS
OR AH,001000000b ; Set bit for ZF
SAHF ; Store AH back to Flags
Of course any CMP (E)AX,(E)AX will set ZF faster and with less code; the point of this is to leave other FLAGS unmodified, as in How to read and write x86 flags registers directly? and how to change flags manually (in assembly code) for 8086?
CAVEAT for early AMD64 - LAHF in long mode is an extension
Some very early x86-64 CPU's, most notably all
AMD Athlon 64, Opteron and Turion 64 before revision D (March 2005) and
Intel before Pentium 4 stepping G1* (December 2005)
As that instruction was originally removed from the AMD64 instruction subset, but later reintroduced. Luckily that happened before x86-64 became a common sight, so only a few, early-on high-end CPUs are affected and even less, surviving today. More so as these are the CPUs that are not able to run Windows 10, or any 64-bit Windows before Windows 10 (see this answer at SuperUser.SE).
If you really expect that someone might try to run that software on a more than 17 year old high-end CPU, it can be checked for by executing CPUID with EAX=80000001h and test for 2^0=1.

Assuming you don’t need to preserve the values of the other flags,
cmp eax, eax

Why jnz counts no cycle?

I found in online resource that IvyBridge has 3 ALU. So I write a small program to test:
global _start
_start:
mov rcx, 10000000
.for_loop: ; do {
inc rax
inc rbx
dec rcx
jnz .for_loop ; } while (--rcx)
xor rdi, rdi
mov rax, 60 ; _exit(0)
syscall
I compile and run it with perf:
$ nasm -felf64 cycle.asm && ld cycle.o && sudo perf stat ./a.out
The output shows:
10,491,664 cycles
which seems to make sense at the first glance, because there are 3 independent instructions (2 inc and 1 dec) that uses ALU in the loop, so they count 1 cycle together.
But what I don't understand is why the whole loop only has 1 cycle? jnz depends on the result of dec rcx, it should counts 1 cycle, so that the whole loop is 2 cycle. I would expect the output to be close to 20,000,000 cycles.
I also tried to change the second inc from inc rbx to inc rax, which makes it dependent on the first inc. The result does becomes close to 20,000,000 cycles, which shows that dependency will delay an instruction so that they can't run at the same time. So why jnz is special?
What I'm missing here?

First of all, dec/jnz will macro-fuse into a single uop on Intel Sandybridge-family. You could defeat that by putting a non-flag-setting instruction between the dec and jnz.
.for_loop: ; do {
inc rax
dec rcx
lea rbx, [rbx+1] ; doesn't touch flags, defeats macro-fusion
jnz .for_loop ; } while (--rcx)
This will still run at 1 iter per cycle on Haswell and later and Ryzen because they have 4 integer execution ports to keep up with 4 uops per iteration. (Your loop with macro-fusion is only 3 fused-domain uops on Intel CPUs, so SnB/IvB can run it at 1 per clock, too.)
See Agner Fog's optimization guide and especially his microarch guide. Also other links in https://stackoverflow.com/tags/x86/info.
Control dependencies are hidden by branch prediction + speculative execution, unlike data dependencies.
Out-of-order execution and branch prediction + speculative execution hide the "latency" of the control dependency. i.e. the next iteration can start running before the CPU verifies that jnz should really be taken.
So each jnz has an input dependency on the previous dec rcx before it can verify the prediction, but later instructions don't have to wait for it to be checked before they can execute. In-order retirement makes sure that mis-speculation is caught before anything can "see" it happen (except for microarchitectural effects leading to the Spectre attack...)
10M iterations is not a lot. I'd normally use at least 100M for something that runs at only 1c per iter. Having a simple microbenchmark run for 0.1 to 1 second is normally good to get very high precision and hide startup overhead.
And BTW, you don't need sudo perf if you set kernel.perf_event_paranoid = 0 with sysctl. It's almost certainly better to do that than to use sudo all the time.

Bring code into the L1 instruction cache without executing it

Let's say I have a function that I plan to execute as part of a benchmark. I want to bring this code into the L1 instruction cache prior to executing since I don't want to measure the cost of I$ misses as part of the benchmark.
The obvious way to do this is to simply execute the code at least once before the benchmark, hence "warming it up" and bringing it into the L1 instruction cache and possibly the uop cache, etc.
What are my alternatives in the case I don't want to execute the code (e.g., because I want the various predictors which key off of instruction addresses to be cold)?

In Granite Rapids and later, PREFETCHIT0 [rip+rel32] to prefetch code into "all levels" of cache, or prefetchit1 to prefetch into all levels except L1i. These instructions are a NOP with an addressing-mode other than RIP-relative, or on CPUs that don't support them. (Perhaps they also prime iTLB or even uop cache, or at least could on paper, in which case this isn't what you want.) The docs in Intel's "future extensions" manual as of 2022 Dec recommends that the target address be the start of some instruction.
Note that this Q&A is about priming things for a microbenchmark. Not things that would be worth doing to improve overall performance. For that, probably just best-effort prefetch into L2 cache (the inner-most unified cache) with prefetcht1 SW prefetch, also priming the dTLB in a way that possibly helps the iTLB (although it will evict a possibly-useful dTLB entry).
Or not, if the L2TLB is a victim cache. see X86 prefetching optimizations: "computed goto" threaded code for more discussion.
Caveat: some of this answer is Intel-centric. If I just say "the uop cache", I'm talking about Sandybridge-family. I know Ryzen has a uop-cache too, but I haven't read much of anything about its internals, and only know some basics about Ryzen from reading Agner Fog's microarch guide.
You can at least prefetch into L2 with software prefetch, but that doesn't even necessarily help with iTLB hits. (The 2nd-level TLB is a victim cache, I think, so a dTLB miss might not populate anything that the iTLB checks.)
But this doesn't help with L1I$ misses, or getting the target code decoded into the uop cache.
If there is a way to do what you want, it's going to be with some kind of trick. x86 has no "standard" way to do this; no code-prefetch instruction. Darek Mihoka wrote about code-prefetch as part of a CPU-emulator interpreter loop: The Common CPU Interpreter Loop Revisited: the Nostradamus Distributor back in 2008 when P4 and Core2 were the hot CPU microarchitectures to tune for.
But of course that's a different case: the goal is sustained performance of indirect branches, not priming things for a benchmark. It doesn't matter if you spend a lot of time achieving the microarchitectural state you want outside the timed portion of a micro-benchmark.
Speaking of which, modern branch predictors aren't just "cold", they always contain some prediction based on aliasing1. This may or may not be important.
Prefetch the first / last lines (and maybe others) with call to a ret
I think instruction fetch / prefetch normally continues past an ordinary ret or jmp, because it can't be detected until decode. So you could just call a function that ends at the end of the previous cache line. (Make sure they're in the same page, so an iTLB miss doesn't block prefetch.)
ret after a call will predict reliably if no other call did anything to the return-address predictor stack, except in rare cases if an interrupt happened between the call and ret, and the kernel code had a deep enough call tree to push the prediction for this ret out of the RSB (return-stack-buffer). Or if Spectre mitigation flushed it intentionally on a context switch.
; make sure this is in the same *page* as function_under_test, to prime the iTLB
ALIGN 64
; could put the label here, but probably better not
60 bytes of (long) NOPs or whatever padding
prime_Icache_first_line:
ret
; jmp back_to_benchmark_startup ; alternative if JMP is handled differently than RET.
lfence ; prevent any speculative execution after RET, in case it or JMP aren't detected as soon as they decode
;;; cache-line boundary here
function_under_test:
...
prime_Icache_last_line: ; label the last RET in function_under_test
ret ; this will prime the "there's a ret here" predictor (correctly)
benchmark_startup:
call prime_Icache_first_line
call prime_Icache_first_line ; IDK if calling twice could possibly help in case prefetch didn't get far the first time? But now the CPU has "seen" the RET and may not fetch past it.
call prime_Icache_last_line ; definitely only need to call once; it's in the right line
lfence
rdtsc
.timed_loop:
call function_under_test
...
jnz .time_loop
We can even extend this technique to more than 2 cache lines by calling to any 0xC3 (ret) byte inside the body of function_under_test. But as #BeeOnRope points out, that's dangerous because it may prime branch prediction with "there's a ret here" causing a mispredict you otherwise wouldn't have had when calling function_under_test for real.
Early in the front-end, branch prediction is needed based on fetch-block address (which block to fetch after this one), not on individual branches inside each block, so this could be a problem even if the ret byte was part of another instruction.
But if this idea is viable, then you can look for a 0xc3 byte as part of an instruction in the cache line, or at worst add a 3-byte NOP r/m32 (0f 1f c3 nop ebx,eax). c3 as a ModR/M encodes a reg,reg instruction (with ebx and eax as operands), so it doesn't have to be hidden in a disp8 to avoid making the NOP even longer, and it's easy to find in short instructions: e.g. 89 c3 mov ebx,eax, or use the other opcode so the same modrm byte gives you mov eax,ebx. Or 83 c3 01 add ebx,0x1, or many other instructions with e/rbx, bl (and r/eax or al).
With a REX prefix, those can be you have a choice of rbx / r11 (and rax/r8 for the /r field if applicable). It's likely you can choose (or modify for this microbenchmark) your register allocation to produce an instruction using the relevant registers to produce a c3 byte without any overhead at all, especially if you can use a custom calling convention (at least for testing purposes) so you can clobber rbx if you weren't already saving/restoring it.
I found these by searching for (space)c3(space) in the output of objdump -d /bin/bash, just to pick a random not-small executable full of compiler-generated code.
Evil hack: end the cache line before with the start of a multi-byte instruction.
; at the end of a cache line
prefetch_Icache_first_line:
db 0xe9 ; the opcode for 5-byte jmp rel32
function_under_test:
... normal code ; first 4 bytes will be treated as a rel32 when decoding starts at prefetch_I...
ret
; then at function_under_test+4 + rel32:
;org whatever (that's not how ORG works in NASM, so actually you'll need a linker script or something to put code here)
prefetch_Icache_branch_target:
jmp back_to_test_startup
So it jumps to a virtual address which depends on the instruction bytes of function_under_test. Map that page and put code in it that jumps back to your benchmark-prep code. The destination has to be within 2GiB, so (in 64-bit code) it's always possible to choose a virtual address for function_under_test that makes the destination a valid user-space virtual address. Actually, for many rel32 values, it's possible to choose the address of function_under_test to keep both it and the target within the low 2GiB of virtual address space, (and definitely 3GiB) and thus valid 32-bit user-space addresses even under a 32-bit kernel.
Or less crazy, using the end of a ret imm16 to consume a byte or two, just requiring a fixup of RSP after return (and treating whatever is temporarily below RSP as a "red zone" if you don't reserve extra space):
; at the end of a cache line
prefetch_Icache_first_line:
db 0xc2 ; the opcode for 3-byte ret imm16
; db 0x00 ; optional: one byte of the immediate at least keeps RSP aligned
; But x86 is little-endian, so the last byte is most significant
;; Cache-line boundary here
function_under_test:
... normal code ; first 2 bytes will be added to RSP when decoding starts at prefetch_Icache_first_line
ret
prefetch_caller:
push rbp
mov rbp, rsp ; save old RSP
;sub rsp, 65536 ; reserve space in case of the max RSP+imm16.
call prefetch_Icache_first_line
;;; UNSAFE HERE in case of signal handler if you didn't SUB.
mov rsp, rbp ; restore RSP; if no signal handlers installed, probably nothing could step on stack memory
...
pop rbp
ret
Using sub rsp, 65536 before calling to the ret imm16 makes it safe even if there might be a signal handler (or interrupt handler in kernel code, if your kernel stack is big enough, otherwise look at the actual byte and see how much will really be added to RSP). It means that call's push/store will probably miss in data cache, and maybe even cause a pagefault to grow the stack. That does happen before fetching the ret imm16, so that won't evict the L1I$ line we wanted to prime.
This whole idea is probably unnecessary; I think the above method can reliably prefetch the first line of a function anyway, and this only works for the first line. (Unless you put a 0xe9 or 0xc2 in the last byte of each relevant cache line, e.g. as part of a NOP if necessary.)
But this does give you a way to non-speculatively do code-fetch from from the cache line you want without architecturally executing any instructions in it. Hopefully a direct jmp is detected before any later instructions execute, and probably without any others even decoding, except ones that decoded in the same block. (And an unconditional jmp always ends a uop-cache line on Intel). i.e. the mispredict penalty is all in the front-end from re-steering the fetch pipeline as soon as decode detects the jmp. I hope ret is like this too, in cases where the return-predictor stack is not empty.
A jmp r/m64 would let you control the destination just by putting the address in the right register or memory. (Figure out what register or memory addressing mode the first byte(s) of function_under_test encode, and put an address there). The opcode is FF /4, so you can only use a register addressing mode if the first byte works as a ModRM that has /r = 4 and mode=11b. But you could put the first 2 bytes of the jmp r/m64 in the previous line, so the extra bytes form the SIB (and disp8 or disp32). Whatever those are, you can set up register contents such that the jump-target address will be loaded from somewhere convenient.
But the key problem with a jmp r/m64 is that default-prediction for an indirect branch can fall through and speculatively execute function_under_test, affecting the branch-prediction entries for those branches. You could have bogus values in registers so you prime branch prediction incorrectly, but that's still different from not touching them at all.
How does this overlapping-instructions hack to consume bytes from the target cache line affect the uop cache?
I think (based on previous experimental evidence) Intel's uop cache puts instructions in the uop-cache line that corresponds to their start address, in cases where they cross a 32 or 64-byte boundary. So when the real execution of function_under_test begins, it will simply miss in the uop-cache because no uop-cache line is caching the instruction-start-address range that includes the first byte of function_under_test. i.e. the overlapping decode is probably not even noticed when it's split across an L1I$ boundary this way.
It is normally a problem for the uop cache to have the same bytes decode as parts of different instructions, but I'm optimistic that we wouldn't have a penalty in this case. (I haven't double-checked that for this case. I'm mostly assuming that lines record which range of start-addresses they cache, and not the whole range of x86 instruction bytes they're caching.)
Create mis-speculation to fetch arbitrary lines, but block exec with lfence
Spectre / Meltdown exploits and mitigation strategies provide some interesting ideas: you could maybe trigger a mispredict that fetches at least the start of the code you want, but maybe doesn't speculate into it.
lfence blocks speculative execution, but (AFAIK) not instruction prefetch / fetch / decode.
I think (and hope) the front-end will follow direct relative jumps on its own, even after lfence, so we can use jmp target_cache_line in the shadow of a mispredict + lfence to fetch and decode but not execute the target function.
If lfence works by blocking the issue stage until the reservation station (OoO scheduler) is empty, then an Intel CPU should probably decode past lfence until the IDQ is full (64 uops on Skylake). There are further buffers before other stages (fetch -> instruction-length-decode, and between that and decode), so fetch can run ahead of that. Presumably there's a HW prefetcher that runs ahead of where actual fetch is reading from, so it's pretty plausible to get several cache lines into the target function in the shadow of a single mispredict, especially if you introduce delays before the mispredict can be detected.
We can use the same return-address frobbing as a retpoline to reliably trigger a mispredict to jmp rel32 which sends fetch into the target function. (I'm pretty sure a re-steer of the front-end can happen in the shadow of speculative execution without waiting to confirm correct speculation, because that would make every re-steer serializing.)
function_under_test:
...
some_line: ; not necessarily the first cache line
...
ret
;;; This goes in the same page as the test function,
;;; so we don't iTLB-miss when trying to send the front-end there
ret_frob:
xorps xmm0,xmm0
movq xmm1, rax
;; The point of this LFENCE is to make sure the RS / ROB are empty so the front-end can run ahead in a burst.
;; so the sqrtpd delay chain isn't gradually issued.
lfence
;; alternatively, load the return address from the stack and create a data dependency on it, e.g. and eax,0
;; create a many-cycle dependency-chain before the RET misprediction can be detected
times 10 sqrtpd xmm0,xmm0 ; high latency, single uop
orps xmm0, xmm1 ; target address with data-dep on the sqrtpd chain
movq [rsp], xmm0 ; overwrite return address
; movd [rsp], xmm0 ; store-forwarding stall: do this *as well* as the movq
ret ; mis-speculate to the lfence/jmp some_line
; but architecturally jump back to the address we got in RAX
prefetch_some_line:
lea rax, [rel back_to_bench_startup]
; or pop rax or load it into xmm1 directly,
; so this block can be CALLed as a function instead of jumped to
call ret_frob
; speculative execution goes here, but architecturally never reached
lfence ; speculative *execution* stops here, fetch continues
jmp some_line
I'm not sure the lfence in ret_frob is needed. But it does make it easier to reason about what the front-end is doing relative to the back-end. After the lfence, the return address has a data dependency on the chain of 10x sqrtpd. (10x 15 to 16 cycle latency on Skylake, for example). But the 10x sqrtpd + orps + movq only take 3 cycles to issue (on 4-wide CPUs), leaving at least 148 cycles + store-forwarding latency before ret can read the return address back from the stack and discover that the return-stack prediction was wrong.
This should be plenty of time for the front-end to follow the jmp some_line and load that line into L1I$, and probably load several lines after that. It should also get some of them decoded into the uop cache.
You need a separate call / lfence / jmp block for each target line (because the target address has to be hard-coded into a direct jump for the front-end to follow it without the back-end executing anything), but they can all share the same ret_frob block.
If you left out the lfence, you could use the above retpoline-like technique to trigger speculative execution into the function. This would let you jump to any target branch in the target function with whatever args you like in registers, so you can mis-prime branch prediction however you like.
Footnote 1:
Modern branch predictors aren't just "cold", they contain predictions
from whatever aliased the target virtual addresses in the various branch-prediction data structures. (At least on Intel where SnB-family pretty definitely uses TAGE prediction.)
So you should decide whether you want to specifically anti-prime the branch predictors by (speculatively) executing the branches in your function with bogus data in registers / flags, or whether your micro-benchmarking environment resembles the surrounding conditions of the real program closely enough.
If your target function has enough branching in a very specific complex pattern (like a branchy sort function over 10 integers), then presumably only that exact input can train the branch predictor well, so any initial state other than a specially-warmed-up state is probably fine.
You may not want the uop-cache primed at all, to look for cold-execution effects in general (like decode), so that might rule out any speculative fetch / decode, not just speculative execution. Or maybe speculative decode is ok if you then run some uop-cache-polluting long-NOPs or times 800 xor eax,eax (2-byte instructions -> 16 per 32-byte block uses up all 3 entries that SnB-family uop caches allow without running out of room and not being able to fit in the uop cache at all). But not so many that you evict L1I$ as well.
Even speculative decode without execute will prime the front-end branch prediction that knows where branches are ahead of decode, though. I think that a ret (or jmp rel32) at the end of the previous cache line

Map the same physical page to two different virtual addresses.
L1I$ is physically addressed. (VIPT but with all the index bits from below the page offset, so effectively PIPT).
Branch-prediction and uop caches are virtually addressed, so with the right choice of virtual addresses, a warm-up run of the function at the alternate virtual address will prime L1I, but not branch prediction or uop caches. (This only works if branch aliasing happens modulo something larger than 4096 bytes, because the position within the page is the same for both mappings.)
Prime the iTLB by calling to a ret in the same page as the test function, but outside it.
After setting this up, no modification of the page tables are required between the warm-up run and the timing run. This is why you use two mappings of the same page instead of remapping a single mapping.
Margaret Bloom suggests that CPUs vulnerable to Meltdown might speculatively fetch instructions from a no-exec page if you jump there (in the shadow of a mispredict so it doesn't actually fault), but that would then require changing the page table, and thus a system call which is expensive and might evict that line of L1I. But if it doesn't pollute the iTLB, you could then re-populate the iTLB entry with a mispredicted branch anywhere into the same page as the function. Or just a call to a dummy ret outside the function in the same page.
None of this will let you get the uop cache warmed up, though, because it's virtually addressed. OTOH, in real life, if branch predictors are cold then probably the uop cache will also be cold.

One approach that could work for small functions would be to execute some code which appears on the same cache line(s) as your target function, which will bring in the entire cache line.
For example, you could organize your code as follows:
ALIGN 64
function_under_test:
; some code, less than 64 bytes
dummy:
ret
and then call the dummy function prior to calling function_under_test - if dummy starts on the same cache line as the target function, it would bring the entire cache line into L1I. This works for functions of 63 bytes or less1.
This can probably be extended to functions up to ~126 bytes or so by using this trick both at before2 and after the target function. You could extend it to arbitrarily sized functions by inserting dummy functions on every cache line and having the target code jump over them, but this comes at a cost of inserting the otherwise-unnecessary jumps your code under test, and requires careful control over the code size so that the dummy functions are placed correctly.
You need fine control over function alignment and placement to achieve this: assembler is probably the easiest, but you can also probably do it with C or C++ in combination with compiler-specific attributes.
1 You could even reuse the ret in the function_under_test itself to support slightly longer functions (e.g., those whose ret starts within 64 bytes of the start).
2 You'd have to be more careful about the dummy function appearing before the code under test: the processor might fetch instructions past the ret and it might (?) even execute them. A ud2 after the dummy ret is likely to block further fetch (but you might want fetch if populating the uop cache is important).

How does RIP-relative addressing perform compared to mov reg, imm64?

It is known fact that x86-64 instructions do not support 64-bit immediate values (except for mov). Hence, when migrating code from 32 to 64 bits, an instruction like this:
cmp rax, addr32
cannot be replaced with the following:
cmp rax, addr64
Under these circumstances, I'm considering two alternatives: (a) using a scratch register for loading the constant or (b) using rip-relative addressing. The two approaches look like this:
mov r11, addr64 ; scratch register
cmp rax, r11
ptr64: dq addr64
...
cmp rax, [rel ptr64] ; encoded as cmp rax, [rip+offset]
I wrote a very simple loop to compare the performance of both approaches (which I paste below). While (b) uses an indirect pointer, (a) has the the immediate encoded in the instruction (which could lead to a worse usage of i-cache). Surprisingly, I found that (b) run ~10% faster than (a). Is this result something to be expected in more common real-world code?
true: dq 0xFFFF0000FFFF0000
false: dq 0xAAAABBBBAAAABBBB
main:
or rax, 1 ; rax is odd and constant "true" is even
mov rcx, 0x1
shl rcx, 30
branch:
mov r11, 0xFFFF0000FFFF0000 ; not present in (b)
cmp rax, r11 ; vs cmp rax, [rel true]
je next
add rax, 2
loop branch
next:
mov rax, 0
ret

Surprisingly, I found that (b) run ~10% faster than (a)
You probably tested on a CPU other than AMD Bulldozer-family or Ryzen, which have a fast loop instruction. On other CPUs, loop is very slow, mostly on purpose for historical reasons, so you bottleneck on it. e.g. 7 uops, one per 5c throughput on Haswell.
mov r64, imm64 is bad for uop cache throughput because of the large immediate taking 2 slots in Intel's uop cache. (See the Sandybridge uop cache section in Agner Fog's microarch pdf), and Which is faster, imm64 or m64 for x86-64? where I listed the details.
Even apart from that, it's not too surprising that 1 extra uop in the loop makes it run slower. You're probably not on an AMD CPU (with single-uop / 1 per 2 clock loop), because the extra mov in such a tiny loop would make more than 10% difference. Or no difference at all, since it's just 3 vs. 4 uops per 2 clocks, if that's correct that even tiny loop loops are limited to one jump per 2 clocks.
On Intel, loop is 7 uops, one per 5 clocks throughput on most CPUs, so the 4-per-clock issue/rename bottleneck won't be what you're hitting. loop is micro-coded, so the front-end can't run from the loop buffer. (And Skylake CPUs have their LSD disabled by a microcode update to fix the partial-register erratum anyway.) So the mov r64,imm64 uop has to be re-read from the uop cache every time through the loop.
A load that hits in cache has very good throughput (2 loads per clock, and in this case micro-fusion means no extra uops to use a memory operand instead of register for cmp). So the main penalty in using a constant from memory is the extra cache footprint and cache misses, but your microbenchmark won't reveal that at all. It also has no other pressure on the load ports.
In the general case:
If possible, use a RIP-relative lea to generate 64-bit address constants.
e.g. lea rax, [rel addr64]. Yes, this takes an extra instruction to get the constant into a register. (BTW, just use default rel. You can use [abs fs:0] if you need it.
You can avoid the extra instruction if you build position-dependent code with the default (small) code model, so static addresses fit in the low 32 bits of virtual address space and can be used as immediates. (Actually low 2GiB, so sign or zero extending both work). See 32-bit absolute addresses no longer allowed in x86-64 Linux? if gcc complains about absolute addressing; -pie is enabled by default on most distros. This of course doesn't work in Linux shared libraries, which only support text relocations for 64-bit addresses. But you should avoid relocations whenever possible by using lea to make position-indepdendent code.
Most integer build-time constants fit in 32 bits, so you can use cmp r64, imm32 or cmp r32, imm32 even in PIC code.
If you do need a 64-bit non-address constant, try to hoist the mov r64, imm64 out of a loop. Your cmp loop would have been fine if the mov wasn't inside the loop. x86-64 has enough registers that you (or the compiler) can usually avoid reloads inside inner-most loops in integer code.

Why is a memory round-trip faster than not performing the round-trip?

I've got some simple 32bit code which computes the product of an array of 32bit integers. The inner loop looks like this:
##loop:
mov esi,[ebx]
mov [esp],esi
imul eax,[esp]
add ebx, 4
dec edx
jnz ##loop
What I'm trying to understand is why the above code is 6% faster than these two versions of the code, which does not perform the redundant memory round-trip:
##loop:
mov esi,[ebx]
imul eax,esi
add ebx, 4
dec edx
jnz ##loop
and
##loop:
imul eax,[ebx]
add ebx, 4
dec edx
jnz ##loop
The two latter pieces of code execute in virtually the same time, and as mentioned both are 6% slower than the first piece (165ms vs 155ms, 200 million elements).
I've tried manually aligning the jump target to a 16 byte boundary, but it makes no difference.
I'm running this on an Intel i7 4770k, Windows 10 x64.
Note: I know the code could be improved by doing all sorts of optimizations, however I'm only interested in the performance difference between the above pieces of code.

I suspect but can't be sure that you are preventing a stall on a data dependency:
The code looks like this:
##loop:
mov esi,[ebx] # (1)Load the memory location to esi reg
(mov [esp],esi) # (1)optionally store the location on the stack
imul eax,[esp] # (3) Perform the multiplication
add ebx, 4 # (1) Add 4
dec edx # (1)decrement counter
jnz ##loop # (0**) loop
Those numbers in brackets are the latencies of the instructions ... that jump is 0 if the branch predictor guesses correctly (which since it will mostly loop it will most of the time).
So: while the multiplication is still going (3 instructions) we get back to the top of the loop after 2 and try to load in to the memory and has to stall. Or we could do a store ... which we can do at the same time as our multiplication and then not stall at all.
What about the dummy store you ask? Why does that work? Notice you are storing the critical value that we are using to multiply to memory. Thus the processor can use this value which is being stored in memory and clobber the register.
So why can't the processor do this anyway? The processor can't produce more memory accesses than you ask it to or it could interfere with multi-processor programs (imagine that cache line that you are writing to is shared and you have to invalidate it on other CPUs every loop by writing to it ... ouch!).
All of this is pure speculation, but it seems to match all the evidence (your code and my knowledge of the intel architecture ... and x86 assembly). Hopefully someone can point out if I have something wrong.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio