Why is a conditional move not vulnerable to Branch Prediction Failure? - performance

After reading this post (answer on StackOverflow) (at the optimization section), I was wondering why conditional moves are not vulnerable for Branch Prediction Failure. I found on an article on cond moves here (PDF by AMD). Also there, they claim the performance advantage of cond. moves. But why is this? I don't see it. At the moment that that ASM-instruction is evaluated, the result of the preceding CMP instruction is not known yet.

Mis-predicted branches are expensive
A modern processor generally executes between one and three instructions each cycle if things go well (if it does not stall waiting for data dependencies for these instructions to arrive from previous instructions or from memory).
The statement above holds surprisingly well for tight loops, but this shouldn't blind you to one additional dependency that can prevent an instruction to be executed when its cycle comes:
for an instruction to be executed, the processor must have started to fetch and decode it 15-20 cycles before.
What should the processor do when it encounters a branch? Fetching and decoding both targets does not scale (if more branches follow, an exponential number of paths would have to be fetched in parallel). So the processor only fetches and decodes one of the two branches, speculatively.
This is why mis-predicted branches are expensive: they cost the 15-20 cycles that are usually invisible because of an efficient instruction pipeline.
Conditional move is never very expensive
Conditional move does not require prediction, so it can never have this penalty. It has data dependencies, same as ordinary instructions. In fact, a conditional move has more data dependencies than ordinary instructions, because the data dependencies include both “condition true” and “condition false” cases. After an instruction that conditionally moves r1 to r2, the contents of r2 seem to depend on both the previous value of r2 and on r1. A well-predicted conditional branch allows the processor to infer more accurate dependencies. But data dependencies typically take one-two cycles to arrive, if they need time to arrive at all.
Note that a conditional move from memory to register would sometimes be a dangerous bet: if the condition is such that the value read from memory is not assigned to the register, you have waited on memory for nothing. But the conditional move instructions offered in instruction sets are typically register to register, preventing this mistake on the part of the programmer.

It is all about the instruction pipeline. Remember, modern CPUs run their instructions in a pipeline, which yields a significant performance boost when the execution flow is predictable by the CPU.
cmov
add eax, ebx
cmp eax, 0x10
cmovne ebx, ecx
add eax, ecx
At the moment that that ASM-instruction is evaluated, the result of the preceding CMP instruction is not known yet.
Perhaps, but the CPU still knows that the instruction following the cmov will be executed right after, regardless of the result from the cmp and cmov instruction. The next instruction may thus safely be fetched/decoded ahead of time, which is not the case with branches.
The next instruction could even execute before the cmov does (in my example this would be safe)
branch
add eax, ebx
cmp eax, 0x10
je .skip
mov ebx, ecx
.skip:
add eax, ecx
In this case, when the CPU's decoder sees je .skip it will have to choose whether to continue prefetching/decoding instructions either 1) from the next instruction, or 2) from the jump target. The CPU will guess that this forward conditional branch won't happen, so the next instruction mov ebx, ecx will go into the pipeline.
A couple of cycles later, the je .skip is executed and the branch is taken. Oh crap! Our pipeline now holds some random junk that should never be executed. The CPU has to flush all its cached instructions and start fresh from .skip:.
That is the performance penalty of mispredicted branches, which can never happen with cmov since it doesn't alter the execution flow.

Indeed the result may not yet be known, but if other circumstances permit (in particular, the dependency chain) the cpu can reorder and execute instructions following the cmov. Since there is no branching involved, those instructions need to be evaluated in any case.
Consider this example:
cmoveq edx, eax
add ecx, ebx
mov eax, [ecx]
The two instructions following the cmov do not depend on the result of the cmov, so they can be executed even while the cmov itself is pending (this is called out of order execution). Even if they can't be executed, they can still be fetched and decoded.
A branching version could be:
jne skip
mov edx, eax
skip:
add ecx, ebx
mov eax, [ecx]
The problem here is that control flow is changing and the cpu isn't clever enough to see that it could just "insert" the skipped mov instruction if the branch was mispredicted as taken - instead it throws away everything it did after the branch, and restarts from scratch. This is where the penalty comes from.

You should read these. With Fog+Intel, just search for CMOV.
Linus Torvald's critique of CMOV circa 2007
Agner Fog's comparison of microarchitectures
Intel® 64 and IA-32 Architectures Optimization Reference Manual
Short answer, correct predictions are 'free' while conditional branch mispredicts can cost 14-20 cycles on Haswell. However, CMOV is never free. Still I think CMOV is a LOT better now than when Torvalds ranted. There is no single one correct for all time on all processors ever answer.

I have this illustration from [Peter Puschner et al.] slide which explains how it transforms into single path code, and speedup the execution.

Related

Cost of instruction jumps in assembly

I've had always been curious about the cost of jumps in assembly.
cmp ecx, edx
je SOME_LOCATION # What's the cost of this jump?
Does it need to do a search in a lookup table for each jumps or how does it work?
No, a jump doesn’t do a search. The assembler resolves the label to an address, which in most cases is then converted to an offset from the current instruction. The address or offset is encoded in the instruction. At run time, the processor loads the address into the IP register or adds the offset to the current value of the IP register (along with all the other effects discussed by #Brendan).
There is a type of jump instruction that can be used to get the destination from a table. The jump instruction reads the address from a memory location. (The instruction specifies a single location, so there still is no “search”.) This instruction could look something like this:
jmp table[eax*4]
where eax is the index of the entry in the table containing the address to jump to.
Originally (e.g. 8086) the cost of a jump wasn't much different to the cost of a mov.
Later CPUs added caches, which meant some jumps were faster (because the code they jump to is in the cache) and some jumps were slower (because the code they jump to isn't in the cache).
Even later CPUs added "out of order" execution, where conditional branches (e.g. je SOME_LOCATION) would have to wait until the flags from "previous instructions that happen to be executed in parallel" became known.
This means that a sequence like
mov esi, edi
cmp ecx, edx
je SOME_LOCATION
can be slower than rearranging it to
cmp ecx, edx
mov esi, edi
je SOME_LOCATION
to increase the chance that the flags would be known.
Even later CPUs added speculative execution. In this case, for conditional branches the CPU just takes a guess at where it will branch to before it actually knows (e.g. before the flags are known), and if it guesses wrong it'll just pretend that it didn't execute the wrong instructions. More specifically, the speculatively executed instructions are tagged at the start of the pipeline and held at the end of the pipeline (at retirement) until the CPU knows if they can be committed to visible state or if they have to be discarded.
After that things just got more complicated, with fancier methods of doing branch prediction, additional "branch target" buffers, etc.
Far jumps that change the code segment are more expensive. In real mode it's not so bad because the CPU mostly only does "CS.base = value * 16" when CS is changed. For protected mode it's a table lookup (to find GDT or LDT entry), decoding the entry, deciding what to do based on what kind of entry it is, then a pile of protection checks. For long mode it's vaguely similar. All of this adds more uncertainty (e.g. with the table entry be in cache?).
On top of all of this there's things like TLB misses. For example, jmp [indirectAddress] can cause a TLB miss at indirectAddress then a TLB miss at the stack top then a TLB miss at the new instruction pointer; where each TLB miss can cost a few hundred cycles.
Mostly; the cost of a jump can be anything from 0 cycles (for a correctly predicted jump) to maybe 1000 cycles; depending on which CPU it is, what kind of jump, what is in caches, what branch prediction predicts, cache/TLB misses, how fast/slow RAM is, and anything I may have forgotten.

Bring code into the L1 instruction cache without executing it

Let's say I have a function that I plan to execute as part of a benchmark. I want to bring this code into the L1 instruction cache prior to executing since I don't want to measure the cost of I$ misses as part of the benchmark.
The obvious way to do this is to simply execute the code at least once before the benchmark, hence "warming it up" and bringing it into the L1 instruction cache and possibly the uop cache, etc.
What are my alternatives in the case I don't want to execute the code (e.g., because I want the various predictors which key off of instruction addresses to be cold)?
In Granite Rapids and later, PREFETCHIT0 [rip+rel32] to prefetch code into "all levels" of cache, or prefetchit1 to prefetch into all levels except L1i. These instructions are a NOP with an addressing-mode other than RIP-relative, or on CPUs that don't support them. (Perhaps they also prime iTLB or even uop cache, or at least could on paper, in which case this isn't what you want.) The docs in Intel's "future extensions" manual as of 2022 Dec recommends that the target address be the start of some instruction.
Note that this Q&A is about priming things for a microbenchmark. Not things that would be worth doing to improve overall performance. For that, probably just best-effort prefetch into L2 cache (the inner-most unified cache) with prefetcht1 SW prefetch, also priming the dTLB in a way that possibly helps the iTLB (although it will evict a possibly-useful dTLB entry).
Or not, if the L2TLB is a victim cache. see X86 prefetching optimizations: "computed goto" threaded code for more discussion.
Caveat: some of this answer is Intel-centric. If I just say "the uop cache", I'm talking about Sandybridge-family. I know Ryzen has a uop-cache too, but I haven't read much of anything about its internals, and only know some basics about Ryzen from reading Agner Fog's microarch guide.
You can at least prefetch into L2 with software prefetch, but that doesn't even necessarily help with iTLB hits. (The 2nd-level TLB is a victim cache, I think, so a dTLB miss might not populate anything that the iTLB checks.)
But this doesn't help with L1I$ misses, or getting the target code decoded into the uop cache.
If there is a way to do what you want, it's going to be with some kind of trick. x86 has no "standard" way to do this; no code-prefetch instruction. Darek Mihoka wrote about code-prefetch as part of a CPU-emulator interpreter loop: The Common CPU Interpreter Loop Revisited: the Nostradamus Distributor back in 2008 when P4 and Core2 were the hot CPU microarchitectures to tune for.
But of course that's a different case: the goal is sustained performance of indirect branches, not priming things for a benchmark. It doesn't matter if you spend a lot of time achieving the microarchitectural state you want outside the timed portion of a micro-benchmark.
Speaking of which, modern branch predictors aren't just "cold", they always contain some prediction based on aliasing1. This may or may not be important.
Prefetch the first / last lines (and maybe others) with call to a ret
I think instruction fetch / prefetch normally continues past an ordinary ret or jmp, because it can't be detected until decode. So you could just call a function that ends at the end of the previous cache line. (Make sure they're in the same page, so an iTLB miss doesn't block prefetch.)
ret after a call will predict reliably if no other call did anything to the return-address predictor stack, except in rare cases if an interrupt happened between the call and ret, and the kernel code had a deep enough call tree to push the prediction for this ret out of the RSB (return-stack-buffer). Or if Spectre mitigation flushed it intentionally on a context switch.
; make sure this is in the same *page* as function_under_test, to prime the iTLB
ALIGN 64
; could put the label here, but probably better not
60 bytes of (long) NOPs or whatever padding
prime_Icache_first_line:
ret
; jmp back_to_benchmark_startup ; alternative if JMP is handled differently than RET.
lfence ; prevent any speculative execution after RET, in case it or JMP aren't detected as soon as they decode
;;; cache-line boundary here
function_under_test:
...
prime_Icache_last_line: ; label the last RET in function_under_test
ret ; this will prime the "there's a ret here" predictor (correctly)
benchmark_startup:
call prime_Icache_first_line
call prime_Icache_first_line ; IDK if calling twice could possibly help in case prefetch didn't get far the first time? But now the CPU has "seen" the RET and may not fetch past it.
call prime_Icache_last_line ; definitely only need to call once; it's in the right line
lfence
rdtsc
.timed_loop:
call function_under_test
...
jnz .time_loop
We can even extend this technique to more than 2 cache lines by calling to any 0xC3 (ret) byte inside the body of function_under_test. But as #BeeOnRope points out, that's dangerous because it may prime branch prediction with "there's a ret here" causing a mispredict you otherwise wouldn't have had when calling function_under_test for real.
Early in the front-end, branch prediction is needed based on fetch-block address (which block to fetch after this one), not on individual branches inside each block, so this could be a problem even if the ret byte was part of another instruction.
But if this idea is viable, then you can look for a 0xc3 byte as part of an instruction in the cache line, or at worst add a 3-byte NOP r/m32 (0f 1f c3 nop ebx,eax). c3 as a ModR/M encodes a reg,reg instruction (with ebx and eax as operands), so it doesn't have to be hidden in a disp8 to avoid making the NOP even longer, and it's easy to find in short instructions: e.g. 89 c3 mov ebx,eax, or use the other opcode so the same modrm byte gives you mov eax,ebx. Or 83 c3 01 add ebx,0x1, or many other instructions with e/rbx, bl (and r/eax or al).
With a REX prefix, those can be you have a choice of rbx / r11 (and rax/r8 for the /r field if applicable). It's likely you can choose (or modify for this microbenchmark) your register allocation to produce an instruction using the relevant registers to produce a c3 byte without any overhead at all, especially if you can use a custom calling convention (at least for testing purposes) so you can clobber rbx if you weren't already saving/restoring it.
I found these by searching for (space)c3(space) in the output of objdump -d /bin/bash, just to pick a random not-small executable full of compiler-generated code.
Evil hack: end the cache line before with the start of a multi-byte instruction.
; at the end of a cache line
prefetch_Icache_first_line:
db 0xe9 ; the opcode for 5-byte jmp rel32
function_under_test:
... normal code ; first 4 bytes will be treated as a rel32 when decoding starts at prefetch_I...
ret
; then at function_under_test+4 + rel32:
;org whatever (that's not how ORG works in NASM, so actually you'll need a linker script or something to put code here)
prefetch_Icache_branch_target:
jmp back_to_test_startup
So it jumps to a virtual address which depends on the instruction bytes of function_under_test. Map that page and put code in it that jumps back to your benchmark-prep code. The destination has to be within 2GiB, so (in 64-bit code) it's always possible to choose a virtual address for function_under_test that makes the destination a valid user-space virtual address. Actually, for many rel32 values, it's possible to choose the address of function_under_test to keep both it and the target within the low 2GiB of virtual address space, (and definitely 3GiB) and thus valid 32-bit user-space addresses even under a 32-bit kernel.
Or less crazy, using the end of a ret imm16 to consume a byte or two, just requiring a fixup of RSP after return (and treating whatever is temporarily below RSP as a "red zone" if you don't reserve extra space):
; at the end of a cache line
prefetch_Icache_first_line:
db 0xc2 ; the opcode for 3-byte ret imm16
; db 0x00 ; optional: one byte of the immediate at least keeps RSP aligned
; But x86 is little-endian, so the last byte is most significant
;; Cache-line boundary here
function_under_test:
... normal code ; first 2 bytes will be added to RSP when decoding starts at prefetch_Icache_first_line
ret
prefetch_caller:
push rbp
mov rbp, rsp ; save old RSP
;sub rsp, 65536 ; reserve space in case of the max RSP+imm16.
call prefetch_Icache_first_line
;;; UNSAFE HERE in case of signal handler if you didn't SUB.
mov rsp, rbp ; restore RSP; if no signal handlers installed, probably nothing could step on stack memory
...
pop rbp
ret
Using sub rsp, 65536 before calling to the ret imm16 makes it safe even if there might be a signal handler (or interrupt handler in kernel code, if your kernel stack is big enough, otherwise look at the actual byte and see how much will really be added to RSP). It means that call's push/store will probably miss in data cache, and maybe even cause a pagefault to grow the stack. That does happen before fetching the ret imm16, so that won't evict the L1I$ line we wanted to prime.
This whole idea is probably unnecessary; I think the above method can reliably prefetch the first line of a function anyway, and this only works for the first line. (Unless you put a 0xe9 or 0xc2 in the last byte of each relevant cache line, e.g. as part of a NOP if necessary.)
But this does give you a way to non-speculatively do code-fetch from from the cache line you want without architecturally executing any instructions in it. Hopefully a direct jmp is detected before any later instructions execute, and probably without any others even decoding, except ones that decoded in the same block. (And an unconditional jmp always ends a uop-cache line on Intel). i.e. the mispredict penalty is all in the front-end from re-steering the fetch pipeline as soon as decode detects the jmp. I hope ret is like this too, in cases where the return-predictor stack is not empty.
A jmp r/m64 would let you control the destination just by putting the address in the right register or memory. (Figure out what register or memory addressing mode the first byte(s) of function_under_test encode, and put an address there). The opcode is FF /4, so you can only use a register addressing mode if the first byte works as a ModRM that has /r = 4 and mode=11b. But you could put the first 2 bytes of the jmp r/m64 in the previous line, so the extra bytes form the SIB (and disp8 or disp32). Whatever those are, you can set up register contents such that the jump-target address will be loaded from somewhere convenient.
But the key problem with a jmp r/m64 is that default-prediction for an indirect branch can fall through and speculatively execute function_under_test, affecting the branch-prediction entries for those branches. You could have bogus values in registers so you prime branch prediction incorrectly, but that's still different from not touching them at all.
How does this overlapping-instructions hack to consume bytes from the target cache line affect the uop cache?
I think (based on previous experimental evidence) Intel's uop cache puts instructions in the uop-cache line that corresponds to their start address, in cases where they cross a 32 or 64-byte boundary. So when the real execution of function_under_test begins, it will simply miss in the uop-cache because no uop-cache line is caching the instruction-start-address range that includes the first byte of function_under_test. i.e. the overlapping decode is probably not even noticed when it's split across an L1I$ boundary this way.
It is normally a problem for the uop cache to have the same bytes decode as parts of different instructions, but I'm optimistic that we wouldn't have a penalty in this case. (I haven't double-checked that for this case. I'm mostly assuming that lines record which range of start-addresses they cache, and not the whole range of x86 instruction bytes they're caching.)
Create mis-speculation to fetch arbitrary lines, but block exec with lfence
Spectre / Meltdown exploits and mitigation strategies provide some interesting ideas: you could maybe trigger a mispredict that fetches at least the start of the code you want, but maybe doesn't speculate into it.
lfence blocks speculative execution, but (AFAIK) not instruction prefetch / fetch / decode.
I think (and hope) the front-end will follow direct relative jumps on its own, even after lfence, so we can use jmp target_cache_line in the shadow of a mispredict + lfence to fetch and decode but not execute the target function.
If lfence works by blocking the issue stage until the reservation station (OoO scheduler) is empty, then an Intel CPU should probably decode past lfence until the IDQ is full (64 uops on Skylake). There are further buffers before other stages (fetch -> instruction-length-decode, and between that and decode), so fetch can run ahead of that. Presumably there's a HW prefetcher that runs ahead of where actual fetch is reading from, so it's pretty plausible to get several cache lines into the target function in the shadow of a single mispredict, especially if you introduce delays before the mispredict can be detected.
We can use the same return-address frobbing as a retpoline to reliably trigger a mispredict to jmp rel32 which sends fetch into the target function. (I'm pretty sure a re-steer of the front-end can happen in the shadow of speculative execution without waiting to confirm correct speculation, because that would make every re-steer serializing.)
function_under_test:
...
some_line: ; not necessarily the first cache line
...
ret
;;; This goes in the same page as the test function,
;;; so we don't iTLB-miss when trying to send the front-end there
ret_frob:
xorps xmm0,xmm0
movq xmm1, rax
;; The point of this LFENCE is to make sure the RS / ROB are empty so the front-end can run ahead in a burst.
;; so the sqrtpd delay chain isn't gradually issued.
lfence
;; alternatively, load the return address from the stack and create a data dependency on it, e.g. and eax,0
;; create a many-cycle dependency-chain before the RET misprediction can be detected
times 10 sqrtpd xmm0,xmm0 ; high latency, single uop
orps xmm0, xmm1 ; target address with data-dep on the sqrtpd chain
movq [rsp], xmm0 ; overwrite return address
; movd [rsp], xmm0 ; store-forwarding stall: do this *as well* as the movq
ret ; mis-speculate to the lfence/jmp some_line
; but architecturally jump back to the address we got in RAX
prefetch_some_line:
lea rax, [rel back_to_bench_startup]
; or pop rax or load it into xmm1 directly,
; so this block can be CALLed as a function instead of jumped to
call ret_frob
; speculative execution goes here, but architecturally never reached
lfence ; speculative *execution* stops here, fetch continues
jmp some_line
I'm not sure the lfence in ret_frob is needed. But it does make it easier to reason about what the front-end is doing relative to the back-end. After the lfence, the return address has a data dependency on the chain of 10x sqrtpd. (10x 15 to 16 cycle latency on Skylake, for example). But the 10x sqrtpd + orps + movq only take 3 cycles to issue (on 4-wide CPUs), leaving at least 148 cycles + store-forwarding latency before ret can read the return address back from the stack and discover that the return-stack prediction was wrong.
This should be plenty of time for the front-end to follow the jmp some_line and load that line into L1I$, and probably load several lines after that. It should also get some of them decoded into the uop cache.
You need a separate call / lfence / jmp block for each target line (because the target address has to be hard-coded into a direct jump for the front-end to follow it without the back-end executing anything), but they can all share the same ret_frob block.
If you left out the lfence, you could use the above retpoline-like technique to trigger speculative execution into the function. This would let you jump to any target branch in the target function with whatever args you like in registers, so you can mis-prime branch prediction however you like.
Footnote 1:
Modern branch predictors aren't just "cold", they contain predictions
from whatever aliased the target virtual addresses in the various branch-prediction data structures. (At least on Intel where SnB-family pretty definitely uses TAGE prediction.)
So you should decide whether you want to specifically anti-prime the branch predictors by (speculatively) executing the branches in your function with bogus data in registers / flags, or whether your micro-benchmarking environment resembles the surrounding conditions of the real program closely enough.
If your target function has enough branching in a very specific complex pattern (like a branchy sort function over 10 integers), then presumably only that exact input can train the branch predictor well, so any initial state other than a specially-warmed-up state is probably fine.
You may not want the uop-cache primed at all, to look for cold-execution effects in general (like decode), so that might rule out any speculative fetch / decode, not just speculative execution. Or maybe speculative decode is ok if you then run some uop-cache-polluting long-NOPs or times 800 xor eax,eax (2-byte instructions -> 16 per 32-byte block uses up all 3 entries that SnB-family uop caches allow without running out of room and not being able to fit in the uop cache at all). But not so many that you evict L1I$ as well.
Even speculative decode without execute will prime the front-end branch prediction that knows where branches are ahead of decode, though. I think that a ret (or jmp rel32) at the end of the previous cache line
Map the same physical page to two different virtual addresses.
L1I$ is physically addressed. (VIPT but with all the index bits from below the page offset, so effectively PIPT).
Branch-prediction and uop caches are virtually addressed, so with the right choice of virtual addresses, a warm-up run of the function at the alternate virtual address will prime L1I, but not branch prediction or uop caches. (This only works if branch aliasing happens modulo something larger than 4096 bytes, because the position within the page is the same for both mappings.)
Prime the iTLB by calling to a ret in the same page as the test function, but outside it.
After setting this up, no modification of the page tables are required between the warm-up run and the timing run. This is why you use two mappings of the same page instead of remapping a single mapping.
Margaret Bloom suggests that CPUs vulnerable to Meltdown might speculatively fetch instructions from a no-exec page if you jump there (in the shadow of a mispredict so it doesn't actually fault), but that would then require changing the page table, and thus a system call which is expensive and might evict that line of L1I. But if it doesn't pollute the iTLB, you could then re-populate the iTLB entry with a mispredicted branch anywhere into the same page as the function. Or just a call to a dummy ret outside the function in the same page.
None of this will let you get the uop cache warmed up, though, because it's virtually addressed. OTOH, in real life, if branch predictors are cold then probably the uop cache will also be cold.
One approach that could work for small functions would be to execute some code which appears on the same cache line(s) as your target function, which will bring in the entire cache line.
For example, you could organize your code as follows:
ALIGN 64
function_under_test:
; some code, less than 64 bytes
dummy:
ret
and then call the dummy function prior to calling function_under_test - if dummy starts on the same cache line as the target function, it would bring the entire cache line into L1I. This works for functions of 63 bytes or less1.
This can probably be extended to functions up to ~126 bytes or so by using this trick both at before2 and after the target function. You could extend it to arbitrarily sized functions by inserting dummy functions on every cache line and having the target code jump over them, but this comes at a cost of inserting the otherwise-unnecessary jumps your code under test, and requires careful control over the code size so that the dummy functions are placed correctly.
You need fine control over function alignment and placement to achieve this: assembler is probably the easiest, but you can also probably do it with C or C++ in combination with compiler-specific attributes.
1 You could even reuse the ret in the function_under_test itself to support slightly longer functions (e.g., those whose ret starts within 64 bytes of the start).
2 You'd have to be more careful about the dummy function appearing before the code under test: the processor might fetch instructions past the ret and it might (?) even execute them. A ud2 after the dummy ret is likely to block further fetch (but you might want fetch if populating the uop cache is important).

In x86 assembly, is it better to use two separate registers for imul?

I am wondering, mostly out of curiosity, if using the same register for an operation is better than using two. What would be better, considering performance and/or other concerns?
mov %rbx, %rcx
imul %rcx, %rcx
or
mov %rbx, %rcx
imul %rbx, %rcx
Any tips for how to benchmark this, or resources where I could read about this type of thing would be appreciated, as I am new to assembly.
resources where I could read about this type of thing
See Agner Fog's microarch pdf, and his optimizing assembly guide. Also other links in the x86 tag wiki (e.g. Intel's optimization manual).
The interesting option you didn't mention is:
mov %rbx, %rcx
imul %rbx, %rbx # doesn'y have to wait for mov to execute
# old value of %rbx is still available in %rcx
If the imul is on the critical path, and mov has non-zero latency (like on AMD CPUs, and Intel before IvyBridge), this is potentially better. The result of imul will be ready one cycle earlier, because has no dependency on the result of the mov.
If, however, the old value is on the critical path and the squared value isn't, then this is worse because it adds a mov to the critical path.
Of course, it also means you have to keep track of the fact that your old variable is now live in a different register, and the old register has the squared value. If this is a problem in a loop, unroll it so you can end up with things where the top of the loop is expecting them. If you wanted this to be easy, you'd use a compiler instead of optimizing asm by hand.
However, Intel P6-family CPUs (PPro/PII to Nehalem) have limited register-read ports, so it can be better to favour reading registers that you just wrote. If the %rbx wasn't written in the last couple cycles, it will have to be read from the permanent register file when the mov and imul uops go through the rename&issue stage (the RAT).
If they don't issue as part of the same group of 4, then they would each need to read %rbx separately. Since the register file in Core2/Nehalem only has 3 read ports, issue groups (quartets, as Agner Fog calls them) stall until all their not-recently-written input register values are read from the register file (at 3 per cycle, or 2 on Core2 is none of the 3 regs are index regs in an addressing mode).
For the full details, see Agner Fog's microarch pdf section 8.8. The Core2 section refers back to the PPro section. PPro has a 3-wide pipeline, so in that section Agner talks about triplets, not quartets.
If mov and imul issue together, they both share the same read of %rbx. There's a 3 in 4 chance of this happening on Core2/Nehalem.
Choosing just between the sequences you mention the first one has a clear (but usually small) advantage over the second for Intel P6-family CPUs. There's no difference for other CPUs, AFAIK, so the choice is obvious.
mov %rbx, %rcx
imul %rcx, %rcx # uses only the recently-written rcx; can't contribute to register-read stalls
worst of both worlds:
mov %rbx, %rcx
imul %rbx, %rcx # can't execute until after the mov, but still reads a potentially-old register
If you're going to depend on a recently-written register, you might as well use only recently-written registers.
Intel Sandybridge-family uses a physical register file (like AMD Bulldozer-family), and doesn't have register-read stalls.
Ivybridge (2nd gen Sandybridge) and later also handle mov reg,reg at register rename time, with zero latency and no execution unit. This means it doesn't matter whether you imul rbx or rcx as far as critical path length.
However, AMD Bulldozer-family can only handle xmm register moves in its rename stage; integer register moves still have 1c latency.
It's potentially still worth caring about which dependency chain the mov is part of, if latency is a limiting factor in the cycles per iteration of a loop.
how to benchmark this
I think you could put together a microbenchmark that has a register read stall on Core2 with imul %rbx, %rcx, but not with imul %rcx, %rcx. However, that would require some trial and error to get the mov and imul to issue in different groups, and unless you're feeling really creative, probably some artificial-looking surrounding code that exists only to read lots of registers. (e.g. lea (%rsi, %rdi, 1), %eax, or even add (%rsi, %rdi, 1), %eax (which has to read all three registers, and does micro-fuse on core2/nehalem so it only takes 1 uop slot in an issue group. (It doesn't micro-fuse on SnB-family)).
On a modern processor, using one register for both source and destination and using two different registers will never make any difference to performance. The reason for this is partly due to register renaming which, if there were a difference in performance would solve it by changing one of the registers to a different one and modify your subsequent instructions to use the new register (your processor actually has more registers than the instruction set has a way of refering to them so that it can do stuff like this). It is also because of the nature of a pipelined processor's implementation -- the contents of source registers are read at one pipeline stage and are then written at another later stage, which makes it difficult or impossible for register usage for a single instruction to cause any kind of interaction like the one you're worrying about.
More problematic is if an instruction refers to a value produced in its previous instruction, but even that is solved (usually) by out-of-order execution.

per clock perf. - can I use different registers for same instruction?

Can I use say four general purpose registers say r8,r9,r10,r11 each with MOV instruction for independent operations and be in impression that CPU is doing all those instructions in a single clock ?
I want to know because according to Agner Fog's Instruction Table, it says reciprocal throughput of MOV instruction is 0.25. It means CPU should be able to execute 4 MOV operations per cycle. Or I misinterpreted that all ??
I am a noob and have been learning Assembly in MASM since two months (mainly for learning debugging stuffs how registers works and it is really fun).
Edit, just re-read your question, and you're asking about different registers. I'll leave in my original answer; let's pretend your question wasn't just the most trivial case. :P
Yes, even without register renaming, these instructions can all execute (on separate execution units) in the same cycle because they're completely independent of each other.
mov eax, 1
mov ebx, ecx
mov edx, [mem]
xor esi,esi ;xor-zero: doesn't even use an execution unit on SnB-family
This is the easiest case for superscalar execution. If eax/rax was the destination for all four instructions, register-renaming would still allow all four instructions to execute in parallel.
Out-of-order execution allows four nearby instructions from separate dependency chains to execute at the same time, even if they weren't decoded or issued in the same clock cycle. And they probably won't retire in the same cycle either, if there are instructions between them. (The x86 ISA guarantees precise exceptions, like most other ISAs (ARM/PPC/etc.). All current designs accomplish with in-order retirement. So if a memory op segfaults, the program will stop at exactly that instruction, not just "well, there was a segfault somewhere recently, but we can't tell you where". (That would be non-precise exceptions).)
Superscalar in-order designs like Atom, or P5 (original Pentium) can still take advantage of the parallelism in these four independent instructions, but not in many other cases.
In a hand-crafted loop, it's common for a SnB-family CPU to be able to sustain well over 3 fused-domain uops per cycle. (It's also very easy to write loops that run at less than one fused-domain uop per cycle, due to latency, to say nothing of cache misses or branch mispredicts.)
Yes, multiple writes to the same architectural register can execute in parallel. Register renaming is not a bottleneck on Intel or AMD designs.
To understand and make full use of Agner Fog's tables, you have to read his microarch guide, or at least his "optimizing assembly" guide. See also good stuff at the x86 wiki.
As Agner Fog's microarch pdf points out (section 9.8 about Intel SnB/IvB):
Register renaming is controlled by the register alias table (RAT) and
the reorder buffer (ROB), shown in figure 6.1. The μops from the
decoders and the stack engine go to the RAT via a queue and then to
the ROB-read and the reservation station. The RAT can handle 4 μops
per clock cycle. The RAT can rename four registers per clock cycle,
and it can even rename the same register four times in one clock
cycle.
read-modify-write is another story (destination of an add instruction). A read-modify-write of an architectural register is (part of) a dependency chain, while an unconditional mov or an xor-zeroing starts a new dep chain. (Same for the output of certain other instructions like lea which don't read their destination).
Those register writes still rename the architectural register to a new physical register as well. This is how CPUs handle cases like
mov eax, 1 ; start of a dep chain
mov [mem+rax+rcx], eax
inc eax ; eax renamed again
The store needs the value of eax from before the inc. It gets it because when it checks the RAT, the architectural eax is still pointing to the same physical register that the mov eax,1 wrote. The inc can't just modify that same physical register because it doesn't know what if anything is not done yet with the previous value of eax.

Instruction Level Profiling: The Meaning of the Instruction Pointer?

When profiling code at the the assembly instruction level, what does the position of the instruction pointer really mean given that modern CPUs don't execute instructions serially or in-order? For example, assume the following x64 assembly code:
mov RAX, [RBX]; // Assume a cache miss here.
mov RSI, [RBX + RCX]; // Another cache miss.
xor R8, R8;
add RDX, RAX; // Dependent on the load into RAX.
add RDI, RSI; // Dependent on the load into RSI.
Which instruction will the instruction pointer spend most of its time on? I can think of good arguments for all of them:
mov RAX, [RBX] is taking probably 100s of cycles because it's a cache miss.
mov RSI, [RBX + RCX] also takes 100s of cycles, but probably executes in parallel with the previous instruction. What does it even mean for the instruction pointer to be on one or the other of these?
xor R8, R8 probably executes out-of-order and finishes before the memory loads finish, but the instruction pointer might stay here until all previous instructions are also finished.
add RDX, RAX generates a pipeline stall because it's the instruction where the value of RAX is actually used after a slow cache-miss load into it.
add RDI, RSI also stalls because it's dependent on the load into RSI.
CPUs maintains a fiction that there are only the architectural registers (RAX, RBX, etc) and there is a specific instruction pointer (IP). Programmers and compilers target this fiction.
Yet as you noted, modern CPUs don't execute serially or in-order. Until you the programmer / user request the IP, it is like Quantum Physics, the IP is a wave of instructions being executed; all so that the processor can run the program as fast as possible. When you request the current IP (for example, via a debugger breakpoint or profiler interrupt), then the processor must recreate the fiction that you expect so it collapses this wave form (all "in flight" instructions), gathers the register values back into architectural names, and builds a context for executing the debugger routine, etc.
In this context, there is an IP that indicates the instruction where the processor should resume execution. During the out-of-order execution, this instruction was the oldest instruction yet to complete, even though at the time of the interrupt the processor was perhaps fetching instructions well past that point.
For example, perhaps the interrupt indicates mov RSI, [RBX + RCX]; as the IP, but the xor had already executed and completed; however, when the processor would resume execution after the interrupt, it will re-execute the xor.
It's a good question, but in the kind of performance tuning I do, it doesn't matter.
It doesn't really matter because what you're looking for is speed-bugs.
These are things that the code is doing that take clock time and that could be done better or not at all. Examples:
- Spending I/O time looking in DLLs for resources that don't, actually, need to be looked for.
- Spending time in memory-allocation routines making and freeing objects that could simply be re-used.
- Re-calculating things in functions that could be memo-ized.
... this is just a few off the top of my head
Your biggest enemy is a self-congratulatory tendency to say "I wouldn't consciously write any bugs. Why would I?" Of course, you know that's why you test software. But the same goes for speed-bugs, and if you don't know how to find those you assume there are none, which is a way of saying "My code has no possible speedups, except maybe a profiler can show me how to shave a few cycles."
In my half-century experience, there is no code that, as first written, contains no speed-bugs. What's more, there's an enormous multiplier effect, where every speed-bug you remove makes the remaining ones more obvious. As a contrived example, suppose bug A accounts for 90% of clock time, and bug B accounts for 9%. If you only fix B, big deal - the code is 11% faster. If you only fix A, that's good - it's 10x faster. But if you fix both, that's really good - it's 100x faster. Fixing A made B big.
So the thing you need most in performance tuning is to find the speed-bugs, and not miss any. When you've done all that, then you can get down to cycle-shaving.

Resources