What's the reason for padding executable sections with "long NOPs"? - gcc

I found that x86-64 programs (at least those compiled using GCC) have functions start by default at addresses aligned to multiples of 16 bytes and that the padding is done by NOP instructions with as many prefixes as could fit to optimally fill the space. For example,
(...)
447454: c3 retq
447455: 90 nop
447456: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:0x0(%rax,%rax,1)
0000000000447460 <__libc_csu_fini>:
447460: f3 c3 repz retq
What's the advantage to filling the space with regular NOPs like observed here or here?

There's no downside, so why not? It makes the disassembly easier to read for humans, because you don't have a huge amount of lines separating functions.
GCC (the actual compiler part that transforms C to assembly) uses the same .p2align directive to ask the assembler to insert padding whether it's inside a function to align branch targets, or whether it's between functions to align function entry points.
GCC could emit .p2align 4,,0x90 to ask the assembler to fill with single-byte NOPs in cases where the NOPs won't be executed, but like I said, there's no reason to bother doing that instead of .p2align 4 (pad out to the next 2^4 boundary with the default choice of filler).
If the end of the function is an indirect branch (tail-call with jmp [rax] or something), speculative execution could run into these NOP instructions. Decoding many short NOPs could overflow the uop cache on Intel SnB-family. (more than 3 cache lines of up-to-6 uops per 32-byte block). (http://agner.org/optimize/ microarch pdf). Long NOPs are potentially better for that.
IDK how Pentium4's trace cache builder behaved; maybe it was useful for that, too? Again, fewer longer NOP instructions are less likely to trigger anything weird in the front-end of a CPU before it figures out that the NOPs aren't executed.
MSVC pads with int3 between functions, IIRC, which will stop speculative execution. That's not a bad idea.
This is guesswork; it's probably not a real factor in performance; if it still mattered on modern CPUs, all compilers would probably avoid short NOPs between functions, but as one of your links showed, not all do.
Some CPUs, like AMD K8/K10 and Bulldozer-family, mark instruction-lengths in L1I cache. Agner Fog says that bandwidth from L2 to L1I is low on K8/K10, and guesses that it may be from adding extra pre-decode information. IDK if this takes longer when there are lots of small instructions? It would have to know where to start decoding, because the middle of an instruction can span a cache-line boundary. IDK how that works.
BTW, these instructions might be decoded as part of a group containing a normal ret, but I don't think there's anything to worry about either way in that case.
Decoding happens in 2 stages in some CPUs: first, instruction-length decoding, which finds blocks of up-to-16 bytes containing up-to-4 instructions (e.g. on Intel P6-family / Sandybridge-family). Then it feeds those blocks to the decoders.
With correct branch prediction for the ret, even nasty stuff like LCP stalls after the ret don't seem to hurt.
Anyway, I don't think this difference is significant. Decoded NOP instructions after a RET should be cancelled before they go anywhere, because the RET is an unconditional branch. I probably makes no difference whether the instruction-length decoder finds many single-byte instructions vs. some prefixes but not the end of an instruction before the end of a 16-byte window.

Related

Does a Length-Changing Prefix (LCP) incur a stall on a simple x86_64 instruction?

Consider a simple instruction like
mov RCX, RDI # 48 89 f9
The 48 is the REX prefix for x86_64. It is not an LCP. But consider adding an LCP (for alignment purposes):
.byte 0x67
mov RCX, RDI # 67 48 89 f9
67 is an address size prefix which in this case is for an instruction without addresses. This instruction also has no immediates, and it doesn't use the F7 opcode (False LCP stalls; F7 would be TEST, NOT, NEG, MUL, IMUL, DIV + IDIV). Assume that it doesn't cross a 16-byte boundary either. Those are the LCP stall cases mentioned in Intel's Optimization Reference Manual.
Would this instruction incur an LCP stall (on Skylake, Haswell, ...) ? What about two LCPs?
My daily driver is a MacBook. So I don't have access to VTune and I can't look at the ILD_STALL event. Is there any other way to know?
TL:DR: 67h is safe here on all CPUs. In 64-bit mode1, 67h can only LCP-stall with addr32 movabs load/store of the accumulator (AL/AX/EAX/RAX) from/to a moffs 32-bit absolute address (vs. the normal 64-bit absolute for that special opcode).
67h isn't length-changing with normal instructions that use a ModRM, even with a disp32 component in the addressing mode, because 32 and 64-bit address-size use identical ModRM formats. That 67h-LCP-stallable form of mov is special and doesn't use a modrm addressing mode.
(It also almost certainly won't have some other meaning in future CPUs, like being part of longer opcode the way rep is3.)
A Length Changing Prefix is when the opcode(+modrm) would imply a different length in bytes for the non-prefixes part of the instruction's machine code, if you ignored prefixes. I.e. it changes the length of the rest of the instruction. (Parallel length-finding is hard, and done separately from full decode: Later insns in a 16-byte block don't even have known start points. So this min(16-byte, 6-instruction) stage needs to look at as few bits as possible after prefixes, for the normal fast case to work. This is the stage where LCP stalls can happen.)
Usually only with an actual imm16 / imm32 opcode, e.g. 66h is length-changing in add cx, 1234, but not add cx, 12: after prefixes or in the appropriate mode, add r/m16, imm8 and add r/m32, imm8 are both opcode + modrm + imm8, 3 bytes regardless, (https://www.felixcloutier.com/x86/add). Pre-decode hardware can find the right length by just skipping prefixes, not modifying interpretation of later opcode+modrm based on what it saw, unlike when 66h means the opcode implies 2 immediate bytes instead of 4. Assemblers will always pick the imm8 encoding when possible because it's shorter (or equal length for the no-modrm add ax, imm16 special case).
(Note that REX.W=1 is length-changing for mov r64, imm64 vs. mov r32, imm32, but all hardware handles that relatively common instruction efficiently so only 66h and 67h can ever actually LCP-stall.)
SnB-family doesn't have any false2 LCP stalls for prefixes that can be length-changing for this opcode but not this this particular instruction, for either 66h or 67h. So F7 is a non-issue on SnB, unlike Core2 and Nehalem. (Earlier P6-family Intel CPUs didn't support 64-bit mode.) Atom/Silvermont don't have LCP penalties at all, nor do AMD or Via CPUs.
Agner Fog's microarch guide covers this well, and explains things clearly. Search for "length-changing prefixes". (This answer is an attempt to put those pieces together with some reminders about how x86 instruction encoding works, etc.)
Footnote 1: 67h increases length-finding difficulty more in non-64-bit modes:
In 64-bit mode, 67h changes from 64 to 32-bit address size, both of which use disp0 / 8 / 32 (0, 1, or 4 bytes of immediate displacement as part of the instruction), and which use the same ModRM + optional SIB encoding for normal addressing modes. RIP+rel32 repurposes the shorter (no SIB) encoding of 32-bit mode's two redundant ways to encode [disp32], so length decoding is unaffected. Note that REX was already designed not to be length-changing (except for mov r64, imm64), by burdening R13 and R12 in the same ways as RBP and RSP as ModRM "escape codes" to signal no base reg, or presence of a SIB byte, respectively.
In 16 and 32-bit modes, 67h switches to 32 or 16-bit address size. Not only are [x + disp32] vs. [x + disp16] different lengths after the ModRM byte (just like immediates for the operand-size prefix), but also 16-bit address-size can't signal a SIB byte. Why don't x86 16-bit addressing modes have a scale factor, while the 32-bit version has it? So the same bits in the mode and /rm fields can imply different lengths.
Footnote 2: "False" LCP stalls
This need (see footnote 1) to sometimes look differently at ModRM even to find the length is presumably why Intel CPUs before Sandybridge have "false" LCP stalls in 16/32-bit modes on 67h prefixes on any instruction with a ModRM, even when they aren't length-changing (e.g. register addressing mode). Instead of optimistically length-finding and checking somehow, a Core2/Nehalem just punts if they see addr32 + most opcodes, if they're not in 64-bit mode.
Fortunately there's basically zero reason to ever use it in 32-bit code so this mostly only matters for 16-bit code that uses 32-bit registers without switching to protected mode. Or code using 67h for padding like you're doing, except in 32-bit mode. .byte 0x67 / mov ecx, edi would be a problem for Core 2 / Nehalem. (I didn't check earlier 32-bit-only P6 family CPUs. They're a lot more obsolete than Nehalem.)
False LCP stalls for 67h never happen in 64-bit mode; as discussed above that's the easy case, and the length pre-decoders already have to know what mode they're in, so fortunately there's no downside to using it for padding. Unlike rep (which could become part of some future opcode), 67h is extremely likely to be safely ignored for instructions where it can apply to some form of the same opcode, even if there isn't actually a memory operand for this one.
Sandybridge-family doesn't ever have any false LCP stalls, removing both the 16/32-bit mode address-size (67h) and the all-modes 66 F7 cases (which needs to look at ModRM to disambiguate instructions like neg di or mul di from test di, imm16.)
SnB-family also removes some 66h true-LCP stalls, e.g. from mov-immediate like mov word ptr [rdi], 0 which is actually useful.
Footnote 3: forward compat of using 67h for padding
When 67h applies to the opcode in general (i.e. it can use a memory operand), it's very unlikely that it will mean something else for the same opcode with a modrm that just happens to encode reg,reg operands. So this is safe for What methods can be used to efficiently extend instruction length on modern x86?.
In fact, "relaxing" a 6-byte call [RIP+rel32] to a 5-byte call rel32 is done by GNU binutils by padding the call rel32 with a 67h address-size prefix, even though that's never meaningful for E8 call rel32. (This happens when linking code compiled with -fno-plt, which uses call [RIP + foo#gotpcrel] for any foo that's not found in the current compilation unit and doesn't have "hidden" visibility.)
But that's not a good precedent: at this point it's too widespread for CPU vendors to want to break that specific prefix+opcode combo (like for What does `rep ret` mean?), but some homebrewed thing in your program like 67h cdq would not get the same treatment from vendors.
The rules, for Sandybridge-family CPUs
edited/condensed from Agner's microarch PDF, these cases can LCP-stall, taking an extra 2 to 3 cycles in pre-decode (if they miss in the uop cache).
Any ALU op with an imm16 that would be imm32 without a 66h. (Except mov-immediate).
Remember that mov and test don't have imm8 forms for wider operand-size so prefer test al, 1, or imm32 if necessary. Or sometimes even test ah, imm8 if you want to test bits in the top half of AX, although beware of 1 cycle of extra latency for reading AH after writing the full reg on HSW and later. GCC uses this trick but should maybe start being careful with it, perhaps sometimes using bt reg, imm8 when feeding a setcc or cmovcc (which can't macro-fuse with test like JCC can).
67h with movabs moffs (A0/A1/A2/A3 opcodes in 64-bit mode, and probably also in 16 or 32-bit mode). Confirmed by my testing with perf counters for ild_stall.lcp on Skylake when LLVM was deciding whether to optimize mov al, [0x123456] to use 67 A0 4-byte-address or a normal opcode + modrm + sib + disp32 (to get absolute instead of rip-relative). That refers to an old version of Agner's guide; he updated soon after I sent him my test results.
If one of the instructions NEG, NOT, DIV, IDIV, MUL and IMUL with a single operand
has a 16-bit operand and there is a 16-bytes boundary between the opcode byte and
the mod-reg-rm byte. These instructions have a bogus length-changing prefix
because these instructions have the same opcode as the TEST instruction with a 16-
bit immediate operand [...]
No penalty on SnB-family for div cx or whatever, regardless of alignment.
The address size prefix (67H) will always cause a delay in 16-bit and 32-bit mode on any
instruction that has a mod/reg/rm byte even if it doesn't change the length of the instruction.
SnB-family removed this penalty, making address-size prefixes usable as padding if you're careful.
Or to summarize another way:
SnB-family has no false LCP stalls.
SnB-family has LCP stalls on every 66h and 67h true LCP except for:
mov r/m16, imm16 and the mov r16, imm16 no-modrm version.
67h address size interaction with ModRM (in 16/32-bit modes).
(That excludes the no-modrm absolute address load/store of AL/AX/EAX/RAX forms- they can still LCP-stall, presumably even in 32-bit mode, like in 64-bit.)
Length-changing REX doesn't stall (on any CPU).
Some examples
(This part ignores the false LCP stalls that some CPUs have in some non-length-changing cases which turn out not to matter here, but perhaps that's why you were worried about 67h for mov reg,reg.)
In your case, the rest of the instruction bytes, starting after the 67, decode as a 3-byte instruction whether the current address-size is 32 or 64. Same even with addressing modes like mov eax, [e/rsi + 1024] (reg+disp32) or addr32 mov edx, [RIP + rel32].
In 16 and 32-bit modes, 67h switches between 16 and 32-bit address size. [x + disp32] vs. [x + disp16] are different lengths after the ModRM byte, but also non-16-bit address-size can signal a SIB byte depending on the R/M field. But in 64-bit mode, 32 and 64-bit address-size both use [x + disp32], and the same ModRM->SIB or not encoding.
There is only one case where a 67h address-size prefix is length-changing in 64-bit mode: movabs load/store with 8-byte vs. 4-byte absolute addresses, and yes it does LCP-stall Intel CPUs. (I posted test results on https://bugs.llvm.org/show_bug.cgi?id=34733#c3)
For example, addr32 movabs [0x123456], al
.intel_syntax noprefix
addr32 mov [0x123456], cl # non-AL to make movabs impossible
mov [0x123456], al # GAS picks normal absolute [disp32]
addr32 mov [0x123456], al # GAS picks A2 movabs since addr32 makes that the shortest choice, same as NASM does.
movabs [0x123456], al # 64-bit absolute address
Note that GAS (fortunately) doesn't choose to use an addr32 prefix on its own, even with as -Os (gcc -Wa,-Os).
$ gcc -c foo.s
$ objdump -drwC -Mintel foo.o
...
0: 67 88 0c 25 56 34 12 00 mov BYTE PTR ds:0x123456,cl
8: 88 04 25 56 34 12 00 mov BYTE PTR ds:0x123456,al # same encoding after the 67
f: 67 a2 56 34 12 00 addr32 mov ds:0x123456,al
15: a2 56 34 12 00 00 00 00 00 movabs ds:0x123456,al # different length for same opcode
As you can see from the last 2 instructions, using the a2 mov moffs, al opcode, with a 67 the rest of the instruction is a different length for the same opcode.
This does LCP-stall on Skylake, so it's only fast when running from the uop cache.
Of course the more common source of LCP stalls is with the 66 prefix and an imm16 (instead of imm32). Like add ax, 1234, as in this random test where I wanted to see if jumping over the LCP-stalling instruction could avoid the problem: Label in %rep section in NASM. But not cases like add ax, 12 which will use add r/m16, imm8 (which is the same length after the 66 prefix as add r/m32, imm8).
Also, Sandybridge-family reportedly avoid LCP stalls for mov-immediate with 16-bit immediate.
Related:
Another example of working around add r/m16, imm16: add 1 byte immediate value to a 2 bytes memory location
x86 assembly 16 bit vs 8 bit immediate operand encoding - choose add r/m16, imm8 instead of the also-3-byte add ax, imm16 form.
Sign or Zero Extension of address in 64bit mode for MOV moffs32? - how address-size interacts with the moffs forms of movabs. (The kind that can LCP-stall)
What methods can be used to efficiently extend instruction length on modern x86? - the general case of what you're doing.
Tuning advice and uarch details:
Usually don't try to save space with addr32 mov [0x123456], al, except maybe when it's a choice between saving 1 byte or using 15 bytes of padding including actual NOPs inside a loop. (more tuning advice below)
One LCP stall usually won't be a disaster with a uop cache, especially if length-decoding probably isn't a front-end bottleneck here (although it often can be if the front-end is a bottleneck at all). Hard to test a single instance in one function by micro-benchmarking, though; only a real full-app benchmark will accurately reflect when the code can run from the uop cache (what Intel perf counters call the DSB), bypassing legacy decode (MITE).
There are queues between stages in modern CPUs that can at least partly absorb stalls https://www.realworldtech.com/haswell-cpu/2/ (moreso than in PPro/PIII), and SnB-family has shorter LCP-stalls than Core2/Nehalem. (But other reasons for pre-decode slowness already dips into their capacity, and after an I-cache miss they may all be empty.)
When prefixes aren't length-changing, the pre-decode pipeline stage that finds instruction boundaries (before steering chunks of bytes to actual complex/simple decoders or doing actual decoding) will find the correct instruction-length / end by skipping all prefixes and then looking at just the opcode (and modrm if applicable).
This pre-decode length-finding is where LCP stalls happen, so fun fact: even Core 2's pre-decode loop buffer can hide LCP stalls in subsequent iterations because it locks down up to 64 bytes / 18 insns of x86 machine code after finding instruction boundaries, using the decode queue (pre-decode output) as a buffer.
In later CPUs, the LSD and uop cache are post decode, so unless something defeats the uop cache (like the pesky JCC-erratum mitigation or simply having too many uops for the uop cache in a 32-byte aligned block of x86 machine code), loops only pay the LCP-stall cost on the first iteration, if they weren't already hot.
I'd say generally work around LCP stalls if you can do so cheaply, especially for code that usually runs "cold". Or if you can just use 32-bit operand-size and avoid partial-register shenanigans, costing usually only a byte of code-size and no extra instructions or uops. Or if you'd have multiple LCP stalls in a row, e.g. from naively using 16-bit immediates, that would be too many bubbles for buffers to hide so you'd have a real problem and it's worth spending extra instructions. (e.g. mov eax, imm32 / add [mem], ax, or movzx load / add r32,imm32 / store, or whatever.)
Padding to end 16-byte fetch blocks at instruction boundaries: not needed
(This is separate from aligning the start of a fetch block at a branch target, which is also sometimes unnecessary given the uop cache.)
Wikichip's section on Skylake pre-decode is incorrectly implies that a partial instruction left at the end of a block has to pre-decode on its own, rather than along with the next 16-byte group that contains the end of the instruction. It seems to be paraphrased from Agner Fog's text, with some changes and additions that make it wrong:
[from wikichip...] As with previous microarchitectures, the pre-decoder has a throughput of 6 macro-ops per cycle or until all 16 bytes are consumed, whichever happens first. Note that the predecoder will not load a new 16-byte block until the previous block has been fully exhausted. For example, suppose a new chunk was loaded, resulting in 7 instructions. In the first cycle, 6 instructions will be processed and a whole second cycle will be wasted for that last instruction. This will produce the much lower throughput of 3.5 instructions per cycle which is considerably less than optimal.
[this part is paraphrased from Agner Fog's Core2/Nehalem section, with the word "fully" having been added"]
Likewise, if the 16-byte block resulted in just 4 instructions with 1 byte of the 5th instruction received, the first 4 instructions will be processed in the first cycle and a second cycle will be required for the last instruction. This will produce an average throughput of 2.5 instructions per cycle.
[nothing like this appears in the current version of Agner's guide, IDK where this misinformation came from. Perhaps made up based on a misunderstanding of what Agner said, but without testing.]
Fortunately no. The rest of the instruction is in the next fetch block, so reality makes a lot more sense: the leftover bytes are prepended to the next 16-byte block.
(Starting a new 16-byte pre-decode block starting with this instruction would also have been plausible, but my testing rules that out: 2.82 IPC with a repeating 5,6,6 byte = 17-byte pattern. If it only ever looked at 16 bytes and left the partial 5 or 6-byte instruction to be the start of the next block, that would give us 2 IPC.)
A repeating pattern of 3x 5 byte instructions unrolled many times (a NASM %rep 2500 or GAS .rept 2500 block, so 7.5k instructions in ~36kiB) runs at 3.19 IPC, pre-decoding and decoding at ~16 bytes per cycle. (16 bytes/cycle) / (5 bytes/insn) = 3.2 instructions per cycle theoretical.
(If wikichip was right, it would predict close to 2 IPC in a 3-1 pattern, which is of course unreasonably low and would be not be an acceptable design for Intel for long runs of long or medium-length when running from legacy decode. 2 IPC is so much narrower than the 4-wide pipeline that it would not be ok even for legacy decode. Intel learned from P4 that running at least decently well from legacy decode is important, even when your CPU caches decoded uops. That's why SnB's uop cache can be so small, only ~1.5k uops. A lot smaller than P4's trace cache, but P4's problem was trying to replace L1i with a trace cache, and having weak decoders. (Also the fact it was a trace cache, so it cached the same code multiple times.))
These perf differences are large enough that you can verify it on your Mac, using a plenty-large repeat count so you don't need perf counters to verify uop-cache misses. (Remember that L1i is inclusive of uop cache, so loops that don't fit in L1i will also evict themselves from uop cache.) Anyway, just measuring total time and knowing the approximate max-turbo that you'll hit is sufficient for a sanity check like this.
Getting better than the theoretical-max that wikichip predicts, even after startup overhead and conservative frequency estimates, will completely rule out that behaviour even on a machine where you don't have perf counters.
$ nasm -felf64 && ld # 3x 5 bytes, repeated 2.5k times
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_retired.retire_slots,uops_executed.thread,idq.dsb_uops -r2 ./testloop
Performance counter stats for './testloop' (2 runs):
604.16 msec task-clock # 1.000 CPUs utilized ( +- 0.02% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
1 page-faults # 0.002 K/sec
2,354,699,144 cycles # 3.897 GHz ( +- 0.02% )
7,502,000,195 instructions # 3.19 insn per cycle ( +- 0.00% )
7,506,746,328 uops_issued.any # 12425.167 M/sec ( +- 0.00% )
7,506,686,463 uops_retired.retire_slots # 12425.068 M/sec ( +- 0.00% )
7,506,726,076 uops_executed.thread # 12425.134 M/sec ( +- 0.00% )
0 idq.dsb_uops # 0.000 K/sec
0.6044392 +- 0.0000998 seconds time elapsed ( +- 0.02% )
(and from another run):
7,501,076,096 idq.mite_uops # 12402.209 M/sec ( +- 0.00% )
No clue why idq.mite_uops:u is not equal to issued or retired. There's nothing to un-laminate, and no stack-sync uops should be necessary, so IDK where the extra issued+retired uops could be coming from. The excess is consistent across runs, and proportional to the %rep count I think.
With other pattern like 5-5-6 (16 bytes) and 5-6-6 (17 bytes), I get similar results.
I do sometimes measure a slight difference when the 16-byte groups are misaligned relative to an absolute 16-byte boundary or not (put a nop at the top of the loop). But that seems to only happen with larger repeat counts. %rep 2500 for 39kiB total size, I still get 2.99 IPC (just under one 16-byte group per cycle), with 0 DSB uops, regardless of aligned vs. misaligned.
I still get 2.99IPC at %rep 5000, but I see a diff at %rep 10000: 2.95 IPC misaligned vs. 2.99 IPC aligned. That largest %rep count is ~156kiB and still fits in the 256k L2 cache so IDK why anything would be different from half that size. (They're much larger than 32k Li1). I think earlier I was seeing a different at 5k, but I can't repro it now. Maybe that was with 17-byte groups.
The actual loop runs 1000000 times in a static executable under _start, with a raw syscall to _exit, so perf counters (and time) for the whole process is basically just the loop. (especially with perf --all-user to only count user-space.)
; complete Linux program
default rel
%use smartalign
alignmode p6, 64
global _start
_start:
mov ebp, 1000000
align 64
.loop:
%ifdef MISALIGN
nop
%endif
%rep 2500
mov eax, 12345 ; 5 bytes.
mov ecx, 123456 ; 5 bytes. Use r8d for 6 bytes
mov edx, 1234567 ; 5 bytes. Use r9d for 6 bytes
%endrep
dec ebp
jnz .loop
.end:
xor edi,edi
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)

Timing of executing 8 bit and 64 bit instructions on 64 bit x64/Amd64 processors

Is there any execution timing difference between 8 it and 64 bit instructions on 64 bit x64/Amd64 processor, when those instructions are similar/same except bit width?
Is there a way to find real processor timing of executing these 2 tiny assembly functions?
-Thanks.
; 64 bit instructions
add64:
mov $0x1, %rax
add $0x2, %rax
ret
; 8 bit instructions
add8:
mov $0x1, %al
add $0x2, %al
ret
Yes, there's a difference. mov $0x1, %al has a false dependency on the old value of RAX on most CPUs, including everything newer than Sandybridge. It's a 2-input 1-output instruction; from the CPU's point of view it's like add $1, %al as far as scheduling it independently or not relative to other uses of RAX. Only writing a 32 or 64-bit register starts a new dependency chain.
This means the AL return value of your add8 function might not be ready until after a cache miss for some independent work the caller happened to be doing in EAX before the call, but the RAX result of add64 could be ready right away for out-of-order execution to get started on later instructions in the caller that use the return value. (Assuming their other inputs are also ready.)
Why doesn't GCC use partial registers? and
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent
and What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? - Important background for understanding performance on modern OoO exec CPUs.
Their code-size also differs: Both the 8-bit instructions are 2 bytes long. (Thanks to the AL, imm8 short-form encoding; add $1, %dl would be 3 bytes). The RAX instructions are 7 and 4 bytes long. This matters for L1i cache footprint (and on a large scale, for how many bytes have to get paged in from disk). On a small scale, how many instructions can fit into a 16 or 32-byte fetch block if the CPU is doing legacy decode because the code wasn't already hot in the uop cache. Also code-alignment of later instructions is affected by varying lengths of previous instructions, sometimes affecting which branches alias each other.
https://agner.org/optimize/ explains the details of the pipelines of various x86 microarchitectures, including front-end decoding effects that can make instruction-length matter beyond just code density in the I-cache / uop-cache.
Generally 32-bit operand-size is the most efficient (for performance, and pretty good for code-size). 32 and 8 are the operand-sizes that x86-64 can use without extra prefixes, and in practice with 8-bit to avoid stalls and badness you need more instructions or longer instructions because they don't zero-extend. The advantages of using 32bit registers/instructions in x86-64.
A few instructions are actually slower in the ALUs for 64-bit operand-size, not just front-end effects. That includes div on most CPUs, and imul on some older CPUs. Also popcnt and bswap. e.g. Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux
Note that mov $0x1, %rax will assemble to 7 bytes with GAS, unless you use as -O2 (not the same as gcc -O2, see this for examples) to get it to optimize to mov $1, %eax which exactly the same architectural effects, but is shorter (no REX or ModRM byte). Some assemblers do that optimization by default, but GAS doesn't. Why NASM on Linux changes registers in x86_64 assembly has more about why this optimization is safe and good, and why you should do it yourself in the source especially if your assembler doesn't do it for you.
But other than the false dep and code-size, they're the same for the back-end of the CPU: all those instructions are single-uop and can run on any scalar-integer ALU execution port1. (https://uops.info/ has automated test results for every form of every unprivileged instruction).
Footnote 1: Excavator (last-gen Bulldozer-family) can also run mov $imm, %reg on 2 more ports (AGU) for 32 and 64-bit operand-size. But merging a new low-8 or low-16 into a full register needs an ALU port. So mov $1, %rax has 4/clock throughput on Excavator, but mov $1, %al only has 2/clock throughput. (And of course only if you use a few different destination registers, not actually AL repeatedly; that would be a latency bottleneck of 1/clock because of the false dependency from writing a partial register on that microarchitecture.)
Previous Bulldozer-family CPUs starting with Piledriver can run mov reg, reg (for r32 or r64) on EX0, EX1, AGU0, AGU1, while most ALU instructions including mov $imm, %reg can only run on EX0/1. Further extending the AGU port's capabilities to also handle mov-immediate was a new feature in Excavator.
Fortunately Bulldozer was obsoleted by AMD's much better Zen architecture which has 4 full scalar integer ALU ports / execution units. (And a wider front end and a uop cache, good caches, and generally doesn't suck in a lot of the ways that Bulldozer sucked.)
Is there a way to measure it?
yes, but generally not in a function you call with call. Instead put it in an unrolled loop so you can run it lots of times with minimal other instructions. Especially useful to look at CPU performance counter results to find front-end / back-end uop counts, as well as just the overall time for your loop.
You can construct your loop to measure latency or throughput; see RDTSCP in NASM always returns the same value (timing a single instruction). Also:
Assembly - How to score a CPU instruction by latency and throughput
Idiomatic way of performance evaluation?
Can x86's MOV really be "free"? Why can't I reproduce this at all? is a good specific example of constructing a microbenchmark to measure / prove something specific.
Generally you don't need to measure yourself (although it's good to understand how, that helps you know what the measurements really mean). People have already done that for most CPU microarchitectures. You can predict performance for a specific CPU for some loops (if you can assume no stalls or cache misses) based on analyzing the instructions. Often that can predict performance fairly accurately, but medium-length dependency chains that OoO exec can only partially hide makes it too hard to accurately predict or account for every cycle.
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? has links to lots of good details, and stuff about CPU internals.
How many CPU cycles are needed for each assembly instruction? (you can't add up a cycle count for each instruction; front-end and back-end throughput, and latency, could each be the bottleneck for a loop.)

Bring code into the L1 instruction cache without executing it

Let's say I have a function that I plan to execute as part of a benchmark. I want to bring this code into the L1 instruction cache prior to executing since I don't want to measure the cost of I$ misses as part of the benchmark.
The obvious way to do this is to simply execute the code at least once before the benchmark, hence "warming it up" and bringing it into the L1 instruction cache and possibly the uop cache, etc.
What are my alternatives in the case I don't want to execute the code (e.g., because I want the various predictors which key off of instruction addresses to be cold)?
In Granite Rapids and later, PREFETCHIT0 [rip+rel32] to prefetch code into "all levels" of cache, or prefetchit1 to prefetch into all levels except L1i. These instructions are a NOP with an addressing-mode other than RIP-relative, or on CPUs that don't support them. (Perhaps they also prime iTLB or even uop cache, or at least could on paper, in which case this isn't what you want.) The docs in Intel's "future extensions" manual as of 2022 Dec recommends that the target address be the start of some instruction.
Note that this Q&A is about priming things for a microbenchmark. Not things that would be worth doing to improve overall performance. For that, probably just best-effort prefetch into L2 cache (the inner-most unified cache) with prefetcht1 SW prefetch, also priming the dTLB in a way that possibly helps the iTLB (although it will evict a possibly-useful dTLB entry).
Or not, if the L2TLB is a victim cache. see X86 prefetching optimizations: "computed goto" threaded code for more discussion.
Caveat: some of this answer is Intel-centric. If I just say "the uop cache", I'm talking about Sandybridge-family. I know Ryzen has a uop-cache too, but I haven't read much of anything about its internals, and only know some basics about Ryzen from reading Agner Fog's microarch guide.
You can at least prefetch into L2 with software prefetch, but that doesn't even necessarily help with iTLB hits. (The 2nd-level TLB is a victim cache, I think, so a dTLB miss might not populate anything that the iTLB checks.)
But this doesn't help with L1I$ misses, or getting the target code decoded into the uop cache.
If there is a way to do what you want, it's going to be with some kind of trick. x86 has no "standard" way to do this; no code-prefetch instruction. Darek Mihoka wrote about code-prefetch as part of a CPU-emulator interpreter loop: The Common CPU Interpreter Loop Revisited: the Nostradamus Distributor back in 2008 when P4 and Core2 were the hot CPU microarchitectures to tune for.
But of course that's a different case: the goal is sustained performance of indirect branches, not priming things for a benchmark. It doesn't matter if you spend a lot of time achieving the microarchitectural state you want outside the timed portion of a micro-benchmark.
Speaking of which, modern branch predictors aren't just "cold", they always contain some prediction based on aliasing1. This may or may not be important.
Prefetch the first / last lines (and maybe others) with call to a ret
I think instruction fetch / prefetch normally continues past an ordinary ret or jmp, because it can't be detected until decode. So you could just call a function that ends at the end of the previous cache line. (Make sure they're in the same page, so an iTLB miss doesn't block prefetch.)
ret after a call will predict reliably if no other call did anything to the return-address predictor stack, except in rare cases if an interrupt happened between the call and ret, and the kernel code had a deep enough call tree to push the prediction for this ret out of the RSB (return-stack-buffer). Or if Spectre mitigation flushed it intentionally on a context switch.
; make sure this is in the same *page* as function_under_test, to prime the iTLB
ALIGN 64
; could put the label here, but probably better not
60 bytes of (long) NOPs or whatever padding
prime_Icache_first_line:
ret
; jmp back_to_benchmark_startup ; alternative if JMP is handled differently than RET.
lfence ; prevent any speculative execution after RET, in case it or JMP aren't detected as soon as they decode
;;; cache-line boundary here
function_under_test:
...
prime_Icache_last_line: ; label the last RET in function_under_test
ret ; this will prime the "there's a ret here" predictor (correctly)
benchmark_startup:
call prime_Icache_first_line
call prime_Icache_first_line ; IDK if calling twice could possibly help in case prefetch didn't get far the first time? But now the CPU has "seen" the RET and may not fetch past it.
call prime_Icache_last_line ; definitely only need to call once; it's in the right line
lfence
rdtsc
.timed_loop:
call function_under_test
...
jnz .time_loop
We can even extend this technique to more than 2 cache lines by calling to any 0xC3 (ret) byte inside the body of function_under_test. But as #BeeOnRope points out, that's dangerous because it may prime branch prediction with "there's a ret here" causing a mispredict you otherwise wouldn't have had when calling function_under_test for real.
Early in the front-end, branch prediction is needed based on fetch-block address (which block to fetch after this one), not on individual branches inside each block, so this could be a problem even if the ret byte was part of another instruction.
But if this idea is viable, then you can look for a 0xc3 byte as part of an instruction in the cache line, or at worst add a 3-byte NOP r/m32 (0f 1f c3 nop ebx,eax). c3 as a ModR/M encodes a reg,reg instruction (with ebx and eax as operands), so it doesn't have to be hidden in a disp8 to avoid making the NOP even longer, and it's easy to find in short instructions: e.g. 89 c3 mov ebx,eax, or use the other opcode so the same modrm byte gives you mov eax,ebx. Or 83 c3 01 add ebx,0x1, or many other instructions with e/rbx, bl (and r/eax or al).
With a REX prefix, those can be you have a choice of rbx / r11 (and rax/r8 for the /r field if applicable). It's likely you can choose (or modify for this microbenchmark) your register allocation to produce an instruction using the relevant registers to produce a c3 byte without any overhead at all, especially if you can use a custom calling convention (at least for testing purposes) so you can clobber rbx if you weren't already saving/restoring it.
I found these by searching for (space)c3(space) in the output of objdump -d /bin/bash, just to pick a random not-small executable full of compiler-generated code.
Evil hack: end the cache line before with the start of a multi-byte instruction.
; at the end of a cache line
prefetch_Icache_first_line:
db 0xe9 ; the opcode for 5-byte jmp rel32
function_under_test:
... normal code ; first 4 bytes will be treated as a rel32 when decoding starts at prefetch_I...
ret
; then at function_under_test+4 + rel32:
;org whatever (that's not how ORG works in NASM, so actually you'll need a linker script or something to put code here)
prefetch_Icache_branch_target:
jmp back_to_test_startup
So it jumps to a virtual address which depends on the instruction bytes of function_under_test. Map that page and put code in it that jumps back to your benchmark-prep code. The destination has to be within 2GiB, so (in 64-bit code) it's always possible to choose a virtual address for function_under_test that makes the destination a valid user-space virtual address. Actually, for many rel32 values, it's possible to choose the address of function_under_test to keep both it and the target within the low 2GiB of virtual address space, (and definitely 3GiB) and thus valid 32-bit user-space addresses even under a 32-bit kernel.
Or less crazy, using the end of a ret imm16 to consume a byte or two, just requiring a fixup of RSP after return (and treating whatever is temporarily below RSP as a "red zone" if you don't reserve extra space):
; at the end of a cache line
prefetch_Icache_first_line:
db 0xc2 ; the opcode for 3-byte ret imm16
; db 0x00 ; optional: one byte of the immediate at least keeps RSP aligned
; But x86 is little-endian, so the last byte is most significant
;; Cache-line boundary here
function_under_test:
... normal code ; first 2 bytes will be added to RSP when decoding starts at prefetch_Icache_first_line
ret
prefetch_caller:
push rbp
mov rbp, rsp ; save old RSP
;sub rsp, 65536 ; reserve space in case of the max RSP+imm16.
call prefetch_Icache_first_line
;;; UNSAFE HERE in case of signal handler if you didn't SUB.
mov rsp, rbp ; restore RSP; if no signal handlers installed, probably nothing could step on stack memory
...
pop rbp
ret
Using sub rsp, 65536 before calling to the ret imm16 makes it safe even if there might be a signal handler (or interrupt handler in kernel code, if your kernel stack is big enough, otherwise look at the actual byte and see how much will really be added to RSP). It means that call's push/store will probably miss in data cache, and maybe even cause a pagefault to grow the stack. That does happen before fetching the ret imm16, so that won't evict the L1I$ line we wanted to prime.
This whole idea is probably unnecessary; I think the above method can reliably prefetch the first line of a function anyway, and this only works for the first line. (Unless you put a 0xe9 or 0xc2 in the last byte of each relevant cache line, e.g. as part of a NOP if necessary.)
But this does give you a way to non-speculatively do code-fetch from from the cache line you want without architecturally executing any instructions in it. Hopefully a direct jmp is detected before any later instructions execute, and probably without any others even decoding, except ones that decoded in the same block. (And an unconditional jmp always ends a uop-cache line on Intel). i.e. the mispredict penalty is all in the front-end from re-steering the fetch pipeline as soon as decode detects the jmp. I hope ret is like this too, in cases where the return-predictor stack is not empty.
A jmp r/m64 would let you control the destination just by putting the address in the right register or memory. (Figure out what register or memory addressing mode the first byte(s) of function_under_test encode, and put an address there). The opcode is FF /4, so you can only use a register addressing mode if the first byte works as a ModRM that has /r = 4 and mode=11b. But you could put the first 2 bytes of the jmp r/m64 in the previous line, so the extra bytes form the SIB (and disp8 or disp32). Whatever those are, you can set up register contents such that the jump-target address will be loaded from somewhere convenient.
But the key problem with a jmp r/m64 is that default-prediction for an indirect branch can fall through and speculatively execute function_under_test, affecting the branch-prediction entries for those branches. You could have bogus values in registers so you prime branch prediction incorrectly, but that's still different from not touching them at all.
How does this overlapping-instructions hack to consume bytes from the target cache line affect the uop cache?
I think (based on previous experimental evidence) Intel's uop cache puts instructions in the uop-cache line that corresponds to their start address, in cases where they cross a 32 or 64-byte boundary. So when the real execution of function_under_test begins, it will simply miss in the uop-cache because no uop-cache line is caching the instruction-start-address range that includes the first byte of function_under_test. i.e. the overlapping decode is probably not even noticed when it's split across an L1I$ boundary this way.
It is normally a problem for the uop cache to have the same bytes decode as parts of different instructions, but I'm optimistic that we wouldn't have a penalty in this case. (I haven't double-checked that for this case. I'm mostly assuming that lines record which range of start-addresses they cache, and not the whole range of x86 instruction bytes they're caching.)
Create mis-speculation to fetch arbitrary lines, but block exec with lfence
Spectre / Meltdown exploits and mitigation strategies provide some interesting ideas: you could maybe trigger a mispredict that fetches at least the start of the code you want, but maybe doesn't speculate into it.
lfence blocks speculative execution, but (AFAIK) not instruction prefetch / fetch / decode.
I think (and hope) the front-end will follow direct relative jumps on its own, even after lfence, so we can use jmp target_cache_line in the shadow of a mispredict + lfence to fetch and decode but not execute the target function.
If lfence works by blocking the issue stage until the reservation station (OoO scheduler) is empty, then an Intel CPU should probably decode past lfence until the IDQ is full (64 uops on Skylake). There are further buffers before other stages (fetch -> instruction-length-decode, and between that and decode), so fetch can run ahead of that. Presumably there's a HW prefetcher that runs ahead of where actual fetch is reading from, so it's pretty plausible to get several cache lines into the target function in the shadow of a single mispredict, especially if you introduce delays before the mispredict can be detected.
We can use the same return-address frobbing as a retpoline to reliably trigger a mispredict to jmp rel32 which sends fetch into the target function. (I'm pretty sure a re-steer of the front-end can happen in the shadow of speculative execution without waiting to confirm correct speculation, because that would make every re-steer serializing.)
function_under_test:
...
some_line: ; not necessarily the first cache line
...
ret
;;; This goes in the same page as the test function,
;;; so we don't iTLB-miss when trying to send the front-end there
ret_frob:
xorps xmm0,xmm0
movq xmm1, rax
;; The point of this LFENCE is to make sure the RS / ROB are empty so the front-end can run ahead in a burst.
;; so the sqrtpd delay chain isn't gradually issued.
lfence
;; alternatively, load the return address from the stack and create a data dependency on it, e.g. and eax,0
;; create a many-cycle dependency-chain before the RET misprediction can be detected
times 10 sqrtpd xmm0,xmm0 ; high latency, single uop
orps xmm0, xmm1 ; target address with data-dep on the sqrtpd chain
movq [rsp], xmm0 ; overwrite return address
; movd [rsp], xmm0 ; store-forwarding stall: do this *as well* as the movq
ret ; mis-speculate to the lfence/jmp some_line
; but architecturally jump back to the address we got in RAX
prefetch_some_line:
lea rax, [rel back_to_bench_startup]
; or pop rax or load it into xmm1 directly,
; so this block can be CALLed as a function instead of jumped to
call ret_frob
; speculative execution goes here, but architecturally never reached
lfence ; speculative *execution* stops here, fetch continues
jmp some_line
I'm not sure the lfence in ret_frob is needed. But it does make it easier to reason about what the front-end is doing relative to the back-end. After the lfence, the return address has a data dependency on the chain of 10x sqrtpd. (10x 15 to 16 cycle latency on Skylake, for example). But the 10x sqrtpd + orps + movq only take 3 cycles to issue (on 4-wide CPUs), leaving at least 148 cycles + store-forwarding latency before ret can read the return address back from the stack and discover that the return-stack prediction was wrong.
This should be plenty of time for the front-end to follow the jmp some_line and load that line into L1I$, and probably load several lines after that. It should also get some of them decoded into the uop cache.
You need a separate call / lfence / jmp block for each target line (because the target address has to be hard-coded into a direct jump for the front-end to follow it without the back-end executing anything), but they can all share the same ret_frob block.
If you left out the lfence, you could use the above retpoline-like technique to trigger speculative execution into the function. This would let you jump to any target branch in the target function with whatever args you like in registers, so you can mis-prime branch prediction however you like.
Footnote 1:
Modern branch predictors aren't just "cold", they contain predictions
from whatever aliased the target virtual addresses in the various branch-prediction data structures. (At least on Intel where SnB-family pretty definitely uses TAGE prediction.)
So you should decide whether you want to specifically anti-prime the branch predictors by (speculatively) executing the branches in your function with bogus data in registers / flags, or whether your micro-benchmarking environment resembles the surrounding conditions of the real program closely enough.
If your target function has enough branching in a very specific complex pattern (like a branchy sort function over 10 integers), then presumably only that exact input can train the branch predictor well, so any initial state other than a specially-warmed-up state is probably fine.
You may not want the uop-cache primed at all, to look for cold-execution effects in general (like decode), so that might rule out any speculative fetch / decode, not just speculative execution. Or maybe speculative decode is ok if you then run some uop-cache-polluting long-NOPs or times 800 xor eax,eax (2-byte instructions -> 16 per 32-byte block uses up all 3 entries that SnB-family uop caches allow without running out of room and not being able to fit in the uop cache at all). But not so many that you evict L1I$ as well.
Even speculative decode without execute will prime the front-end branch prediction that knows where branches are ahead of decode, though. I think that a ret (or jmp rel32) at the end of the previous cache line
Map the same physical page to two different virtual addresses.
L1I$ is physically addressed. (VIPT but with all the index bits from below the page offset, so effectively PIPT).
Branch-prediction and uop caches are virtually addressed, so with the right choice of virtual addresses, a warm-up run of the function at the alternate virtual address will prime L1I, but not branch prediction or uop caches. (This only works if branch aliasing happens modulo something larger than 4096 bytes, because the position within the page is the same for both mappings.)
Prime the iTLB by calling to a ret in the same page as the test function, but outside it.
After setting this up, no modification of the page tables are required between the warm-up run and the timing run. This is why you use two mappings of the same page instead of remapping a single mapping.
Margaret Bloom suggests that CPUs vulnerable to Meltdown might speculatively fetch instructions from a no-exec page if you jump there (in the shadow of a mispredict so it doesn't actually fault), but that would then require changing the page table, and thus a system call which is expensive and might evict that line of L1I. But if it doesn't pollute the iTLB, you could then re-populate the iTLB entry with a mispredicted branch anywhere into the same page as the function. Or just a call to a dummy ret outside the function in the same page.
None of this will let you get the uop cache warmed up, though, because it's virtually addressed. OTOH, in real life, if branch predictors are cold then probably the uop cache will also be cold.
One approach that could work for small functions would be to execute some code which appears on the same cache line(s) as your target function, which will bring in the entire cache line.
For example, you could organize your code as follows:
ALIGN 64
function_under_test:
; some code, less than 64 bytes
dummy:
ret
and then call the dummy function prior to calling function_under_test - if dummy starts on the same cache line as the target function, it would bring the entire cache line into L1I. This works for functions of 63 bytes or less1.
This can probably be extended to functions up to ~126 bytes or so by using this trick both at before2 and after the target function. You could extend it to arbitrarily sized functions by inserting dummy functions on every cache line and having the target code jump over them, but this comes at a cost of inserting the otherwise-unnecessary jumps your code under test, and requires careful control over the code size so that the dummy functions are placed correctly.
You need fine control over function alignment and placement to achieve this: assembler is probably the easiest, but you can also probably do it with C or C++ in combination with compiler-specific attributes.
1 You could even reuse the ret in the function_under_test itself to support slightly longer functions (e.g., those whose ret starts within 64 bytes of the start).
2 You'd have to be more careful about the dummy function appearing before the code under test: the processor might fetch instructions past the ret and it might (?) even execute them. A ud2 after the dummy ret is likely to block further fetch (but you might want fetch if populating the uop cache is important).

Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?

I'm doing micro-optimization on a performance critical part of my code and came across the sequence of instructions (in AT&T syntax):
add %rax, %rbx
mov %rdx, %rax
mov %rbx, %rdx
I thought I finally had a use case for xchg which would allow me to shave an instruction and write:
add %rbx, %rax
xchg %rax, %rdx
However, to my dimay I found from Agner Fog's instruction tables, that xchg is a 3 micro-op instruction with a 2 cycle latency on Sandy Bridge, Ivy Bridge, Broadwell, Haswell and even Skylake. 3 whole micro-ops and 2 cycles of latency! The 3 micro-ops throws off my 4-1-1-1 cadence and the 2 cycle latency makes it worse than the original in the best case since the last 2 instructions in the original might execute in parallel.
Now... I get that the CPU might be breaking the instruction into micro-ops that are equivalent to:
mov %rax, %tmp
mov %rdx, %rax
mov %tmp, %rdx
where tmp is an anonymous internal register and I suppose the last two micro-ops could be run in parallel so the latency is 2 cycles.
Given that register renaming occurs on these micro-architectures, though, it doesn't make sense to me that this is done this way. Why wouldn't the register renamer just swap the labels? In theory, this would have a latency of only 1 cycle (possibly 0?) and could be represented as a single micro-op so it would be much cheaper.
Supporting efficient xchg is non-trivial, and presumably not worth the extra complexity it would require in various parts of the CPU. A real CPU's microarchitecture is much more complicated than the mental model that you can use while optimizing software for it. For example, speculative execution makes everything more complicated, because it has to be able to roll back to the point where an exception occurred.
Making fxch efficient was important for x87 performance because the stack nature of x87 makes it (or alternatives like fld st(2)) hard to avoid. Compiler-generated FP code (for targets without SSE support) really does use fxch a significant amount. It seems that fast fxch was done because it was important, not because it's easy. Intel Haswell even dropped support for single-uop fxch. It's still zero-latency, but decodes to 2 uops on HSW and later (up from 1 in P5, and PPro through IvyBridge).
xchg is usually easy to avoid. In most cases, you can just unroll a loop so it's ok that the same value is now in a different register. e.g. Fibonacci with add rax, rdx / add rdx, rax instead of add rax, rdx / xchg rax, rdx. Compilers generally don't use xchg reg,reg, and usually hand-written asm doesn't either. (This chicken/egg problem is pretty similar to loop being slow (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?). loop would have been very useful for for adc loops on Core2/Nehalem where an adc + dec/jnz loop causes partial-flag stalls.)
Since xchg is still slow-ish on previous CPUs, compilers wouldn't start using it with -mtune=generic for several years. Unlike fxch or mov-elimination, a design-change to support fast xchg wouldn't help the CPU run most existing code faster, and would only enable performance gains over the current design in rare cases where it's actually a useful peephole optimization.
Integer registers are complicated by partial-register stuff, unlike x87
There are 4 operand sizes of xchg, 3 of which use the same opcode with REX or operand-size prefixes. (xchg r8,r8 is a separate opcode, so it's probably easier to make the decoders decode it differently from the others). The decoders already have to recognize xchg with a memory operand as special, because of the implicit lock prefix, but it's probably less decoder complexity (transistor-count + power) if the reg-reg forms all decode to the same number of uops for different operand sizes.
Making some r,r forms decode to a single uop would be even more complexity, because single-uop instructions have to be handled by the "simple" decoders as well as the complex decoder. So they would all need to be able to parse xchg and decide whether it was a single uop or multi-uop form.
AMD and Intel CPUs behave somewhat similarly from a programmer's perspective, but there are many signs that the internal implementation is vastly different. For example, Intel mov-elimination only works some of the time, limited by some kind of microarchitectural resources, but AMD CPUs that do mov-elimination do it 100% of the time (e.g. Bulldozer for the low lane of vector regs).
See Intel's optimization manual, Example 3-23. Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions, where they discuss overwriting the zero-latency-movzx result right away to free up the internal resource sooner. (I tried the examples on Haswell and Skylake, and found that mov-elimination did in fact work significantly more of the time when doing that, but that it was actually slightly slower in total cycles, instead of faster. The example was intended to show the benefit on IvyBridge, which probably bottlenecks on its 3 ALU ports, but HSW/SKL only bottleneck on resource conflicts in the dep chains and don't seem to be bothered by needing an ALU port for more of the movzx instructions.)
I don't know exactly what needs tracking in a limited-size table(?) for mov-elimination. Probably it's related to needing to free register-file entries as soon as possible when they're no longer needed, because Physical Register File size limits rather than ROB size can be the bottleneck for the out-of-order window size. Swapping around indices might make this harder.
xor-zeroing is eliminated 100% of the time on Intel Sandybridge-family; it's assumed that this works by renaming to a physical zero register, and this register never needs to be freed.
If xchg used the same mechanism that mov-elimination does, it also could probably only work some of the time. It would need to decode to enough uops to work in cases where it isn't handled at rename. (Or else the issue/rename stage would have to insert extra uops when an xchg will take more than 1 uop, like it does when un-laminating micro-fused uops with indexed addressing modes that can't stay micro-fused in the ROB, or when inserting merging uops for flags or high-8 partial registers. But that's a significant complication that would only be worth doing if xchg was a common and important instruction.)
Note that xchg r32,r32 has to zero-extend both results to 64 bits, so it can't be a simple swap of RAT (Register Alias Table) entries. It would be more like truncating both registers in-place. And note that Intel CPUs never eliminate mov same,same. It does already need to support mov r32,r32 and movzx r32, r8 with no execution port, so presumably it has some bits that indicate that rax = al or something. (And yes, Intel HSW/SKL do that, not just Ivybridge, despite what Agner's microarch guide says.)
We know P6 and SnB had upper-zeroed bits like this, because xor eax,eax before setz al avoids a partial-register stall when reading eax. HSW/SKL never rename al separately in the first place, only ah. It may not be a coincidence that partial-register renaming (other than AH) seems to have been dropped in the same uarch that introduced mov-elimination (Ivybridge). Still, setting that bit for 2 registers at once would be a special case that required special support.
xchg r64,r64 could maybe just swap the RAT entries, but decoding that differently from the r32 case is yet another complication. It might still need to trigger partial-register merging for both inputs, but add r64,r64 needs to do that, too.
Also note that an Intel uop (other than fxch) only ever produces one register result (plus flags). Not touching flags doesn't "free up" an output slot; For example mulx r64,r64,r64 still takes 2 uops to produce 2 integer outputs on HSW/SKL, even though all the "work" is done in the multiply unit on port 1, same as with mul r64 which does produce a flag result.)
Even if it is as simple as "swap the RAT entries", building a RAT that supports writing more than one entry per uop is a complication. What to do when renaming 4 xchg uops in a single issue group? It seems to me like it would make the logic significantly more complicated. Remember that this has to be built out of logic gates / transistors. Even if you say "handle that special case with a trap to microcode", you have to build the whole pipeline to support the possibility that that pipeline stage could take that kind of exception.
Single-uop fxch requires support for swapping RAT entries (or some other mechanism) in the FP RAT (fRAT), but it's a separate block of hardware from the integer RAT (iRAT). Leaving out that complication in the iRAT seems reasonable even if you have it in the fRAT (pre-Haswell).
Issue/rename complexity is definitely an issue for power consumption, though. Note that Skylake widened a lot of the front-end (legacy decode and uop cache fetch), and retirement, but kept the 4-wide issue/rename limit. SKL also added replicated execution units on more port in the back-end, so issue bandwidth is a bottleneck even more of the time, especially in code with a mix of loads, stores, and ALU.
The RAT (or the integer register file, IDK) may even have limited read ports, since there seem to be some front-end bottlenecks in issuing/renaming many 3-input uops like add rax, [rcx+rdx]. I posted some microbenchmarks (this and the follow-up post) showing Skylake being faster than Haswell when reading lots of registers, e.g. with micro-fusion of indexed addressing modes. Or maybe the bottleneck there was really some other microarchitectural limit.
But how does 1-uop fxch work? IDK how it's done in Sandybridge / Ivybridge. In P6-family CPUs, an extra remapping table exists basically to support FXCH. That might only be needed because P6 uses a Retirement Register File with 1 entry per "logical" register, instead of a physical register file (PRF). As you say, you'd expect it to be simpler when even "cold" register values are just a pointer to a PRF entry. (Source: US patent 5,499,352: Floating point register alias table FXCH and retirement floating point register array (describes Intel's P6 uarch).
One main reason the rfRAT array 802 is included within the present invention fRAT logic is a direct result of the manner in which the present invention implements the FXCH instruction.
(Thanks Andy Glew (#krazyglew), I hadn't thought of looking up patents to find out about CPU internals.) It's pretty heavy going, but may provide some insight into the bookkeeping needed for speculative execution.
Interesting tidbit: the patent describes integer as well, and mentions that there are some "hidden" logical registers which are reserved for use by microcode. (Intel's 3-uop xchg almost certain uses one of these as a temporary.)
We might be able to get some insight from looking at what AMD does.
Interestingly, AMD has 2-uop xchg r,r in K10, Bulldozer-family, Bobcat/Jaguar, and Ryzen. (But Jaguar xchg r8,r8 is 3 uops. Maybe to support the xchg ah,al corner case without a special uop for swapping the low 16 of a single reg).
Presumably both uops read the old values of the input architectural registers before the first one updates the RAT. IDK exactly how this works, since they aren't necessarily issued/renamed in the same cycle (but they are at least contiguous in the uop flow, so at worst the 2nd uop is the first uop in the next cycle). I have no idea if Haswell's 2-uop fxch works similarly, or if they're doing something else.
Ryzen is a new architecture designed after mov-elimination was "invented", so presumably they take advantage of it wherever possible. (Bulldozer-family renames vector moves (but only for the low 128b lane of YMM vectors); Ryzen is the first AMD architecture to do it for GP regs too.) xchg r32,r32 and r64,r64 are zero-latency (renamed), but still 2 uops each. (r8 and r16 need an execution unit, because they merge with the old value instead of zero-extending or copying the entire reg, but are still only 2 uops).
Ryzen's fxch is 1 uop. AMD (like Intel) probably isn't spending a lot of transistors on making x87 fast (e.g. fmul is only 1 per clock and on the same port as fadd), so presumably they were able to do this without a lot of extra support. Their micro-coded x87 instructions (like fyl2x) are faster than on recent Intel CPUs, so maybe Intel cares even less (at least about the microcoded x87 instruction).
Maybe AMD could have made xchg r64,r64 a single uop too, more easily than Intel. Maybe even xchg r32,r32 could be single uop, since like Intel it needs to support mov r32,r32 zero-extension with no execution port, so maybe it could just set whatever "upper 32 zeroed" bit exists to support that. Ryzen doesn't eliminate movzx r32, r8 at rename, so presumably there's only an upper32-zero bit, not bits for other widths.
What Intel might be able to do cheaply if they wanted to:
It's possible that Intel could support 2-uop xchg r,r the way Ryzen does (zero latency for the r32,r32 and r64,r64 forms, or 1c for the r8,r8 and r16,r16 forms) without too much extra complexity in critical parts of the core, like the issue/rename and retirement stages that manage the Register Alias Table (RAT). But maybe not, if they can't have 2 uops read the "old" value of a register when the first uop writes it.
Stuff like xchg ah,al is definitely a extra complication, since Intel CPUs don't rename partial registers separately anymore, except AH/BH/CH/DH.
xchg latency in practice on current hardware
Your guess about how it might work internally is good. It almost certainly uses one of the internal temporary registers (accessible only to microcode). Your guess about how they can reorder is too limited, though.
In fact, one direction has 2c latency and the other direction has ~1c latency.
00000000004000e0 <_start.loop>:
4000e0: 48 87 d1 xchg rcx,rdx # slow version
4000e3: 48 83 c1 01 add rcx,0x1
4000e7: 48 83 c1 01 add rcx,0x1
4000eb: 48 87 ca xchg rdx,rcx
4000ee: 48 83 c2 01 add rdx,0x1
4000f2: 48 83 c2 01 add rdx,0x1
4000f6: ff cd dec ebp
4000f8: 7f e6 jg 4000e0 <_start.loop>
This loop runs in ~8.06 cycles per iteration on Skylake. Reversing the xchg operands makes it run in ~6.23c cycles per iteration (measured with perf stat on Linux). uops issued/executed counters are equal, so no elimination happened. It looks like the dst <- src direction is the slow one, since putting the add uops on that dependency chain makes things slower than when they're on the dst -> src dependency chain.
If you ever want to use xchg reg,reg on the critical path (code-size reasons?), do it with the dst -> src direction on the critical path, because that's only about 1c latency.
Other side-topics from comments and the question
The 3 micro-ops throws off my 4-1-1-1 cadence
Sandybridge-family decoders are different from Core2/Nehalem. They can produce up to 4 uops total, not 7, so the patterns are 1-1-1-1, 2-1-1, 3-1, or 4.
Also beware that if the last uop is one that can macro-fuse, they will hang onto it until the next decode cycle in case the first instruction in the next block is a jcc. (This is a win when code runs multiple times from the uop cache for each time it's decoded. And that's still usually 3 uops per clock decode throughput.)
Skylake has an extra "simple" decoder so it can do 1-1-1-1-1 up to 4-1 I guess, but > 4 uops for one instruction still requires the microcode ROM. Skylake beefed up the uop cache, too, and can often bottleneck on the 4 fused-domain uops per clock issue/rename throughput limit if the back-end (or branch misses) aren't a bottleneck first.
I'm literally searching for ~1% speed bumps so hand optimization has been working out on the main loop code. Unfortunately that's ~18kB of code so I'm not even trying to consider the uop cache anymore.
That seems kinda crazy, unless you're mostly limiting yourself to asm-level optimization in shorter loops inside your main loop. Any inner loops within the main loop will still run from the uop cache, and that should probably be where you're spending most of your time optimizing. Compilers usually do a good-enough job that it's not practical for a human to do much over a large scale. Try to write your C or C++ in such a way that the compiler can do a good job with it, of course, but looking for tiny peephole optimizations like this over 18kB of code seems like going down the rabbit hole.
Use perf counters like idq.dsb_uops vs. uops_issued.any to see how many of your total uops came from the uop cache (DSB = Decoded Stream Buffer or something). Intel's optimization manual has some suggestions for other perf counters to look at for code that doesn't fit in the uop cache, such as DSB2MITE_SWITCHES.PENALTY_CYCLES. (MITE is the legacy-decode path). Search the pdf for DSB to find a few places it's mentioned.
Perf counters will help you find spots with potential problems, e.g. regions with higher than average uops_issued.stall_cycles could benefit from finding ways to expose more ILP if there are any, or from solving a front-end problem, or from reducing branch-mispredicts.
As discussed in comments, a single uop produces at most 1 register result
As an aside, with a mul %rbx, do you really get %rdx and %rax all at once or does the ROB technically have access to the lower part of the result one cycle earlier than the higher part? Or is it like the "mul" uop goes into the multiplication unit and then the multiplication unit issues two uops straight into the ROB to write the result at the end?
Terminology: the multiply result doesn't go into the ROB. It goes over the forwarding network to whatever other uops read it, and goes into the PRF.
The mul %rbx instruction decodes to 2 uops in the decoders. They don't even have to issue in the same cycle, let alone execute in the same cycle.
However, Agner Fog's instruction tables only list a single latency number. It turns out that 3 cycles is the latency from both inputs to RAX. The minimum latency for RDX is 4c, according to InstlatX64 testing on both Haswell and Skylake-X.
From this, I conclude that the 2nd uop is dependent on the first, and exists to write the high half of the result to an architectural register. The port1 uop produces a full 128b multiply result.
I don't know where the high-half result lives until the p6 uop reads it. Perhaps there's some sort of internal queue between the multiply execution unit and hardware connected to port 6. By scheduling the p6 uop with a dependency on the low-half result, that might arrange for the p6 uops from multiple in-flight mul instructions to run in the correct order. But then instead of actually using that dummy low-half input, the uop would take the high half result from the queue output in an execution unit that's connected to port 6 and return that as the result. (This is pure guess work, but I think it's plausible as one possible internal implementation. See comments for some earlier ideas).
Interestingly, according to Agner Fog's instruction tables, on Haswell the two uops for mul r64 go to ports 1 and 6. mul r32 is 3 uops, and runs on p1 + p0156. Agner doesn't say whether that's really 2p1 + p0156 or p1 + 2p0156 like he does for some other insns. (However, he says that mulx r32,r32,r32 runs on p1 + 2p056 (note that p056 doesn't include p1).)
Even more strangely, he says that Skylake runs mulx r64,r64,r64 on p1 p5 but mul r64 on p1 p6. If that's accurate and not a typo (which is a possibility), it pretty much rules out the possibility that the extra uop is an upper-half multiplier.

INC instruction vs ADD 1: Does it matter?

From Ira Baxter answer on, Why do the INC and DEC instructions not affect the Carry Flag (CF)?
Mostly, I stay away from INC and DEC now, because they do partial condition code updates, and this can cause funny stalls in the pipeline, and ADD/SUB don't. So where it doesn't matter (most places), I use ADD/SUB to avoid the stalls. I use INC/DEC only when keeping the code small matters, e.g., fitting in a cache line where the size of one or two instructions makes enough difference to matter. This is probably pointless nano[literally!]-optimization, but I'm pretty old-school in my coding habits.
And I would like to ask why it can cause stalls in the pipeline while add doesn't? After all, both ADD and INC updates flag registers. The only difference is that INC doesn't update CF. But why it matters?
Update: the Efficiency cores on Alder Lake are Gracemont, and run inc reg as a single uop, but at only 1/clock, vs. 4/clock for add reg, 1 (https://uops.info/). This may be a false dependency on FLAGS like P4 had; the uops.info tests didn't try adding a dep-breaking instruction. Other than the TL:DR, I haven't updated other parts of this answer.
TL:DR/advice for modern CPUs: Probably use add; Intel Alder Lake's E-cores are relevant for "generic" tuning and seem to run inc slowly.
Other than Alder Lake and earlier Silvermont-family, use inc except with a memory destination; that's fine on mainstream Intel or any AMD. (e.g. like gcc -mtune=core2, -mtune=haswell, or -mtune=znver1). inc mem costs an extra uop vs. add on Intel P6 / SnB-family; the load can't micro-fuse.
If you care about Silvermont-family (including KNL in Xeon Phi, and some netbooks, chromebooks, and NAS servers), probably avoid inc. add 1 only costs 1 extra byte in 64-bit code, or 2 in 32-bit code. But it's not a performance disaster (just locally 1 extra ALU port used, not creating false dependencies or big stalls), so if you don't care much about SMont then don't worry about it.
Writing CF instead of leaving it unmodified can potentially be useful with other surrounding code that might benefit from CF dep-breaking, e.g. shifts. See below.
If you want to inc/dec without touching any flags, lea eax, [rax+1] runs efficiently and has the same code-size as add eax, 1. (Usually on fewer possible execution ports than add/inc, though, so add/inc are better when destroying FLAGS is not a problem. https://agner.org/optimize/)
On modern CPUs, add is never slower than inc (except for indirect code-size / decode effects), but usually it's not faster either, so you should prefer inc for code-size reasons. Especially if this choice is repeated many times in the same binary (e.g. if you are a compiler-writer).
inc saves 1 byte (64-bit mode), or 2 bytes (opcodes 0x40..F inc r32/dec r32 short form in 32-bit mode, re-purposed as the REX prefix for x86-64). This makes a small percentage difference in total code size. This helps instruction-cache hit rates, iTLB hit rate, and number of pages that have to be loaded from disk.
Advantages of inc:
code-size directly
Not using an immediate can have uop-cache effects on Sandybridge-family, which could offset the better micro-fusion of add. (See Agner Fog's table 9.1 in the Sandybridge section of his microarch guide.) Perf counters can easily measure issue-stage uops, but it's harder to measure how things pack into the uop cache and uop-cache read bandwidth effects.
Leaving CF unmodified is an advantage in some cases, on CPUs where you can read CF after inc without a stall. (Not on Nehalem and earlier.)
There is one exception among modern CPUs: Silvermont/Goldmont/Knight's Landing decodes inc/dec efficiently as 1 uop, but expands to 2 in the allocate/rename (aka issue) stage. The extra uop merges partial flags. inc throughput is only 1 per clock, vs. 0.5c (or 0.33c Goldmont) for independent add r32, imm8 because of the dep chain created by the flag-merging uops.
Unlike P4, the register result doesn't have a false-dep on flags (see below), so out-of-order execution takes the flag-merging off the latency critical path when nothing uses the flag result. (But the OOO window is much smaller than mainstream CPUs like Haswell or Ryzen.) Running inc as 2 separate uops is probably a win for Silvermont in most cases; most x86 instructions write all the flags without reading them, breaking these flag dependency chains.
SMont/KNL has a queue between decode and allocate/rename (See Intel's optimization manual, figure 16-2) so expanding to 2 uops during issue can fill bubbles from decode stalls (on instructions like one-operand mul, or pshufb, which produce more than 1 uop from the decoder and cause a 3-7 cycle stall for microcode). Or on Silvermont, just an instruction with more than 3 prefixes (including escape bytes and mandatory prefixes), e.g. REX + any SSSE3 or SSE4 instruction. But note that there is a ~28 uop loop buffer, so small loops don't suffer from these decode stalls.
inc/dec aren't the only instructions that decode as 1 but issue as 2: push/pop, call/ret, and lea with 3 components do this too. So do KNL's AVX512 gather instructions. Source: Intel's optimization manual, 17.1.2 Out-of-Order Engine (KNL). It's only a small throughput penalty (and sometimes not even that if anything else is a bigger bottleneck), so it's generally fine to still use inc for "generic" tuning.
Intel's optimization manual still recommends add 1 over inc in general, to avoid risks of partial-flag stalls. But since Intel's compiler doesn't do that by default, it's not too likely that future CPUs will make inc slow in all cases, like P4 did.
Clang 5.0 and Intel's ICC 17 (on Godbolt) do use inc when optimizing for speed (-O3), not just for size. -mtune=pentium4 makes them avoid inc/dec, but the default -mtune=generic doesn't put much weight on P4.
ICC17 -xMIC-AVX512 (equivalent to gcc's -march=knl) does avoid inc, which is probably a good bet in general for Silvermont / KNL. But it's not usually a performance disaster to use inc, so it's probably still appropriate for "generic" tuning to use inc/dec in most code, especially when the flag result isn't part of the critical path.
Other than Silvermont, this is mostly-stale optimization advice left over from Pentium4. On modern CPUs, there's only a problem if you actually read a flag that wasn't written by the last insn that wrote any flags. e.g. in BigInteger adc loops. (And in that case, you need to preserve CF so using add would break your code.)
add writes all the condition-flag bits in the EFLAGS register. Register-renaming makes write-only easy for out-of-order execution: see write-after-write and write-after-read hazards. add eax, 1 and add ecx, 1 can execute in parallel because they are fully independent of each other. (Even Pentium4 renames the condition flag bits separate from the rest of EFLAGS, since even add leaves the interrupts-enabled and many other bits unmodified.)
On P4, inc and dec depend on the previous value of the all the flags, so they can't execute in parallel with each other or preceding flag-setting instructions. (e.g. add eax, [mem] / inc ecx makes the inc wait until after the add, even if the add's load misses in cache.) This is called a false dependency. Partial-flag writes work by reading the old value of the flags, updating the bits other than CF, then writing the full flags.
All other out-of-order x86 CPUs (including AMD's), rename different parts of flags separately, so internally they do a write-only update to all the flags except CF. (source: Agner Fog's microarchitecture guide). Only a few instructions, like adc or cmc, truly read and then write flags. But also shl r, cl (see below).
Cases where add dest, 1 is preferable to inc dest, at least for Intel P6/SnB uarch families:
Memory-destination: add [rdi], 1 can micro-fuse the store and the load+add on Intel Core2 and SnB-family, so it's 2 fused-domain uops / 4 unfused-domain uops.
inc [rdi] can only micro-fuse the store, so it's 3F / 4U.
According to Agner Fog's tables, AMD and Silvermont run memory-dest inc and add the same, as a single macro-op / uop.
But beware of uop-cache effects with add [label], 1 which needs a 32-bit address and an 8-bit immediate for the same uop.
Before a variable-count shift/rotate to break the dependency on flags and avoid partial-flag merging: shl reg, cl has an input dependency on the flags, because of unfortunate CISC history: it has to leave them unmodified if the shift count is 0.
On Intel SnB-family, variable-count shifts are 3 uops (up from 1 on Core2/Nehalem). AFAICT, two of the uops read/write flags, and an independent uop reads reg and cl, and writes reg. It's a weird case of having better latency (1c + inevitable resource conflicts) than throughput (1.5c), and only being able to achieve max throughput if mixed with instructions that break dependencies on flags. (I posted more about this on Agner Fog's forum). Use BMI2 shlx when possible; it's 1 uop and the count can be in any register.
Anyway, inc (writing flags but leaving CF unmodified) before variable-count shl leaves it with a false dependency on whatever wrote CF last, and on SnB/IvB can require an extra uop to merge flags.
Core2/Nehalem manage to avoid even the false dep on flags: Merom runs a loop of 6 independent shl reg,cl instructions at nearly two shifts per clock, same performance with cl=0 or cl=13. Anything better than 1 per clock proves there's no input-dependency on flags.
I tried loops with shl edx, 2 and shl edx, 0 (immediate-count shifts), but didn't see a speed difference between dec and sub on Core2, HSW, or SKL. I don't know about AMD.
Update: The nice shift performance on Intel P6-family comes at the cost of a large performance pothole which you need to avoid: when an instruction depends on the flag-result of a shift instruction: The front end stalls until the instruction is retired. (Source: Intel's optimization manual, (Section 3.5.2.6: Partial Flag Register Stalls)). So shr eax, 2 / jnz is pretty catastrophic for performance on Intel pre-Sandybridge, I guess! Use shr eax, 2 / test eax,eax / jnz if you care about Nehalem and earlier. Intel's examples makes it clear this applies to immediate-count shifts, not just count=cl.
In processors based on Intel Core microarchitecture [this means Core 2 and later], shift immediate by 1 is handled by special hardware such that it does not experience partial flag stall.
Intel actually means the special opcode with no immediate, which shifts by an implicit 1. I think there is a performance difference between the two ways of encoding shr eax,1, with the short encoding (using the original 8086 opcode D1 /5) producing a write-only (partial) flag result, but the longer encoding (C1 /5, imm8 with an immediate 1) not having its immediate checked for 0 until execution time, but without tracking the flag output in the out-of-order machinery.
Since looping over bits is common, but looping over every 2nd bit (or any other stride) is very uncommon, this seems like a reasonable design choice. This explains why compilers like to test the result of a shift instead of directly using flag results from shr.
Update: for variable count shifts on SnB-family, Intel's optimization manual says:
3.5.1.6 Variable Bit Count Rotation and Shift
In Intel microarchitecture code name Sandy Bridge, The “ROL/ROR/SHL/SHR reg, cl” instruction has three micro-ops. When the flag result is not needed, one of these micro-ops may be discarded, providing
better performance in many common usages. When these instructions update partial flag results that are subsequently used, the full three micro-ops flow must go through the execution and retirement pipeline,
experiencing slower performance. In Intel microarchitecture code name Ivy Bridge, executing the full three micro-ops flow to use the updated partial flag result has additional delay.
Consider the looped sequence below:
loop:
shl eax, cl
add ebx, eax
dec edx ; DEC does not update carry, causing SHL to execute slower three micro-ops flow
jnz loop
The DEC instruction does not modify the carry flag. Consequently, the
SHL EAX, CL instruction needs to execute the three micro-ops flow in
subsequent iterations. The SUB instruction will update all flags. So
replacing DEC with SUB will allow SHL EAX, CL to execute the two
micro-ops flow.
Terminology
Partial-flag stalls happen when flags are read, if they happen at all. P4 never has partial-flag stalls, because they never need to be merged. It has false dependencies instead.
Several answers / comments mix up the terminology. They describe a false dependency, but then call it a partial-flag stall. It's a slowdown which happens because of writing only some of the flags, but the term "partial-flag stall" is what happens on pre-SnB Intel hardware when partial-flag writes have to be merged. Intel SnB-family CPUs insert an extra uop to merge flags without stalling. Nehalem and earlier stall for ~7 cycles. I'm not sure how big the penalty is on AMD CPUs.
(Note that partial-register penalties are not always the same as partial-flags, see below).
### Partial flag stall on Intel P6-family CPUs:
bigint_loop:
adc eax, [array_end + rcx*4] # partial-flag stall when adc reads CF
inc rcx # rcx counts up from negative values towards zero
# test rcx,rcx # eliminate partial-flag stalls by writing all flags, or better use add rcx,1
jnz
# this loop doesn't do anything useful; it's not normally useful to loop the carry-out back to the carry-in for the same accumulator.
# Note that `test` will change the input to the next adc, and so would replacing inc with add 1
In other cases, e.g. a partial flag write followed by a full flag write, or a read of only flags written by inc, is fine. On SnB-family CPUs, inc/dec can even macro-fuse with a jcc, the same as add/sub.
After P4, Intel mostly gave up on trying to get people to re-compile with -mtune=pentium4 or modify hand-written asm as much to avoid serious bottlenecks. (Tuning for a specific microarchitecture will always be a thing, but P4 was unusual in deprecating so many things that used to be fast on previous CPUs, and thus were common in existing binaries.) P4 wanted people to use a RISC-like subset of the x86, and also had branch-prediction hints as prefixes for JCC instructions. (It also had other serious problems, like the trace cache that just wasn't good enough, and weak decoders that meant bad performance on trace-cache misses. Not to mention the whole philosophy of clocking very high ran into the power-density wall.)
When Intel abandoned P4 (NetBurst uarch), they returned to P6-family designs (Pentium-M / Core2 / Nehalem) which inherited their partial-flag / partial-reg handling from earlier P6-family CPUs (PPro to PIII) which pre-dated the netburst mis-step. (Not everything about P4 was inherently bad, and some of the ideas re-appeared in Sandybridge, but overall NetBurst is widely considered a mistake.) Some very-CISC instructions are still slower than the multi-instruction alternatives, e.g. enter, loop, or bt [mem], reg (because the value of reg affects which memory address is used), but these were all slow in older CPUs so compilers already avoided them.
Pentium-M even improved hardware support for partial-regs (lower merging penalties). In Sandybridge, Intel kept partial-flag and partial-reg renaming and made it much more efficient when merging is needed (merging uop inserted with no or minimal stall). SnB made major internal changes and is considered a new uarch family, even though it inherits a lot from Nehalem, and some ideas from P4. (But note that SnB's decoded-uop cache is not a trace cache, though, so it's a very different solution to the decoder throughput/power problem that NetBurst's trace cache tried to solve.)
For example, inc al and inc ah can run in parallel on P6/SnB-family CPUs, but reading eax afterwards requires merging.
PPro/PIII stall for 5-6 cycles when reading the full reg. Core2/Nehalem stall for only 2 or 3 cycles while inserting a merging uop for partial regs, but partial flags are still a longer stall.
SnB inserts a merging uop without stalling, like for flags. Intel's optimization guide says that for merging AH/BH/CH/DH into the wider reg, inserting the merging uop takes an entire issue/rename cycle during which no other uops can be allocated. But for low8/low16, the merging uop is "part of the flow", so it apparently doesn't cause additional front-end throughput penalties beyond taking up one of the 4 slots in an issue/rename cycle.
In IvyBridge (or at least Haswell), Intel dropped partial-register renaming for low8 and low16 registers, keeping it only for high8 registers (AH/BH/CH/DH). Reading high8 registers has extra latency. Also, setcc al has a false dependency on the old value of rax, unlike in Nehalem and earlier (and probably Sandybridge). See this HSW/SKL partial-register performance Q&A for the details.
(I've previously claimed that Haswell could merge AH with no uop, but that's not true and not what Agner Fog's guide says. I skimmed too quickly and unfortunately repeated my wrong understanding in lots of comments and other posts.)
AMD CPUs, and Intel Silvermont, don't rename partial regs (other than flags), so mov al, [mem] has a false dependency on the old value of eax. (The upside is no partial-reg merging slowdowns when reading the full reg later.)
Normally, the only time add instead of inc will make your code faster on AMD or mainstream Intel is when your code actually depends on the doesn't-touch-CF behaviour of inc. i.e. usually add only helps when it would break your code, but note the shl case mentioned above, where the instruction reads flags but usually your code doesn't care about that, so it's a false dependency.
If you do actually want to leave CF unmodified, pre SnB-family CPUs have serious problems with partial-flag stalls, but on SnB-family the overhead of having the CPU merge the partial flags is very low, so it can be best to keep using inc or dec as part of a loop condition when targeting those CPU, with some unrolling. (For details, see the BigInteger adc Q&A I linked earlier). It can be useful to use lea to do arithmetic without affecting flags at all, if you don't need to branch on the result.
Skylake doesn't have partial-flag merging costs
Update: Skylake doesn't have partial-flag merging uops at all: CF is just a separate register from the rest of FLAGS. Instructions that need both parts (like cmovbe) read both inputs separately. That makes cmovbe a 2-uop instruction, but most other cmovcc instructions 1-uop on Skylake. See What is a Partial Flag Stall?.
adc only reads CF so it can be single-uop on Skylake with no interaction at all with an inc or dec in the same loop.
(TODO: rewrite earlier parts of this answer.)
Depending on the CPU implementation of the instructions, a partial register update may cause a stall. According to Agner Fog's optimization guide, page 62,
For historical reasons, the INC and DEC instructions leave the carry flag unchanged, while the other arithmetic flags are written to. This causes a false dependence on the previous value of the flags and costs an extra μop. To avoid these problems, it is recommended that you always use ADD and SUB instead of INC and DEC. For example, INC EAX should be replaced by ADD EAX,1.
See also page 83 on "Partial flags stalls" and page 100 on "Partial flags stall".

Resources