Which is generally faster to test for zero in x86 ASM: "TEST EAX, EAX" versus "TEST AL, AL"? - performance

Which is generally faster to test the byte in AL for zero / non-zero?
TEST EAX, EAX
TEST AL, AL
Assume a previous "MOVZX EAX, BYTE PTR [ESP+4]" instruction loaded a byte parameter with zero-extension to the remainder of EAX, preventing the combine-value penalty that I already know about.
So AL=EAX and there are no partial-register penalties for reading EAX.
Intuitively just examining AL might let you think it's faster, but I'm betting there are more penalty issues to consider for byte access of a >32-bit register.
Any info/details appreciated, thanks!

Code-size is equal, and so is performance on all x86 CPUs AFAIK.
Intel CPUs (with partial-register renaming) definitely don't have a penalty for reading AL after writing EAX. Other CPUs also have no penalty for reading low-byte registers.
Reading AH would have a penalty on Intel CPUs, like some extra latency. (How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent)
In general 32-bit operand-size and 8-bit operand size (with low-8 not high-8) are equal speed except for the false-dependencies or later partial-register reading penalties of writing an 8-bit register. Since TEST only reads registers, this can't be a problem. Even add al, bl is fine: the instruction already had an input dependency on both registers, and on Sandybridge-family a RMW to the low byte of a register doesn't rename it separately. (Haswell and later don't rename low-byte registers separately anyway).
Pick whichever operand-size you like. 8-bit and 32-bit are basically equal. The choice is just a matter of human readability. If you're going to work with the value as a 32-bit integer later, then go 32-bit. If it's logically still an 8-bit value and you were only using movzx as the x86 equivalent of ARM ldrb or MIPS lbu, then using 8-bit makes sense.
There are code-size advantages to instructions like cmp al, imm which can use the no-modrm short-form encoding. cmp al, 0 is still worse than test al,al on some old CPUs (Core 2), where cmp/jcc macro-fusion is less flexible than test/jcc macro-fusion. (Test whether a register is zero with CMP reg,0 vs OR reg,reg?)
There is one difference between these instructions: test al,al sets SF according to the high bit of AL (which can be non-zero). test eax,eax will always clear SF. If you only care about ZF then that makes no difference, but if you have a use for the high bit in SF for a later branch or cmovcc/setcc then you can avoid doing a 2nd test.
Other ways to test a byte in memory:
If you're consuming the flag result with setcc or cmovcc, not a jcc branch, then macro-fusion doesn't matter in the discussion below.
If you also need the actual value in a register later, movzx/test/jcc is almost certainly best. Otherwise you can consider a memory-destination compare.
cmp [mem], immediate can micro-fuse into a load+cmp uop on Intel, as long as the addressing mode is not RIP-relative. (On Sandybridge-family, indexed addressing modes will un-laminate even on Haswell and later: See Micro fusion and addressing modes). Agner Fog doesn't mention whether AMD has this limitation for fusing cmp/jcc with a memory operand.
;;; no downside for setcc or cmovcc, only with JCC on Intel
;;; unknown on AMD
cmp byte [esp+4], 0 ; micro-fuses into load+cmp with this addressing mode
jnz ... ; breaks macro-fusion on SnB-family
I don't have an AMD CPU to test whether Ryzen or any other AMD still fuses cmp/jcc when the cmp is mem, immediate. Modern AMD CPUs do in general do cmp/jcc and test/jcc fusion. (But not add/sub/and/jcc fusion like SnB-family).
cmp mem,imm / jcc (vs. movzx/test+jcc):
smaller code-size in bytes
same number of front-end / fused-domain uops (2) on mainstream Intel. This would be 3 front-end uops if micro-fusion of the cmp+load wasn't possible, e.g. with a RIP-relative addressing mode + immediate. Or on Sandybridge-family with an indexed addressing mode, it would unlaminate to 3 uops after decode but before issuing into the back-end.
Advantage: this is still 2 on Silvermont/Goldmont / KNL or very old CPUs without macro-fusion. The main advantage of movzx/test/jcc over this is macro-fusion, so it falls behind on CPUs where that doesn't happen.
3 back-end uops (unfused domain = execution ports and space in the scheduler aka RS) because cmp-immediate can't macro-fuse with a JCC on Intel Sandybridge-family CPUs (tested on Skylake). The uops are load, cmp, and a separate branch uop. (vs. 2 for movzx / test+jcc). Back-end uops usually aren't a bottleneck directly, but if the load isn't ready for a while it takes up more space in the RS, limiting how much further past this out-of-order execution can see.
cmp [mem], reg / jcc can macro + micro-fuse into a single compare+branch uop so it's excellent. If you need a zeroed register for anything later in your function, do xor-zero it first and use it for a single-uop compare+branch on memory.
movzx eax, [esp+4] ; 1 uop (load-port only on Intel and Ryzen)
test al,al ; fuses with jcc
jnz ... ; 1 uop
This is still 2 uops for the front-end but only 2 for the back-end as well. The test/jcc macro-fuse together. It costs more code-size, though.
If you aren't branching but instead using the FLAGS result for cmovcc or setcc, using cmp mem, imm has no downside. It can micro-fuse as long as you don't use a RIP-relative addressing mode (which always blocks micro-fusion when there's also an immediate), or an indexed addressing mode.

Related

Does a Length-Changing Prefix (LCP) incur a stall on a simple x86_64 instruction?

Consider a simple instruction like
mov RCX, RDI # 48 89 f9
The 48 is the REX prefix for x86_64. It is not an LCP. But consider adding an LCP (for alignment purposes):
.byte 0x67
mov RCX, RDI # 67 48 89 f9
67 is an address size prefix which in this case is for an instruction without addresses. This instruction also has no immediates, and it doesn't use the F7 opcode (False LCP stalls; F7 would be TEST, NOT, NEG, MUL, IMUL, DIV + IDIV). Assume that it doesn't cross a 16-byte boundary either. Those are the LCP stall cases mentioned in Intel's Optimization Reference Manual.
Would this instruction incur an LCP stall (on Skylake, Haswell, ...) ? What about two LCPs?
My daily driver is a MacBook. So I don't have access to VTune and I can't look at the ILD_STALL event. Is there any other way to know?
TL:DR: 67h is safe here on all CPUs. In 64-bit mode1, 67h can only LCP-stall with addr32 movabs load/store of the accumulator (AL/AX/EAX/RAX) from/to a moffs 32-bit absolute address (vs. the normal 64-bit absolute for that special opcode).
67h isn't length-changing with normal instructions that use a ModRM, even with a disp32 component in the addressing mode, because 32 and 64-bit address-size use identical ModRM formats. That 67h-LCP-stallable form of mov is special and doesn't use a modrm addressing mode.
(It also almost certainly won't have some other meaning in future CPUs, like being part of longer opcode the way rep is3.)
A Length Changing Prefix is when the opcode(+modrm) would imply a different length in bytes for the non-prefixes part of the instruction's machine code, if you ignored prefixes. I.e. it changes the length of the rest of the instruction. (Parallel length-finding is hard, and done separately from full decode: Later insns in a 16-byte block don't even have known start points. So this min(16-byte, 6-instruction) stage needs to look at as few bits as possible after prefixes, for the normal fast case to work. This is the stage where LCP stalls can happen.)
Usually only with an actual imm16 / imm32 opcode, e.g. 66h is length-changing in add cx, 1234, but not add cx, 12: after prefixes or in the appropriate mode, add r/m16, imm8 and add r/m32, imm8 are both opcode + modrm + imm8, 3 bytes regardless, (https://www.felixcloutier.com/x86/add). Pre-decode hardware can find the right length by just skipping prefixes, not modifying interpretation of later opcode+modrm based on what it saw, unlike when 66h means the opcode implies 2 immediate bytes instead of 4. Assemblers will always pick the imm8 encoding when possible because it's shorter (or equal length for the no-modrm add ax, imm16 special case).
(Note that REX.W=1 is length-changing for mov r64, imm64 vs. mov r32, imm32, but all hardware handles that relatively common instruction efficiently so only 66h and 67h can ever actually LCP-stall.)
SnB-family doesn't have any false2 LCP stalls for prefixes that can be length-changing for this opcode but not this this particular instruction, for either 66h or 67h. So F7 is a non-issue on SnB, unlike Core2 and Nehalem. (Earlier P6-family Intel CPUs didn't support 64-bit mode.) Atom/Silvermont don't have LCP penalties at all, nor do AMD or Via CPUs.
Agner Fog's microarch guide covers this well, and explains things clearly. Search for "length-changing prefixes". (This answer is an attempt to put those pieces together with some reminders about how x86 instruction encoding works, etc.)
Footnote 1: 67h increases length-finding difficulty more in non-64-bit modes:
In 64-bit mode, 67h changes from 64 to 32-bit address size, both of which use disp0 / 8 / 32 (0, 1, or 4 bytes of immediate displacement as part of the instruction), and which use the same ModRM + optional SIB encoding for normal addressing modes. RIP+rel32 repurposes the shorter (no SIB) encoding of 32-bit mode's two redundant ways to encode [disp32], so length decoding is unaffected. Note that REX was already designed not to be length-changing (except for mov r64, imm64), by burdening R13 and R12 in the same ways as RBP and RSP as ModRM "escape codes" to signal no base reg, or presence of a SIB byte, respectively.
In 16 and 32-bit modes, 67h switches to 32 or 16-bit address size. Not only are [x + disp32] vs. [x + disp16] different lengths after the ModRM byte (just like immediates for the operand-size prefix), but also 16-bit address-size can't signal a SIB byte. Why don't x86 16-bit addressing modes have a scale factor, while the 32-bit version has it? So the same bits in the mode and /rm fields can imply different lengths.
Footnote 2: "False" LCP stalls
This need (see footnote 1) to sometimes look differently at ModRM even to find the length is presumably why Intel CPUs before Sandybridge have "false" LCP stalls in 16/32-bit modes on 67h prefixes on any instruction with a ModRM, even when they aren't length-changing (e.g. register addressing mode). Instead of optimistically length-finding and checking somehow, a Core2/Nehalem just punts if they see addr32 + most opcodes, if they're not in 64-bit mode.
Fortunately there's basically zero reason to ever use it in 32-bit code so this mostly only matters for 16-bit code that uses 32-bit registers without switching to protected mode. Or code using 67h for padding like you're doing, except in 32-bit mode. .byte 0x67 / mov ecx, edi would be a problem for Core 2 / Nehalem. (I didn't check earlier 32-bit-only P6 family CPUs. They're a lot more obsolete than Nehalem.)
False LCP stalls for 67h never happen in 64-bit mode; as discussed above that's the easy case, and the length pre-decoders already have to know what mode they're in, so fortunately there's no downside to using it for padding. Unlike rep (which could become part of some future opcode), 67h is extremely likely to be safely ignored for instructions where it can apply to some form of the same opcode, even if there isn't actually a memory operand for this one.
Sandybridge-family doesn't ever have any false LCP stalls, removing both the 16/32-bit mode address-size (67h) and the all-modes 66 F7 cases (which needs to look at ModRM to disambiguate instructions like neg di or mul di from test di, imm16.)
SnB-family also removes some 66h true-LCP stalls, e.g. from mov-immediate like mov word ptr [rdi], 0 which is actually useful.
Footnote 3: forward compat of using 67h for padding
When 67h applies to the opcode in general (i.e. it can use a memory operand), it's very unlikely that it will mean something else for the same opcode with a modrm that just happens to encode reg,reg operands. So this is safe for What methods can be used to efficiently extend instruction length on modern x86?.
In fact, "relaxing" a 6-byte call [RIP+rel32] to a 5-byte call rel32 is done by GNU binutils by padding the call rel32 with a 67h address-size prefix, even though that's never meaningful for E8 call rel32. (This happens when linking code compiled with -fno-plt, which uses call [RIP + foo#gotpcrel] for any foo that's not found in the current compilation unit and doesn't have "hidden" visibility.)
But that's not a good precedent: at this point it's too widespread for CPU vendors to want to break that specific prefix+opcode combo (like for What does `rep ret` mean?), but some homebrewed thing in your program like 67h cdq would not get the same treatment from vendors.
The rules, for Sandybridge-family CPUs
edited/condensed from Agner's microarch PDF, these cases can LCP-stall, taking an extra 2 to 3 cycles in pre-decode (if they miss in the uop cache).
Any ALU op with an imm16 that would be imm32 without a 66h. (Except mov-immediate).
Remember that mov and test don't have imm8 forms for wider operand-size so prefer test al, 1, or imm32 if necessary. Or sometimes even test ah, imm8 if you want to test bits in the top half of AX, although beware of 1 cycle of extra latency for reading AH after writing the full reg on HSW and later. GCC uses this trick but should maybe start being careful with it, perhaps sometimes using bt reg, imm8 when feeding a setcc or cmovcc (which can't macro-fuse with test like JCC can).
67h with movabs moffs (A0/A1/A2/A3 opcodes in 64-bit mode, and probably also in 16 or 32-bit mode). Confirmed by my testing with perf counters for ild_stall.lcp on Skylake when LLVM was deciding whether to optimize mov al, [0x123456] to use 67 A0 4-byte-address or a normal opcode + modrm + sib + disp32 (to get absolute instead of rip-relative). That refers to an old version of Agner's guide; he updated soon after I sent him my test results.
If one of the instructions NEG, NOT, DIV, IDIV, MUL and IMUL with a single operand
has a 16-bit operand and there is a 16-bytes boundary between the opcode byte and
the mod-reg-rm byte. These instructions have a bogus length-changing prefix
because these instructions have the same opcode as the TEST instruction with a 16-
bit immediate operand [...]
No penalty on SnB-family for div cx or whatever, regardless of alignment.
The address size prefix (67H) will always cause a delay in 16-bit and 32-bit mode on any
instruction that has a mod/reg/rm byte even if it doesn't change the length of the instruction.
SnB-family removed this penalty, making address-size prefixes usable as padding if you're careful.
Or to summarize another way:
SnB-family has no false LCP stalls.
SnB-family has LCP stalls on every 66h and 67h true LCP except for:
mov r/m16, imm16 and the mov r16, imm16 no-modrm version.
67h address size interaction with ModRM (in 16/32-bit modes).
(That excludes the no-modrm absolute address load/store of AL/AX/EAX/RAX forms- they can still LCP-stall, presumably even in 32-bit mode, like in 64-bit.)
Length-changing REX doesn't stall (on any CPU).
Some examples
(This part ignores the false LCP stalls that some CPUs have in some non-length-changing cases which turn out not to matter here, but perhaps that's why you were worried about 67h for mov reg,reg.)
In your case, the rest of the instruction bytes, starting after the 67, decode as a 3-byte instruction whether the current address-size is 32 or 64. Same even with addressing modes like mov eax, [e/rsi + 1024] (reg+disp32) or addr32 mov edx, [RIP + rel32].
In 16 and 32-bit modes, 67h switches between 16 and 32-bit address size. [x + disp32] vs. [x + disp16] are different lengths after the ModRM byte, but also non-16-bit address-size can signal a SIB byte depending on the R/M field. But in 64-bit mode, 32 and 64-bit address-size both use [x + disp32], and the same ModRM->SIB or not encoding.
There is only one case where a 67h address-size prefix is length-changing in 64-bit mode: movabs load/store with 8-byte vs. 4-byte absolute addresses, and yes it does LCP-stall Intel CPUs. (I posted test results on https://bugs.llvm.org/show_bug.cgi?id=34733#c3)
For example, addr32 movabs [0x123456], al
.intel_syntax noprefix
addr32 mov [0x123456], cl # non-AL to make movabs impossible
mov [0x123456], al # GAS picks normal absolute [disp32]
addr32 mov [0x123456], al # GAS picks A2 movabs since addr32 makes that the shortest choice, same as NASM does.
movabs [0x123456], al # 64-bit absolute address
Note that GAS (fortunately) doesn't choose to use an addr32 prefix on its own, even with as -Os (gcc -Wa,-Os).
$ gcc -c foo.s
$ objdump -drwC -Mintel foo.o
...
0: 67 88 0c 25 56 34 12 00 mov BYTE PTR ds:0x123456,cl
8: 88 04 25 56 34 12 00 mov BYTE PTR ds:0x123456,al # same encoding after the 67
f: 67 a2 56 34 12 00 addr32 mov ds:0x123456,al
15: a2 56 34 12 00 00 00 00 00 movabs ds:0x123456,al # different length for same opcode
As you can see from the last 2 instructions, using the a2 mov moffs, al opcode, with a 67 the rest of the instruction is a different length for the same opcode.
This does LCP-stall on Skylake, so it's only fast when running from the uop cache.
Of course the more common source of LCP stalls is with the 66 prefix and an imm16 (instead of imm32). Like add ax, 1234, as in this random test where I wanted to see if jumping over the LCP-stalling instruction could avoid the problem: Label in %rep section in NASM. But not cases like add ax, 12 which will use add r/m16, imm8 (which is the same length after the 66 prefix as add r/m32, imm8).
Also, Sandybridge-family reportedly avoid LCP stalls for mov-immediate with 16-bit immediate.
Related:
Another example of working around add r/m16, imm16: add 1 byte immediate value to a 2 bytes memory location
x86 assembly 16 bit vs 8 bit immediate operand encoding - choose add r/m16, imm8 instead of the also-3-byte add ax, imm16 form.
Sign or Zero Extension of address in 64bit mode for MOV moffs32? - how address-size interacts with the moffs forms of movabs. (The kind that can LCP-stall)
What methods can be used to efficiently extend instruction length on modern x86? - the general case of what you're doing.
Tuning advice and uarch details:
Usually don't try to save space with addr32 mov [0x123456], al, except maybe when it's a choice between saving 1 byte or using 15 bytes of padding including actual NOPs inside a loop. (more tuning advice below)
One LCP stall usually won't be a disaster with a uop cache, especially if length-decoding probably isn't a front-end bottleneck here (although it often can be if the front-end is a bottleneck at all). Hard to test a single instance in one function by micro-benchmarking, though; only a real full-app benchmark will accurately reflect when the code can run from the uop cache (what Intel perf counters call the DSB), bypassing legacy decode (MITE).
There are queues between stages in modern CPUs that can at least partly absorb stalls https://www.realworldtech.com/haswell-cpu/2/ (moreso than in PPro/PIII), and SnB-family has shorter LCP-stalls than Core2/Nehalem. (But other reasons for pre-decode slowness already dips into their capacity, and after an I-cache miss they may all be empty.)
When prefixes aren't length-changing, the pre-decode pipeline stage that finds instruction boundaries (before steering chunks of bytes to actual complex/simple decoders or doing actual decoding) will find the correct instruction-length / end by skipping all prefixes and then looking at just the opcode (and modrm if applicable).
This pre-decode length-finding is where LCP stalls happen, so fun fact: even Core 2's pre-decode loop buffer can hide LCP stalls in subsequent iterations because it locks down up to 64 bytes / 18 insns of x86 machine code after finding instruction boundaries, using the decode queue (pre-decode output) as a buffer.
In later CPUs, the LSD and uop cache are post decode, so unless something defeats the uop cache (like the pesky JCC-erratum mitigation or simply having too many uops for the uop cache in a 32-byte aligned block of x86 machine code), loops only pay the LCP-stall cost on the first iteration, if they weren't already hot.
I'd say generally work around LCP stalls if you can do so cheaply, especially for code that usually runs "cold". Or if you can just use 32-bit operand-size and avoid partial-register shenanigans, costing usually only a byte of code-size and no extra instructions or uops. Or if you'd have multiple LCP stalls in a row, e.g. from naively using 16-bit immediates, that would be too many bubbles for buffers to hide so you'd have a real problem and it's worth spending extra instructions. (e.g. mov eax, imm32 / add [mem], ax, or movzx load / add r32,imm32 / store, or whatever.)
Padding to end 16-byte fetch blocks at instruction boundaries: not needed
(This is separate from aligning the start of a fetch block at a branch target, which is also sometimes unnecessary given the uop cache.)
Wikichip's section on Skylake pre-decode is incorrectly implies that a partial instruction left at the end of a block has to pre-decode on its own, rather than along with the next 16-byte group that contains the end of the instruction. It seems to be paraphrased from Agner Fog's text, with some changes and additions that make it wrong:
[from wikichip...] As with previous microarchitectures, the pre-decoder has a throughput of 6 macro-ops per cycle or until all 16 bytes are consumed, whichever happens first. Note that the predecoder will not load a new 16-byte block until the previous block has been fully exhausted. For example, suppose a new chunk was loaded, resulting in 7 instructions. In the first cycle, 6 instructions will be processed and a whole second cycle will be wasted for that last instruction. This will produce the much lower throughput of 3.5 instructions per cycle which is considerably less than optimal.
[this part is paraphrased from Agner Fog's Core2/Nehalem section, with the word "fully" having been added"]
Likewise, if the 16-byte block resulted in just 4 instructions with 1 byte of the 5th instruction received, the first 4 instructions will be processed in the first cycle and a second cycle will be required for the last instruction. This will produce an average throughput of 2.5 instructions per cycle.
[nothing like this appears in the current version of Agner's guide, IDK where this misinformation came from. Perhaps made up based on a misunderstanding of what Agner said, but without testing.]
Fortunately no. The rest of the instruction is in the next fetch block, so reality makes a lot more sense: the leftover bytes are prepended to the next 16-byte block.
(Starting a new 16-byte pre-decode block starting with this instruction would also have been plausible, but my testing rules that out: 2.82 IPC with a repeating 5,6,6 byte = 17-byte pattern. If it only ever looked at 16 bytes and left the partial 5 or 6-byte instruction to be the start of the next block, that would give us 2 IPC.)
A repeating pattern of 3x 5 byte instructions unrolled many times (a NASM %rep 2500 or GAS .rept 2500 block, so 7.5k instructions in ~36kiB) runs at 3.19 IPC, pre-decoding and decoding at ~16 bytes per cycle. (16 bytes/cycle) / (5 bytes/insn) = 3.2 instructions per cycle theoretical.
(If wikichip was right, it would predict close to 2 IPC in a 3-1 pattern, which is of course unreasonably low and would be not be an acceptable design for Intel for long runs of long or medium-length when running from legacy decode. 2 IPC is so much narrower than the 4-wide pipeline that it would not be ok even for legacy decode. Intel learned from P4 that running at least decently well from legacy decode is important, even when your CPU caches decoded uops. That's why SnB's uop cache can be so small, only ~1.5k uops. A lot smaller than P4's trace cache, but P4's problem was trying to replace L1i with a trace cache, and having weak decoders. (Also the fact it was a trace cache, so it cached the same code multiple times.))
These perf differences are large enough that you can verify it on your Mac, using a plenty-large repeat count so you don't need perf counters to verify uop-cache misses. (Remember that L1i is inclusive of uop cache, so loops that don't fit in L1i will also evict themselves from uop cache.) Anyway, just measuring total time and knowing the approximate max-turbo that you'll hit is sufficient for a sanity check like this.
Getting better than the theoretical-max that wikichip predicts, even after startup overhead and conservative frequency estimates, will completely rule out that behaviour even on a machine where you don't have perf counters.
$ nasm -felf64 && ld # 3x 5 bytes, repeated 2.5k times
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_retired.retire_slots,uops_executed.thread,idq.dsb_uops -r2 ./testloop
Performance counter stats for './testloop' (2 runs):
604.16 msec task-clock # 1.000 CPUs utilized ( +- 0.02% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
1 page-faults # 0.002 K/sec
2,354,699,144 cycles # 3.897 GHz ( +- 0.02% )
7,502,000,195 instructions # 3.19 insn per cycle ( +- 0.00% )
7,506,746,328 uops_issued.any # 12425.167 M/sec ( +- 0.00% )
7,506,686,463 uops_retired.retire_slots # 12425.068 M/sec ( +- 0.00% )
7,506,726,076 uops_executed.thread # 12425.134 M/sec ( +- 0.00% )
0 idq.dsb_uops # 0.000 K/sec
0.6044392 +- 0.0000998 seconds time elapsed ( +- 0.02% )
(and from another run):
7,501,076,096 idq.mite_uops # 12402.209 M/sec ( +- 0.00% )
No clue why idq.mite_uops:u is not equal to issued or retired. There's nothing to un-laminate, and no stack-sync uops should be necessary, so IDK where the extra issued+retired uops could be coming from. The excess is consistent across runs, and proportional to the %rep count I think.
With other pattern like 5-5-6 (16 bytes) and 5-6-6 (17 bytes), I get similar results.
I do sometimes measure a slight difference when the 16-byte groups are misaligned relative to an absolute 16-byte boundary or not (put a nop at the top of the loop). But that seems to only happen with larger repeat counts. %rep 2500 for 39kiB total size, I still get 2.99 IPC (just under one 16-byte group per cycle), with 0 DSB uops, regardless of aligned vs. misaligned.
I still get 2.99IPC at %rep 5000, but I see a diff at %rep 10000: 2.95 IPC misaligned vs. 2.99 IPC aligned. That largest %rep count is ~156kiB and still fits in the 256k L2 cache so IDK why anything would be different from half that size. (They're much larger than 32k Li1). I think earlier I was seeing a different at 5k, but I can't repro it now. Maybe that was with 17-byte groups.
The actual loop runs 1000000 times in a static executable under _start, with a raw syscall to _exit, so perf counters (and time) for the whole process is basically just the loop. (especially with perf --all-user to only count user-space.)
; complete Linux program
default rel
%use smartalign
alignmode p6, 64
global _start
_start:
mov ebp, 1000000
align 64
.loop:
%ifdef MISALIGN
nop
%endif
%rep 2500
mov eax, 12345 ; 5 bytes.
mov ecx, 123456 ; 5 bytes. Use r8d for 6 bytes
mov edx, 1234567 ; 5 bytes. Use r9d for 6 bytes
%endrep
dec ebp
jnz .loop
.end:
xor edi,edi
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)

Are some general purpose registers faster than others?

In x86-64, will certain instructions execute faster if some general purpose registers are preferred over others?
For instance, would mov eax, ecx execute faster than mov r8d, ecx? I can imagine that the latter would need a REX prefix which would make the instruction fetch slower?
What about using rax instead of rcx? What about add or xor? Other operations? Smaller registers like r15b vs al? al vs ah?
AMD vs Intel? Newer processors? Older processors? Combinations of instructions?
Clarification: Should certain general purpose registers be preferred over others, and which ones are they?
In general, architectural registers are all equal, and renamed onto a large array of physical registers.
(Except partial registers can be slower, especially high-byte AH/BH/CH/DH which are slow to read after writing the full register, on Haswell and later. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent and also Why doesn't GCC use partial registers? for problems when writing 8-bit and 16-bit registers). The rest of this answer is just going to consider 32/64-bit operand-size.)
But some instruction require specific registers, like legacy variable-count shifts (without BMI2 shrx etc) require the count in CL. Division requires the dividend in EDX:EAX (or RDX:RAX for the slower 64-bit version).
Using a call-preserved register like RBX means your function has to spend extra instructions saving/restoring it.
But of course there are perf differences if you need more instructions. So lets assume all else is equal, and just talk about the uops, latency, and code-size of a single instruction just by changing which register is used for one of its operands. TL:DR: the only perf difference is due to instruction-encoding restrictions / differences. Sometimes a different register will allow / require (or get the assembler to pick) a different encoding, which will often be smaller / larger as a special case, and sometimes even executes differently.
Generally smaller code is faster, and packs better in the uop cache and I-cache, so unless you've analyzed a specific case and found a problem, favour the smaller encoding. Often that means keeping a byte value in AL so you can use those special-case instructions, and avoiding RBP / R13 for pointers.
Special cases where a specific encoding is extra slow, not just size
LEA with RBP or R13 as a base can be slower on Intel if the addressing mode didn't already have a +displacement constant.
e.g. lea eax, [rbp + 12] is encodeable as-written, and is just as fast as lea eax, [rcx + 12].
But lea eax, [rbp + rcx*4] can only be encoded in machine code as lea eax, [rbp + rcx*4 + 0] (because of addressing mode escape-code stuff), which is a 3-component LEA, and thus slower on Intel (3 cycle latency on Sandybridge-family instead of 1 cycle, see https://agner.org/optimize/ instruction tables and microarch PDF). On AMD, having a scaled-index would already make it a slow-LEA even with lea eax, [rdx + rcx*4]
Outside of LEA, using RBP / R13 as the base in any addressing mode always requires a disp8/32 byte or dword, but I don't think the actual AGUs are slower for a 3-component addressing mode. So it's just a code-size effect.
Other cases include Which Intel microarchitecture introduced the ADC reg,0 single-uop special case? where the short-form 2-byte encoding for adc al, imm8 is 2 uops even on modern uarches like Skylake, where adc bl, imm8 is 1 uop.
So not only does the adc reg,0 special case not work for adc al,0 on Sandybridge through Haswell, Broadwell and newer forgot (or chose not to) optimize how that encoding decodes to uops. (Of course you could manually encode adc al,0 using the 3-byte Mod/RM encoding, but assemblers will always pick the shortest encoding so adc al,0 will assemble to the short form by default.) Only a problem with byte registers; adc eax,0 will use the opcode ModRM imm8 3-byte encoding, not 5-byte opcode imm32.
For other cases of op al,imm8, the only difference is code-size, which only indirectly matters for performance. (Because of decoding, uop-cache packing, and I-cache misses).
See Tips for golfing in x86/x64 machine code for more about special cases of code-size, like xchg eax, ecx being 1-byte vs. xchg edx, ecx being 2 bytes.
add rsp, 8 can need an extra stack-sync uop if there hasn't been an explicit use of RSP or ESP since the last push/pop/call/ret (along the path of execution of course, not in the static code layout). (What is the stack engine in the Sandybridge microarchitecture?). This is why compilers like clang use a dummy push or pop to reserve / free a single stack slot: Why does this function push RAX to the stack as the first operation?
LEA will be slower with EBP, RBP, or R13 as the base (PDF warning, page 3-22). But generally the answer is No.
Taking a step back, it's important to realize that since the advent of register renaming that architectural registers don't deal with actual, physical registers on most micro-architectures. For example, each Cascade Lake core has a register file of 180 integer and 168 FP registers.
You have stuffed too many questions altogether, however, if I understood the question well, you are confusing the processor architecture with the small but fast Register file, which fills in the speed gap between the processor and memory technologies. The register file is small enough that it can only support one instruction at a time, i.e. the current instruction, and fast enough that it can almost catch up with the processor speed.
I would like to build a short background, the naming conventions of these registers serves two purposes: one, it makes the older versions of the x86 ISA implementations compatible up till now, and two, every name of these registers has a special purpose to it besides its general purpose use. For example, the ECX register is used as a counter to implement loops i.e. instructions like JECXZ and LOOP uses ECX register exclusively. Though you need to watch out for some flags that you would not want to lose.
And now the answer to your question stems from the second purpose. So some registers would seem to be faster because these special registers are hardcoded into the processor and can be accessed much quicker, however, the difference should not be much.
And the second thing that you might know, not all instructions are of the same complexity, especially in x86, the opcode of instructions can be from 1-3 bytes and as more and more functionality is added to the instruction in terms of, prefixes, addressing modes, etc. these instructions start to become slower, So it is not the case that some registers are slower than other, it is just that some registers are encoded into the instruction and therefore those instructions run faster with that combination of register. And if otherwise used, it would seem slower. I hope that helps. Thanks

How does RIP-relative addressing perform compared to mov reg, imm64?

It is known fact that x86-64 instructions do not support 64-bit immediate values (except for mov). Hence, when migrating code from 32 to 64 bits, an instruction like this:
cmp rax, addr32
cannot be replaced with the following:
cmp rax, addr64
Under these circumstances, I'm considering two alternatives: (a) using a scratch register for loading the constant or (b) using rip-relative addressing. The two approaches look like this:
mov r11, addr64 ; scratch register
cmp rax, r11
ptr64: dq addr64
...
cmp rax, [rel ptr64] ; encoded as cmp rax, [rip+offset]
I wrote a very simple loop to compare the performance of both approaches (which I paste below). While (b) uses an indirect pointer, (a) has the the immediate encoded in the instruction (which could lead to a worse usage of i-cache). Surprisingly, I found that (b) run ~10% faster than (a). Is this result something to be expected in more common real-world code?
true: dq 0xFFFF0000FFFF0000
false: dq 0xAAAABBBBAAAABBBB
main:
or rax, 1 ; rax is odd and constant "true" is even
mov rcx, 0x1
shl rcx, 30
branch:
mov r11, 0xFFFF0000FFFF0000 ; not present in (b)
cmp rax, r11 ; vs cmp rax, [rel true]
je next
add rax, 2
loop branch
next:
mov rax, 0
ret
Surprisingly, I found that (b) run ~10% faster than (a)
You probably tested on a CPU other than AMD Bulldozer-family or Ryzen, which have a fast loop instruction. On other CPUs, loop is very slow, mostly on purpose for historical reasons, so you bottleneck on it. e.g. 7 uops, one per 5c throughput on Haswell.
mov r64, imm64 is bad for uop cache throughput because of the large immediate taking 2 slots in Intel's uop cache. (See the Sandybridge uop cache section in Agner Fog's microarch pdf), and Which is faster, imm64 or m64 for x86-64? where I listed the details.
Even apart from that, it's not too surprising that 1 extra uop in the loop makes it run slower. You're probably not on an AMD CPU (with single-uop / 1 per 2 clock loop), because the extra mov in such a tiny loop would make more than 10% difference. Or no difference at all, since it's just 3 vs. 4 uops per 2 clocks, if that's correct that even tiny loop loops are limited to one jump per 2 clocks.
On Intel, loop is 7 uops, one per 5 clocks throughput on most CPUs, so the 4-per-clock issue/rename bottleneck won't be what you're hitting. loop is micro-coded, so the front-end can't run from the loop buffer. (And Skylake CPUs have their LSD disabled by a microcode update to fix the partial-register erratum anyway.) So the mov r64,imm64 uop has to be re-read from the uop cache every time through the loop.
A load that hits in cache has very good throughput (2 loads per clock, and in this case micro-fusion means no extra uops to use a memory operand instead of register for cmp). So the main penalty in using a constant from memory is the extra cache footprint and cache misses, but your microbenchmark won't reveal that at all. It also has no other pressure on the load ports.
In the general case:
If possible, use a RIP-relative lea to generate 64-bit address constants.
e.g. lea rax, [rel addr64]. Yes, this takes an extra instruction to get the constant into a register. (BTW, just use default rel. You can use [abs fs:0] if you need it.
You can avoid the extra instruction if you build position-dependent code with the default (small) code model, so static addresses fit in the low 32 bits of virtual address space and can be used as immediates. (Actually low 2GiB, so sign or zero extending both work). See 32-bit absolute addresses no longer allowed in x86-64 Linux? if gcc complains about absolute addressing; -pie is enabled by default on most distros. This of course doesn't work in Linux shared libraries, which only support text relocations for 64-bit addresses. But you should avoid relocations whenever possible by using lea to make position-indepdendent code.
Most integer build-time constants fit in 32 bits, so you can use cmp r64, imm32 or cmp r32, imm32 even in PIC code.
If you do need a 64-bit non-address constant, try to hoist the mov r64, imm64 out of a loop. Your cmp loop would have been fine if the mov wasn't inside the loop. x86-64 has enough registers that you (or the compiler) can usually avoid reloads inside inner-most loops in integer code.

Are scaled-index addressing modes a good idea?

Consider the following code:
void foo(int* __restrict__ a)
{
int i; int val = 0;
for (i = 0; i < 100; i++) {
val = 2 * i;
a[i] = val;
}
}
This complies (with maximum optimization but no unrolling or vectorization) into...
GCC 7.2:
foo(int*):
xor eax, eax
.L2:
mov DWORD PTR [rdi], eax
add eax, 2
add rdi, 4
cmp eax, 200
jne .L2
rep ret
clang 5.0:
foo(int*): # #foo(int*)
xor eax, eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
mov dword ptr [rdi + 2*rax], eax
add rax, 2
cmp rax, 200
jne .LBB0_1
ret
What are the pros and cons of GCC's vs clang's approach? i.e. an extra variable incremented separately, vs multiplying via a more complex addressing mode?
Notes:
This question also relates to this one with about the same code, but with float's rather than int's.
Yes, take advantage of the power of x86 addressing modes to save uops, in cases where an index doesn't unlaminate into more extra uops than it would cost to do pointer increments.
(In many cases unrolling and using pointer increments is a win because of unlamination on Intel Sandybridge-family, but if you're not unrolling or if you're only using mov loads instead of folding memory operands into ALU ops for micro-fusion, then indexed addressing modes are often break even on some CPUs and a win on others.)
It's essential to read and understand Micro fusion and addressing modes if you want to make optimal choices here. (And note that IACA gets it wrong, and doesn't simulate Haswell and later keeping some uops micro-fused, so you can't even just check your work by having it do static analysis for you.)
Indexed addressing modes are generally cheap. At worst they cost one extra uop for the front-end (on Intel SnB-family CPUs in some situations), and/or prevent a store-address uop from using port7 (which only supports base + displacement addressing modes). See Agner Fog's microarch pdf, and also David Kanter's Haswell write-up, for more about the store-AGU on port7 which Intel added in Haswell.
On Haswell+, if you need your loop to sustain more than 2 memory ops per clock, then avoid indexed stores.
At best they're free other than the code-size cost of the extra byte in the machine-code encoding. (Having an index register requires a SIB (Scale Index Base) byte in the encoding).
More often the only penalty is the 1 extra cycle of load-use latency vs. a simple [base + 0-2047] addressing mode, on Intel Sandybridge-family CPUs.
It's usually only worth using an extra instruction to avoid an indexed addressing mode if you're going to use that addressing mode in multiple instructions. (e.g. load / modify / store).
Scaling the index is free (on modern CPUs at least) if you're already using a 2-register addressing mode. For lea, Agner Fog's table lists AMD Ryzen as having 2c latency and only 2 per clock throughput for lea with scaled-index addressing modes (or 3-component), otherwise 1c latency and 0.25c throughput. e.g. lea rax, [rcx + rdx] is faster than lea rax, [rcx + 2*rdx], but not by enough to be worth using extra instructions instead.) Ryzen also doesn't like a 32-bit destination in 64-bit mode, for some reason. But the worst-case LEA is still not bad at all. And anyway, mostly unrelated to address-mode choice for loads, because most CPUs (other than in-order Atom) run LEA on the ALUs, not the AGUs used for actual loads/stores.
The main question is between one-register unscaled (so it can be a "base" register in the machine-code encoding: [base + idx*scale + disp]) or two-register. Note that for Intel's micro-fusion limitations, [disp32 + idx*scale] (e.g. indexing a static array) is an indexed addressing mode.
Neither function is totally optimal (even without considering unrolling or vectorization), but clang's looks very close.
The only thing clang could do better is save 2 bytes of code size by avoiding the REX prefixes with add eax, 2 and cmp eax, 200. It promoted all the operands to 64-bit because it's using them with pointers and I guess proved that the C loop doesn't need them to wrap, so in asm it uses 64-bit everywhere. This is pointless; 32-bit operations are always at least as fast as 64, and implicit zero-extension is free. But this only costs 2 bytes of code-size, and costs no performance other than indirect front-end effects from that.
You've constructed your loop so the compiler needs to keep a specific value in registers and can't totally transform the problem into just a pointer-increment + compare against an end pointer (which compilers often do when they don't need the loop variable for anything except array indexing).
You also can't transform to counting a negative index up towards zero (which compilers never do, but reduces the loop overhead to a total of 1 macro-fused add + branch uop on Intel CPUs (which can fuse add + jcc, while AMD can only fuse test or cmp / jcc).
Clang has done a good job noticing that it can use 2*var as the array index (in bytes). This is a good optimization for tune=generic. The indexed store will un-laminate on Intel Sandybridge and Ivybridge, but stay micro-fused on Haswell and later. (And on other CPUs, like Nehalem, Silvermont, Ryzen, Jaguar, or whatever, there's no disadvantage.)
gcc's loop has 1 extra uop in the loop. It can still in theory run at 1 store per clock on Core2 / Nehalem, but it's right up against the 4 uops per clock limit. (And actually, Core2 can't macro-fuse the cmp/jcc in 64-bit mode, so it bottlenecks on the front-end).
Indexed addressing (in loads and stores, lea is different still) has some trade-offs, for example
On many µarchs, instructions that use indexed addressing have a slightly longer latency than instruction that don't. But usually throughput is a more important consideration.
On Netburst, stores with a SIB byte generate an extra µop, and therefore may cost throughput as well. The SIB byte causes an extra µop regardless of whether you use it for indexes addressing or not, but indexed addressing always costs the extra µop. It doesn't apply to loads.
On Haswell/Broadwell (and still in Skylake/Kabylake), stores with indexed addressing cannot use port 7 for address generation, instead one of the more general address generation ports will be used, reducing the throughput available for loads.
So for loads it's usually good (or not bad) to use indexed addressing if it saves an add somewhere, unless they are part of a chain of dependent loads. For stores it's more dangerous to use indexed addressing. In the example code it shouldn't make a large difference. Saving the add is not really relevant, ALU instructions wouldn't be the bottleneck. The address generation happening in ports 2 or 3 doesn't matter either since there are no loads.

What is the best way to set a register to zero in x86 assembly: xor, mov or and?

All the following instructions do the same thing: set %eax to zero. Which way is optimal (requiring fewest machine cycles)?
xorl %eax, %eax
mov $0, %eax
andl $0, %eax
TL;DR summary: xor same, same is the best choice for all CPUs. No other method has any advantage over it, and it has at least some advantage over any other method. It's officially recommended by Intel and AMD, and what compilers do. In 64-bit mode, still use xor r32, r32, because writing a 32-bit reg zeros the upper 32. xor r64, r64 is a waste of a byte, because it needs a REX prefix.
Even worse than that, Silvermont only recognizes xor r32,r32 as dep-breaking, not 64-bit operand-size. Thus even when a REX prefix is still required because you're zeroing r8..r15, use xor r10d,r10d, not xor r10,r10.
GP-integer examples:
xor eax, eax ; RAX = 0. Including AL=0 etc.
xor r10d, r10d ; R10 = 0. Still prefer 32-bit operand-size.
xor edx, edx ; RDX = 0
; small code-size alternative: cdq ; zero RDX if EAX is already zero
; SUB-OPTIMAL
xor rax,rax ; waste of a REX prefix, and extra slow on Silvermont
xor r10,r10 ; bad on Silvermont (not dep breaking), same as r10d on other CPUs because a REX prefix is still needed for r10d or r10.
mov eax, 0 ; doesn't touch FLAGS, but not faster and takes more bytes
and eax, 0 ; false dependency. (Microbenchmark experiments might want this)
sub eax, eax ; same as xor on most but not all CPUs; bad on Silvermont for example.
xor cl, cl ; false dep on some CPUs, not a zeroing idiom. Use xor ecx,ecx
mov cl, 0 ; only 2 bytes, and probably better than xor cl,cl *if* you need to leave the rest of ECX/RCX unmodified
Zeroing a vector register is usually best done with pxor xmm, xmm. That's typically what gcc does (even before use with FP instructions).
xorps xmm, xmm can make sense. It's one byte shorter than pxor, but xorps needs execution port 5 on Intel Nehalem, while pxor can run on any port (0/1/5). (Nehalem's 2c bypass delay latency between integer and FP is usually not relevant, because out-of-order execution can typically hide it at the start of a new dependency chain).
On SnB-family microarchitectures, neither flavour of xor-zeroing even needs an execution port. On AMD, and pre-Nehalem P6/Core2 Intel, xorps and pxor are handled the same way (as vector-integer instructions).
Using the AVX version of a 128b vector instruction zeros the upper part of the reg as well, so vpxor xmm, xmm, xmm is a good choice for zeroing YMM(AVX1/AVX2) or ZMM(AVX512), or any future vector extension. vpxor ymm, ymm, ymm doesn't take any extra bytes to encode, though, and runs the same on Intel, but slower on AMD before Zen2 (2 uops). The AVX512 ZMM zeroing would require extra bytes (for the EVEX prefix), so XMM or YMM zeroing should be preferred.
XMM/YMM/ZMM examples
# Good:
xorps xmm0, xmm0 ; smallest code size (for non-AVX)
pxor xmm0, xmm0 ; costs an extra byte, runs on any port on Nehalem.
xorps xmm15, xmm15 ; Needs a REX prefix but that's unavoidable if you need to use high registers without AVX. Code-size is the only penalty.
# Good with AVX:
vpxor xmm0, xmm0, xmm0 ; zeros X/Y/ZMM0
vpxor xmm15, xmm0, xmm0 ; zeros X/Y/ZMM15, still only 2-byte VEX prefix
#sub-optimal AVX
vpxor xmm15, xmm15, xmm15 ; 3-byte VEX prefix because of high source reg
vpxor ymm0, ymm0, ymm0 ; decodes to 2 uops on AMD before Zen2
# Good with AVX512
vpxor xmm15, xmm0, xmm0 ; zero ZMM15 using an AVX1-encoded instruction (2-byte VEX prefix).
vpxord xmm30, xmm30, xmm30 ; EVEX is unavoidable when zeroing zmm16..31, but still prefer XMM or YMM for fewer uops on probable future AMD. May be worth using only high regs to avoid needing vzeroupper in short functions.
# Good with AVX512 *without* AVX512VL (e.g. KNL / Xeon Phi)
vpxord zmm30, zmm30, zmm30 ; Without AVX512VL you have to use a 512-bit instruction.
# sub-optimal with AVX512 (even without AVX512VL)
vpxord zmm0, zmm0, zmm0 ; EVEX prefix (4 bytes), and a 512-bit uop. Use AVX1 vpxor xmm0, xmm0, xmm0 even on KNL to save code size.
See Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm? and
What is the most efficient way to clear a single or a few ZMM registers on Knights Landing?
Semi-related: Fastest way to set __m256 value to all ONE bits and
Set all bits in CPU register to 1 efficiently also covers AVX512 k0..7 mask registers. SSE/AVX vpcmpeqd is dep-breaking on many (although still needs a uop to write the 1s), but AVX512 vpternlogd for ZMM regs isn't even dep-breaking. Inside a loop consider copying from another register instead of re-creating ones with an ALU uop, especially with AVX512.
But zeroing is cheap: xor-zeroing an xmm reg inside a loop is usually as good as copying, except on some AMD CPUs (Bulldozer and Zen) which have mov-elimination for vector regs but still need an ALU uop to write zeros for xor-zeroing.
What's special about zeroing idioms like xor on various uarches
Some CPUs recognize sub same,same as a zeroing idiom like xor, but all CPUs that recognize any zeroing idioms recognize xor. Just use xor so you don't have to worry about which CPU recognizes which zeroing idiom.
xor (being a recognized zeroing idiom, unlike mov reg, 0) has some obvious and some subtle advantages (summary list, then I'll expand on those):
smaller code-size than mov reg,0. (All CPUs)
avoids partial-register penalties for later code. (Intel P6-family and SnB-family).
doesn't use an execution unit, saving power and freeing up execution resources. (Intel SnB-family)
smaller uop (no immediate data) leaves room in the uop cache-line for nearby instructions to borrow if needed. (Intel SnB-family).
doesn't use up entries in the physical register file. (Intel SnB-family (and P4) at least, possibly AMD as well since they use a similar PRF design instead of keeping register state in the ROB like Intel P6-family microarchitectures.)
Smaller machine-code size (2 bytes instead of 5) is always an advantage: Higher code density leads to fewer instruction-cache misses, and better instruction fetch and potentially decode bandwidth.
The benefit of not using an execution unit for xor on Intel SnB-family microarchitectures is minor, but saves power. It's more likely to matter on SnB or IvB, which only have 3 ALU execution ports. Haswell and later have 4 execution ports that can handle integer ALU instructions, including mov r32, imm32, so with perfect decision-making by the scheduler (which doesn't always happen in practice), HSW could still sustain 4 uops per clock even when they all need ALU execution ports.
See my answer on another question about zeroing registers for some more details.
Bruce Dawson's blog post that Michael Petch linked (in a comment on the question) points out that xor is handled at the register-rename stage without needing an execution unit (zero uops in the unfused domain), but missed the fact that it's still one uop in the fused domain. Modern Intel CPUs can issue & retire 4 fused-domain uops per clock. That's where the 4 zeros per clock limit comes from. Increased complexity of the register renaming hardware is only one of the reasons for limiting the width of the design to 4. (Bruce has written some very excellent blog posts, like his series on FP math and x87 / SSE / rounding issues, which I do highly recommend).
On AMD Bulldozer-family CPUs, mov immediate runs on the same EX0/EX1 integer execution ports as xor. mov reg,reg can also run on AGU0/1, but that's only for register copying, not for setting from immediates. So AFAIK, on AMD the only advantage to xor over mov is the shorter encoding. It might also save physical register resources, but I haven't seen any tests.
Recognized zeroing idioms avoid partial-register penalties on Intel CPUs which rename partial registers separately from full registers (P6 & SnB families).
xor will tag the register as having the upper parts zeroed, so xor eax, eax / inc al / inc eax avoids the usual partial-register penalty that pre-IvB CPUs have. Even without xor, IvB only needs a merging uop when the high 8bits (AH) are modified and then the whole register is read, and Haswell even removes that.
From Agner Fog's microarch guide, pg 98 (Pentium M section, referenced by later sections including SnB):
The processor recognizes the XOR of a register with itself as setting
it to zero. A special tag in the register remembers that the high part
of the register is zero so that EAX = AL. This tag is remembered even
in a loop:
; Example 7.9. Partial register problem avoided in loop
xor eax, eax
mov ecx, 100
LL:
mov al, [esi]
mov [edi], eax ; No extra uop
inc esi
add edi, 4
dec ecx
jnz LL
(from pg82): The processor remembers that the upper 24 bits of EAX are zero as long as
you don't get an interrupt, misprediction, or other serializing event.
pg82 of that guide also confirms that mov reg, 0 is not recognized as a zeroing idiom, at least on early P6 designs like PIII or PM. I'd be very surprised if they spent transistors on detecting it on later CPUs.
xor sets flags, which means you have to be careful when testing conditions. Since setcc is unfortunately only available with an 8bit destination, you usually need to take care to avoid partial-register penalties.
It would have been nice if x86-64 repurposed one of the removed opcodes (like AAM) for a 16/32/64 bit setcc r/m, with the predicate encoded in the source-register 3-bit field of the r/m field (the way some other single-operand instructions use them as opcode bits). But they didn't do that, and that wouldn't help for x86-32 anyway.
Ideally, you should use xor / set flags / setcc / read full register:
...
call some_func
xor ecx,ecx ; zero *before* the test
test eax,eax
setnz cl ; cl = (some_func() != 0)
add ebx, ecx ; no partial-register penalty here
This has optimal performance on all CPUs (no stalls, merging uops, or false dependencies).
Things are more complicated when you don't want to xor before a flag-setting instruction. e.g. you want to branch on one condition and then setcc on another condition from the same flags. e.g. cmp/jle, sete, and you either don't have a spare register, or you want to keep the xor out of the not-taken code path altogether.
There are no recognized zeroing idioms that don't affect flags, so the best choice depends on the target microarchitecture. On Core2, inserting a merging uop might cause a 2 or 3 cycle stall. It appears to be cheaper on SnB, but I didn't spend much time trying to measure. Using mov reg, 0 / setcc would have a significant penalty on older Intel CPUs, and still be somewhat worse on newer Intel.
Using setcc / movzx r32, r8 is probably the best alternative for Intel P6 & SnB families, if you can't xor-zero ahead of the flag-setting instruction. That should be better than repeating the test after an xor-zeroing. (Don't even consider sahf / lahf or pushf / popf). IvB can eliminate movzx r32, r8 (i.e. handle it with register-renaming with no execution unit or latency, like xor-zeroing). Haswell and later only eliminate regular mov instructions, so movzx takes an execution unit and has non-zero latency, making test/setcc/movzx worse than xor/test/setcc, but still at least as good as test/mov r,0/setcc (and much better on older CPUs).
Using setcc / movzx with no zeroing first is bad on AMD/P4/Silvermont, because they don't track deps separately for sub-registers. There would be a false dep on the old value of the register. Using mov reg, 0/setcc for zeroing / dependency-breaking is probably the best alternative when xor/test/setcc isn't an option.
Of course, if you don't need setcc's output to be wider than 8 bits, you don't need to zero anything. However, beware of false dependencies on CPUs other than P6 / SnB if you pick a register that was recently part of a long dependency chain. (And beware of causing a partial reg stall or extra uop if you call a function that might save/restore the register you're using part of.)
and with an immediate zero isn't special-cased as independent of the old value on any CPUs I'm aware of, so it doesn't break dependency chains. It has no advantages over xor and many disadvantages.
It's useful only for writing microbenchmarks when you want a dependency as part of a latency test, but want to create a known value by zeroing and adding.
See http://agner.org/optimize/ for microarch details, including which zeroing idioms are recognized as dependency breaking (e.g. sub same,same is on some but not all CPUs, while xor same,same is recognized on all.) mov does break the dependency chain on the old value of the register (regardless of the source value, zero or not, because that's how mov works). xor only breaks dependency chains in the special-case where src and dest are the same register, which is why mov is left out of the list of specially recognized dependency-breakers. (Also, because it's not recognized as a zeroing idiom, with the other benefits that carries.)
Interestingly, the oldest P6 design (PPro through Pentium III) didn't recognize xor-zeroing as a dependency-breaker, only as a zeroing idiom for the purposes of avoiding partial-register stalls, so in some cases it was worth using both mov and then xor-zeroing in that order to break the dep and then zero again + set the internal tag bit that the high bits are zero so EAX=AX=AL.
See Agner Fog's Example 6.17. in his microarch pdf. He says this also applies to P2, P3, and even (early?) PM. A comment on the linked blog post says it was only PPro that had this oversight, but I've tested on Katmai PIII, and #Fanael tested on a Pentium M, and we both found that it didn't break a dependency for a latency-bound imul chain. This confirms Agner Fog's results, unfortunately.
TL:DR:
If it really makes your code nicer or saves instructions, then sure, zero with mov to avoid touching the flags, as long as you don't introduce a performance problem other than code size. Avoiding clobbering flags is the only sensible reason for not using xor, but sometimes you can xor-zero ahead of the thing that sets flags if you have a spare register.
mov-zero ahead of setcc is better for latency than movzx reg32, reg8 after (except on Intel when you can pick different registers), but worse code size.

Resources