How can memory destination BTS be significantly slower than load / BTS reg,reg / store? - performance

In the general case, how can an instruction that can take memory or register operands ever be slower with memory operands then mov + mov -> instruction -> mov + mov
Based on the throughput and latency found in Agner Fog's instruction tables (looking at Skylake in my case, p238)
I see that the following numbers for the btr/bts instructions:
instruction, operands, uops fused domain, uops unfused domain, latency, throughput
mov r,r 1 1 0-1 .25
mov m,r 1 2 2 1
mov r,m 1 1 2 .5
...
bts/btr r,r 1 1 N/A .5
bts/btr m,r 10 10 N/A 5
I dont see how these numbers could possibly be correct. Even in the worst case where there are no registers to spare and you have store one in a temporary memory location it would be faster to:
## hypothetical worst-case microcode that saves/restores a scratch register
mov m,r // + 1 throughput , save a register
mov r,m // + .5 throughput , load BTS destination operand
bts r,r // + 1 throughput , do bts (or btr)
mov m,r // + 1 throughput , store result
mov r,m // + .5 throughput , restore register
As the worst case this has a better throughput than just bts m,r (4 < 5). (Editor's note: adding up throughputs doesn't work when they have different bottlenecks. You need to consider uops and ports; this sequence should be 2c throughput, bottlenecked on 1/clock store throughput.)
And microcode instructions have there own set of registers so it seems aggressively unlikely this would actually be needed. Can anyone explain why bts (or in general any instruction) could have higher throughput with memory, register operands than using the worst case moving policy.
(Editor's note: yes, there are a few hidden temp register that microcode can use. Something like add [mem], reg does at least logically just load into one of those and then store the result.)

What you're missing is that BT, BTC, BTS and BTR don't work like you described when a memory operand is used. You're assuming the memory versions work the same as the register versions, but that's not quite the case. With the register version, the value of the second operand is used is taken modulo 64 (or 16 or 32). With the memory version, the value of the second operand is used as is. This means that the actual memory location accessed by the instruction may not be the address given by the memory operand, but one somewhere past it.
For example, ignoring the need to save registers and atomicity, to get the same operation of BTS [rsi + rdi], rax using the register version of BTS you'd need to do something like this:
LEA rbx, [rsi + rdi]
MOV rcx, rax
SHR rcx, 8
MOV rdx, [rbx + rcx]
BTS rdx, rax
MOV [rbx + rcx], rdx
You can simplify this if you know the value of RAX is less than 64, or if it's a simpler memory operand. Indeed as you've noticed, it may be an advantage in cases like these to use the faster register version over the slower memory version even if it means a few more instructions.

Related

How often do the contents of a CPU register change?

Does the data that CPU registers hold change often? The Wikipedia article describes the registers as "a quickly accessible location...of a small amount of fast storage". I'm assuming the memory is fast because a register is accessed and modified often?
Yes, data registers may change on subsequent instructions which is quite often. There are more complications with superscalarity, out-of-order execution, pipelining, register renaming, etc which complicate the analysis, but even on a simple in-order CPU, a register can change as often as once per instruction. A plausible program may have a run of many instructions, all affecting the same register:
// Type your code here, or load an example.
int polynom(int num) {
return num * num + 2 * num + 1;
}
which compiles as:
polynom(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
* mov eax, DWORD PTR [rbp-4]
* imul eax, eax
* mov edx, DWORD PTR [rbp-4]
add edx, edx
* add eax, edx
* add eax, 1
pop rbp
ret
Note the many writes to the eax register, noted with an asterisk. In this little function, five almost-consecutive instructions write to this specific register, meaning that we can expect the program-visible state of eax1 to change at a rate of over 1 GHz if this code were to be called in a tight loop.
On a more fundamental note, there are some architectural registers that almost always change on every instruction. The most evident of these is the program counter (called PC in many contexts, EIP on x86, RIP on x86_64). Because this register points to the currently executing instruction, it must certainly change with every instruction, barring counterexamples like x86 REP encodings or an instruction that simply jumps to itself.
1 Again, barring architectural considerations like register renaming, which uses multiple physical registers to implement a single logical, program-visible register.
Since modern CPU's run in GHz, CPU registers can change what they are storing hundred of millions or even billions of times per second.
Since most modern CPU's have ~128 registers, they would typically change values a few million times per second when performing many operations.

Assembly - How to score a CPU instruction by latency and throughput

I'm looking for a type of a formula / way to measure how fast an instruction is, or more specific to give a "score" each of the instruction by CPU cycles.
Let's take the follow assembly program for an example,
nop
mov eax,dword ptr [rbp+34h]
inc eax
mov dword ptr [rbp+34h],eax
and the following Intel Skylake information:
mov r,m : Throughput=0.5 Latency=2
mov m,r
: Throughput=1 Latency=2
nop : Throughput=0.25 Latency=non
inc : Throughput=0.25 Latency=1
I know that the order of the instructions in the program are matter in here but
I'm looking to create something general that not need to be "accurate to the single cycle"
any one have any idea how can I do that?
There is no formula you can apply; you have to measure.
The same instruction on different versions of the same uarch family can have different performance. e.g. mulps:
Sandybridge 1c / 5c throughput/latency.
HSW 0.5 / 5. BDW 0.5 / 3 (faster multiply path in the FMA unit? FMA is still 5c).
SKL 0.5 / 4 (lower latency FMA, too). SKL runs addps on the FMA unit as well, dropping the dedicated FP multiply unit so add latency is higher, but throughput is higher.
There's no way you could predict any of this without measuring, or knowing some microarchitectural details. We expect FP math ops won't be single-cycle latency, because they're much more complicated than integer ops. (So if they were single cycle, the clock speed is set too low for integer ops.)
You measure by repeating the instruction many times in an unrolled loop. Or fully unrolled with no looping, but then you defeat the uop-cache and can get front-end bottlenecks. (e.g. for decoding 10-byte mov r64, imm64)
https://uops.info/ has already automated this testing for every form of every (unprivileged) instruction, and you can even click on any table entry to see what test loops they used. e.g. Skylake xchg r32, eax latency testing (https://uops.info/html-lat/SKL/XCHG_R32_EAX-Measurements.html) from each input operand to each output. (2 cycle latency from EAX -> R8D, but 1 cycle latency from R8D -> EAX.) So we can guess that the 3 uops include copying EAX to an internal temporary, but moving directly from the other operand to EAX.
https://uops.info/ is the current best source of test data; when it and Agner's tables disagree, my own measurements and/or other sources have always confirmed uops.info's testing was accurate. And they don't try to make up a latency number for 2 halves of a round-trip like movd xmm0,eax and back, they show you the range of possible latencies assuming the rest of the chain was the minimum plausible.
Agner Fog creates his instruction tables (which you appear to be reading) by timing large non-looping blocks of code that repeat an instruction. https://agner.org/optimize/. The intro section of his instruction-tables explains briefly how he measures, and his microarch guide explains more details of how different x86 microarchitectures work internally. Unfortunately there are occasional typos or copy/paste errors in his hand-edited tables.
http://instlatx64.atw.hu/ also has results of experimental measurements. I think they use a similar technique of a large block of the same instruction repeated, maybe small enough to fit in the uop cache. But they don't use perf counters to measure what execution port each instruction needs, so their throughput numbers don't help you figure out which instructions compete with which other instructions.
These latter two sources have been around for longer than uops.info, and cover some older CPUs, especially older AMD.
To measure latency yourself, you make the output of each instruction an input for the next.
mov ecx, 10000000
inc_latency:
inc eax
inc eax
inc eax
inc eax
inc eax
inc eax
sub ecx,1 ; avoid partial-flag false dep for P4
jnz inc_latency ; dec or sub/jnz macro-fuses into 1 uop on Intel SnB-family
This dependency chain of 7 inc instructions will bottleneck the loop at 1 iteration per 7 * inc_latency cycles. Using perf counters for core clock cycles (not RDTSC cycles), you can easily measure the time for all the iterations to 1 part in 10k, and with more care probably even more precisely than that. The repeat count of 10000000 hides start/stop overhead of whatever timing you use.
I normally put a loop like this in a Linux static executable that just makes a sys_exit(0) system call directly (with a syscall) instruction, and time the whole executable with perf stat ./testloop to get time and a cycle count. (See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example).
Another example is Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths, with the added complication of using lfence to drain the out-of-order execution window for two dep chains.
To measure throughput, you use separate registers, and/or include an xor-zeroing occasionally to break dep chains and let out-of-order exec overlap things. Don't forget to also use perf counters to see which ports it can run on, so you can tell which other instructions it will compete with. (e.g. FMA (p01) and shuffles (p5) don't compete at all for back-end resources on Haswell/Skylake, only for front-end throughput.) Don't forget to measure front-end uop counts, too: some instructions decode to multiply uops.
How many different dependency chains do we need to avoid a bottleneck? Well we know the latency (measure it first), and we know the max possible throughput (number of execution ports, or front-end throughput.)
For example, if FP multiply had 0.25c throughput (4 per clock), we could keep 20 in flight at once on Haswell (5c latency). That's more than we have registers, so we could just use all 16 and discover that in fact the throughput is only 0.5c. But if it had turned out that 16 registers was a bottleneck, we could add xorps xmm0,xmm0 occasionally and let out-of-order execution overlap some blocks.
More is normally better; having just barely enough to hide latency can slow down with imperfect scheduling. If we wanted to go nuts measuring inc, we'd do this:
mov ecx, 10000000
inc_latency:
%rep 10 ;; source-level repeat of a block, no runtime branching
inc eax
inc ebx
; not ecx, we're using it as a loop counter
inc edx
inc esi
inc edi
inc ebp
inc r8d
inc r9d
inc r10d
inc r11d
inc r12d
inc r13d
inc r14d
inc r15d
%endrep
sub ecx,1 ; break partial-flag false dep for P4
jnz inc_latency ; dec/jnz macro-fuses into 1 uop on Intel SnB-family
If we were worried about partial-flag false dependencies or flag-merging effects, we might experiment with mixing in an xor eax,eax somewhere to let OoO exec overlap more than just when sub wrote all flags. (See INC instruction vs ADD 1: Does it matter?)
There's a similar problem for measuring throughput and latency of shl r32, cl on Sandybridge-family: the flag dependency chain isn't normally relevant for a computation, but putting shl back-to-back creates a dependency through FLAGS as well as through the register. (Or for throughput, there isn't even a register dep).
I posted about this on Agner Fog's blog: https://www.agner.org/optimize/blog/read.php?i=415#860. I mixed shl edx,cl in with four add edx,1 instructions, to see what incremental slowdown adding one more instruction had, where the FLAGS dependency was a non-issue. On SKL, it only slows down by an extra 1.23 cycles on average, so the true latency cost of that shl was only ~1.23 cycles, not 2. (It's not a whole number or just 1 because of resource conflicts to run the flag-merging uops of the shl, I guess. BMI2 shlx edx, edx, ecx would be exactly 1c because it's only a single uop.)
Related: for static performance analysis of whole blocks of code (containing different instructions), see What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?. (It's using the word "latency" for the end-to-end latency of a whole computation, but actually asking about things small enough for OoO exec to overlap different parts, so instruction latency and throughput both matter.)
The Latency=2 numbers for load/store appear to be from Agner Fog's instruction tables (https://agner.org/optimize/). They unfortunately aren't accurate for a chain of mov rax, [rax]. You'll find that's 4c
latency if you measure it by putting that in a loop.
Agner splits up load/store latency into something that makes the total store/reload latency come out correct, but for some reason he doesn't make the load part equal to the L1d load-use latency when it comes from cache instead of the store buffer. (But also note that if the load feeds an ALU instruction instead of another load, the latency is 5c. So the simple addressing-mode fast-path only helps for pure pointer-chasing.)

How does RIP-relative addressing perform compared to mov reg, imm64?

It is known fact that x86-64 instructions do not support 64-bit immediate values (except for mov). Hence, when migrating code from 32 to 64 bits, an instruction like this:
cmp rax, addr32
cannot be replaced with the following:
cmp rax, addr64
Under these circumstances, I'm considering two alternatives: (a) using a scratch register for loading the constant or (b) using rip-relative addressing. The two approaches look like this:
mov r11, addr64 ; scratch register
cmp rax, r11
ptr64: dq addr64
...
cmp rax, [rel ptr64] ; encoded as cmp rax, [rip+offset]
I wrote a very simple loop to compare the performance of both approaches (which I paste below). While (b) uses an indirect pointer, (a) has the the immediate encoded in the instruction (which could lead to a worse usage of i-cache). Surprisingly, I found that (b) run ~10% faster than (a). Is this result something to be expected in more common real-world code?
true: dq 0xFFFF0000FFFF0000
false: dq 0xAAAABBBBAAAABBBB
main:
or rax, 1 ; rax is odd and constant "true" is even
mov rcx, 0x1
shl rcx, 30
branch:
mov r11, 0xFFFF0000FFFF0000 ; not present in (b)
cmp rax, r11 ; vs cmp rax, [rel true]
je next
add rax, 2
loop branch
next:
mov rax, 0
ret
Surprisingly, I found that (b) run ~10% faster than (a)
You probably tested on a CPU other than AMD Bulldozer-family or Ryzen, which have a fast loop instruction. On other CPUs, loop is very slow, mostly on purpose for historical reasons, so you bottleneck on it. e.g. 7 uops, one per 5c throughput on Haswell.
mov r64, imm64 is bad for uop cache throughput because of the large immediate taking 2 slots in Intel's uop cache. (See the Sandybridge uop cache section in Agner Fog's microarch pdf), and Which is faster, imm64 or m64 for x86-64? where I listed the details.
Even apart from that, it's not too surprising that 1 extra uop in the loop makes it run slower. You're probably not on an AMD CPU (with single-uop / 1 per 2 clock loop), because the extra mov in such a tiny loop would make more than 10% difference. Or no difference at all, since it's just 3 vs. 4 uops per 2 clocks, if that's correct that even tiny loop loops are limited to one jump per 2 clocks.
On Intel, loop is 7 uops, one per 5 clocks throughput on most CPUs, so the 4-per-clock issue/rename bottleneck won't be what you're hitting. loop is micro-coded, so the front-end can't run from the loop buffer. (And Skylake CPUs have their LSD disabled by a microcode update to fix the partial-register erratum anyway.) So the mov r64,imm64 uop has to be re-read from the uop cache every time through the loop.
A load that hits in cache has very good throughput (2 loads per clock, and in this case micro-fusion means no extra uops to use a memory operand instead of register for cmp). So the main penalty in using a constant from memory is the extra cache footprint and cache misses, but your microbenchmark won't reveal that at all. It also has no other pressure on the load ports.
In the general case:
If possible, use a RIP-relative lea to generate 64-bit address constants.
e.g. lea rax, [rel addr64]. Yes, this takes an extra instruction to get the constant into a register. (BTW, just use default rel. You can use [abs fs:0] if you need it.
You can avoid the extra instruction if you build position-dependent code with the default (small) code model, so static addresses fit in the low 32 bits of virtual address space and can be used as immediates. (Actually low 2GiB, so sign or zero extending both work). See 32-bit absolute addresses no longer allowed in x86-64 Linux? if gcc complains about absolute addressing; -pie is enabled by default on most distros. This of course doesn't work in Linux shared libraries, which only support text relocations for 64-bit addresses. But you should avoid relocations whenever possible by using lea to make position-indepdendent code.
Most integer build-time constants fit in 32 bits, so you can use cmp r64, imm32 or cmp r32, imm32 even in PIC code.
If you do need a 64-bit non-address constant, try to hoist the mov r64, imm64 out of a loop. Your cmp loop would have been fine if the mov wasn't inside the loop. x86-64 has enough registers that you (or the compiler) can usually avoid reloads inside inner-most loops in integer code.

Do complex addressing modes have extra overhead for loads from memory?

Is there a difference in performance between these mov load instructions? Do the more complex addressing modes have extra overhead (latency or throughput) compared to the simple ones?
# AT&T syntax # Intel syntax:
movq (%rsi), %rax mov rax, [rsi]
movq (%rdi, %rsi), %rax mov rax, [rdi + rsi]
movq (%rdi, %rsi, 4), %rax mov rax, [rdi + rsi*4]
Yes, there is an overhead for "complex addressing" on recent Intel CPUs. The cost is one additional cycle of latency (e.g., 5 cycles for a normal GP load using complex addressing versus 4 cycles with simple addressing).
Simple addressing is anything of the form [reg + offset] where the immediate offset between 0 and 2047 inclusive.
Complex addressing is anything other than simple addressing.
In particular any addressing mode with two registers like your examples [rdi + rsi] or [rdi + rsi*4] are complex addressing and cost an extra cycle.
There is an exceptional case: if the index register1 is zeroed via a zeroing idiom (like xor edi, edi, but not like mov edi, 0) you don't pay the complex addressing penalty.
1 The index register is the one multiplied by 1, 2, 4 or 8, i.e., rsi in [rdi + rsi*4]. In the case neither register shows a multiplier, like [rdi + rsi] the multiplier is 1 and you'll have to check your assembler to see how to specify which is the index and which is the displacement. nasm seems to use the second register as the index.
Depending on which specific CPU; mostly "no, there's no extra overhead". However...
Most CPUs have out-of-order cores, which means they perform instruction in whatever order is fastest and not in the order the instructions are given. For this to work, one instruction (e.g. movq (%rdi, %rsi, 4), %rax) can't happen until things it depended on are finished (e.g. the values in rdi and rsi are known).
For example, these 2 instructions can occur in parallel (because the second instruction doesn't depend on the first):
movq (%rdi), %edi
movq (%rsi), %rax
And these 2 instructions can't occur in parallel (the second instruction has to wait until the first instruction completes):
movq (%rdi), %rdi
movq (%rdi, %rsi), %rax
Also note that the bottleneck for a piece of code may not be execution. If the bottleneck is instruction fetch then larger instructions will be worse; if the bottleneck is instruction decode then more complex instructions can be worse; if the bottleneck is data cache bandwidth then anything that reads/writes to memory can be worse, etc.
Basically; you can't look at individual instructions in isolation and decide if they're better/worse. You have to look at entire sequences of multiple instructions so that you can know about any dependencies on previous instructions (and their latencies); and you have to know what the bottleneck is (e.g. from performance monitoring tools); and if you know all this then you can make an "educated guess" that's only really useful for a small number of CPUs (because different CPUs have different characteristics).

What is the best way to set a register to zero in x86 assembly: xor, mov or and?

All the following instructions do the same thing: set %eax to zero. Which way is optimal (requiring fewest machine cycles)?
xorl %eax, %eax
mov $0, %eax
andl $0, %eax
TL;DR summary: xor same, same is the best choice for all CPUs. No other method has any advantage over it, and it has at least some advantage over any other method. It's officially recommended by Intel and AMD, and what compilers do. In 64-bit mode, still use xor r32, r32, because writing a 32-bit reg zeros the upper 32. xor r64, r64 is a waste of a byte, because it needs a REX prefix.
Even worse than that, Silvermont only recognizes xor r32,r32 as dep-breaking, not 64-bit operand-size. Thus even when a REX prefix is still required because you're zeroing r8..r15, use xor r10d,r10d, not xor r10,r10.
GP-integer examples:
xor eax, eax ; RAX = 0. Including AL=0 etc.
xor r10d, r10d ; R10 = 0. Still prefer 32-bit operand-size.
xor edx, edx ; RDX = 0
; small code-size alternative: cdq ; zero RDX if EAX is already zero
; SUB-OPTIMAL
xor rax,rax ; waste of a REX prefix, and extra slow on Silvermont
xor r10,r10 ; bad on Silvermont (not dep breaking), same as r10d on other CPUs because a REX prefix is still needed for r10d or r10.
mov eax, 0 ; doesn't touch FLAGS, but not faster and takes more bytes
and eax, 0 ; false dependency. (Microbenchmark experiments might want this)
sub eax, eax ; same as xor on most but not all CPUs; bad on Silvermont for example.
xor cl, cl ; false dep on some CPUs, not a zeroing idiom. Use xor ecx,ecx
mov cl, 0 ; only 2 bytes, and probably better than xor cl,cl *if* you need to leave the rest of ECX/RCX unmodified
Zeroing a vector register is usually best done with pxor xmm, xmm. That's typically what gcc does (even before use with FP instructions).
xorps xmm, xmm can make sense. It's one byte shorter than pxor, but xorps needs execution port 5 on Intel Nehalem, while pxor can run on any port (0/1/5). (Nehalem's 2c bypass delay latency between integer and FP is usually not relevant, because out-of-order execution can typically hide it at the start of a new dependency chain).
On SnB-family microarchitectures, neither flavour of xor-zeroing even needs an execution port. On AMD, and pre-Nehalem P6/Core2 Intel, xorps and pxor are handled the same way (as vector-integer instructions).
Using the AVX version of a 128b vector instruction zeros the upper part of the reg as well, so vpxor xmm, xmm, xmm is a good choice for zeroing YMM(AVX1/AVX2) or ZMM(AVX512), or any future vector extension. vpxor ymm, ymm, ymm doesn't take any extra bytes to encode, though, and runs the same on Intel, but slower on AMD before Zen2 (2 uops). The AVX512 ZMM zeroing would require extra bytes (for the EVEX prefix), so XMM or YMM zeroing should be preferred.
XMM/YMM/ZMM examples
# Good:
xorps xmm0, xmm0 ; smallest code size (for non-AVX)
pxor xmm0, xmm0 ; costs an extra byte, runs on any port on Nehalem.
xorps xmm15, xmm15 ; Needs a REX prefix but that's unavoidable if you need to use high registers without AVX. Code-size is the only penalty.
# Good with AVX:
vpxor xmm0, xmm0, xmm0 ; zeros X/Y/ZMM0
vpxor xmm15, xmm0, xmm0 ; zeros X/Y/ZMM15, still only 2-byte VEX prefix
#sub-optimal AVX
vpxor xmm15, xmm15, xmm15 ; 3-byte VEX prefix because of high source reg
vpxor ymm0, ymm0, ymm0 ; decodes to 2 uops on AMD before Zen2
# Good with AVX512
vpxor xmm15, xmm0, xmm0 ; zero ZMM15 using an AVX1-encoded instruction (2-byte VEX prefix).
vpxord xmm30, xmm30, xmm30 ; EVEX is unavoidable when zeroing zmm16..31, but still prefer XMM or YMM for fewer uops on probable future AMD. May be worth using only high regs to avoid needing vzeroupper in short functions.
# Good with AVX512 *without* AVX512VL (e.g. KNL / Xeon Phi)
vpxord zmm30, zmm30, zmm30 ; Without AVX512VL you have to use a 512-bit instruction.
# sub-optimal with AVX512 (even without AVX512VL)
vpxord zmm0, zmm0, zmm0 ; EVEX prefix (4 bytes), and a 512-bit uop. Use AVX1 vpxor xmm0, xmm0, xmm0 even on KNL to save code size.
See Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm? and
What is the most efficient way to clear a single or a few ZMM registers on Knights Landing?
Semi-related: Fastest way to set __m256 value to all ONE bits and
Set all bits in CPU register to 1 efficiently also covers AVX512 k0..7 mask registers. SSE/AVX vpcmpeqd is dep-breaking on many (although still needs a uop to write the 1s), but AVX512 vpternlogd for ZMM regs isn't even dep-breaking. Inside a loop consider copying from another register instead of re-creating ones with an ALU uop, especially with AVX512.
But zeroing is cheap: xor-zeroing an xmm reg inside a loop is usually as good as copying, except on some AMD CPUs (Bulldozer and Zen) which have mov-elimination for vector regs but still need an ALU uop to write zeros for xor-zeroing.
What's special about zeroing idioms like xor on various uarches
Some CPUs recognize sub same,same as a zeroing idiom like xor, but all CPUs that recognize any zeroing idioms recognize xor. Just use xor so you don't have to worry about which CPU recognizes which zeroing idiom.
xor (being a recognized zeroing idiom, unlike mov reg, 0) has some obvious and some subtle advantages (summary list, then I'll expand on those):
smaller code-size than mov reg,0. (All CPUs)
avoids partial-register penalties for later code. (Intel P6-family and SnB-family).
doesn't use an execution unit, saving power and freeing up execution resources. (Intel SnB-family)
smaller uop (no immediate data) leaves room in the uop cache-line for nearby instructions to borrow if needed. (Intel SnB-family).
doesn't use up entries in the physical register file. (Intel SnB-family (and P4) at least, possibly AMD as well since they use a similar PRF design instead of keeping register state in the ROB like Intel P6-family microarchitectures.)
Smaller machine-code size (2 bytes instead of 5) is always an advantage: Higher code density leads to fewer instruction-cache misses, and better instruction fetch and potentially decode bandwidth.
The benefit of not using an execution unit for xor on Intel SnB-family microarchitectures is minor, but saves power. It's more likely to matter on SnB or IvB, which only have 3 ALU execution ports. Haswell and later have 4 execution ports that can handle integer ALU instructions, including mov r32, imm32, so with perfect decision-making by the scheduler (which doesn't always happen in practice), HSW could still sustain 4 uops per clock even when they all need ALU execution ports.
See my answer on another question about zeroing registers for some more details.
Bruce Dawson's blog post that Michael Petch linked (in a comment on the question) points out that xor is handled at the register-rename stage without needing an execution unit (zero uops in the unfused domain), but missed the fact that it's still one uop in the fused domain. Modern Intel CPUs can issue & retire 4 fused-domain uops per clock. That's where the 4 zeros per clock limit comes from. Increased complexity of the register renaming hardware is only one of the reasons for limiting the width of the design to 4. (Bruce has written some very excellent blog posts, like his series on FP math and x87 / SSE / rounding issues, which I do highly recommend).
On AMD Bulldozer-family CPUs, mov immediate runs on the same EX0/EX1 integer execution ports as xor. mov reg,reg can also run on AGU0/1, but that's only for register copying, not for setting from immediates. So AFAIK, on AMD the only advantage to xor over mov is the shorter encoding. It might also save physical register resources, but I haven't seen any tests.
Recognized zeroing idioms avoid partial-register penalties on Intel CPUs which rename partial registers separately from full registers (P6 & SnB families).
xor will tag the register as having the upper parts zeroed, so xor eax, eax / inc al / inc eax avoids the usual partial-register penalty that pre-IvB CPUs have. Even without xor, IvB only needs a merging uop when the high 8bits (AH) are modified and then the whole register is read, and Haswell even removes that.
From Agner Fog's microarch guide, pg 98 (Pentium M section, referenced by later sections including SnB):
The processor recognizes the XOR of a register with itself as setting
it to zero. A special tag in the register remembers that the high part
of the register is zero so that EAX = AL. This tag is remembered even
in a loop:
; Example 7.9. Partial register problem avoided in loop
xor eax, eax
mov ecx, 100
LL:
mov al, [esi]
mov [edi], eax ; No extra uop
inc esi
add edi, 4
dec ecx
jnz LL
(from pg82): The processor remembers that the upper 24 bits of EAX are zero as long as
you don't get an interrupt, misprediction, or other serializing event.
pg82 of that guide also confirms that mov reg, 0 is not recognized as a zeroing idiom, at least on early P6 designs like PIII or PM. I'd be very surprised if they spent transistors on detecting it on later CPUs.
xor sets flags, which means you have to be careful when testing conditions. Since setcc is unfortunately only available with an 8bit destination, you usually need to take care to avoid partial-register penalties.
It would have been nice if x86-64 repurposed one of the removed opcodes (like AAM) for a 16/32/64 bit setcc r/m, with the predicate encoded in the source-register 3-bit field of the r/m field (the way some other single-operand instructions use them as opcode bits). But they didn't do that, and that wouldn't help for x86-32 anyway.
Ideally, you should use xor / set flags / setcc / read full register:
...
call some_func
xor ecx,ecx ; zero *before* the test
test eax,eax
setnz cl ; cl = (some_func() != 0)
add ebx, ecx ; no partial-register penalty here
This has optimal performance on all CPUs (no stalls, merging uops, or false dependencies).
Things are more complicated when you don't want to xor before a flag-setting instruction. e.g. you want to branch on one condition and then setcc on another condition from the same flags. e.g. cmp/jle, sete, and you either don't have a spare register, or you want to keep the xor out of the not-taken code path altogether.
There are no recognized zeroing idioms that don't affect flags, so the best choice depends on the target microarchitecture. On Core2, inserting a merging uop might cause a 2 or 3 cycle stall. It appears to be cheaper on SnB, but I didn't spend much time trying to measure. Using mov reg, 0 / setcc would have a significant penalty on older Intel CPUs, and still be somewhat worse on newer Intel.
Using setcc / movzx r32, r8 is probably the best alternative for Intel P6 & SnB families, if you can't xor-zero ahead of the flag-setting instruction. That should be better than repeating the test after an xor-zeroing. (Don't even consider sahf / lahf or pushf / popf). IvB can eliminate movzx r32, r8 (i.e. handle it with register-renaming with no execution unit or latency, like xor-zeroing). Haswell and later only eliminate regular mov instructions, so movzx takes an execution unit and has non-zero latency, making test/setcc/movzx worse than xor/test/setcc, but still at least as good as test/mov r,0/setcc (and much better on older CPUs).
Using setcc / movzx with no zeroing first is bad on AMD/P4/Silvermont, because they don't track deps separately for sub-registers. There would be a false dep on the old value of the register. Using mov reg, 0/setcc for zeroing / dependency-breaking is probably the best alternative when xor/test/setcc isn't an option.
Of course, if you don't need setcc's output to be wider than 8 bits, you don't need to zero anything. However, beware of false dependencies on CPUs other than P6 / SnB if you pick a register that was recently part of a long dependency chain. (And beware of causing a partial reg stall or extra uop if you call a function that might save/restore the register you're using part of.)
and with an immediate zero isn't special-cased as independent of the old value on any CPUs I'm aware of, so it doesn't break dependency chains. It has no advantages over xor and many disadvantages.
It's useful only for writing microbenchmarks when you want a dependency as part of a latency test, but want to create a known value by zeroing and adding.
See http://agner.org/optimize/ for microarch details, including which zeroing idioms are recognized as dependency breaking (e.g. sub same,same is on some but not all CPUs, while xor same,same is recognized on all.) mov does break the dependency chain on the old value of the register (regardless of the source value, zero or not, because that's how mov works). xor only breaks dependency chains in the special-case where src and dest are the same register, which is why mov is left out of the list of specially recognized dependency-breakers. (Also, because it's not recognized as a zeroing idiom, with the other benefits that carries.)
Interestingly, the oldest P6 design (PPro through Pentium III) didn't recognize xor-zeroing as a dependency-breaker, only as a zeroing idiom for the purposes of avoiding partial-register stalls, so in some cases it was worth using both mov and then xor-zeroing in that order to break the dep and then zero again + set the internal tag bit that the high bits are zero so EAX=AX=AL.
See Agner Fog's Example 6.17. in his microarch pdf. He says this also applies to P2, P3, and even (early?) PM. A comment on the linked blog post says it was only PPro that had this oversight, but I've tested on Katmai PIII, and #Fanael tested on a Pentium M, and we both found that it didn't break a dependency for a latency-bound imul chain. This confirms Agner Fog's results, unfortunately.
TL:DR:
If it really makes your code nicer or saves instructions, then sure, zero with mov to avoid touching the flags, as long as you don't introduce a performance problem other than code size. Avoiding clobbering flags is the only sensible reason for not using xor, but sometimes you can xor-zero ahead of the thing that sets flags if you have a spare register.
mov-zero ahead of setcc is better for latency than movzx reg32, reg8 after (except on Intel when you can pick different registers), but worse code size.

Resources