In x86 assembly, is it better to use two separate registers for imul?

In x86 assembly, is it better to use two separate registers for imul? - performance

I am wondering, mostly out of curiosity, if using the same register for an operation is better than using two. What would be better, considering performance and/or other concerns?
mov %rbx, %rcx
imul %rcx, %rcx
or
mov %rbx, %rcx
imul %rbx, %rcx
Any tips for how to benchmark this, or resources where I could read about this type of thing would be appreciated, as I am new to assembly.

resources where I could read about this type of thing
See Agner Fog's microarch pdf, and his optimizing assembly guide. Also other links in the x86 tag wiki (e.g. Intel's optimization manual).
The interesting option you didn't mention is:
mov %rbx, %rcx
imul %rbx, %rbx # doesn'y have to wait for mov to execute
# old value of %rbx is still available in %rcx
If the imul is on the critical path, and mov has non-zero latency (like on AMD CPUs, and Intel before IvyBridge), this is potentially better. The result of imul will be ready one cycle earlier, because has no dependency on the result of the mov.
If, however, the old value is on the critical path and the squared value isn't, then this is worse because it adds a mov to the critical path.
Of course, it also means you have to keep track of the fact that your old variable is now live in a different register, and the old register has the squared value. If this is a problem in a loop, unroll it so you can end up with things where the top of the loop is expecting them. If you wanted this to be easy, you'd use a compiler instead of optimizing asm by hand.
However, Intel P6-family CPUs (PPro/PII to Nehalem) have limited register-read ports, so it can be better to favour reading registers that you just wrote. If the %rbx wasn't written in the last couple cycles, it will have to be read from the permanent register file when the mov and imul uops go through the rename&issue stage (the RAT).
If they don't issue as part of the same group of 4, then they would each need to read %rbx separately. Since the register file in Core2/Nehalem only has 3 read ports, issue groups (quartets, as Agner Fog calls them) stall until all their not-recently-written input register values are read from the register file (at 3 per cycle, or 2 on Core2 is none of the 3 regs are index regs in an addressing mode).
For the full details, see Agner Fog's microarch pdf section 8.8. The Core2 section refers back to the PPro section. PPro has a 3-wide pipeline, so in that section Agner talks about triplets, not quartets.
If mov and imul issue together, they both share the same read of %rbx. There's a 3 in 4 chance of this happening on Core2/Nehalem.
Choosing just between the sequences you mention the first one has a clear (but usually small) advantage over the second for Intel P6-family CPUs. There's no difference for other CPUs, AFAIK, so the choice is obvious.
mov %rbx, %rcx
imul %rcx, %rcx # uses only the recently-written rcx; can't contribute to register-read stalls
worst of both worlds:
mov %rbx, %rcx
imul %rbx, %rcx # can't execute until after the mov, but still reads a potentially-old register
If you're going to depend on a recently-written register, you might as well use only recently-written registers.
Intel Sandybridge-family uses a physical register file (like AMD Bulldozer-family), and doesn't have register-read stalls.
Ivybridge (2nd gen Sandybridge) and later also handle mov reg,reg at register rename time, with zero latency and no execution unit. This means it doesn't matter whether you imul rbx or rcx as far as critical path length.
However, AMD Bulldozer-family can only handle xmm register moves in its rename stage; integer register moves still have 1c latency.
It's potentially still worth caring about which dependency chain the mov is part of, if latency is a limiting factor in the cycles per iteration of a loop.
how to benchmark this
I think you could put together a microbenchmark that has a register read stall on Core2 with imul %rbx, %rcx, but not with imul %rcx, %rcx. However, that would require some trial and error to get the mov and imul to issue in different groups, and unless you're feeling really creative, probably some artificial-looking surrounding code that exists only to read lots of registers. (e.g. lea (%rsi, %rdi, 1), %eax, or even add (%rsi, %rdi, 1), %eax (which has to read all three registers, and does micro-fuse on core2/nehalem so it only takes 1 uop slot in an issue group. (It doesn't micro-fuse on SnB-family)).

On a modern processor, using one register for both source and destination and using two different registers will never make any difference to performance. The reason for this is partly due to register renaming which, if there were a difference in performance would solve it by changing one of the registers to a different one and modify your subsequent instructions to use the new register (your processor actually has more registers than the instruction set has a way of refering to them so that it can do stuff like this). It is also because of the nature of a pipelined processor's implementation -- the contents of source registers are read at one pipeline stage and are then written at another later stage, which makes it difficult or impossible for register usage for a single instruction to cause any kind of interaction like the one you're worrying about.
More problematic is if an instruction refers to a value produced in its previous instruction, but even that is solved (usually) by out-of-order execution.

Related

Are some general purpose registers faster than others?

In x86-64, will certain instructions execute faster if some general purpose registers are preferred over others?
For instance, would mov eax, ecx execute faster than mov r8d, ecx? I can imagine that the latter would need a REX prefix which would make the instruction fetch slower?
What about using rax instead of rcx? What about add or xor? Other operations? Smaller registers like r15b vs al? al vs ah?
AMD vs Intel? Newer processors? Older processors? Combinations of instructions?
Clarification: Should certain general purpose registers be preferred over others, and which ones are they?

In general, architectural registers are all equal, and renamed onto a large array of physical registers.
(Except partial registers can be slower, especially high-byte AH/BH/CH/DH which are slow to read after writing the full register, on Haswell and later. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent and also Why doesn't GCC use partial registers? for problems when writing 8-bit and 16-bit registers). The rest of this answer is just going to consider 32/64-bit operand-size.)
But some instruction require specific registers, like legacy variable-count shifts (without BMI2 shrx etc) require the count in CL. Division requires the dividend in EDX:EAX (or RDX:RAX for the slower 64-bit version).
Using a call-preserved register like RBX means your function has to spend extra instructions saving/restoring it.
But of course there are perf differences if you need more instructions. So lets assume all else is equal, and just talk about the uops, latency, and code-size of a single instruction just by changing which register is used for one of its operands. TL:DR: the only perf difference is due to instruction-encoding restrictions / differences. Sometimes a different register will allow / require (or get the assembler to pick) a different encoding, which will often be smaller / larger as a special case, and sometimes even executes differently.
Generally smaller code is faster, and packs better in the uop cache and I-cache, so unless you've analyzed a specific case and found a problem, favour the smaller encoding. Often that means keeping a byte value in AL so you can use those special-case instructions, and avoiding RBP / R13 for pointers.
Special cases where a specific encoding is extra slow, not just size
LEA with RBP or R13 as a base can be slower on Intel if the addressing mode didn't already have a +displacement constant.
e.g. lea eax, [rbp + 12] is encodeable as-written, and is just as fast as lea eax, [rcx + 12].
But lea eax, [rbp + rcx*4] can only be encoded in machine code as lea eax, [rbp + rcx*4 + 0] (because of addressing mode escape-code stuff), which is a 3-component LEA, and thus slower on Intel (3 cycle latency on Sandybridge-family instead of 1 cycle, see https://agner.org/optimize/ instruction tables and microarch PDF). On AMD, having a scaled-index would already make it a slow-LEA even with lea eax, [rdx + rcx*4]
Outside of LEA, using RBP / R13 as the base in any addressing mode always requires a disp8/32 byte or dword, but I don't think the actual AGUs are slower for a 3-component addressing mode. So it's just a code-size effect.
Other cases include Which Intel microarchitecture introduced the ADC reg,0 single-uop special case? where the short-form 2-byte encoding for adc al, imm8 is 2 uops even on modern uarches like Skylake, where adc bl, imm8 is 1 uop.
So not only does the adc reg,0 special case not work for adc al,0 on Sandybridge through Haswell, Broadwell and newer forgot (or chose not to) optimize how that encoding decodes to uops. (Of course you could manually encode adc al,0 using the 3-byte Mod/RM encoding, but assemblers will always pick the shortest encoding so adc al,0 will assemble to the short form by default.) Only a problem with byte registers; adc eax,0 will use the opcode ModRM imm8 3-byte encoding, not 5-byte opcode imm32.
For other cases of op al,imm8, the only difference is code-size, which only indirectly matters for performance. (Because of decoding, uop-cache packing, and I-cache misses).
See Tips for golfing in x86/x64 machine code for more about special cases of code-size, like xchg eax, ecx being 1-byte vs. xchg edx, ecx being 2 bytes.
add rsp, 8 can need an extra stack-sync uop if there hasn't been an explicit use of RSP or ESP since the last push/pop/call/ret (along the path of execution of course, not in the static code layout). (What is the stack engine in the Sandybridge microarchitecture?). This is why compilers like clang use a dummy push or pop to reserve / free a single stack slot: Why does this function push RAX to the stack as the first operation?

LEA will be slower with EBP, RBP, or R13 as the base (PDF warning, page 3-22). But generally the answer is No.
Taking a step back, it's important to realize that since the advent of register renaming that architectural registers don't deal with actual, physical registers on most micro-architectures. For example, each Cascade Lake core has a register file of 180 integer and 168 FP registers.

You have stuffed too many questions altogether, however, if I understood the question well, you are confusing the processor architecture with the small but fast Register file, which fills in the speed gap between the processor and memory technologies. The register file is small enough that it can only support one instruction at a time, i.e. the current instruction, and fast enough that it can almost catch up with the processor speed.
I would like to build a short background, the naming conventions of these registers serves two purposes: one, it makes the older versions of the x86 ISA implementations compatible up till now, and two, every name of these registers has a special purpose to it besides its general purpose use. For example, the ECX register is used as a counter to implement loops i.e. instructions like JECXZ and LOOP uses ECX register exclusively. Though you need to watch out for some flags that you would not want to lose.
And now the answer to your question stems from the second purpose. So some registers would seem to be faster because these special registers are hardcoded into the processor and can be accessed much quicker, however, the difference should not be much.
And the second thing that you might know, not all instructions are of the same complexity, especially in x86, the opcode of instructions can be from 1-3 bytes and as more and more functionality is added to the instruction in terms of, prefixes, addressing modes, etc. these instructions start to become slower, So it is not the case that some registers are slower than other, it is just that some registers are encoded into the instruction and therefore those instructions run faster with that combination of register. And if otherwise used, it would seem slower. I hope that helps. Thanks

x86 - instruction interleaving to avoid cpu stall

Gcc6 - intel core 2 duo.
Compilation flags: "-march=native -O3" (-S)
I was compiling a simple program and asked for the assembly output:
Code
movq 8(%rsi), %rdi
call _atoi
movq 16(%rbp), %rdi
movl %eax, %ebx
call _atof
pxor %xmm1, %xmm1
movl $1, %eax <- this instruction is my problem
cvtsi2sd %ebx, %xmm1
leaq LC0(%rip), %rdi
addsd %xmm1, %xmm0
call _printf
addq $8, %rsp
Execution
read/convert an integer variable, then read/convert a double value and add them.
The problem
I perfectly understand that one (the compiler more so) has to avoid cpu stalls as much as possible.
I've shown the offending instruction in the code section above.
To me, with cpu reordering, and different execution context, this interleaved instruction is useless.
My rationale is: chances that we stall are very high anyway and the cpu will wait for pxor xmm1 to return before being able to reuse it in the next instruction. Adding an instruction will just fill the cpu decoder for nothing. The cpu HAS to wait anyway. So why not leaving it alone for 1 instruction?
Moving the pxor before atof seems not possible as atof may use it.
Question
Is that a bug, a legacy junk (when cpu were not able to reorder) or.. else?
Thanks
EDIT:
I admit my question was not clear: can this instruction be safely removed without performance consequences?

The x86-64 ABI requires that calls to varargs functions (like printf) set %al = the count of floating-point args passed in xmm registers. In this case, you're passing one double, so the ABI requires %al = 1. (Fun fact: C's promotion rules make it impossible to pass a float to a vararg function. This is why there are no printf conversion specifiers for float, only double.)
mov $1, %eax avoids false dependencies on the rest of eax, (compared to mov $1, %al), so gcc prefers spending extra instruction bytes on that, even though it's tuning for Core2 (which renames partial registers).
Previous answer, before it was clarified that the question was why the mov is done all, not about its ordering.
IIRC, gcc doesn't do much instruction scheduling for x86, because it's assuming out-of-order execution. I tried to google that, but didn't find the quote from a gcc developer that I seem to remember reading (maybe in a gcc bug report comment).
Anyway, it looks ok to me, unless you're tuning for in-order Atom or P5. If you are, use gcc -O3 -march=atom (which implies -mtune=atom). But anyway, you're clearly not doing that, because you used -march=native on a C2Duo, which is a 4-wide out-of-order design with a fairly large scheduler.
To me, with cpu reordering, and different execution context, this interleaved instruction is useless.
I have no idea what you think the problem is, or what ordering you think would be better, so I'll just explain why it looks good.
I didn't take the time to edit this down to a short answer, so you might prefer to just read Agner Fog's microarch pdf for details of the Core2 pipeline, and skim this answer. See also other links from the x86 tag wiki.
...
call _atof
# xmm0 is probably still not ready when the following instructions issue
pxor %xmm1, %xmm1 # no inputs, so can run any time after being issued.
gcc uses pxor because cvtsi2sd is badly designed, giving it a false dependency on the previous value of the vector register. Note how the upper half of the vector register keeps its old value. Intel probably designed it this way because the original SSE cvtsi2ss was first implemented on Pentium III, where 128b vectors were handled as two halves. Zeroing the rest of the register (including the upper half) instead of merging probably would have taken an extra uop on PIII.
This short-sighted design choice saddled the architecture with the choice between an extra dependency-breaking instruction, or a false dependency. A false dep might not matter at all, or might be a big slowdown if the register used by one function happened to be used for a very long FP dependency chain in another function (maybe including a cache miss).
On Intel SnB-family CPUs, xor-zeroing is handled at register-rename time, so the uop never needs to execute on an execution port; it's already completed as soon as it issues into the ROB. This is true for integer and vector registers.
On other CPUs, the pxor will need an execution port, but has no input dependencies so it can execute any time there's a free ALU port, after it issues.
movl $1, %eax # no input dependencies, can execute any time.
This instruction could be placed anywhere after call atof and before call printf.
cvtsi2sd %ebx, %xmm1 # no false dependency thanks to pxor.
This is a 2 uop instruction on Core2 (Merom and Penryn), according to Agner Fog's tables. That's weird because cvtsi2ss is 1 uop. (They're both 2 uops in SnB; presumably one uop to move data between integer and vector, and another for the conversion).
Putting this insn earlier would be good, potentially issue it a cycle earlier, since it's part of the longest dependency chain here. (The integer stuff is all simple and trivial). However, printf has to parse the format string before it will decide to look at xmm0, so the FP instructions aren't actually on the critical path.
It can't go ahead of pxor, and call / pxor / cvtsi2sd would mean pxor would decode by itself that cycle. Decoding will start with the instruction after the call, after the ret in the called function has been decoded (and the return-address predictor predicts the jump back to the insn after the call). Multi-uop instructions have to be the first instruction in a block, so having pxor and mov imm32 decode that cycle means less of a decode bottleneck.
leaq LC0(%rip), %rdi # 1 uop
addsd %xmm1, %xmm0 # 1 uop
call _printf # 3 uop insn
cvtsi2sd/lea/addsd can all decode in the same cycle, which is optimal. If the mov imm32 was after the cvt, it could decode in the same cycle as well (since pre-SnB decoders can handle up to 4-1-1-1), but it couldn't have issued as soon.
If decoding was only barely keeping up with issue, that would mean pxor would issue by itself (because no other instructions were decoded yet). Then cvtsi2sd/mov imm/lea (4 uops), then addsd / call (4 uops). (addsd decoded with the previous issue group; core2 has a short queue between decode and issue to help absorb decode bubbles like this, and make it useful to be able to decode up to 7 uops in a cycle.)
That's not appreciably different from the current issue pattern in a decode-bottleneck situation: (pxor / mov imm) / (cvtsi2sd/lea/addsd) / (call printf)
If decode isn't the bottleneck, I'm not sure if Core2 can issue a ret or jmp in the same cycle as uops that follow the jump. In SnB-family CPUs, an unconditional jump always ends an issue group. e.g. a 3-uop loop issues ABC, ABC, ABC, not ABCA, BCAB, CABC.
Assuming the instructions after the ret issue with a group not including the ret, we'd have
(pxor/mov imm/cvtsi2sd), (lea / addsd / 2 of call's 3 uops) / (last call uop)
So the cvtsi2sd still issues in the first cycle after returning from atof, which means it can get started executing right away. Even on Core2, where pxor takes an execution unit, the first of the 2 uops from cvtsi2sd can probably execute in the same cycle as pxor. It's probably only the 2nd uop that has an input dependency on the dst register.
(mov imm / pxor / cvtsi2sd) would be equivalent, and so would the slower-to-decode (pxor / cvtsi2sd / mov imm), or getting the lea executed before mov imm.

per clock perf. - can I use different registers for same instruction?

Can I use say four general purpose registers say r8,r9,r10,r11 each with MOV instruction for independent operations and be in impression that CPU is doing all those instructions in a single clock ?
I want to know because according to Agner Fog's Instruction Table, it says reciprocal throughput of MOV instruction is 0.25. It means CPU should be able to execute 4 MOV operations per cycle. Or I misinterpreted that all ??
I am a noob and have been learning Assembly in MASM since two months (mainly for learning debugging stuffs how registers works and it is really fun).

Edit, just re-read your question, and you're asking about different registers. I'll leave in my original answer; let's pretend your question wasn't just the most trivial case. :P
Yes, even without register renaming, these instructions can all execute (on separate execution units) in the same cycle because they're completely independent of each other.
mov eax, 1
mov ebx, ecx
mov edx, [mem]
xor esi,esi ;xor-zero: doesn't even use an execution unit on SnB-family
This is the easiest case for superscalar execution. If eax/rax was the destination for all four instructions, register-renaming would still allow all four instructions to execute in parallel.
Out-of-order execution allows four nearby instructions from separate dependency chains to execute at the same time, even if they weren't decoded or issued in the same clock cycle. And they probably won't retire in the same cycle either, if there are instructions between them. (The x86 ISA guarantees precise exceptions, like most other ISAs (ARM/PPC/etc.). All current designs accomplish with in-order retirement. So if a memory op segfaults, the program will stop at exactly that instruction, not just "well, there was a segfault somewhere recently, but we can't tell you where". (That would be non-precise exceptions).)
Superscalar in-order designs like Atom, or P5 (original Pentium) can still take advantage of the parallelism in these four independent instructions, but not in many other cases.
In a hand-crafted loop, it's common for a SnB-family CPU to be able to sustain well over 3 fused-domain uops per cycle. (It's also very easy to write loops that run at less than one fused-domain uop per cycle, due to latency, to say nothing of cache misses or branch mispredicts.)
Yes, multiple writes to the same architectural register can execute in parallel. Register renaming is not a bottleneck on Intel or AMD designs.
To understand and make full use of Agner Fog's tables, you have to read his microarch guide, or at least his "optimizing assembly" guide. See also good stuff at the x86 wiki.
As Agner Fog's microarch pdf points out (section 9.8 about Intel SnB/IvB):
Register renaming is controlled by the register alias table (RAT) and
the reorder buffer (ROB), shown in figure 6.1. The μops from the
decoders and the stack engine go to the RAT via a queue and then to
the ROB-read and the reservation station. The RAT can handle 4 μops
per clock cycle. The RAT can rename four registers per clock cycle,
and it can even rename the same register four times in one clock
cycle.
read-modify-write is another story (destination of an add instruction). A read-modify-write of an architectural register is (part of) a dependency chain, while an unconditional mov or an xor-zeroing starts a new dep chain. (Same for the output of certain other instructions like lea which don't read their destination).
Those register writes still rename the architectural register to a new physical register as well. This is how CPUs handle cases like
mov eax, 1 ; start of a dep chain
mov [mem+rax+rcx], eax
inc eax ; eax renamed again
The store needs the value of eax from before the inc. It gets it because when it checks the RAT, the architectural eax is still pointing to the same physical register that the mov eax,1 wrote. The inc can't just modify that same physical register because it doesn't know what if anything is not done yet with the previous value of eax.

What is the best way to set a register to zero in x86 assembly: xor, mov or and?

All the following instructions do the same thing: set %eax to zero. Which way is optimal (requiring fewest machine cycles)?
xorl %eax, %eax
mov $0, %eax
andl $0, %eax

TL;DR summary: xor same, same is the best choice for all CPUs. No other method has any advantage over it, and it has at least some advantage over any other method. It's officially recommended by Intel and AMD, and what compilers do. In 64-bit mode, still use xor r32, r32, because writing a 32-bit reg zeros the upper 32. xor r64, r64 is a waste of a byte, because it needs a REX prefix.
Even worse than that, Silvermont only recognizes xor r32,r32 as dep-breaking, not 64-bit operand-size. Thus even when a REX prefix is still required because you're zeroing r8..r15, use xor r10d,r10d, not xor r10,r10.
GP-integer examples:
xor eax, eax ; RAX = 0. Including AL=0 etc.
xor r10d, r10d ; R10 = 0. Still prefer 32-bit operand-size.
xor edx, edx ; RDX = 0
; small code-size alternative: cdq ; zero RDX if EAX is already zero
; SUB-OPTIMAL
xor rax,rax ; waste of a REX prefix, and extra slow on Silvermont
xor r10,r10 ; bad on Silvermont (not dep breaking), same as r10d on other CPUs because a REX prefix is still needed for r10d or r10.
mov eax, 0 ; doesn't touch FLAGS, but not faster and takes more bytes
and eax, 0 ; false dependency. (Microbenchmark experiments might want this)
sub eax, eax ; same as xor on most but not all CPUs; bad on Silvermont for example.
xor cl, cl ; false dep on some CPUs, not a zeroing idiom. Use xor ecx,ecx
mov cl, 0 ; only 2 bytes, and probably better than xor cl,cl *if* you need to leave the rest of ECX/RCX unmodified
Zeroing a vector register is usually best done with pxor xmm, xmm. That's typically what gcc does (even before use with FP instructions).
xorps xmm, xmm can make sense. It's one byte shorter than pxor, but xorps needs execution port 5 on Intel Nehalem, while pxor can run on any port (0/1/5). (Nehalem's 2c bypass delay latency between integer and FP is usually not relevant, because out-of-order execution can typically hide it at the start of a new dependency chain).
On SnB-family microarchitectures, neither flavour of xor-zeroing even needs an execution port. On AMD, and pre-Nehalem P6/Core2 Intel, xorps and pxor are handled the same way (as vector-integer instructions).
Using the AVX version of a 128b vector instruction zeros the upper part of the reg as well, so vpxor xmm, xmm, xmm is a good choice for zeroing YMM(AVX1/AVX2) or ZMM(AVX512), or any future vector extension. vpxor ymm, ymm, ymm doesn't take any extra bytes to encode, though, and runs the same on Intel, but slower on AMD before Zen2 (2 uops). The AVX512 ZMM zeroing would require extra bytes (for the EVEX prefix), so XMM or YMM zeroing should be preferred.
XMM/YMM/ZMM examples
# Good:
xorps xmm0, xmm0 ; smallest code size (for non-AVX)
pxor xmm0, xmm0 ; costs an extra byte, runs on any port on Nehalem.
xorps xmm15, xmm15 ; Needs a REX prefix but that's unavoidable if you need to use high registers without AVX. Code-size is the only penalty.
# Good with AVX:
vpxor xmm0, xmm0, xmm0 ; zeros X/Y/ZMM0
vpxor xmm15, xmm0, xmm0 ; zeros X/Y/ZMM15, still only 2-byte VEX prefix
#sub-optimal AVX
vpxor xmm15, xmm15, xmm15 ; 3-byte VEX prefix because of high source reg
vpxor ymm0, ymm0, ymm0 ; decodes to 2 uops on AMD before Zen2
# Good with AVX512
vpxor xmm15, xmm0, xmm0 ; zero ZMM15 using an AVX1-encoded instruction (2-byte VEX prefix).
vpxord xmm30, xmm30, xmm30 ; EVEX is unavoidable when zeroing zmm16..31, but still prefer XMM or YMM for fewer uops on probable future AMD. May be worth using only high regs to avoid needing vzeroupper in short functions.
# Good with AVX512 *without* AVX512VL (e.g. KNL / Xeon Phi)
vpxord zmm30, zmm30, zmm30 ; Without AVX512VL you have to use a 512-bit instruction.
# sub-optimal with AVX512 (even without AVX512VL)
vpxord zmm0, zmm0, zmm0 ; EVEX prefix (4 bytes), and a 512-bit uop. Use AVX1 vpxor xmm0, xmm0, xmm0 even on KNL to save code size.
See Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm? and
What is the most efficient way to clear a single or a few ZMM registers on Knights Landing?
Semi-related: Fastest way to set __m256 value to all ONE bits and
Set all bits in CPU register to 1 efficiently also covers AVX512 k0..7 mask registers. SSE/AVX vpcmpeqd is dep-breaking on many (although still needs a uop to write the 1s), but AVX512 vpternlogd for ZMM regs isn't even dep-breaking. Inside a loop consider copying from another register instead of re-creating ones with an ALU uop, especially with AVX512.
But zeroing is cheap: xor-zeroing an xmm reg inside a loop is usually as good as copying, except on some AMD CPUs (Bulldozer and Zen) which have mov-elimination for vector regs but still need an ALU uop to write zeros for xor-zeroing.
What's special about zeroing idioms like xor on various uarches
Some CPUs recognize sub same,same as a zeroing idiom like xor, but all CPUs that recognize any zeroing idioms recognize xor. Just use xor so you don't have to worry about which CPU recognizes which zeroing idiom.
xor (being a recognized zeroing idiom, unlike mov reg, 0) has some obvious and some subtle advantages (summary list, then I'll expand on those):
smaller code-size than mov reg,0. (All CPUs)
avoids partial-register penalties for later code. (Intel P6-family and SnB-family).
doesn't use an execution unit, saving power and freeing up execution resources. (Intel SnB-family)
smaller uop (no immediate data) leaves room in the uop cache-line for nearby instructions to borrow if needed. (Intel SnB-family).
doesn't use up entries in the physical register file. (Intel SnB-family (and P4) at least, possibly AMD as well since they use a similar PRF design instead of keeping register state in the ROB like Intel P6-family microarchitectures.)
Smaller machine-code size (2 bytes instead of 5) is always an advantage: Higher code density leads to fewer instruction-cache misses, and better instruction fetch and potentially decode bandwidth.
The benefit of not using an execution unit for xor on Intel SnB-family microarchitectures is minor, but saves power. It's more likely to matter on SnB or IvB, which only have 3 ALU execution ports. Haswell and later have 4 execution ports that can handle integer ALU instructions, including mov r32, imm32, so with perfect decision-making by the scheduler (which doesn't always happen in practice), HSW could still sustain 4 uops per clock even when they all need ALU execution ports.
See my answer on another question about zeroing registers for some more details.
Bruce Dawson's blog post that Michael Petch linked (in a comment on the question) points out that xor is handled at the register-rename stage without needing an execution unit (zero uops in the unfused domain), but missed the fact that it's still one uop in the fused domain. Modern Intel CPUs can issue & retire 4 fused-domain uops per clock. That's where the 4 zeros per clock limit comes from. Increased complexity of the register renaming hardware is only one of the reasons for limiting the width of the design to 4. (Bruce has written some very excellent blog posts, like his series on FP math and x87 / SSE / rounding issues, which I do highly recommend).
On AMD Bulldozer-family CPUs, mov immediate runs on the same EX0/EX1 integer execution ports as xor. mov reg,reg can also run on AGU0/1, but that's only for register copying, not for setting from immediates. So AFAIK, on AMD the only advantage to xor over mov is the shorter encoding. It might also save physical register resources, but I haven't seen any tests.
Recognized zeroing idioms avoid partial-register penalties on Intel CPUs which rename partial registers separately from full registers (P6 & SnB families).
xor will tag the register as having the upper parts zeroed, so xor eax, eax / inc al / inc eax avoids the usual partial-register penalty that pre-IvB CPUs have. Even without xor, IvB only needs a merging uop when the high 8bits (AH) are modified and then the whole register is read, and Haswell even removes that.
From Agner Fog's microarch guide, pg 98 (Pentium M section, referenced by later sections including SnB):
The processor recognizes the XOR of a register with itself as setting
it to zero. A special tag in the register remembers that the high part
of the register is zero so that EAX = AL. This tag is remembered even
in a loop:
; Example 7.9. Partial register problem avoided in loop
xor eax, eax
mov ecx, 100
LL:
mov al, [esi]
mov [edi], eax ; No extra uop
inc esi
add edi, 4
dec ecx
jnz LL
(from pg82): The processor remembers that the upper 24 bits of EAX are zero as long as
you don't get an interrupt, misprediction, or other serializing event.
pg82 of that guide also confirms that mov reg, 0 is not recognized as a zeroing idiom, at least on early P6 designs like PIII or PM. I'd be very surprised if they spent transistors on detecting it on later CPUs.
xor sets flags, which means you have to be careful when testing conditions. Since setcc is unfortunately only available with an 8bit destination, you usually need to take care to avoid partial-register penalties.
It would have been nice if x86-64 repurposed one of the removed opcodes (like AAM) for a 16/32/64 bit setcc r/m, with the predicate encoded in the source-register 3-bit field of the r/m field (the way some other single-operand instructions use them as opcode bits). But they didn't do that, and that wouldn't help for x86-32 anyway.
Ideally, you should use xor / set flags / setcc / read full register:
...
call some_func
xor ecx,ecx ; zero *before* the test
test eax,eax
setnz cl ; cl = (some_func() != 0)
add ebx, ecx ; no partial-register penalty here
This has optimal performance on all CPUs (no stalls, merging uops, or false dependencies).
Things are more complicated when you don't want to xor before a flag-setting instruction. e.g. you want to branch on one condition and then setcc on another condition from the same flags. e.g. cmp/jle, sete, and you either don't have a spare register, or you want to keep the xor out of the not-taken code path altogether.
There are no recognized zeroing idioms that don't affect flags, so the best choice depends on the target microarchitecture. On Core2, inserting a merging uop might cause a 2 or 3 cycle stall. It appears to be cheaper on SnB, but I didn't spend much time trying to measure. Using mov reg, 0 / setcc would have a significant penalty on older Intel CPUs, and still be somewhat worse on newer Intel.
Using setcc / movzx r32, r8 is probably the best alternative for Intel P6 & SnB families, if you can't xor-zero ahead of the flag-setting instruction. That should be better than repeating the test after an xor-zeroing. (Don't even consider sahf / lahf or pushf / popf). IvB can eliminate movzx r32, r8 (i.e. handle it with register-renaming with no execution unit or latency, like xor-zeroing). Haswell and later only eliminate regular mov instructions, so movzx takes an execution unit and has non-zero latency, making test/setcc/movzx worse than xor/test/setcc, but still at least as good as test/mov r,0/setcc (and much better on older CPUs).
Using setcc / movzx with no zeroing first is bad on AMD/P4/Silvermont, because they don't track deps separately for sub-registers. There would be a false dep on the old value of the register. Using mov reg, 0/setcc for zeroing / dependency-breaking is probably the best alternative when xor/test/setcc isn't an option.
Of course, if you don't need setcc's output to be wider than 8 bits, you don't need to zero anything. However, beware of false dependencies on CPUs other than P6 / SnB if you pick a register that was recently part of a long dependency chain. (And beware of causing a partial reg stall or extra uop if you call a function that might save/restore the register you're using part of.)
and with an immediate zero isn't special-cased as independent of the old value on any CPUs I'm aware of, so it doesn't break dependency chains. It has no advantages over xor and many disadvantages.
It's useful only for writing microbenchmarks when you want a dependency as part of a latency test, but want to create a known value by zeroing and adding.
See http://agner.org/optimize/ for microarch details, including which zeroing idioms are recognized as dependency breaking (e.g. sub same,same is on some but not all CPUs, while xor same,same is recognized on all.) mov does break the dependency chain on the old value of the register (regardless of the source value, zero or not, because that's how mov works). xor only breaks dependency chains in the special-case where src and dest are the same register, which is why mov is left out of the list of specially recognized dependency-breakers. (Also, because it's not recognized as a zeroing idiom, with the other benefits that carries.)
Interestingly, the oldest P6 design (PPro through Pentium III) didn't recognize xor-zeroing as a dependency-breaker, only as a zeroing idiom for the purposes of avoiding partial-register stalls, so in some cases it was worth using both mov and then xor-zeroing in that order to break the dep and then zero again + set the internal tag bit that the high bits are zero so EAX=AX=AL.
See Agner Fog's Example 6.17. in his microarch pdf. He says this also applies to P2, P3, and even (early?) PM. A comment on the linked blog post says it was only PPro that had this oversight, but I've tested on Katmai PIII, and #Fanael tested on a Pentium M, and we both found that it didn't break a dependency for a latency-bound imul chain. This confirms Agner Fog's results, unfortunately.
TL:DR:
If it really makes your code nicer or saves instructions, then sure, zero with mov to avoid touching the flags, as long as you don't introduce a performance problem other than code size. Avoiding clobbering flags is the only sensible reason for not using xor, but sometimes you can xor-zero ahead of the thing that sets flags if you have a spare register.
mov-zero ahead of setcc is better for latency than movzx reg32, reg8 after (except on Intel when you can pick different registers), but worse code size.

Why is a conditional move not vulnerable to Branch Prediction Failure?

After reading this post (answer on StackOverflow) (at the optimization section), I was wondering why conditional moves are not vulnerable for Branch Prediction Failure. I found on an article on cond moves here (PDF by AMD). Also there, they claim the performance advantage of cond. moves. But why is this? I don't see it. At the moment that that ASM-instruction is evaluated, the result of the preceding CMP instruction is not known yet.

Mis-predicted branches are expensive
A modern processor generally executes between one and three instructions each cycle if things go well (if it does not stall waiting for data dependencies for these instructions to arrive from previous instructions or from memory).
The statement above holds surprisingly well for tight loops, but this shouldn't blind you to one additional dependency that can prevent an instruction to be executed when its cycle comes:
for an instruction to be executed, the processor must have started to fetch and decode it 15-20 cycles before.
What should the processor do when it encounters a branch? Fetching and decoding both targets does not scale (if more branches follow, an exponential number of paths would have to be fetched in parallel). So the processor only fetches and decodes one of the two branches, speculatively.
This is why mis-predicted branches are expensive: they cost the 15-20 cycles that are usually invisible because of an efficient instruction pipeline.
Conditional move is never very expensive
Conditional move does not require prediction, so it can never have this penalty. It has data dependencies, same as ordinary instructions. In fact, a conditional move has more data dependencies than ordinary instructions, because the data dependencies include both “condition true” and “condition false” cases. After an instruction that conditionally moves r1 to r2, the contents of r2 seem to depend on both the previous value of r2 and on r1. A well-predicted conditional branch allows the processor to infer more accurate dependencies. But data dependencies typically take one-two cycles to arrive, if they need time to arrive at all.
Note that a conditional move from memory to register would sometimes be a dangerous bet: if the condition is such that the value read from memory is not assigned to the register, you have waited on memory for nothing. But the conditional move instructions offered in instruction sets are typically register to register, preventing this mistake on the part of the programmer.

It is all about the instruction pipeline. Remember, modern CPUs run their instructions in a pipeline, which yields a significant performance boost when the execution flow is predictable by the CPU.
cmov
add eax, ebx
cmp eax, 0x10
cmovne ebx, ecx
add eax, ecx
At the moment that that ASM-instruction is evaluated, the result of the preceding CMP instruction is not known yet.
Perhaps, but the CPU still knows that the instruction following the cmov will be executed right after, regardless of the result from the cmp and cmov instruction. The next instruction may thus safely be fetched/decoded ahead of time, which is not the case with branches.
The next instruction could even execute before the cmov does (in my example this would be safe)
branch
add eax, ebx
cmp eax, 0x10
je .skip
mov ebx, ecx
.skip:
add eax, ecx
In this case, when the CPU's decoder sees je .skip it will have to choose whether to continue prefetching/decoding instructions either 1) from the next instruction, or 2) from the jump target. The CPU will guess that this forward conditional branch won't happen, so the next instruction mov ebx, ecx will go into the pipeline.
A couple of cycles later, the je .skip is executed and the branch is taken. Oh crap! Our pipeline now holds some random junk that should never be executed. The CPU has to flush all its cached instructions and start fresh from .skip:.
That is the performance penalty of mispredicted branches, which can never happen with cmov since it doesn't alter the execution flow.

Indeed the result may not yet be known, but if other circumstances permit (in particular, the dependency chain) the cpu can reorder and execute instructions following the cmov. Since there is no branching involved, those instructions need to be evaluated in any case.
Consider this example:
cmoveq edx, eax
add ecx, ebx
mov eax, [ecx]
The two instructions following the cmov do not depend on the result of the cmov, so they can be executed even while the cmov itself is pending (this is called out of order execution). Even if they can't be executed, they can still be fetched and decoded.
A branching version could be:
jne skip
mov edx, eax
skip:
add ecx, ebx
mov eax, [ecx]
The problem here is that control flow is changing and the cpu isn't clever enough to see that it could just "insert" the skipped mov instruction if the branch was mispredicted as taken - instead it throws away everything it did after the branch, and restarts from scratch. This is where the penalty comes from.

You should read these. With Fog+Intel, just search for CMOV.
Linus Torvald's critique of CMOV circa 2007
Agner Fog's comparison of microarchitectures
Intel® 64 and IA-32 Architectures Optimization Reference Manual
Short answer, correct predictions are 'free' while conditional branch mispredicts can cost 14-20 cycles on Haswell. However, CMOV is never free. Still I think CMOV is a LOT better now than when Torvalds ranted. There is no single one correct for all time on all processors ever answer.

I have this illustration from [Peter Puschner et al.] slide which explains how it transforms into single path code, and speedup the execution.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio