How to enable alignment exceptions for my process on x64? - windows

I'm curious to see if my 64-bit application suffers from alignment faults.
From Windows Data Alignment on IPF, x86, and x64 archive:
In Windows, an application program that generates an alignment fault will raise an exception, EXCEPTION_DATATYPE_MISALIGNMENT.
On the x64 architecture, the alignment exceptions are disabled by default, and the fix-ups are done by the hardware. The application can enable alignment exceptions by setting a couple of register bits, in which case the exceptions will be raised unless the user has the operating system mask the exceptions with SEM_NOALIGNMENTFAULTEXCEPT. (For details, see the AMD Architecture Programmer's Manual Volume 2: System Programming.)
[Ed. emphasis mine]
On the x86 architecture, the operating system does not make the alignment fault visible to the application. On these two platforms, you will also suffer performance degradation on the alignment fault, but it will be significantly less severe than on the Itanium, because the hardware will make the multiple accesses of memory to retrieve the unaligned data.
On the Itanium, by default, the operating system (OS) will make this exception visible to the application, and a termination handler might be useful in these cases. If you do not set up a handler, then your program will hang or crash. In Listing 3, we provide an example that shows how to catch the EXCEPTION_DATATYPE_MISALIGNMENT exception.
Ignoring the direction to consult the AMD Architecture Programmer's Manual, i will instead consult the Intel 64 and IA-32 Architectures Software Developer’s Manual
5.10.5 Checking Alignment
When the CPL is 3, alignment of memory references can be checked by setting the
AM flag in the CR0 register and the AC flag in the EFLAGS register. Unaligned memory
references generate alignment exceptions (#AC). The processor does not generate
alignment exceptions when operating at privilege level 0, 1, or 2. See Table 6-7 for a
description of the alignment requirements when alignment checking is enabled.
Excellent. I'm not sure what that means, but excellent.
Then there's also:
2.5 CONTROL REGISTERS
Control registers (CR0, CR1, CR2, CR3, and CR4; see Figure 2-6) determine operating
mode of the processor and the characteristics of the currently executing task.
These registers are 32 bits in all 32-bit modes and compatibility mode.
In 64-bit mode, control registers are expanded to 64 bits. The MOV CRn instructions
are used to manipulate the register bits. Operand-size prefixes for these instructions
are ignored.
The control registers are summarized below, and each architecturally defined control
field in these control registers are described individually. In Figure 2-6, the width of
the register in 64-bit mode is indicated in parenthesis (except for CR0).
CR0 — Contains system control flags that control operating mode and states of
the processor
AM
Alignment Mask (bit 18 of CR0) — Enables automatic alignment checking
when set; disables alignment checking when clear. Alignment checking is
performed only when the AM flag is set, the AC flag in the EFLAGS register is
set, CPL is 3, and the processor is operating in either protected or virtual-
8086 mode.
I tried
The language i am actually using is Delphi, but pretend it's language agnostic pseudocode:
void UnmaskAlignmentExceptions()
{
asm
mov rax, cr0; //copy CR0 flags into RAX
or rax, 0x20000; //set bit 18 (AM)
mov cr0, rax; //copy flags back
}
The first instruction
mov rax, cr0;
fails with a Privileged Instruction exception.
How to enable alignment exceptions for my process on x64?
PUSHF
I discovered that the x86 has the instruction:
PUSHF, POPF: Push/pop first 16-bits of EFLAGS on/off the stack
PUSHFD, POPFD: Push/pop all 32-bits of EFLAGS on/off the stack
That then led me to the x64 version:
PUSHFQ, POPFQ: Push/pop the RFLAGS quad on/off the stack
(In 64-bit world the EFLAGS are renamed RFLAGS).
So i wrote:
void EnableAlignmentExceptions;
{
asm
PUSHFQ; //Push RFLAGS quadword onto the stack
POP RAX; //Pop them flags into RAX
OR RAX, $20000; //set bit 18 (AC=Alignment Check) of the flags
PUSH RAX; //Push the modified flags back onto the stack
POPFQ; //Pop the stack back into RFLAGS;
}
And it didn't crash or trigger a protection exception. I have no idea if it does what i want it to.
Bonus Reading
How to catch data-alignment faults on x86 (aka SIGBUS on Sparc) (unrelated question; x86 not x64, Ubunutu not Windows, gcc vs not)

Applications running on x64 have access to a flag register (sometimes referred to as EFLAGS). Bit 18 in this register allows applications to get exceptions when alignment errors occur. So in theory, all a program has to do to enable exceptions for alignment errors is modify the flags register.
However
In order for that to actually work, the operating system kernel must set cr0's bit 18 to allow it. And the Windows operating system doesn't do that. Why not? Who knows?
Applications can not set values in the control register. Only the kernel can do this. Device drivers run inside the kernel, so they can set this too.
It is possible to muck about and try to get this to work by creating a device driver, see:
Old New Thing - Disabling the program crash dialog archive
and the comments that follow. Note that this post is over a decade old, so some of the links are dead.
You might also find this comment (and some of the other answers in this question) to be useful:
Larry Osterman - 07-28-2004 2:22 AM
We actually built a version of NT with alignment exceptions turned on for x86 (you can do that as Skywing mentioned).
We quickly turned it off, because of the number of apps that broke :)

As an alternative to AC for finding slowdowns due to unaligned accesses, you can use hardware performance counter events on Intel CPUs for mem_inst_retired.split_loads and mem_inst_retired.split_stores to find loads/stores that split across a cache-line boundary.
perf record -c 10 -e mem_inst_retired.split_stores,mem_inst_retired.split_loads ./a.out should be useful on Linux. -c 10 records a sample every 10 HW events. If your program does a lot of unaligned accesses and you only want to find the real hotspots, leave it at the default. But -c 10 can get useful data even on a tiny binary that calls printf once. Other perf options like -g to record parent functions on each sample work as usual, and could be useful.
On Windows, use whatever tool you prefer for looking at perf counters. VTune is popular.
Modern Intel CPUs (P6 family and newer) have no penalty for misalignment within a cache line. https://agner.org/optimize/. In fact, such loads/stores are even guaranteed to be atomic (up to 8 bytes), on Intel CPUs. So AC is stricter than necessary, but it will help find potentially-risky accesses that could be page-splits or cache-line splits with differently-aligned data.
AMD CPUs may have penalties for crossing a 16-byte boundary within a 64-byte cache line. I'm not familiar with what hardware counters are available there. Beware that profiling on Intel HW won't necessarily find slowdowns that occur on AMD CPUs, if the offending access never crosses a cache line boundary.
See How can I accurately benchmark unaligned access speed on x86_64? for some details on the penalties, including my testing on 4k-split latency and throughput on Skylake.
See also http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ for possible penalties to store-forwarding efficiency for misaligned loads/stores on Intel/AMD.
Running normal binaries with AC set is not always practical. Compiler-generated code might choose to use an unaligned 8-byte load or store to copy multiple struct members, or to store some literal data.
gcc -O3 -mtune=generic (i.e. the default with optimization enabled) assumes that cache-line splits are cheap enough to be worth the risk of using unaligned accesses instead of multiple narrow accesses like the source does. Page-splits got much cheaper in Skylake, down from ~100 to 150 cycles in Haswell to ~10 cycles in Skylake (about the same penalty as CL splits), because apparently Intel found they were less rare than they previously thought.
Many optimized library functions (like memcpy) use unaligned integer accesses. e.g. glibc's memcpy, for a 6-byte copy, would do 2 overlapping 4-byte loads from the start/end of the buffer, then 2 overlapping stores. (It doesn't have a special case for exactly 6 bytes to do a dword + word, just increasing powers of 2). This comment in the source explains its strategies.
So even if your OS would let you enable AC, you might need a special version of libraries to not trigger AC all over the place for stuff like small memcpy.
SIMD
Alignment when looping sequentially over an array really matters for AVX512, where a vector is the same width as a cache line. If your pointers are misaligned, every access is a cache-line split, not just every other with AVX2. Aligned is always better, but for many algorithms with a decent amount of computation mixed with memory access, it only makes a significant difference with AVX512.
(So with AVX1/2, it's often good to just use unaligned loads, instead of always doing extra work to check alignment and go scalar until an alignment boundary. Especially if your data is usually aligned but you want the function to still work marginally slower in case it isn't.)
Scattered misaligned accesses cross a cache line boundary essentially have twice the cache footprint from touching both lines, if the lines aren't otherwise touched.
Checking for 16, 32 or 64 byte alignment with SIMD is simple in asm: just use [v]movdqa alignment-required loads/stores, or legacy-SSE memory source operands for instructions like paddb xmm0, [rdi]. Instead of vmovdqu or VEX-coded memory source operands like vpaddb xmm0, xmm1, [rdi] which let hardware handle the case of misalignment if/when it occurs.
But in C with intrinsics, some compilers (MSVC and ICC) compile alignment-required intrinsics like _mm_load_si128 into [v]movdqu, never using [v]movdqa, so that's annoying if you actually wanted to use alignment-required loads.
Of course, _mm256_load_si256 or 128 can fold into an AVX memory source operand for vpaddb ymm0, ymm1, [rdi] with any compiler including GCC/clang, same for 128-bit any time AVX and optimization are enabled. But store intrinsics that don't get optimized away entirely do get done with vmovdqa / vmovaps, so at least you can verify store alignment.
To verify load alignment with AVX, you can disable optimization so you'll get separate load / spill into __m256i temporary / reload.

This works in 64-bit Intel CPU. May fail in some AMD
pushfq
bts qword ptr [rsp], 12h ; set AC bit of rflags
popfq
It will not work right away in 32-bit CPUs, these will require first a kernel driver to change the AM bit of CR0 and then
pushfd
bts dword ptr [esp], 12h
popfd

Related

Timing of executing 8 bit and 64 bit instructions on 64 bit x64/Amd64 processors

Is there any execution timing difference between 8 it and 64 bit instructions on 64 bit x64/Amd64 processor, when those instructions are similar/same except bit width?
Is there a way to find real processor timing of executing these 2 tiny assembly functions?
-Thanks.
; 64 bit instructions
add64:
mov $0x1, %rax
add $0x2, %rax
ret
; 8 bit instructions
add8:
mov $0x1, %al
add $0x2, %al
ret
Yes, there's a difference. mov $0x1, %al has a false dependency on the old value of RAX on most CPUs, including everything newer than Sandybridge. It's a 2-input 1-output instruction; from the CPU's point of view it's like add $1, %al as far as scheduling it independently or not relative to other uses of RAX. Only writing a 32 or 64-bit register starts a new dependency chain.
This means the AL return value of your add8 function might not be ready until after a cache miss for some independent work the caller happened to be doing in EAX before the call, but the RAX result of add64 could be ready right away for out-of-order execution to get started on later instructions in the caller that use the return value. (Assuming their other inputs are also ready.)
Why doesn't GCC use partial registers? and
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent
and What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? - Important background for understanding performance on modern OoO exec CPUs.
Their code-size also differs: Both the 8-bit instructions are 2 bytes long. (Thanks to the AL, imm8 short-form encoding; add $1, %dl would be 3 bytes). The RAX instructions are 7 and 4 bytes long. This matters for L1i cache footprint (and on a large scale, for how many bytes have to get paged in from disk). On a small scale, how many instructions can fit into a 16 or 32-byte fetch block if the CPU is doing legacy decode because the code wasn't already hot in the uop cache. Also code-alignment of later instructions is affected by varying lengths of previous instructions, sometimes affecting which branches alias each other.
https://agner.org/optimize/ explains the details of the pipelines of various x86 microarchitectures, including front-end decoding effects that can make instruction-length matter beyond just code density in the I-cache / uop-cache.
Generally 32-bit operand-size is the most efficient (for performance, and pretty good for code-size). 32 and 8 are the operand-sizes that x86-64 can use without extra prefixes, and in practice with 8-bit to avoid stalls and badness you need more instructions or longer instructions because they don't zero-extend. The advantages of using 32bit registers/instructions in x86-64.
A few instructions are actually slower in the ALUs for 64-bit operand-size, not just front-end effects. That includes div on most CPUs, and imul on some older CPUs. Also popcnt and bswap. e.g. Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux
Note that mov $0x1, %rax will assemble to 7 bytes with GAS, unless you use as -O2 (not the same as gcc -O2, see this for examples) to get it to optimize to mov $1, %eax which exactly the same architectural effects, but is shorter (no REX or ModRM byte). Some assemblers do that optimization by default, but GAS doesn't. Why NASM on Linux changes registers in x86_64 assembly has more about why this optimization is safe and good, and why you should do it yourself in the source especially if your assembler doesn't do it for you.
But other than the false dep and code-size, they're the same for the back-end of the CPU: all those instructions are single-uop and can run on any scalar-integer ALU execution port1. (https://uops.info/ has automated test results for every form of every unprivileged instruction).
Footnote 1: Excavator (last-gen Bulldozer-family) can also run mov $imm, %reg on 2 more ports (AGU) for 32 and 64-bit operand-size. But merging a new low-8 or low-16 into a full register needs an ALU port. So mov $1, %rax has 4/clock throughput on Excavator, but mov $1, %al only has 2/clock throughput. (And of course only if you use a few different destination registers, not actually AL repeatedly; that would be a latency bottleneck of 1/clock because of the false dependency from writing a partial register on that microarchitecture.)
Previous Bulldozer-family CPUs starting with Piledriver can run mov reg, reg (for r32 or r64) on EX0, EX1, AGU0, AGU1, while most ALU instructions including mov $imm, %reg can only run on EX0/1. Further extending the AGU port's capabilities to also handle mov-immediate was a new feature in Excavator.
Fortunately Bulldozer was obsoleted by AMD's much better Zen architecture which has 4 full scalar integer ALU ports / execution units. (And a wider front end and a uop cache, good caches, and generally doesn't suck in a lot of the ways that Bulldozer sucked.)
Is there a way to measure it?
yes, but generally not in a function you call with call. Instead put it in an unrolled loop so you can run it lots of times with minimal other instructions. Especially useful to look at CPU performance counter results to find front-end / back-end uop counts, as well as just the overall time for your loop.
You can construct your loop to measure latency or throughput; see RDTSCP in NASM always returns the same value (timing a single instruction). Also:
Assembly - How to score a CPU instruction by latency and throughput
Idiomatic way of performance evaluation?
Can x86's MOV really be "free"? Why can't I reproduce this at all? is a good specific example of constructing a microbenchmark to measure / prove something specific.
Generally you don't need to measure yourself (although it's good to understand how, that helps you know what the measurements really mean). People have already done that for most CPU microarchitectures. You can predict performance for a specific CPU for some loops (if you can assume no stalls or cache misses) based on analyzing the instructions. Often that can predict performance fairly accurately, but medium-length dependency chains that OoO exec can only partially hide makes it too hard to accurately predict or account for every cycle.
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? has links to lots of good details, and stuff about CPU internals.
How many CPU cycles are needed for each assembly instruction? (you can't add up a cycle count for each instruction; front-end and back-end throughput, and latency, could each be the bottleneck for a loop.)

x86 mfence and C++ memory barrier

I'm checking how the compiler emits instructions for multi-core memory barriers on x86_64. The below code is the one I'm testing using gcc_x86_64_8.3.
std::atomic<bool> flag {false};
int any_value {0};
void set()
{
any_value = 10;
flag.store(true, std::memory_order_release);
}
void get()
{
while (!flag.load(std::memory_order_acquire));
assert(any_value == 10);
}
int main()
{
std::thread a {set};
get();
a.join();
}
When I use std::memory_order_seq_cst, I can see the MFENCE instruction is used with any optimization -O1, -O2, -O3. This instruction makes sure the store buffers are flushed, therefore updating their data in L1D cache (and using MESI protocol to make sure other threads can see effect).
However when I use std::memory_order_release/acquire with no optimizations MFENCE instruction is also used, but the instruction is omitted using -O1, -O2, -O3 optimizations, and not seeing other instructions that flush the buffers.
In the case where MFENCE is not used, what makes sure the store buffer data is committed to cache memory to ensure the memory order semantics?
Below is the assembly code for the get/set functions with -O3, like what we get on the Godbolt compiler explorer:
set():
mov DWORD PTR any_value[rip], 10
mov BYTE PTR flag[rip], 1
ret
.LC0:
.string "/tmp/compiler-explorer-compiler119218-62-hw8j86.n2ft/example.cpp"
.LC1:
.string "any_value == 10"
get():
.L8:
movzx eax, BYTE PTR flag[rip]
test al, al
je .L8
cmp DWORD PTR any_value[rip], 10
jne .L15
ret
.L15:
push rax
mov ecx, OFFSET FLAT:get()::__PRETTY_FUNCTION__
mov edx, 17
mov esi, OFFSET FLAT:.LC0
mov edi, OFFSET FLAT:.LC1
call __assert_fail
The x86 memory ordering model provides #StoreStore and #LoadStore barriers for all store instructions1, which is all what the release semantics require. Also the processor will commit a store instruction as soon as possible; when the store instruction retires, the store becomes the oldest in the store buffer, the core has the target cache line in a writeable coherence state, and a cache port is available to perform the store operation2. So there is no need for an MFENCE instruction. The flag will become visible to the other thread as soon as possible and when it does, any_value is guaranteed to be 10.
On the other hand, sequential consistency also requires #StoreLoad and #LoadLoad barriers. MFENCE is required to provide both3 barriers and so it is used at all optimization levels.
Related: Size of store buffers on Intel hardware? What exactly is a store buffer?.
Footnotes:
(1) There are exceptions that don't apply here. In particular, non-temporal stores and stores to the uncacheable write-combining memory types provide only the #LoadStore barrier. Anyway, these barriers are provided for stores to the write-back memory type on both Intel and AMD processors.
(2) This is in contrast to write-combining stores which are made globally-visible under certain conditions. See Section 11.3.1 of the Intel manual Volume 3.
(3) See the discussion under Peter's answer.
x86's TSO memory model is sequential-consistency + a store buffer, so only seq-cst stores need any special fencing. (Stalling after a store until the store buffer drains, before later loads, is all we need to recover sequential consistency). The weaker acq/rel model is compatible with the StoreLoad reordering caused by a store buffer.
(See discussion in comments re: whether "allowing StoreLoad reordering" is an accurate and sufficient description of what x86 allows. A core always sees its own stores in program order because loads snoop the store buffer, so you could say that store-forwarding also reorders loads of recently-stored data. Except you can't always: Globally Invisible load instructions)
(And BTW, compilers other than gcc use xchg to do a seq-cst store. This is actually more efficient on current CPUs. GCC's mov+mfence might have been cheaper in the past, but is currently usually worse even if you don't care about the old value. See Why does a std::atomic store with sequential consistency use XCHG? for a comparison between GCC's mov+mfence vs. xchg. Also my answer on Which is a better write barrier on x86: lock+addl or xchgl?)
Fun fact: you can achieve sequential consistency by instead fencing seq-cst loads instead of stores. But cheap loads are much more valuable than cheap stores for most use-cases, so everyone uses ABIs where the full barriers go on the stores.
See https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html for details of how C++11 atomic ops map to asm instruction sequences for x86, PowerPC, ARMv7, ARMv8, and Itanium. Also When are x86 LFENCE, SFENCE and MFENCE instructions required?
when I use std::memory_order_release/acquire with no optimizations MFENCE instruction is also used
That's because flag.store(true, std::memory_order_release); doesn't inline, because you disabled optimization. That includes inlining of very simple member functions like atomic::store(T, std::memory_order = std::memory_order_seq_cst)
When the ordering parameter to the __atomic_store_n() GCC builtin is a runtime variable (in the atomic::store() header implementation), GCC plays it conservative and promotes it to seq_cst.
It might actually be worth it for gcc to branch over mfence because it's so expensive, but that's not what we get. (But that would make larger code-size for functions with runtime variable order params, and the code path might not be hot. So branching is probably only a good idea in the libatomic implementation, or with profile-guided optimization for rare cases where a function is large enough to not inline but takes a variable order.)

Is x86 32-bit assembly code valid x86 64-bit assembly code?

Is all x86 32-bit assembly code valid x86 64-bit assembly code?
I've wondered whether 32-bit assembly code is a subset of 64-bit assembly code, i.e., every 32-bit assembly code can be run in a 64-bit environment?
I guess the answer is yes, because 64-bit Windows is capable of executing 32-bit programs, but then I've seen that the 64-bit processor supports a 32-bit compatible mode?
If not, please provide a small example of 32-bit assembly code that isn't valid 64-bit assembly code and explain how the 64-bit processor executes the 32-bit assembly code.
A modern x86 CPU has three main operation modes (this description is simplified):
In real mode, the CPU executes 16 bit code with paging and segmentation disabled. Memory addresses in your code refer to phyiscal addresses, the content of segment registers is shifted and added to the address to form an effective address.
In protected mode, the CPU executes 16 bit or 32 bit code depending on the segment selector in the CS (code segment) register. Segmentation is enabled, paging can (and usually is) enabled. Programs can switch between 16 bit and 32 bit code by far jumping to an appropriate segment. The CPU can enter the submode virtual 8086 mode to emulate real mode for individual processes from inside a protected mode operating system.
In long mode, the CPU executes 64 bit code. Segmentation is mostly disabled, paging is enabled. The CPU can enter the sub-mode compatibility mode to execute 16 bit and 32 bit protected mode code from within an operating system written for long mode. Compatibility mode is entered by far-jumping to a CS selector with the appropriate bits set. Virtual 8086 mode is unavailable.
Wikipedia has a nice table of x86-64 operating modes including legacy and real modes, and all 3 sub-modes of long mode. Under a mainstream x86-64 OS, after booting the CPU cores will always all be in long mode, switching between different sub-modes depending on 32 or 64-bit user-space. (Not counting System Management Mode interrupts...)
Now what is the difference between 16 bit, 32 bit, and 64 bit mode?
16-bit and 32-bit mode are basically the same thing except for the following differences:
In 16 bit mode, the default address and operand width is 16 bit. You can change these to 32 bit for a single instruction using the 0x67 and 0x66 prefixes, respectively. In 32 bit mode, it's the other way round.
In 16 bit mode, the instruction pointer is truncated to 16 bit, jumping to addresses higher than 65536 can lead to weird results.
VEX/EVEX encoded instructions (including those of the AVX, AVX2, BMI, BMI2 and AVX512 instruction sets) aren't decoded in real or Virtual 8086 mode (though they are available in 16 bit protected mode).
16 bit mode has fewer addressing modes than 32 bit mode, though it is possible to override to a 32 bit addressing mode on a per-instruction basis if the need arises.
Now, 64 bit mode is a somewhat different. Most instructions behave just like in 32 bit mode with the following differences:
There are eight additional registers named r8, r9, ..., r15. Each register can be used as a byte, word, dword, or qword register. The family of REX prefixes (0x40 to 0x4f) encode whether an operand refers to an old or new register. Eight additional SSE/AVX registers xmm8, xmm9, ..., xmm15 are also available.
you can only push/pop 64 bit and 16 bit quantities (though you shouldn't do the latter), 32 bit quantities cannot be pushed/popped.
The single-byte inc reg and dec reg instructions are unavailable, their instruction space has been repurposed for the REX prefixes. Two-byte inc r/m and dec r/m is still available, so inc reg and dec reg can still be encoded.
A new instruction-pointer relative addressing mode exists, using the shorter of the 2 redundant ways 32-bit mode had to encode a [disp32] absolute address.
The default address width is 64 bit, a 32 bit address width can be selected through the 0x67 prefix. 16 bit addressing is unavailable.
The default operand width is 32 bit. A width of 16 bit can be selected through the 0x66 prefix, a 64 bit width can be selected through an appropriate REX prefix independently of which registers you use.
It is not possible to use ah, bh, ch, and dh in an instruction that requires a REX prefix. A REX prefix causes those register numbers to mean instead the low 8 bits of registers si, di, sp, and bp.
writing to the low 32 bits of a 64 bit register clears the upper 32 bit, avoiding false dependencies for out-of-order exec. (Writing 8 or 16-bit partial registers still merges with the 64-bit old value.)
as segmentation is nonfunctional, segment overrides are meaningless no-ops except for the fs and gs overrides (0x64, 0x65) which serve to support thread-local storage (TLS).
also, many instructions that specifically deal with segmentation are unavailable. These are: push/pop seg (except push/pop fs/gs), arpl, call far (only the 0xff encoding is valid), les, lds, jmp far (only the 0xff encoding is valid),
instructions that deal with decimal arithmetic are unavailable, these are: daa, das, aaa, aas, aam, aad,
additionally, the following instructions are unavailable: bound (rarely used), pusha/popa (not useful with the additional registers), salc (undocumented),
the 0x82 instruction alias for 0x80 is invalid.
on early amd64 CPUs, lahf and sahf are unavailable.
And that's basically all of it!
No, it isn't.
While there is a large amount of overlap, 64-bit assembly code is not a superset of 32-bit assembly code and so 32-bit assembly is not in general valid in 64-bit mode.
This applies both the mnemonic assembly source (which is assembled into binary format by an assembler), as well as the binary machine code format itself.
This question covers in some detail instructions that were removed, but there are also many encoding forms whose meanings were changed.
For example, Jester in the comments gives the example of push eax not being valid in 64-bit code. Based on this reference you can see that the 32-bit push is marked N.E. meaning not encodable. In 64-bit mode, the encoding is used to represent push rax (an 8-byte push) instead. So the same sequence of bytes has a different meaning in 32-bit mode versus 64-bit mode.
In general, you can browse the list of instructions on that site and find many which are listed as invalid or not encodable in 64-bit.
If not, please provide a small example of 32-bit assembly code that
isn't valid 64-bit assembly code and explain how the 64-bit processor
executes the 32-bit assembly code.
As above, push eax is one such example. I think what is missing is that 64-bit CPUs support directly running 32-bit binaries. They don't do it via compatibility between 32-bit and 64-bit instructions at the machine language level, but simply by having a 32-bit mode where the decoders (in particular) interpret the instruction stream as 32-bit x86 rather than x86-64, as well as the so-called long mode for running 64-bit instructions. When such 64-bit chips were first released, it was common to run a 32-bit operating system, which pretty much means the chip is permanently in this mode (never goes into 64-bit mode).
More recently, it is typical to run a 64-bit operating system, which is aware of the modes, and which will put the CPU into 32-bit mode when the user launches a 32-bit process (which are still very common: until very recently my browser was still 32-bit).
All the details and proper terminology for the modes can be found in fuz's answer, which is really the one you should read.

INC instruction vs ADD 1: Does it matter?

From Ira Baxter answer on, Why do the INC and DEC instructions not affect the Carry Flag (CF)?
Mostly, I stay away from INC and DEC now, because they do partial condition code updates, and this can cause funny stalls in the pipeline, and ADD/SUB don't. So where it doesn't matter (most places), I use ADD/SUB to avoid the stalls. I use INC/DEC only when keeping the code small matters, e.g., fitting in a cache line where the size of one or two instructions makes enough difference to matter. This is probably pointless nano[literally!]-optimization, but I'm pretty old-school in my coding habits.
And I would like to ask why it can cause stalls in the pipeline while add doesn't? After all, both ADD and INC updates flag registers. The only difference is that INC doesn't update CF. But why it matters?
Update: the Efficiency cores on Alder Lake are Gracemont, and run inc reg as a single uop, but at only 1/clock, vs. 4/clock for add reg, 1 (https://uops.info/). This may be a false dependency on FLAGS like P4 had; the uops.info tests didn't try adding a dep-breaking instruction. Other than the TL:DR, I haven't updated other parts of this answer.
TL:DR/advice for modern CPUs: Probably use add; Intel Alder Lake's E-cores are relevant for "generic" tuning and seem to run inc slowly.
Other than Alder Lake and earlier Silvermont-family, use inc except with a memory destination; that's fine on mainstream Intel or any AMD. (e.g. like gcc -mtune=core2, -mtune=haswell, or -mtune=znver1). inc mem costs an extra uop vs. add on Intel P6 / SnB-family; the load can't micro-fuse.
If you care about Silvermont-family (including KNL in Xeon Phi, and some netbooks, chromebooks, and NAS servers), probably avoid inc. add 1 only costs 1 extra byte in 64-bit code, or 2 in 32-bit code. But it's not a performance disaster (just locally 1 extra ALU port used, not creating false dependencies or big stalls), so if you don't care much about SMont then don't worry about it.
Writing CF instead of leaving it unmodified can potentially be useful with other surrounding code that might benefit from CF dep-breaking, e.g. shifts. See below.
If you want to inc/dec without touching any flags, lea eax, [rax+1] runs efficiently and has the same code-size as add eax, 1. (Usually on fewer possible execution ports than add/inc, though, so add/inc are better when destroying FLAGS is not a problem. https://agner.org/optimize/)
On modern CPUs, add is never slower than inc (except for indirect code-size / decode effects), but usually it's not faster either, so you should prefer inc for code-size reasons. Especially if this choice is repeated many times in the same binary (e.g. if you are a compiler-writer).
inc saves 1 byte (64-bit mode), or 2 bytes (opcodes 0x40..F inc r32/dec r32 short form in 32-bit mode, re-purposed as the REX prefix for x86-64). This makes a small percentage difference in total code size. This helps instruction-cache hit rates, iTLB hit rate, and number of pages that have to be loaded from disk.
Advantages of inc:
code-size directly
Not using an immediate can have uop-cache effects on Sandybridge-family, which could offset the better micro-fusion of add. (See Agner Fog's table 9.1 in the Sandybridge section of his microarch guide.) Perf counters can easily measure issue-stage uops, but it's harder to measure how things pack into the uop cache and uop-cache read bandwidth effects.
Leaving CF unmodified is an advantage in some cases, on CPUs where you can read CF after inc without a stall. (Not on Nehalem and earlier.)
There is one exception among modern CPUs: Silvermont/Goldmont/Knight's Landing decodes inc/dec efficiently as 1 uop, but expands to 2 in the allocate/rename (aka issue) stage. The extra uop merges partial flags. inc throughput is only 1 per clock, vs. 0.5c (or 0.33c Goldmont) for independent add r32, imm8 because of the dep chain created by the flag-merging uops.
Unlike P4, the register result doesn't have a false-dep on flags (see below), so out-of-order execution takes the flag-merging off the latency critical path when nothing uses the flag result. (But the OOO window is much smaller than mainstream CPUs like Haswell or Ryzen.) Running inc as 2 separate uops is probably a win for Silvermont in most cases; most x86 instructions write all the flags without reading them, breaking these flag dependency chains.
SMont/KNL has a queue between decode and allocate/rename (See Intel's optimization manual, figure 16-2) so expanding to 2 uops during issue can fill bubbles from decode stalls (on instructions like one-operand mul, or pshufb, which produce more than 1 uop from the decoder and cause a 3-7 cycle stall for microcode). Or on Silvermont, just an instruction with more than 3 prefixes (including escape bytes and mandatory prefixes), e.g. REX + any SSSE3 or SSE4 instruction. But note that there is a ~28 uop loop buffer, so small loops don't suffer from these decode stalls.
inc/dec aren't the only instructions that decode as 1 but issue as 2: push/pop, call/ret, and lea with 3 components do this too. So do KNL's AVX512 gather instructions. Source: Intel's optimization manual, 17.1.2 Out-of-Order Engine (KNL). It's only a small throughput penalty (and sometimes not even that if anything else is a bigger bottleneck), so it's generally fine to still use inc for "generic" tuning.
Intel's optimization manual still recommends add 1 over inc in general, to avoid risks of partial-flag stalls. But since Intel's compiler doesn't do that by default, it's not too likely that future CPUs will make inc slow in all cases, like P4 did.
Clang 5.0 and Intel's ICC 17 (on Godbolt) do use inc when optimizing for speed (-O3), not just for size. -mtune=pentium4 makes them avoid inc/dec, but the default -mtune=generic doesn't put much weight on P4.
ICC17 -xMIC-AVX512 (equivalent to gcc's -march=knl) does avoid inc, which is probably a good bet in general for Silvermont / KNL. But it's not usually a performance disaster to use inc, so it's probably still appropriate for "generic" tuning to use inc/dec in most code, especially when the flag result isn't part of the critical path.
Other than Silvermont, this is mostly-stale optimization advice left over from Pentium4. On modern CPUs, there's only a problem if you actually read a flag that wasn't written by the last insn that wrote any flags. e.g. in BigInteger adc loops. (And in that case, you need to preserve CF so using add would break your code.)
add writes all the condition-flag bits in the EFLAGS register. Register-renaming makes write-only easy for out-of-order execution: see write-after-write and write-after-read hazards. add eax, 1 and add ecx, 1 can execute in parallel because they are fully independent of each other. (Even Pentium4 renames the condition flag bits separate from the rest of EFLAGS, since even add leaves the interrupts-enabled and many other bits unmodified.)
On P4, inc and dec depend on the previous value of the all the flags, so they can't execute in parallel with each other or preceding flag-setting instructions. (e.g. add eax, [mem] / inc ecx makes the inc wait until after the add, even if the add's load misses in cache.) This is called a false dependency. Partial-flag writes work by reading the old value of the flags, updating the bits other than CF, then writing the full flags.
All other out-of-order x86 CPUs (including AMD's), rename different parts of flags separately, so internally they do a write-only update to all the flags except CF. (source: Agner Fog's microarchitecture guide). Only a few instructions, like adc or cmc, truly read and then write flags. But also shl r, cl (see below).
Cases where add dest, 1 is preferable to inc dest, at least for Intel P6/SnB uarch families:
Memory-destination: add [rdi], 1 can micro-fuse the store and the load+add on Intel Core2 and SnB-family, so it's 2 fused-domain uops / 4 unfused-domain uops.
inc [rdi] can only micro-fuse the store, so it's 3F / 4U.
According to Agner Fog's tables, AMD and Silvermont run memory-dest inc and add the same, as a single macro-op / uop.
But beware of uop-cache effects with add [label], 1 which needs a 32-bit address and an 8-bit immediate for the same uop.
Before a variable-count shift/rotate to break the dependency on flags and avoid partial-flag merging: shl reg, cl has an input dependency on the flags, because of unfortunate CISC history: it has to leave them unmodified if the shift count is 0.
On Intel SnB-family, variable-count shifts are 3 uops (up from 1 on Core2/Nehalem). AFAICT, two of the uops read/write flags, and an independent uop reads reg and cl, and writes reg. It's a weird case of having better latency (1c + inevitable resource conflicts) than throughput (1.5c), and only being able to achieve max throughput if mixed with instructions that break dependencies on flags. (I posted more about this on Agner Fog's forum). Use BMI2 shlx when possible; it's 1 uop and the count can be in any register.
Anyway, inc (writing flags but leaving CF unmodified) before variable-count shl leaves it with a false dependency on whatever wrote CF last, and on SnB/IvB can require an extra uop to merge flags.
Core2/Nehalem manage to avoid even the false dep on flags: Merom runs a loop of 6 independent shl reg,cl instructions at nearly two shifts per clock, same performance with cl=0 or cl=13. Anything better than 1 per clock proves there's no input-dependency on flags.
I tried loops with shl edx, 2 and shl edx, 0 (immediate-count shifts), but didn't see a speed difference between dec and sub on Core2, HSW, or SKL. I don't know about AMD.
Update: The nice shift performance on Intel P6-family comes at the cost of a large performance pothole which you need to avoid: when an instruction depends on the flag-result of a shift instruction: The front end stalls until the instruction is retired. (Source: Intel's optimization manual, (Section 3.5.2.6: Partial Flag Register Stalls)). So shr eax, 2 / jnz is pretty catastrophic for performance on Intel pre-Sandybridge, I guess! Use shr eax, 2 / test eax,eax / jnz if you care about Nehalem and earlier. Intel's examples makes it clear this applies to immediate-count shifts, not just count=cl.
In processors based on Intel Core microarchitecture [this means Core 2 and later], shift immediate by 1 is handled by special hardware such that it does not experience partial flag stall.
Intel actually means the special opcode with no immediate, which shifts by an implicit 1. I think there is a performance difference between the two ways of encoding shr eax,1, with the short encoding (using the original 8086 opcode D1 /5) producing a write-only (partial) flag result, but the longer encoding (C1 /5, imm8 with an immediate 1) not having its immediate checked for 0 until execution time, but without tracking the flag output in the out-of-order machinery.
Since looping over bits is common, but looping over every 2nd bit (or any other stride) is very uncommon, this seems like a reasonable design choice. This explains why compilers like to test the result of a shift instead of directly using flag results from shr.
Update: for variable count shifts on SnB-family, Intel's optimization manual says:
3.5.1.6 Variable Bit Count Rotation and Shift
In Intel microarchitecture code name Sandy Bridge, The “ROL/ROR/SHL/SHR reg, cl” instruction has three micro-ops. When the flag result is not needed, one of these micro-ops may be discarded, providing
better performance in many common usages. When these instructions update partial flag results that are subsequently used, the full three micro-ops flow must go through the execution and retirement pipeline,
experiencing slower performance. In Intel microarchitecture code name Ivy Bridge, executing the full three micro-ops flow to use the updated partial flag result has additional delay.
Consider the looped sequence below:
loop:
shl eax, cl
add ebx, eax
dec edx ; DEC does not update carry, causing SHL to execute slower three micro-ops flow
jnz loop
The DEC instruction does not modify the carry flag. Consequently, the
SHL EAX, CL instruction needs to execute the three micro-ops flow in
subsequent iterations. The SUB instruction will update all flags. So
replacing DEC with SUB will allow SHL EAX, CL to execute the two
micro-ops flow.
Terminology
Partial-flag stalls happen when flags are read, if they happen at all. P4 never has partial-flag stalls, because they never need to be merged. It has false dependencies instead.
Several answers / comments mix up the terminology. They describe a false dependency, but then call it a partial-flag stall. It's a slowdown which happens because of writing only some of the flags, but the term "partial-flag stall" is what happens on pre-SnB Intel hardware when partial-flag writes have to be merged. Intel SnB-family CPUs insert an extra uop to merge flags without stalling. Nehalem and earlier stall for ~7 cycles. I'm not sure how big the penalty is on AMD CPUs.
(Note that partial-register penalties are not always the same as partial-flags, see below).
### Partial flag stall on Intel P6-family CPUs:
bigint_loop:
adc eax, [array_end + rcx*4] # partial-flag stall when adc reads CF
inc rcx # rcx counts up from negative values towards zero
# test rcx,rcx # eliminate partial-flag stalls by writing all flags, or better use add rcx,1
jnz
# this loop doesn't do anything useful; it's not normally useful to loop the carry-out back to the carry-in for the same accumulator.
# Note that `test` will change the input to the next adc, and so would replacing inc with add 1
In other cases, e.g. a partial flag write followed by a full flag write, or a read of only flags written by inc, is fine. On SnB-family CPUs, inc/dec can even macro-fuse with a jcc, the same as add/sub.
After P4, Intel mostly gave up on trying to get people to re-compile with -mtune=pentium4 or modify hand-written asm as much to avoid serious bottlenecks. (Tuning for a specific microarchitecture will always be a thing, but P4 was unusual in deprecating so many things that used to be fast on previous CPUs, and thus were common in existing binaries.) P4 wanted people to use a RISC-like subset of the x86, and also had branch-prediction hints as prefixes for JCC instructions. (It also had other serious problems, like the trace cache that just wasn't good enough, and weak decoders that meant bad performance on trace-cache misses. Not to mention the whole philosophy of clocking very high ran into the power-density wall.)
When Intel abandoned P4 (NetBurst uarch), they returned to P6-family designs (Pentium-M / Core2 / Nehalem) which inherited their partial-flag / partial-reg handling from earlier P6-family CPUs (PPro to PIII) which pre-dated the netburst mis-step. (Not everything about P4 was inherently bad, and some of the ideas re-appeared in Sandybridge, but overall NetBurst is widely considered a mistake.) Some very-CISC instructions are still slower than the multi-instruction alternatives, e.g. enter, loop, or bt [mem], reg (because the value of reg affects which memory address is used), but these were all slow in older CPUs so compilers already avoided them.
Pentium-M even improved hardware support for partial-regs (lower merging penalties). In Sandybridge, Intel kept partial-flag and partial-reg renaming and made it much more efficient when merging is needed (merging uop inserted with no or minimal stall). SnB made major internal changes and is considered a new uarch family, even though it inherits a lot from Nehalem, and some ideas from P4. (But note that SnB's decoded-uop cache is not a trace cache, though, so it's a very different solution to the decoder throughput/power problem that NetBurst's trace cache tried to solve.)
For example, inc al and inc ah can run in parallel on P6/SnB-family CPUs, but reading eax afterwards requires merging.
PPro/PIII stall for 5-6 cycles when reading the full reg. Core2/Nehalem stall for only 2 or 3 cycles while inserting a merging uop for partial regs, but partial flags are still a longer stall.
SnB inserts a merging uop without stalling, like for flags. Intel's optimization guide says that for merging AH/BH/CH/DH into the wider reg, inserting the merging uop takes an entire issue/rename cycle during which no other uops can be allocated. But for low8/low16, the merging uop is "part of the flow", so it apparently doesn't cause additional front-end throughput penalties beyond taking up one of the 4 slots in an issue/rename cycle.
In IvyBridge (or at least Haswell), Intel dropped partial-register renaming for low8 and low16 registers, keeping it only for high8 registers (AH/BH/CH/DH). Reading high8 registers has extra latency. Also, setcc al has a false dependency on the old value of rax, unlike in Nehalem and earlier (and probably Sandybridge). See this HSW/SKL partial-register performance Q&A for the details.
(I've previously claimed that Haswell could merge AH with no uop, but that's not true and not what Agner Fog's guide says. I skimmed too quickly and unfortunately repeated my wrong understanding in lots of comments and other posts.)
AMD CPUs, and Intel Silvermont, don't rename partial regs (other than flags), so mov al, [mem] has a false dependency on the old value of eax. (The upside is no partial-reg merging slowdowns when reading the full reg later.)
Normally, the only time add instead of inc will make your code faster on AMD or mainstream Intel is when your code actually depends on the doesn't-touch-CF behaviour of inc. i.e. usually add only helps when it would break your code, but note the shl case mentioned above, where the instruction reads flags but usually your code doesn't care about that, so it's a false dependency.
If you do actually want to leave CF unmodified, pre SnB-family CPUs have serious problems with partial-flag stalls, but on SnB-family the overhead of having the CPU merge the partial flags is very low, so it can be best to keep using inc or dec as part of a loop condition when targeting those CPU, with some unrolling. (For details, see the BigInteger adc Q&A I linked earlier). It can be useful to use lea to do arithmetic without affecting flags at all, if you don't need to branch on the result.
Skylake doesn't have partial-flag merging costs
Update: Skylake doesn't have partial-flag merging uops at all: CF is just a separate register from the rest of FLAGS. Instructions that need both parts (like cmovbe) read both inputs separately. That makes cmovbe a 2-uop instruction, but most other cmovcc instructions 1-uop on Skylake. See What is a Partial Flag Stall?.
adc only reads CF so it can be single-uop on Skylake with no interaction at all with an inc or dec in the same loop.
(TODO: rewrite earlier parts of this answer.)
Depending on the CPU implementation of the instructions, a partial register update may cause a stall. According to Agner Fog's optimization guide, page 62,
For historical reasons, the INC and DEC instructions leave the carry flag unchanged, while the other arithmetic flags are written to. This causes a false dependence on the previous value of the flags and costs an extra μop. To avoid these problems, it is recommended that you always use ADD and SUB instead of INC and DEC. For example, INC EAX should be replaced by ADD EAX,1.
See also page 83 on "Partial flags stalls" and page 100 on "Partial flags stall".

Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly?

Background:
While optimizing some Pascal code with embedded assembly language, I noticed an unnecessary MOV instruction, and removed it.
To my surprise, removing the un-necessary instruction caused my program to slow down.
I found that adding arbitrary, useless MOV instructions increased performance even further.
The effect is erratic, and changes based on execution order: the same junk instructions transposed up or down by a single line produce a slowdown.
I understand that the CPU does all kinds of optimizations and streamlining, but, this seems more like black magic.
The data:
A version of my code conditionally compiles three junk operations in the middle of a loop that runs 2**20==1048576 times. (The surrounding program just calculates SHA-256 hashes).
The results on my rather old machine (Intel(R) Core(TM)2 CPU 6400 # 2.13 GHz):
avg time (ms) with -dJUNKOPS: 1822.84 ms
avg time (ms) without: 1836.44 ms
The programs were run 25 times in a loop, with the run order changing randomly each time.
Excerpt:
{$asmmode intel}
procedure example_junkop_in_sha256;
var s1, t2 : uint32;
begin
// Here are parts of the SHA-256 algorithm, in Pascal:
// s0 {r10d} := ror(a, 2) xor ror(a, 13) xor ror(a, 22)
// s1 {r11d} := ror(e, 6) xor ror(e, 11) xor ror(e, 25)
// Here is how I translated them (side by side to show symmetry):
asm
MOV r8d, a ; MOV r9d, e
ROR r8d, 2 ; ROR r9d, 6
MOV r10d, r8d ; MOV r11d, r9d
ROR r8d, 11 {13 total} ; ROR r9d, 5 {11 total}
XOR r10d, r8d ; XOR r11d, r9d
ROR r8d, 9 {22 total} ; ROR r9d, 14 {25 total}
XOR r10d, r8d ; XOR r11d, r9d
// Here is the extraneous operation that I removed, causing a speedup
// s1 is the uint32 variable declared at the start of the Pascal code.
//
// I had cleaned up the code, so I no longer needed this variable, and
// could just leave the value sitting in the r11d register until I needed
// it again later.
//
// Since copying to RAM seemed like a waste, I removed the instruction,
// only to discover that the code ran slower without it.
{$IFDEF JUNKOPS}
MOV s1, r11d
{$ENDIF}
// The next part of the code just moves on to another part of SHA-256,
// maj { r12d } := (a and b) xor (a and c) xor (b and c)
mov r8d, a
mov r9d, b
mov r13d, r9d // Set aside a copy of b
and r9d, r8d
mov r12d, c
and r8d, r12d { a and c }
xor r9d, r8d
and r12d, r13d { c and b }
xor r12d, r9d
// Copying the calculated value to the same s1 variable is another speedup.
// As far as I can tell, it doesn't actually matter what register is copied,
// but moving this line up or down makes a huge difference.
{$IFDEF JUNKOPS}
MOV s1, r9d // after mov r12d, c
{$ENDIF}
// And here is where the two calculated values above are actually used:
// T2 {r12d} := S0 {r10d} + Maj {r12d};
ADD r12d, r10d
MOV T2, r12d
end
end;
Try it yourself:
The code is online at GitHub if you want to try it out yourself.
My questions:
Why would uselessly copying a register's contents to RAM ever increase performance?
Why would the same useless instruction provide a speedup on some lines, and a slowdown on others?
Is this behavior something that could be exploited predictably by a compiler?
The most likely cause of the speed improvement is that:
inserting a MOV shifts the subsequent instructions to different memory addresses
one of those moved instructions was an important conditional branch
that branch was being incorrectly predicted due to aliasing in the branch prediction table
moving the branch eliminated the alias and allowed the branch to be predicted correctly
Your Core2 doesn't keep a separate history record for each conditional jump. Instead it keeps a shared history of all conditional jumps. One disadvantage of global branch prediction is that the history is diluted by irrelevant information if the different conditional jumps are uncorrelated.
This little branch prediction tutorial shows how branch prediction buffers work. The cache buffer is indexed by the lower portion of the address of the branch instruction. This works well unless two important uncorrelated branches share the same lower bits. In that case, you end-up with aliasing which causes many mispredicted branches (which stalls the instruction pipeline and slowing your program).
If you want to understand how branch mispredictions affect performance, take a look at this excellent answer: https://stackoverflow.com/a/11227902/1001643
Compilers typically don't have enough information to know which branches will alias and whether those aliases will be significant. However, that information can be determined at runtime with tools such as Cachegrind and VTune.
You may want to read http://research.google.com/pubs/pub37077.html
TL;DR: randomly inserting nop instructions in programs can easily increase performance by 5% or more, and no, compilers cannot easily exploit this. It's usually a combination of branch predictor and cache behaviour, but it can just as well be e.g. a reservation station stall (even in case there are no dependency chains that are broken or obvious resource over-subscriptions whatsoever).
I believe in modern CPUs the assembly instructions, while being the last visible layer to a programmer for providing execution instructions to a CPU, actually are several layers from actual execution by the CPU.
Modern CPUs are RISC/CISC hybrids that translate CISC x86 instructions into internal instructions that are more RISC in behavior. Additionally there are out-of-order execution analyzers, branch predictors, Intel's "micro-ops fusion" that try to group instructions into larger batches of simultaneous work (kind of like the VLIW/Itanium titanic). There are even cache boundaries that could make the code run faster for god-knows-why if it's bigger (maybe the cache controller slots it more intelligently, or keeps it around longer).
CISC has always had an assembly-to-microcode translation layer, but the point is that with modern CPUs things are much much much more complicated. With all the extra transistor real estate in modern semiconductor fabrication plants, CPUs can probably apply several optimization approaches in parallel and then select the one at the end that provides the best speedup. The extra instructions may be biasing the CPU to use one optimization path that is better than others.
The effect of the extra instructions probably depends on the CPU model / generation / manufacturer, and isn't likely to be predictable. Optimizing assembly language this way would require execution against many CPU architecture generations, perhaps using CPU-specific execution paths, and would only be desirable for really really important code sections, although if you're doing assembly, you probably already know that.
Preparing the cache
Move operations to memory can prepare the cache and make subsequent move operations faster. A CPU usually have two load units and one store units. A load unit can read from memory into a register (one read per cycle), a store unit stores from register to memory. There are also other units that do operations between registers. All the units work in parallel. So, on each cycle, we may do several operations at once, but no more than two loads, one store, and several register operations. Usually it is up to 4 simple operations with plain registers, up to 3 simple operations with XMM/YMM registers and a 1-2 complex operations with any kind of registers. Your code has lots of operations with registers, so one dummy memory store operation is free (since there are more than 4 register operations anyway), but it prepares memory cache for the subsequent store operation. To find out how memory stores work, please refer to the Intel 64 and IA-32 Architectures Optimization Reference Manual.
Breaking the false dependencies
Although this does not exactly refer to your case, but sometimes using 32-bit mov operations under the 64-bit processor (as in your case) are used to clear the higher bits (32-63) and break the dependency chains.
It is well known that under x86-64, using 32-bit operands clears the higher bits of the 64-bit register. Pleas read the relevant section - 3.4.1.1 - of The Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1:
32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register
So, the mov instructions, that may seem useless at the first sight, clear the higher bits of the appropriate registers. What it gives to us? It breaks dependency chains and allows the instructions to execute in parallel, in random order, by the Out-of-Order algorithm implemented internally by CPUs since Pentium Pro in 1995.
A Quote from the Intel® 64 and IA-32 Architectures Optimization Reference Manual, Section 3.5.1.8:
Code sequences that modifies partial register can experience some delay in its dependency chain, but can be avoided by using dependency breaking idioms. In processors based on Intel Core micro-architecture, a number of instructions can help clear execution dependency when software uses these instruction to clear register content to zero. Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For
moves, this can be accomplished with 32-bit moves or by using MOVZX.
Assembly/Compiler Coding Rule 37. (M impact, MH generality): Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.
The MOVZX and MOV with 32-bit operands for x64 are equivalent - they all break dependency chains.
That's why your code executes faster. If there are no dependencies, the CPU can internally rename the registers, even though at the first sight it may seem that the second instruction modifies a register used by the first instruction, and the two cannot execute in parallel. But due to register renaming they can.
Register renaming is a technique used internally by a CPU that eliminates the false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them.
I think you now see that it is too obvious.

Resources