Can I atomically increment a 16 bit counter on x86/x86_64?

Can I atomically increment a 16 bit counter on x86/x86_64? - performance

I want to save memory by converting an existing 32 bit counter to a 16 bit counter. This counter is atomically incremented/decremented. If I do this:
What instructions do I use for atomic_inc(uint16_t x) on x86/x86_64?
Is this reliable in multi-processor x86/x86_64 machines?
Is there a performance penalty to pay on any of these architectures for doing this?
If yes for (3), what's the expected performance penalty?
Thanks for your comments!

Here's one that uses GCC assembly extensions, as an alternative to Steve's Delphi answer:
uint16_t atomic_inc(uint16_t volatile* ptr)
{
uint16_t value(1);
__asm__("lock xadd %w0, %w1" : "+r" (value) : "m" (*ptr));
return ++value;
}
Change the 1 with -1, and the ++ with --, for decrement.

Here is a Delphi function that works:
function LockedInc( var Target :WORD ) :WORD;
asm
mov ecx, eax
mov ax, 1
Lock xadd [ecx], ax
Inc eax
end;
I guess you could convert it to whichever language you require.

The simplest way to perform an atomic increase is as follows (this is inline ASM):
asm
lock inc dword ptr Counter;
end;
where J is an integer. This will directly increase Counter in its memory location.
I have tested this with brute force and it works 100%.

To answer the other three questions:
Didn't find a way to make a numbered list starting with 2
Yes, this is reliable in a multiprocessor environment
Yes, there is a performance penalty
The "lock" prefix locks down the busses, not only for the processor, but for any external hardware, which may want to access the bus via DMA (mass storage, graphics...). So it is slow, typically ~100 clock cycles, but it may be more costly. But if you have "megabytes" of counters, chances are, you will be facing a cache miss, and in this case you will have to wait about ~100 clocks anyway (the memory access time), in case of a page miss, several hundred, so the overhead from lock might not matter.

Related

x86 mfence and C++ memory barrier

I'm checking how the compiler emits instructions for multi-core memory barriers on x86_64. The below code is the one I'm testing using gcc_x86_64_8.3.
std::atomic<bool> flag {false};
int any_value {0};
void set()
{
any_value = 10;
flag.store(true, std::memory_order_release);
}
void get()
{
while (!flag.load(std::memory_order_acquire));
assert(any_value == 10);
}
int main()
{
std::thread a {set};
get();
a.join();
}
When I use std::memory_order_seq_cst, I can see the MFENCE instruction is used with any optimization -O1, -O2, -O3. This instruction makes sure the store buffers are flushed, therefore updating their data in L1D cache (and using MESI protocol to make sure other threads can see effect).
However when I use std::memory_order_release/acquire with no optimizations MFENCE instruction is also used, but the instruction is omitted using -O1, -O2, -O3 optimizations, and not seeing other instructions that flush the buffers.
In the case where MFENCE is not used, what makes sure the store buffer data is committed to cache memory to ensure the memory order semantics?
Below is the assembly code for the get/set functions with -O3, like what we get on the Godbolt compiler explorer:
set():
mov DWORD PTR any_value[rip], 10
mov BYTE PTR flag[rip], 1
ret
.LC0:
.string "/tmp/compiler-explorer-compiler119218-62-hw8j86.n2ft/example.cpp"
.LC1:
.string "any_value == 10"
get():
.L8:
movzx eax, BYTE PTR flag[rip]
test al, al
je .L8
cmp DWORD PTR any_value[rip], 10
jne .L15
ret
.L15:
push rax
mov ecx, OFFSET FLAT:get()::__PRETTY_FUNCTION__
mov edx, 17
mov esi, OFFSET FLAT:.LC0
mov edi, OFFSET FLAT:.LC1
call __assert_fail

The x86 memory ordering model provides #StoreStore and #LoadStore barriers for all store instructions1, which is all what the release semantics require. Also the processor will commit a store instruction as soon as possible; when the store instruction retires, the store becomes the oldest in the store buffer, the core has the target cache line in a writeable coherence state, and a cache port is available to perform the store operation2. So there is no need for an MFENCE instruction. The flag will become visible to the other thread as soon as possible and when it does, any_value is guaranteed to be 10.
On the other hand, sequential consistency also requires #StoreLoad and #LoadLoad barriers. MFENCE is required to provide both3 barriers and so it is used at all optimization levels.
Related: Size of store buffers on Intel hardware? What exactly is a store buffer?.
Footnotes:
(1) There are exceptions that don't apply here. In particular, non-temporal stores and stores to the uncacheable write-combining memory types provide only the #LoadStore barrier. Anyway, these barriers are provided for stores to the write-back memory type on both Intel and AMD processors.
(2) This is in contrast to write-combining stores which are made globally-visible under certain conditions. See Section 11.3.1 of the Intel manual Volume 3.
(3) See the discussion under Peter's answer.

x86's TSO memory model is sequential-consistency + a store buffer, so only seq-cst stores need any special fencing. (Stalling after a store until the store buffer drains, before later loads, is all we need to recover sequential consistency). The weaker acq/rel model is compatible with the StoreLoad reordering caused by a store buffer.
(See discussion in comments re: whether "allowing StoreLoad reordering" is an accurate and sufficient description of what x86 allows. A core always sees its own stores in program order because loads snoop the store buffer, so you could say that store-forwarding also reorders loads of recently-stored data. Except you can't always: Globally Invisible load instructions)
(And BTW, compilers other than gcc use xchg to do a seq-cst store. This is actually more efficient on current CPUs. GCC's mov+mfence might have been cheaper in the past, but is currently usually worse even if you don't care about the old value. See Why does a std::atomic store with sequential consistency use XCHG? for a comparison between GCC's mov+mfence vs. xchg. Also my answer on Which is a better write barrier on x86: lock+addl or xchgl?)
Fun fact: you can achieve sequential consistency by instead fencing seq-cst loads instead of stores. But cheap loads are much more valuable than cheap stores for most use-cases, so everyone uses ABIs where the full barriers go on the stores.
See https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html for details of how C++11 atomic ops map to asm instruction sequences for x86, PowerPC, ARMv7, ARMv8, and Itanium. Also When are x86 LFENCE, SFENCE and MFENCE instructions required?
when I use std::memory_order_release/acquire with no optimizations MFENCE instruction is also used
That's because flag.store(true, std::memory_order_release); doesn't inline, because you disabled optimization. That includes inlining of very simple member functions like atomic::store(T, std::memory_order = std::memory_order_seq_cst)
When the ordering parameter to the __atomic_store_n() GCC builtin is a runtime variable (in the atomic::store() header implementation), GCC plays it conservative and promotes it to seq_cst.
It might actually be worth it for gcc to branch over mfence because it's so expensive, but that's not what we get. (But that would make larger code-size for functions with runtime variable order params, and the code path might not be hot. So branching is probably only a good idea in the libatomic implementation, or with profile-guided optimization for rare cases where a function is large enough to not inline but takes a variable order.)

Setting and clearing the zero flag in x86

What's the most efficient way to set and also to clear the zero flag (ZF) in x86-64?
Methods that work without the need for a register with a known value, or without any free registers at all are preferred, but if a better method is available when those or other assumptions are true it is also worth mentioning.

ZF=0
This is harder. cmp between any two regs known to be not equal. Or cmp reg,imm with any value some reg couldn't possibly have. e.g. cmp reg,1 with any known-zero register.
In general test reg,reg is good with any known-non-0 register value, e.g. a pointer.
test rsp, rsp is probably a good choice, or even test esp, esp to save a byte will work except if your stack is in the unusual location of spanning a 4G boundary.
I don't see a way to create ZF=0 in one instruction without a false dependency on some input reg. xor eax,eax / inc eax or dec will do the trick in 2 uops if you don't mind destroying a register, breaking false dependencies. (not doesn't set FLAGS, and neg will just do 0-0 = 0.)
or eax, -1 doesn't need any pre-condition for the register value. (False dependency, but not a true dependency so you can pick any register even if it might be zero.) It doesn't have to be -1, it's not gaining you anything so if you can make it something useful so much the better.
or eax,-1 FLAG results: ZF=0 PF=1 SF=1 CF=0 OF=0 (AF=undefined).
If you need to do this in a loop, you can obviously set up for it outside the loop, if you can dedicate a register to being non-zero for use with test.
ZF=1
Least destructive: cmp eax,eax - but has a false dependency (I assume) and needs a back-end uop: not a zeroing idiom. RSP doesn't usually change much so cmp esp, esp could be a good choice. (Unless that forces a stack-sync uop).
Most efficient: xor-zeroing (like xor eax,eax using any free register) is definitely the most efficient way on SnB-family (same cost as a 2-byte nop, or 3-byte if it needs a REX because you want to zero one of r8d..r15d): 1 front-end uop, zero back-end uops on SnB-family, and the FLAGS result is ready in the same cycle it issues. (Relevant only in case the front-end was stalled, or some other case where a uop depending on it issues in the same cycle and there aren't any older uops in the RS with ready inputs, otherwise such uops would have priority for whichever execution port.)
Flag results: ZF=1 PF=1 SF=0 CF=0 OF=0 (AF=undefined). (Or use sub eax,eax to get well-defined AF=0. In practice modern CPUs pick AF=0 for xor-zeroing, too, so they can decode both zeroing idioms the same way. Silvermont only recognizes 32-bit operand-size xor as a zeroing idiom, not sub.)
xor-zero is very cheap on all other uarches as well, of course: no input dependencies, and doesn't need any pre-existing register value. (And thus doesn't contribute to P6-family register-read stalls). So it will be at worst tied with anything else you could do on any other uarch (where it does require an execution unit.)
(On early P6-family, before Pentium M, xor-zeroing does not break dependencies; it only triggers the special al=eax state that avoids partial-register stuff. But none of those CPUs are x86-64, all 32-bit only.)
It's pretty common to want a zeroed register for something anyway, e.g. as a sub destination for 0 - x to copy-and-negate, so take advantage of it by putting the xor-zeroing where you need it to also create a useful FLAG condition.
Interesting but probably not useful: test al, 0 is 2 bytes long. But so is cmp esp,esp.
As #prl suggested, cmp same,same with any register will work without disturbing a value. I suspect this is not special-cased as dependency breaking the way sub same,same is on some CPUs, so pick a "cold" register. Again 2 or 3 bytes, 1 uop. It can micro-fuse with a JCC, but that would be dumb (unless the JCC is also a branch target from some other condition?)
Flag results: same as xor-zeroing.
Downsides:
(probably) false dependency
on P6-family can contribute to a register-read stall, so pick a cold register you're already reading in nearby instructions.
needs a back-end execution unit on SnB-family
Just for fun, other as-cheap alternatives include test al, 0. 2 bytes for AL, 3 or 4 bytes for any other 8-bit register. (REX) + opcode + modrm + imm8. The original register value doesn't matter because an imm8 of zero guarantees that reg & 0 = 0.
If you happen to have a 1 or -1 in a register you can destroy, 32-bit mode inc or dec would set ZF in only 1 byte. But in x86-64 that's at least 2 bytes. Nothing comes to mind for a 1-byte instruction in 64-bit mode that's actually efficient and sets FLAGS.
ZF=!CF
sbb same,same can set ZF=!CF (leaving CF unmodified), and setting the reg to 0 (CF=0) or -1 (CF=1). On AMD since Bulldozer (BD-family and Zen-family), this has no dependency on the GP register, only CF. But on other uarches it's not special cased and there is a false dep on the reg. And it's 2 uops on Intel before Broadwell.
ZF=!bool(integer register)
To set ZF=!integer_reg, obviously the normal test reg,reg is your best bet. (Better than and reg,reg or or reg,reg, unless you're intentionally rewriting the register to avoid P6 register-read stalls.)
ZF=1 if the register value is zero, so it's like C's logical inverse operator.
ZF=!ZF
Perhaps setz al / test al, al. No single instruction: I don't think any read ZF and write FLAGS. setz materializes ZF in a register, then test is just ZF = !reg.
Other FLAGS conditions:
How to read and write x86 flags registers directly?
One instruction to clear PF (Parity Flag) -- get odd number of bits in result register (not possible without pre-existing register values for test or cmp).
How can I set or clear overflow flag in x86 assembly? (e.g. for the start of an ADOX chain.)
pushf/pop rax is not terrible, but writing flags with popf is very slow (e.g. 1/20c throughput on SKL). It's microcoded because flags like IF also live in EFLAGS, and there isn't a condition-codes-only version or a special fast-path for user-space. (Or maybe 20c is the fast path.)
lahf (FLAGS->AH) / sahf (AH->FLAGS) can be useful but miss OF.
CF has clc/stc/cmc instructions. (clc is as efficient as xor-zeroing on SnB-family.)

The least intrusive way to manipulate any(i) of the lower 8 bits of Flags is to use the classic LAHF/SAHF instructions which bring them to/from AH, on which any bit operation can be applied.
(i) Just bits 7 (SF), 6 (ZF), 4 (AF), 2 (PF), and 0 (CF)
Turning off ZF
LAHF ; Load lower 8 bit from Flags into AH
AND AH,010111111b ; Clear bit for ZF
SAHF ; Store AH back to Flags
Turning on ZF
LAHF ; Load AH from FLAGS
OR AH,001000000b ; Set bit for ZF
SAHF ; Store AH back to Flags
Of course any CMP (E)AX,(E)AX will set ZF faster and with less code; the point of this is to leave other FLAGS unmodified, as in How to read and write x86 flags registers directly? and how to change flags manually (in assembly code) for 8086?
CAVEAT for early AMD64 - LAHF in long mode is an extension
Some very early x86-64 CPU's, most notably all
AMD Athlon 64, Opteron and Turion 64 before revision D (March 2005) and
Intel before Pentium 4 stepping G1* (December 2005)
As that instruction was originally removed from the AMD64 instruction subset, but later reintroduced. Luckily that happened before x86-64 became a common sight, so only a few, early-on high-end CPUs are affected and even less, surviving today. More so as these are the CPUs that are not able to run Windows 10, or any 64-bit Windows before Windows 10 (see this answer at SuperUser.SE).
If you really expect that someone might try to run that software on a more than 17 year old high-end CPU, it can be checked for by executing CPUID with EAX=80000001h and test for 2^0=1.

Assuming you don’t need to preserve the values of the other flags,
cmp eax, eax

Are scaled-index addressing modes a good idea?

Consider the following code:
void foo(int* __restrict__ a)
{
int i; int val = 0;
for (i = 0; i < 100; i++) {
val = 2 * i;
a[i] = val;
}
}
This complies (with maximum optimization but no unrolling or vectorization) into...
GCC 7.2:
foo(int*):
xor eax, eax
.L2:
mov DWORD PTR [rdi], eax
add eax, 2
add rdi, 4
cmp eax, 200
jne .L2
rep ret
clang 5.0:
foo(int*): # #foo(int*)
xor eax, eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
mov dword ptr [rdi + 2*rax], eax
add rax, 2
cmp rax, 200
jne .LBB0_1
ret
What are the pros and cons of GCC's vs clang's approach? i.e. an extra variable incremented separately, vs multiplying via a more complex addressing mode?
Notes:
This question also relates to this one with about the same code, but with float's rather than int's.

Yes, take advantage of the power of x86 addressing modes to save uops, in cases where an index doesn't unlaminate into more extra uops than it would cost to do pointer increments.
(In many cases unrolling and using pointer increments is a win because of unlamination on Intel Sandybridge-family, but if you're not unrolling or if you're only using mov loads instead of folding memory operands into ALU ops for micro-fusion, then indexed addressing modes are often break even on some CPUs and a win on others.)
It's essential to read and understand Micro fusion and addressing modes if you want to make optimal choices here. (And note that IACA gets it wrong, and doesn't simulate Haswell and later keeping some uops micro-fused, so you can't even just check your work by having it do static analysis for you.)
Indexed addressing modes are generally cheap. At worst they cost one extra uop for the front-end (on Intel SnB-family CPUs in some situations), and/or prevent a store-address uop from using port7 (which only supports base + displacement addressing modes). See Agner Fog's microarch pdf, and also David Kanter's Haswell write-up, for more about the store-AGU on port7 which Intel added in Haswell.
On Haswell+, if you need your loop to sustain more than 2 memory ops per clock, then avoid indexed stores.
At best they're free other than the code-size cost of the extra byte in the machine-code encoding. (Having an index register requires a SIB (Scale Index Base) byte in the encoding).
More often the only penalty is the 1 extra cycle of load-use latency vs. a simple [base + 0-2047] addressing mode, on Intel Sandybridge-family CPUs.
It's usually only worth using an extra instruction to avoid an indexed addressing mode if you're going to use that addressing mode in multiple instructions. (e.g. load / modify / store).
Scaling the index is free (on modern CPUs at least) if you're already using a 2-register addressing mode. For lea, Agner Fog's table lists AMD Ryzen as having 2c latency and only 2 per clock throughput for lea with scaled-index addressing modes (or 3-component), otherwise 1c latency and 0.25c throughput. e.g. lea rax, [rcx + rdx] is faster than lea rax, [rcx + 2*rdx], but not by enough to be worth using extra instructions instead.) Ryzen also doesn't like a 32-bit destination in 64-bit mode, for some reason. But the worst-case LEA is still not bad at all. And anyway, mostly unrelated to address-mode choice for loads, because most CPUs (other than in-order Atom) run LEA on the ALUs, not the AGUs used for actual loads/stores.
The main question is between one-register unscaled (so it can be a "base" register in the machine-code encoding: [base + idx*scale + disp]) or two-register. Note that for Intel's micro-fusion limitations, [disp32 + idx*scale] (e.g. indexing a static array) is an indexed addressing mode.
Neither function is totally optimal (even without considering unrolling or vectorization), but clang's looks very close.
The only thing clang could do better is save 2 bytes of code size by avoiding the REX prefixes with add eax, 2 and cmp eax, 200. It promoted all the operands to 64-bit because it's using them with pointers and I guess proved that the C loop doesn't need them to wrap, so in asm it uses 64-bit everywhere. This is pointless; 32-bit operations are always at least as fast as 64, and implicit zero-extension is free. But this only costs 2 bytes of code-size, and costs no performance other than indirect front-end effects from that.
You've constructed your loop so the compiler needs to keep a specific value in registers and can't totally transform the problem into just a pointer-increment + compare against an end pointer (which compilers often do when they don't need the loop variable for anything except array indexing).
You also can't transform to counting a negative index up towards zero (which compilers never do, but reduces the loop overhead to a total of 1 macro-fused add + branch uop on Intel CPUs (which can fuse add + jcc, while AMD can only fuse test or cmp / jcc).
Clang has done a good job noticing that it can use 2*var as the array index (in bytes). This is a good optimization for tune=generic. The indexed store will un-laminate on Intel Sandybridge and Ivybridge, but stay micro-fused on Haswell and later. (And on other CPUs, like Nehalem, Silvermont, Ryzen, Jaguar, or whatever, there's no disadvantage.)
gcc's loop has 1 extra uop in the loop. It can still in theory run at 1 store per clock on Core2 / Nehalem, but it's right up against the 4 uops per clock limit. (And actually, Core2 can't macro-fuse the cmp/jcc in 64-bit mode, so it bottlenecks on the front-end).

Indexed addressing (in loads and stores, lea is different still) has some trade-offs, for example
On many µarchs, instructions that use indexed addressing have a slightly longer latency than instruction that don't. But usually throughput is a more important consideration.
On Netburst, stores with a SIB byte generate an extra µop, and therefore may cost throughput as well. The SIB byte causes an extra µop regardless of whether you use it for indexes addressing or not, but indexed addressing always costs the extra µop. It doesn't apply to loads.
On Haswell/Broadwell (and still in Skylake/Kabylake), stores with indexed addressing cannot use port 7 for address generation, instead one of the more general address generation ports will be used, reducing the throughput available for loads.
So for loads it's usually good (or not bad) to use indexed addressing if it saves an add somewhere, unless they are part of a chain of dependent loads. For stores it's more dangerous to use indexed addressing. In the example code it shouldn't make a large difference. Saving the add is not really relevant, ALU instructions wouldn't be the bottleneck. The address generation happening in ports 2 or 3 doesn't matter either since there are no loads.

Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly?

Background:
While optimizing some Pascal code with embedded assembly language, I noticed an unnecessary MOV instruction, and removed it.
To my surprise, removing the un-necessary instruction caused my program to slow down.
I found that adding arbitrary, useless MOV instructions increased performance even further.
The effect is erratic, and changes based on execution order: the same junk instructions transposed up or down by a single line produce a slowdown.
I understand that the CPU does all kinds of optimizations and streamlining, but, this seems more like black magic.
The data:
A version of my code conditionally compiles three junk operations in the middle of a loop that runs 2**20==1048576 times. (The surrounding program just calculates SHA-256 hashes).
The results on my rather old machine (Intel(R) Core(TM)2 CPU 6400 # 2.13 GHz):
avg time (ms) with -dJUNKOPS: 1822.84 ms
avg time (ms) without: 1836.44 ms
The programs were run 25 times in a loop, with the run order changing randomly each time.
Excerpt:
{$asmmode intel}
procedure example_junkop_in_sha256;
var s1, t2 : uint32;
begin
// Here are parts of the SHA-256 algorithm, in Pascal:
// s0 {r10d} := ror(a, 2) xor ror(a, 13) xor ror(a, 22)
// s1 {r11d} := ror(e, 6) xor ror(e, 11) xor ror(e, 25)
// Here is how I translated them (side by side to show symmetry):
asm
MOV r8d, a ; MOV r9d, e
ROR r8d, 2 ; ROR r9d, 6
MOV r10d, r8d ; MOV r11d, r9d
ROR r8d, 11 {13 total} ; ROR r9d, 5 {11 total}
XOR r10d, r8d ; XOR r11d, r9d
ROR r8d, 9 {22 total} ; ROR r9d, 14 {25 total}
XOR r10d, r8d ; XOR r11d, r9d
// Here is the extraneous operation that I removed, causing a speedup
// s1 is the uint32 variable declared at the start of the Pascal code.
//
// I had cleaned up the code, so I no longer needed this variable, and
// could just leave the value sitting in the r11d register until I needed
// it again later.
//
// Since copying to RAM seemed like a waste, I removed the instruction,
// only to discover that the code ran slower without it.
{$IFDEF JUNKOPS}
MOV s1, r11d
{$ENDIF}
// The next part of the code just moves on to another part of SHA-256,
// maj { r12d } := (a and b) xor (a and c) xor (b and c)
mov r8d, a
mov r9d, b
mov r13d, r9d // Set aside a copy of b
and r9d, r8d
mov r12d, c
and r8d, r12d { a and c }
xor r9d, r8d
and r12d, r13d { c and b }
xor r12d, r9d
// Copying the calculated value to the same s1 variable is another speedup.
// As far as I can tell, it doesn't actually matter what register is copied,
// but moving this line up or down makes a huge difference.
{$IFDEF JUNKOPS}
MOV s1, r9d // after mov r12d, c
{$ENDIF}
// And here is where the two calculated values above are actually used:
// T2 {r12d} := S0 {r10d} + Maj {r12d};
ADD r12d, r10d
MOV T2, r12d
end
end;
Try it yourself:
The code is online at GitHub if you want to try it out yourself.
My questions:
Why would uselessly copying a register's contents to RAM ever increase performance?
Why would the same useless instruction provide a speedup on some lines, and a slowdown on others?
Is this behavior something that could be exploited predictably by a compiler?

The most likely cause of the speed improvement is that:
inserting a MOV shifts the subsequent instructions to different memory addresses
one of those moved instructions was an important conditional branch
that branch was being incorrectly predicted due to aliasing in the branch prediction table
moving the branch eliminated the alias and allowed the branch to be predicted correctly
Your Core2 doesn't keep a separate history record for each conditional jump. Instead it keeps a shared history of all conditional jumps. One disadvantage of global branch prediction is that the history is diluted by irrelevant information if the different conditional jumps are uncorrelated.
This little branch prediction tutorial shows how branch prediction buffers work. The cache buffer is indexed by the lower portion of the address of the branch instruction. This works well unless two important uncorrelated branches share the same lower bits. In that case, you end-up with aliasing which causes many mispredicted branches (which stalls the instruction pipeline and slowing your program).
If you want to understand how branch mispredictions affect performance, take a look at this excellent answer: https://stackoverflow.com/a/11227902/1001643
Compilers typically don't have enough information to know which branches will alias and whether those aliases will be significant. However, that information can be determined at runtime with tools such as Cachegrind and VTune.

You may want to read http://research.google.com/pubs/pub37077.html
TL;DR: randomly inserting nop instructions in programs can easily increase performance by 5% or more, and no, compilers cannot easily exploit this. It's usually a combination of branch predictor and cache behaviour, but it can just as well be e.g. a reservation station stall (even in case there are no dependency chains that are broken or obvious resource over-subscriptions whatsoever).

I believe in modern CPUs the assembly instructions, while being the last visible layer to a programmer for providing execution instructions to a CPU, actually are several layers from actual execution by the CPU.
Modern CPUs are RISC/CISC hybrids that translate CISC x86 instructions into internal instructions that are more RISC in behavior. Additionally there are out-of-order execution analyzers, branch predictors, Intel's "micro-ops fusion" that try to group instructions into larger batches of simultaneous work (kind of like the VLIW/Itanium titanic). There are even cache boundaries that could make the code run faster for god-knows-why if it's bigger (maybe the cache controller slots it more intelligently, or keeps it around longer).
CISC has always had an assembly-to-microcode translation layer, but the point is that with modern CPUs things are much much much more complicated. With all the extra transistor real estate in modern semiconductor fabrication plants, CPUs can probably apply several optimization approaches in parallel and then select the one at the end that provides the best speedup. The extra instructions may be biasing the CPU to use one optimization path that is better than others.
The effect of the extra instructions probably depends on the CPU model / generation / manufacturer, and isn't likely to be predictable. Optimizing assembly language this way would require execution against many CPU architecture generations, perhaps using CPU-specific execution paths, and would only be desirable for really really important code sections, although if you're doing assembly, you probably already know that.

Preparing the cache
Move operations to memory can prepare the cache and make subsequent move operations faster. A CPU usually have two load units and one store units. A load unit can read from memory into a register (one read per cycle), a store unit stores from register to memory. There are also other units that do operations between registers. All the units work in parallel. So, on each cycle, we may do several operations at once, but no more than two loads, one store, and several register operations. Usually it is up to 4 simple operations with plain registers, up to 3 simple operations with XMM/YMM registers and a 1-2 complex operations with any kind of registers. Your code has lots of operations with registers, so one dummy memory store operation is free (since there are more than 4 register operations anyway), but it prepares memory cache for the subsequent store operation. To find out how memory stores work, please refer to the Intel 64 and IA-32 Architectures Optimization Reference Manual.
Breaking the false dependencies
Although this does not exactly refer to your case, but sometimes using 32-bit mov operations under the 64-bit processor (as in your case) are used to clear the higher bits (32-63) and break the dependency chains.
It is well known that under x86-64, using 32-bit operands clears the higher bits of the 64-bit register. Pleas read the relevant section - 3.4.1.1 - of The Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1:
32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register
So, the mov instructions, that may seem useless at the first sight, clear the higher bits of the appropriate registers. What it gives to us? It breaks dependency chains and allows the instructions to execute in parallel, in random order, by the Out-of-Order algorithm implemented internally by CPUs since Pentium Pro in 1995.
A Quote from the Intel® 64 and IA-32 Architectures Optimization Reference Manual, Section 3.5.1.8:
Code sequences that modifies partial register can experience some delay in its dependency chain, but can be avoided by using dependency breaking idioms. In processors based on Intel Core micro-architecture, a number of instructions can help clear execution dependency when software uses these instruction to clear register content to zero. Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For
moves, this can be accomplished with 32-bit moves or by using MOVZX.
Assembly/Compiler Coding Rule 37. (M impact, MH generality): Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.
The MOVZX and MOV with 32-bit operands for x64 are equivalent - they all break dependency chains.
That's why your code executes faster. If there are no dependencies, the CPU can internally rename the registers, even though at the first sight it may seem that the second instruction modifies a register used by the first instruction, and the two cannot execute in parallel. But due to register renaming they can.
Register renaming is a technique used internally by a CPU that eliminates the false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them.
I think you now see that it is too obvious.

Where is the L1 memory cache of Intel x86 processors documented?

I am trying to profile and optimize algorithms and I would like to understand the specific impact of the caches on various processors. For recent Intel x86 processors (e.g. Q9300), it is very hard to find detailed information about cache structure. In particular, most web sites (including Intel.com) that post processor specs do not include any reference to L1 cache. Is this because the L1 cache does not exist or is this information for some reason considered unimportant? Are there any articles or discussions about the elimination of the L1 cache?
[edit]
After running various tests and diagnostic programs (mostly those discussed in the answers below), I have concluded that my Q9300 seems to have a 32K L1 data cache. I still haven't found a clear explanation as to why this information is so difficult to come by. My current working theory is that the details of L1 caching are now being treated as trade secrets by Intel.

It is near impossible to find specs on Intel caches. When I was teaching a class on caches last year, I asked friends inside Intel (in the compiler group) and they couldn't find specs.
But wait!!! Jed, bless his soul, tells us that on Linux systems, you can squeeze lots of information out of the kernel:
grep . /sys/devices/system/cpu/cpu0/cache/index*/*
This will give you associativity, set size, and a bunch of other information (but not latency).
For example, I learned that although AMD advertises their 128K L1 cache, my AMD machine has a split I and D cache of 64K each.
Two suggestions which are now mostly obsolete thanks to Jed:
AMD publishes a lot more information about its caches, so you can at least got some information about a modern cache. For example, last year's AMD L1 caches delivered two words per cycle (peak).
The open-source tool valgrind has all sorts of cache models inside it, and it is invaluable for profiling and understanding cache behavior. It comes with a very nice visualization tool kcachegrind which is part of the KDE SDK.
For example: in Q3 2008, AMD K8/K10 CPUs use 64 byte cache lines, with a 64kB each L1I/L1D split cache. L1D is 2-way associative and exclusive with L2, with latency of 3 cycles. L2 cache is 16-way associative and latency is about 12 cycles.
AMD Bulldozer-family CPUs use a split L1 with a 16kiB 4-way associative L1D per cluster (2 per core).
Intel CPUs have kept L1 the same for a long time (from Pentium M to Haswell to Skylake, and presumably many generations after that): Split 32kB each I and D caches, with L1D being 8-way associative. 64 byte cache lines, matching the burst-transfer size of DDR DRAM. Load-use latency is ~4 cycles.
Also see the x86 tag wiki for links to more performance and microarchitectural data.

This Intel Manual: Intel® 64 and IA-32 Architectures Optimization Reference Manual has a decent discussion of cache considerations.
Page 46, Section 2.2.5.1 Intel® 64 and IA-32 Architectures Optimization Reference Manual
Even MicroSlop is waking up to the need for more tools to monitor cache usage and performance, and has a GetLogicalProcessorInformation() function example (...while blazing new trails in creating ridiculously long function names in the process) I think I'll code up.
UPDATE I: Hazwell increases cache load performance 2X, from Inside the Tock; Haswell's Architecture
If there were any doubt how critical it is to make the best possible use of cache, this presentation by Cliff Click, formerly of Azul, should dispel any and all doubt. In his words, "memory is the new disk!".
UPDATE II: SkyLake's significantly improved cache performance specifications.

You are looking at the consumer specifications, not the developer specifications. Here is the documentation you want. The cache sizes vary by processor family sub-models, so they typically are not in the IA-32 development manuals, but you can easily look them up on NewEgg and such.
Edit: More specifically: Chapter 10 of Volume 3A (Systems Programming Guide), Chapter 7 of the Optimization Reference Manual, and potentially something in the TLB page-caching manual, although I would assume that one is further out from the L1 than you care about.

I did some more investigating. There is a group at ETH Zurich who built a memory-performance evaluation tool which might be able to get information about the size at least (and maybe also associativity) of L1 and L2 caches. The program works by trying different read patterns experimentally and measuring the resulting throughput. A simplified version was used for the popular textbook by Bryant and O'Hallaron.

L1 caches exist on these platforms. This will almost definitly remain true until memory and front side bus speeds exceed the speed of the CPU, which is a very likely a long way off.
On Windows, you can use the GetLogicalProcessorInformation to get some level of cache information (size, line size, associativity, etc.) The Ex version on Win7 will give even more data, like which cores share which cache. CpuZ also gives this information.

Locality of Reference has a major impact on performance of some algorithms; The size and speed of L1, L2 (and on newer CPUs L3) cache obviously play a large part in this. Matrix multiplication is one such algorithm.

Intel Manual Vol. 2 specifies the following formula to compute cache size:
This Cache Size in Bytes
= (Ways + 1) * (Partitions + 1) * (Line_Size + 1) * (Sets + 1)
= (EBX[31:22] + 1) * (EBX[21:12] + 1) * (EBX[11:0] + 1) * (ECX + 1)
Where the Ways, Partitions, Line_Size and Sets are queried using cpuid with eax set to 0x04.
Providing the header file declaration
x86_cache_size.h:
unsigned int get_cache_line_size(unsigned int cache_level);
The implementation looks as follows:
;1st argument - the cache level
get_cache_line_size:
push rbx
;set line number argument to be used with CPUID instruction
mov ecx, edi
;set cpuid initial value
mov eax, 0x04
cpuid
;cache line size
mov eax, ebx
and eax, 0x7ff
inc eax
;partitions
shr ebx, 12
mov edx, ebx
and edx, 0x1ff
inc edx
mul edx
;ways of associativity
shr ebx, 10
mov edx, ebx
and edx, 0x1ff
inc edx
mul edx
;number of sets
inc ecx
mul ecx
pop rbx
ret
Which on my machine works as follows:
#include "x86_cache_size.h"
int main(void){
unsigned int L1_cache_size = get_cache_line_size(1);
unsigned int L2_cache_size = get_cache_line_size(2);
unsigned int L3_cache_size = get_cache_line_size(3);
//L1 size = 32768, L2 size = 262144, L3 size = 8388608
printf("L1 size = %u, L2 size = %u, L3 size = %u\n", L1_cache_size, L2_cache_size, L3_cache_size);
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio