Unnecessary instructions generated for _mm_movemask_epi8 intrinsic in x64 mode - gcc

The intrinsic function _mm_movemask_epi8 from SSE2 is defined by Intel with the following prototype:
int _mm_movemask_epi8 (__m128i a);
This intrinsic function directly corresponds to the pmovmskb instruction, which is generated by all compilers.
According to this reference, the pmovmskb instruction can write the resulting integer mask to either a 32-bit or a 64-bit general purpose register in x64 mode. In any case, only 16 lower bits of the result can be nonzero, i.e. the result is surely within range [0; 65535].
Speaking of the intrinsic function _mm_movemask_epi8, its returned value is of type int, which as a signed integer of 32-bit size on most platforms. Unfortunately, there is no alternative function which returns a 64-bit integer in x64 mode. As a result:
Compiler usually generates pmovmskb instruction with 32-bit destination register (e.g. eax).
Compiler cannot assume that upper 32 bits of the whole register (e.g. rax) are zero.
Compiler inserts unnecessary instruction (e.g. mov eax, eax) to zero the upper half of 64-bit register, given that the register is later used as 64-bit value (e.g. as an index of array).
An example of code and generated assembly with such a problem can be seen in this answer. Also the comments to that answer contain some related discussion. I regularly experience this problem with MSVC2013 compiler, but it seems that it is also present on GCC.
The questions are:
Why is this happening?
Is there any way to reliably avoid generation of unnecessary instructions on popular compilers? In particular, when result is used as index, i.e. in x = array[_mm_movemask_epi8(xmmValue)];
What is the approximate cost of unnecessary instructions like mov eax, eax on modern CPU architectures? Is there any chance that these instructions are completely eliminated by CPU internally and they do not actually occupy time of execution units (Agner Fog's instruction tables document mentions such a possibility).

Why is this happening?
gcc's internal instruction definitions that tells it what pmovmskb does must be failing to inform it that the upper 32-bits of rax will always be zero. My guess is that it's treated like a function call return value, where the ABI allows a function returning a 32bit int to leave garbage in the upper 32bits of rax.
GCC does know about 32-bit operations in general zero-extending for free, but this missed optimization is widespread for intrinsics, also affecting scalar intrinsics like _mm_popcnt_u32.
There's also the issue of gcc (not) knowing that the actual result has set bits only in the low 16 of its 32-bit int result (unless you used AVX2 vpmovmskb ymm). So actual sign extension is unnecessary; implicit zero extension is totally fine.
Is there any way to reliably avoid generation of unnecessary instructions on popular compilers? In particular, when result is used as index, i.e. in x = array[_mm_movemask_epi8(xmmValue)];
No, other than fixing gcc. Has anyone reported this as a compiler missed-optimization bug?
clang doesn't have this bug. I added code to Paul R's test to actually use the result as an array index, and clang is still fine.
gcc always either zero or sign extends (to a different register in this case, perhaps because it wants to "keep" the 32-bit value in the bottom of RAX, not because it's optimizing for mov-elimination.
Casting to unsigned helps with GCC6 and later; it will use the pmovmskb result directly as part of an addressing mode, but also returning it results in a mov rax, rdx.
And with older GCC, at least gets it to use mov instead of movsxd or cdqe.
What is the approximate cost of unnecessary instructions like mov eax, eax on modern CPU architectures? Is there any chance that these instructions are completely eliminated by CPU internally and they do not actually occupy time of execution units (Agner Fog's instruction tables document mentions such a possibility).
mov same,same is never eliminated on SnB-family microarchitectures or AMD zen. mov ecx, eax would be eliminated. See Can x86's MOV really be "free"? Why can't I reproduce this at all? for details.
Even if it doesn't take an execution unit, it still takes a slot in the fused-domain part of the pipeline, and a slot in the uop-cache. And code-size. If you're close to the front-end 4 fused-domain uops per clock limit (pipeline width), then it's a problem.
It also costs an extra 1c of latency in the dep chain.
(Back-end throughput is not a problem, though. On Haswell and newer, it can run on port6 which has no vector execution units. On AMD, the integer ports are separate from the vector ports.)

gcc.godbolt.org is a great online resource for testing this kind of issue with different compilers.
clang seems to do the best with this, e.g.
#include <xmmintrin.h>
#include <cstdint>
int32_t test32(const __m128i v) {
int32_t mask = _mm_movemask_epi8(v);
return mask;
}
int64_t test64(const __m128i v) {
int64_t mask = _mm_movemask_epi8(v);
return mask;
}
generates:
test32(long long __vector(2)): # #test32(long long __vector(2))
vpmovmskb eax, xmm0
ret
test64(long long __vector(2)): # #test64(long long __vector(2))
vpmovmskb eax, xmm0
ret
Whereas gcc generates an extra cdqe instruction in the 64-bit case:
test32(long long __vector(2)):
vpmovmskb eax, xmm0
ret
test64(long long __vector(2)):
vpmovmskb eax, xmm0
cdqe
ret

Related

What's the purpose of xchg ax,ax prior to the break instruction int 3 in DebugBreak()?

In MASM, I've always inserted a standalone break instruction
00007ff7`63141120 cc int 3
However, replacing that instruction with the MSVC DebugBreak function generates
KERNELBASE!DebugBreak:
00007ff8`6b159b90 6690 xchg ax,ax
00007ff8`6b159b92 cc int 3
00007ff8`6b159b93 c3 ret
I was surprised to see the xchg instruction prior to the break instruction
xchg ax,ax
As noted from another S.O. article:
Actually, xchg ax,ax is just how MS disassembles "66 90". 66 is the
operand size override, so it supposedly operates on ax instead of eax.
However, the CPU still executes it as a nop. The 66 prefix is used
here to make the instruction two bytes in size, usually for alignment
purposes.
MSVC, like most compilers, aligns functions to 16 byte boundaries.
Question What is the purpose of that xchg instruction?
MSVC generates 2 byte nop before any single-byte instruction at the beginning of a function (except ret in empty functions1). I've tried __halt, _enable, _disable intrinsics and seen the same effect.
Apparently it is for patching. /hotpatch option gives the same change for x86, and /hotpatch option is not recognized on x64. According to the /hotpatch documentation, it is expected behavior (emphasis mine):
Because instructions are always two bytes or larger on the ARM architecture, and because x64 compilation is always treated as if /hotpatch has been specified, you don't have to specify /hotpatch when you compile for these targets;
So hotpatching support is unconditional for x64, and its result is seen in DebugBreak implementation.
See here: https://godbolt.org/z/1G737cErf
See this post on why it is needed for hotpatching: Why do Windows functions all begin with a pointless MOV EDI, EDI instruction?. Looks like that currently hotpatching is smart enough to use any two bytes or more instruction, not just MOV EDI, EDI, still it cannot use single-byte instruction, as two-byte backward jump may be written at exact moment when the instruction pointer points at the second instruction.
1 As discussed in comments, empty functions have three-byte ret 0, although it is not apparent from MSVC assembly output, as it is represented there as just ret)

Are some general purpose registers faster than others?

In x86-64, will certain instructions execute faster if some general purpose registers are preferred over others?
For instance, would mov eax, ecx execute faster than mov r8d, ecx? I can imagine that the latter would need a REX prefix which would make the instruction fetch slower?
What about using rax instead of rcx? What about add or xor? Other operations? Smaller registers like r15b vs al? al vs ah?
AMD vs Intel? Newer processors? Older processors? Combinations of instructions?
Clarification: Should certain general purpose registers be preferred over others, and which ones are they?
In general, architectural registers are all equal, and renamed onto a large array of physical registers.
(Except partial registers can be slower, especially high-byte AH/BH/CH/DH which are slow to read after writing the full register, on Haswell and later. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent and also Why doesn't GCC use partial registers? for problems when writing 8-bit and 16-bit registers). The rest of this answer is just going to consider 32/64-bit operand-size.)
But some instruction require specific registers, like legacy variable-count shifts (without BMI2 shrx etc) require the count in CL. Division requires the dividend in EDX:EAX (or RDX:RAX for the slower 64-bit version).
Using a call-preserved register like RBX means your function has to spend extra instructions saving/restoring it.
But of course there are perf differences if you need more instructions. So lets assume all else is equal, and just talk about the uops, latency, and code-size of a single instruction just by changing which register is used for one of its operands. TL:DR: the only perf difference is due to instruction-encoding restrictions / differences. Sometimes a different register will allow / require (or get the assembler to pick) a different encoding, which will often be smaller / larger as a special case, and sometimes even executes differently.
Generally smaller code is faster, and packs better in the uop cache and I-cache, so unless you've analyzed a specific case and found a problem, favour the smaller encoding. Often that means keeping a byte value in AL so you can use those special-case instructions, and avoiding RBP / R13 for pointers.
Special cases where a specific encoding is extra slow, not just size
LEA with RBP or R13 as a base can be slower on Intel if the addressing mode didn't already have a +displacement constant.
e.g. lea eax, [rbp + 12] is encodeable as-written, and is just as fast as lea eax, [rcx + 12].
But lea eax, [rbp + rcx*4] can only be encoded in machine code as lea eax, [rbp + rcx*4 + 0] (because of addressing mode escape-code stuff), which is a 3-component LEA, and thus slower on Intel (3 cycle latency on Sandybridge-family instead of 1 cycle, see https://agner.org/optimize/ instruction tables and microarch PDF). On AMD, having a scaled-index would already make it a slow-LEA even with lea eax, [rdx + rcx*4]
Outside of LEA, using RBP / R13 as the base in any addressing mode always requires a disp8/32 byte or dword, but I don't think the actual AGUs are slower for a 3-component addressing mode. So it's just a code-size effect.
Other cases include Which Intel microarchitecture introduced the ADC reg,0 single-uop special case? where the short-form 2-byte encoding for adc al, imm8 is 2 uops even on modern uarches like Skylake, where adc bl, imm8 is 1 uop.
So not only does the adc reg,0 special case not work for adc al,0 on Sandybridge through Haswell, Broadwell and newer forgot (or chose not to) optimize how that encoding decodes to uops. (Of course you could manually encode adc al,0 using the 3-byte Mod/RM encoding, but assemblers will always pick the shortest encoding so adc al,0 will assemble to the short form by default.) Only a problem with byte registers; adc eax,0 will use the opcode ModRM imm8 3-byte encoding, not 5-byte opcode imm32.
For other cases of op al,imm8, the only difference is code-size, which only indirectly matters for performance. (Because of decoding, uop-cache packing, and I-cache misses).
See Tips for golfing in x86/x64 machine code for more about special cases of code-size, like xchg eax, ecx being 1-byte vs. xchg edx, ecx being 2 bytes.
add rsp, 8 can need an extra stack-sync uop if there hasn't been an explicit use of RSP or ESP since the last push/pop/call/ret (along the path of execution of course, not in the static code layout). (What is the stack engine in the Sandybridge microarchitecture?). This is why compilers like clang use a dummy push or pop to reserve / free a single stack slot: Why does this function push RAX to the stack as the first operation?
LEA will be slower with EBP, RBP, or R13 as the base (PDF warning, page 3-22). But generally the answer is No.
Taking a step back, it's important to realize that since the advent of register renaming that architectural registers don't deal with actual, physical registers on most micro-architectures. For example, each Cascade Lake core has a register file of 180 integer and 168 FP registers.
You have stuffed too many questions altogether, however, if I understood the question well, you are confusing the processor architecture with the small but fast Register file, which fills in the speed gap between the processor and memory technologies. The register file is small enough that it can only support one instruction at a time, i.e. the current instruction, and fast enough that it can almost catch up with the processor speed.
I would like to build a short background, the naming conventions of these registers serves two purposes: one, it makes the older versions of the x86 ISA implementations compatible up till now, and two, every name of these registers has a special purpose to it besides its general purpose use. For example, the ECX register is used as a counter to implement loops i.e. instructions like JECXZ and LOOP uses ECX register exclusively. Though you need to watch out for some flags that you would not want to lose.
And now the answer to your question stems from the second purpose. So some registers would seem to be faster because these special registers are hardcoded into the processor and can be accessed much quicker, however, the difference should not be much.
And the second thing that you might know, not all instructions are of the same complexity, especially in x86, the opcode of instructions can be from 1-3 bytes and as more and more functionality is added to the instruction in terms of, prefixes, addressing modes, etc. these instructions start to become slower, So it is not the case that some registers are slower than other, it is just that some registers are encoded into the instruction and therefore those instructions run faster with that combination of register. And if otherwise used, it would seem slower. I hope that helps. Thanks

x86 mfence and C++ memory barrier

I'm checking how the compiler emits instructions for multi-core memory barriers on x86_64. The below code is the one I'm testing using gcc_x86_64_8.3.
std::atomic<bool> flag {false};
int any_value {0};
void set()
{
any_value = 10;
flag.store(true, std::memory_order_release);
}
void get()
{
while (!flag.load(std::memory_order_acquire));
assert(any_value == 10);
}
int main()
{
std::thread a {set};
get();
a.join();
}
When I use std::memory_order_seq_cst, I can see the MFENCE instruction is used with any optimization -O1, -O2, -O3. This instruction makes sure the store buffers are flushed, therefore updating their data in L1D cache (and using MESI protocol to make sure other threads can see effect).
However when I use std::memory_order_release/acquire with no optimizations MFENCE instruction is also used, but the instruction is omitted using -O1, -O2, -O3 optimizations, and not seeing other instructions that flush the buffers.
In the case where MFENCE is not used, what makes sure the store buffer data is committed to cache memory to ensure the memory order semantics?
Below is the assembly code for the get/set functions with -O3, like what we get on the Godbolt compiler explorer:
set():
mov DWORD PTR any_value[rip], 10
mov BYTE PTR flag[rip], 1
ret
.LC0:
.string "/tmp/compiler-explorer-compiler119218-62-hw8j86.n2ft/example.cpp"
.LC1:
.string "any_value == 10"
get():
.L8:
movzx eax, BYTE PTR flag[rip]
test al, al
je .L8
cmp DWORD PTR any_value[rip], 10
jne .L15
ret
.L15:
push rax
mov ecx, OFFSET FLAT:get()::__PRETTY_FUNCTION__
mov edx, 17
mov esi, OFFSET FLAT:.LC0
mov edi, OFFSET FLAT:.LC1
call __assert_fail
The x86 memory ordering model provides #StoreStore and #LoadStore barriers for all store instructions1, which is all what the release semantics require. Also the processor will commit a store instruction as soon as possible; when the store instruction retires, the store becomes the oldest in the store buffer, the core has the target cache line in a writeable coherence state, and a cache port is available to perform the store operation2. So there is no need for an MFENCE instruction. The flag will become visible to the other thread as soon as possible and when it does, any_value is guaranteed to be 10.
On the other hand, sequential consistency also requires #StoreLoad and #LoadLoad barriers. MFENCE is required to provide both3 barriers and so it is used at all optimization levels.
Related: Size of store buffers on Intel hardware? What exactly is a store buffer?.
Footnotes:
(1) There are exceptions that don't apply here. In particular, non-temporal stores and stores to the uncacheable write-combining memory types provide only the #LoadStore barrier. Anyway, these barriers are provided for stores to the write-back memory type on both Intel and AMD processors.
(2) This is in contrast to write-combining stores which are made globally-visible under certain conditions. See Section 11.3.1 of the Intel manual Volume 3.
(3) See the discussion under Peter's answer.
x86's TSO memory model is sequential-consistency + a store buffer, so only seq-cst stores need any special fencing. (Stalling after a store until the store buffer drains, before later loads, is all we need to recover sequential consistency). The weaker acq/rel model is compatible with the StoreLoad reordering caused by a store buffer.
(See discussion in comments re: whether "allowing StoreLoad reordering" is an accurate and sufficient description of what x86 allows. A core always sees its own stores in program order because loads snoop the store buffer, so you could say that store-forwarding also reorders loads of recently-stored data. Except you can't always: Globally Invisible load instructions)
(And BTW, compilers other than gcc use xchg to do a seq-cst store. This is actually more efficient on current CPUs. GCC's mov+mfence might have been cheaper in the past, but is currently usually worse even if you don't care about the old value. See Why does a std::atomic store with sequential consistency use XCHG? for a comparison between GCC's mov+mfence vs. xchg. Also my answer on Which is a better write barrier on x86: lock+addl or xchgl?)
Fun fact: you can achieve sequential consistency by instead fencing seq-cst loads instead of stores. But cheap loads are much more valuable than cheap stores for most use-cases, so everyone uses ABIs where the full barriers go on the stores.
See https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html for details of how C++11 atomic ops map to asm instruction sequences for x86, PowerPC, ARMv7, ARMv8, and Itanium. Also When are x86 LFENCE, SFENCE and MFENCE instructions required?
when I use std::memory_order_release/acquire with no optimizations MFENCE instruction is also used
That's because flag.store(true, std::memory_order_release); doesn't inline, because you disabled optimization. That includes inlining of very simple member functions like atomic::store(T, std::memory_order = std::memory_order_seq_cst)
When the ordering parameter to the __atomic_store_n() GCC builtin is a runtime variable (in the atomic::store() header implementation), GCC plays it conservative and promotes it to seq_cst.
It might actually be worth it for gcc to branch over mfence because it's so expensive, but that's not what we get. (But that would make larger code-size for functions with runtime variable order params, and the code path might not be hot. So branching is probably only a good idea in the libatomic implementation, or with profile-guided optimization for rare cases where a function is large enough to not inline but takes a variable order.)

Are scaled-index addressing modes a good idea?

Consider the following code:
void foo(int* __restrict__ a)
{
int i; int val = 0;
for (i = 0; i < 100; i++) {
val = 2 * i;
a[i] = val;
}
}
This complies (with maximum optimization but no unrolling or vectorization) into...
GCC 7.2:
foo(int*):
xor eax, eax
.L2:
mov DWORD PTR [rdi], eax
add eax, 2
add rdi, 4
cmp eax, 200
jne .L2
rep ret
clang 5.0:
foo(int*): # #foo(int*)
xor eax, eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
mov dword ptr [rdi + 2*rax], eax
add rax, 2
cmp rax, 200
jne .LBB0_1
ret
What are the pros and cons of GCC's vs clang's approach? i.e. an extra variable incremented separately, vs multiplying via a more complex addressing mode?
Notes:
This question also relates to this one with about the same code, but with float's rather than int's.
Yes, take advantage of the power of x86 addressing modes to save uops, in cases where an index doesn't unlaminate into more extra uops than it would cost to do pointer increments.
(In many cases unrolling and using pointer increments is a win because of unlamination on Intel Sandybridge-family, but if you're not unrolling or if you're only using mov loads instead of folding memory operands into ALU ops for micro-fusion, then indexed addressing modes are often break even on some CPUs and a win on others.)
It's essential to read and understand Micro fusion and addressing modes if you want to make optimal choices here. (And note that IACA gets it wrong, and doesn't simulate Haswell and later keeping some uops micro-fused, so you can't even just check your work by having it do static analysis for you.)
Indexed addressing modes are generally cheap. At worst they cost one extra uop for the front-end (on Intel SnB-family CPUs in some situations), and/or prevent a store-address uop from using port7 (which only supports base + displacement addressing modes). See Agner Fog's microarch pdf, and also David Kanter's Haswell write-up, for more about the store-AGU on port7 which Intel added in Haswell.
On Haswell+, if you need your loop to sustain more than 2 memory ops per clock, then avoid indexed stores.
At best they're free other than the code-size cost of the extra byte in the machine-code encoding. (Having an index register requires a SIB (Scale Index Base) byte in the encoding).
More often the only penalty is the 1 extra cycle of load-use latency vs. a simple [base + 0-2047] addressing mode, on Intel Sandybridge-family CPUs.
It's usually only worth using an extra instruction to avoid an indexed addressing mode if you're going to use that addressing mode in multiple instructions. (e.g. load / modify / store).
Scaling the index is free (on modern CPUs at least) if you're already using a 2-register addressing mode. For lea, Agner Fog's table lists AMD Ryzen as having 2c latency and only 2 per clock throughput for lea with scaled-index addressing modes (or 3-component), otherwise 1c latency and 0.25c throughput. e.g. lea rax, [rcx + rdx] is faster than lea rax, [rcx + 2*rdx], but not by enough to be worth using extra instructions instead.) Ryzen also doesn't like a 32-bit destination in 64-bit mode, for some reason. But the worst-case LEA is still not bad at all. And anyway, mostly unrelated to address-mode choice for loads, because most CPUs (other than in-order Atom) run LEA on the ALUs, not the AGUs used for actual loads/stores.
The main question is between one-register unscaled (so it can be a "base" register in the machine-code encoding: [base + idx*scale + disp]) or two-register. Note that for Intel's micro-fusion limitations, [disp32 + idx*scale] (e.g. indexing a static array) is an indexed addressing mode.
Neither function is totally optimal (even without considering unrolling or vectorization), but clang's looks very close.
The only thing clang could do better is save 2 bytes of code size by avoiding the REX prefixes with add eax, 2 and cmp eax, 200. It promoted all the operands to 64-bit because it's using them with pointers and I guess proved that the C loop doesn't need them to wrap, so in asm it uses 64-bit everywhere. This is pointless; 32-bit operations are always at least as fast as 64, and implicit zero-extension is free. But this only costs 2 bytes of code-size, and costs no performance other than indirect front-end effects from that.
You've constructed your loop so the compiler needs to keep a specific value in registers and can't totally transform the problem into just a pointer-increment + compare against an end pointer (which compilers often do when they don't need the loop variable for anything except array indexing).
You also can't transform to counting a negative index up towards zero (which compilers never do, but reduces the loop overhead to a total of 1 macro-fused add + branch uop on Intel CPUs (which can fuse add + jcc, while AMD can only fuse test or cmp / jcc).
Clang has done a good job noticing that it can use 2*var as the array index (in bytes). This is a good optimization for tune=generic. The indexed store will un-laminate on Intel Sandybridge and Ivybridge, but stay micro-fused on Haswell and later. (And on other CPUs, like Nehalem, Silvermont, Ryzen, Jaguar, or whatever, there's no disadvantage.)
gcc's loop has 1 extra uop in the loop. It can still in theory run at 1 store per clock on Core2 / Nehalem, but it's right up against the 4 uops per clock limit. (And actually, Core2 can't macro-fuse the cmp/jcc in 64-bit mode, so it bottlenecks on the front-end).
Indexed addressing (in loads and stores, lea is different still) has some trade-offs, for example
On many µarchs, instructions that use indexed addressing have a slightly longer latency than instruction that don't. But usually throughput is a more important consideration.
On Netburst, stores with a SIB byte generate an extra µop, and therefore may cost throughput as well. The SIB byte causes an extra µop regardless of whether you use it for indexes addressing or not, but indexed addressing always costs the extra µop. It doesn't apply to loads.
On Haswell/Broadwell (and still in Skylake/Kabylake), stores with indexed addressing cannot use port 7 for address generation, instead one of the more general address generation ports will be used, reducing the throughput available for loads.
So for loads it's usually good (or not bad) to use indexed addressing if it saves an add somewhere, unless they are part of a chain of dependent loads. For stores it's more dangerous to use indexed addressing. In the example code it shouldn't make a large difference. Saving the add is not really relevant, ALU instructions wouldn't be the bottleneck. The address generation happening in ports 2 or 3 doesn't matter either since there are no loads.

Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly?

Background:
While optimizing some Pascal code with embedded assembly language, I noticed an unnecessary MOV instruction, and removed it.
To my surprise, removing the un-necessary instruction caused my program to slow down.
I found that adding arbitrary, useless MOV instructions increased performance even further.
The effect is erratic, and changes based on execution order: the same junk instructions transposed up or down by a single line produce a slowdown.
I understand that the CPU does all kinds of optimizations and streamlining, but, this seems more like black magic.
The data:
A version of my code conditionally compiles three junk operations in the middle of a loop that runs 2**20==1048576 times. (The surrounding program just calculates SHA-256 hashes).
The results on my rather old machine (Intel(R) Core(TM)2 CPU 6400 # 2.13 GHz):
avg time (ms) with -dJUNKOPS: 1822.84 ms
avg time (ms) without: 1836.44 ms
The programs were run 25 times in a loop, with the run order changing randomly each time.
Excerpt:
{$asmmode intel}
procedure example_junkop_in_sha256;
var s1, t2 : uint32;
begin
// Here are parts of the SHA-256 algorithm, in Pascal:
// s0 {r10d} := ror(a, 2) xor ror(a, 13) xor ror(a, 22)
// s1 {r11d} := ror(e, 6) xor ror(e, 11) xor ror(e, 25)
// Here is how I translated them (side by side to show symmetry):
asm
MOV r8d, a ; MOV r9d, e
ROR r8d, 2 ; ROR r9d, 6
MOV r10d, r8d ; MOV r11d, r9d
ROR r8d, 11 {13 total} ; ROR r9d, 5 {11 total}
XOR r10d, r8d ; XOR r11d, r9d
ROR r8d, 9 {22 total} ; ROR r9d, 14 {25 total}
XOR r10d, r8d ; XOR r11d, r9d
// Here is the extraneous operation that I removed, causing a speedup
// s1 is the uint32 variable declared at the start of the Pascal code.
//
// I had cleaned up the code, so I no longer needed this variable, and
// could just leave the value sitting in the r11d register until I needed
// it again later.
//
// Since copying to RAM seemed like a waste, I removed the instruction,
// only to discover that the code ran slower without it.
{$IFDEF JUNKOPS}
MOV s1, r11d
{$ENDIF}
// The next part of the code just moves on to another part of SHA-256,
// maj { r12d } := (a and b) xor (a and c) xor (b and c)
mov r8d, a
mov r9d, b
mov r13d, r9d // Set aside a copy of b
and r9d, r8d
mov r12d, c
and r8d, r12d { a and c }
xor r9d, r8d
and r12d, r13d { c and b }
xor r12d, r9d
// Copying the calculated value to the same s1 variable is another speedup.
// As far as I can tell, it doesn't actually matter what register is copied,
// but moving this line up or down makes a huge difference.
{$IFDEF JUNKOPS}
MOV s1, r9d // after mov r12d, c
{$ENDIF}
// And here is where the two calculated values above are actually used:
// T2 {r12d} := S0 {r10d} + Maj {r12d};
ADD r12d, r10d
MOV T2, r12d
end
end;
Try it yourself:
The code is online at GitHub if you want to try it out yourself.
My questions:
Why would uselessly copying a register's contents to RAM ever increase performance?
Why would the same useless instruction provide a speedup on some lines, and a slowdown on others?
Is this behavior something that could be exploited predictably by a compiler?
The most likely cause of the speed improvement is that:
inserting a MOV shifts the subsequent instructions to different memory addresses
one of those moved instructions was an important conditional branch
that branch was being incorrectly predicted due to aliasing in the branch prediction table
moving the branch eliminated the alias and allowed the branch to be predicted correctly
Your Core2 doesn't keep a separate history record for each conditional jump. Instead it keeps a shared history of all conditional jumps. One disadvantage of global branch prediction is that the history is diluted by irrelevant information if the different conditional jumps are uncorrelated.
This little branch prediction tutorial shows how branch prediction buffers work. The cache buffer is indexed by the lower portion of the address of the branch instruction. This works well unless two important uncorrelated branches share the same lower bits. In that case, you end-up with aliasing which causes many mispredicted branches (which stalls the instruction pipeline and slowing your program).
If you want to understand how branch mispredictions affect performance, take a look at this excellent answer: https://stackoverflow.com/a/11227902/1001643
Compilers typically don't have enough information to know which branches will alias and whether those aliases will be significant. However, that information can be determined at runtime with tools such as Cachegrind and VTune.
You may want to read http://research.google.com/pubs/pub37077.html
TL;DR: randomly inserting nop instructions in programs can easily increase performance by 5% or more, and no, compilers cannot easily exploit this. It's usually a combination of branch predictor and cache behaviour, but it can just as well be e.g. a reservation station stall (even in case there are no dependency chains that are broken or obvious resource over-subscriptions whatsoever).
I believe in modern CPUs the assembly instructions, while being the last visible layer to a programmer for providing execution instructions to a CPU, actually are several layers from actual execution by the CPU.
Modern CPUs are RISC/CISC hybrids that translate CISC x86 instructions into internal instructions that are more RISC in behavior. Additionally there are out-of-order execution analyzers, branch predictors, Intel's "micro-ops fusion" that try to group instructions into larger batches of simultaneous work (kind of like the VLIW/Itanium titanic). There are even cache boundaries that could make the code run faster for god-knows-why if it's bigger (maybe the cache controller slots it more intelligently, or keeps it around longer).
CISC has always had an assembly-to-microcode translation layer, but the point is that with modern CPUs things are much much much more complicated. With all the extra transistor real estate in modern semiconductor fabrication plants, CPUs can probably apply several optimization approaches in parallel and then select the one at the end that provides the best speedup. The extra instructions may be biasing the CPU to use one optimization path that is better than others.
The effect of the extra instructions probably depends on the CPU model / generation / manufacturer, and isn't likely to be predictable. Optimizing assembly language this way would require execution against many CPU architecture generations, perhaps using CPU-specific execution paths, and would only be desirable for really really important code sections, although if you're doing assembly, you probably already know that.
Preparing the cache
Move operations to memory can prepare the cache and make subsequent move operations faster. A CPU usually have two load units and one store units. A load unit can read from memory into a register (one read per cycle), a store unit stores from register to memory. There are also other units that do operations between registers. All the units work in parallel. So, on each cycle, we may do several operations at once, but no more than two loads, one store, and several register operations. Usually it is up to 4 simple operations with plain registers, up to 3 simple operations with XMM/YMM registers and a 1-2 complex operations with any kind of registers. Your code has lots of operations with registers, so one dummy memory store operation is free (since there are more than 4 register operations anyway), but it prepares memory cache for the subsequent store operation. To find out how memory stores work, please refer to the Intel 64 and IA-32 Architectures Optimization Reference Manual.
Breaking the false dependencies
Although this does not exactly refer to your case, but sometimes using 32-bit mov operations under the 64-bit processor (as in your case) are used to clear the higher bits (32-63) and break the dependency chains.
It is well known that under x86-64, using 32-bit operands clears the higher bits of the 64-bit register. Pleas read the relevant section - 3.4.1.1 - of The Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1:
32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register
So, the mov instructions, that may seem useless at the first sight, clear the higher bits of the appropriate registers. What it gives to us? It breaks dependency chains and allows the instructions to execute in parallel, in random order, by the Out-of-Order algorithm implemented internally by CPUs since Pentium Pro in 1995.
A Quote from the Intel® 64 and IA-32 Architectures Optimization Reference Manual, Section 3.5.1.8:
Code sequences that modifies partial register can experience some delay in its dependency chain, but can be avoided by using dependency breaking idioms. In processors based on Intel Core micro-architecture, a number of instructions can help clear execution dependency when software uses these instruction to clear register content to zero. Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For
moves, this can be accomplished with 32-bit moves or by using MOVZX.
Assembly/Compiler Coding Rule 37. (M impact, MH generality): Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.
The MOVZX and MOV with 32-bit operands for x64 are equivalent - they all break dependency chains.
That's why your code executes faster. If there are no dependencies, the CPU can internally rename the registers, even though at the first sight it may seem that the second instruction modifies a register used by the first instruction, and the two cannot execute in parallel. But due to register renaming they can.
Register renaming is a technique used internally by a CPU that eliminates the false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them.
I think you now see that it is too obvious.

Resources