INC instruction vs ADD 1: Does it matter? - performance

From Ira Baxter answer on, Why do the INC and DEC instructions not affect the Carry Flag (CF)?
Mostly, I stay away from INC and DEC now, because they do partial condition code updates, and this can cause funny stalls in the pipeline, and ADD/SUB don't. So where it doesn't matter (most places), I use ADD/SUB to avoid the stalls. I use INC/DEC only when keeping the code small matters, e.g., fitting in a cache line where the size of one or two instructions makes enough difference to matter. This is probably pointless nano[literally!]-optimization, but I'm pretty old-school in my coding habits.
And I would like to ask why it can cause stalls in the pipeline while add doesn't? After all, both ADD and INC updates flag registers. The only difference is that INC doesn't update CF. But why it matters?

Update: the Efficiency cores on Alder Lake are Gracemont, and run inc reg as a single uop, but at only 1/clock, vs. 4/clock for add reg, 1 (https://uops.info/). This may be a false dependency on FLAGS like P4 had; the uops.info tests didn't try adding a dep-breaking instruction. Other than the TL:DR, I haven't updated other parts of this answer.
TL:DR/advice for modern CPUs: Probably use add; Intel Alder Lake's E-cores are relevant for "generic" tuning and seem to run inc slowly.
Other than Alder Lake and earlier Silvermont-family, use inc except with a memory destination; that's fine on mainstream Intel or any AMD. (e.g. like gcc -mtune=core2, -mtune=haswell, or -mtune=znver1). inc mem costs an extra uop vs. add on Intel P6 / SnB-family; the load can't micro-fuse.
If you care about Silvermont-family (including KNL in Xeon Phi, and some netbooks, chromebooks, and NAS servers), probably avoid inc. add 1 only costs 1 extra byte in 64-bit code, or 2 in 32-bit code. But it's not a performance disaster (just locally 1 extra ALU port used, not creating false dependencies or big stalls), so if you don't care much about SMont then don't worry about it.
Writing CF instead of leaving it unmodified can potentially be useful with other surrounding code that might benefit from CF dep-breaking, e.g. shifts. See below.
If you want to inc/dec without touching any flags, lea eax, [rax+1] runs efficiently and has the same code-size as add eax, 1. (Usually on fewer possible execution ports than add/inc, though, so add/inc are better when destroying FLAGS is not a problem. https://agner.org/optimize/)
On modern CPUs, add is never slower than inc (except for indirect code-size / decode effects), but usually it's not faster either, so you should prefer inc for code-size reasons. Especially if this choice is repeated many times in the same binary (e.g. if you are a compiler-writer).
inc saves 1 byte (64-bit mode), or 2 bytes (opcodes 0x40..F inc r32/dec r32 short form in 32-bit mode, re-purposed as the REX prefix for x86-64). This makes a small percentage difference in total code size. This helps instruction-cache hit rates, iTLB hit rate, and number of pages that have to be loaded from disk.
Advantages of inc:
code-size directly
Not using an immediate can have uop-cache effects on Sandybridge-family, which could offset the better micro-fusion of add. (See Agner Fog's table 9.1 in the Sandybridge section of his microarch guide.) Perf counters can easily measure issue-stage uops, but it's harder to measure how things pack into the uop cache and uop-cache read bandwidth effects.
Leaving CF unmodified is an advantage in some cases, on CPUs where you can read CF after inc without a stall. (Not on Nehalem and earlier.)
There is one exception among modern CPUs: Silvermont/Goldmont/Knight's Landing decodes inc/dec efficiently as 1 uop, but expands to 2 in the allocate/rename (aka issue) stage. The extra uop merges partial flags. inc throughput is only 1 per clock, vs. 0.5c (or 0.33c Goldmont) for independent add r32, imm8 because of the dep chain created by the flag-merging uops.
Unlike P4, the register result doesn't have a false-dep on flags (see below), so out-of-order execution takes the flag-merging off the latency critical path when nothing uses the flag result. (But the OOO window is much smaller than mainstream CPUs like Haswell or Ryzen.) Running inc as 2 separate uops is probably a win for Silvermont in most cases; most x86 instructions write all the flags without reading them, breaking these flag dependency chains.
SMont/KNL has a queue between decode and allocate/rename (See Intel's optimization manual, figure 16-2) so expanding to 2 uops during issue can fill bubbles from decode stalls (on instructions like one-operand mul, or pshufb, which produce more than 1 uop from the decoder and cause a 3-7 cycle stall for microcode). Or on Silvermont, just an instruction with more than 3 prefixes (including escape bytes and mandatory prefixes), e.g. REX + any SSSE3 or SSE4 instruction. But note that there is a ~28 uop loop buffer, so small loops don't suffer from these decode stalls.
inc/dec aren't the only instructions that decode as 1 but issue as 2: push/pop, call/ret, and lea with 3 components do this too. So do KNL's AVX512 gather instructions. Source: Intel's optimization manual, 17.1.2 Out-of-Order Engine (KNL). It's only a small throughput penalty (and sometimes not even that if anything else is a bigger bottleneck), so it's generally fine to still use inc for "generic" tuning.
Intel's optimization manual still recommends add 1 over inc in general, to avoid risks of partial-flag stalls. But since Intel's compiler doesn't do that by default, it's not too likely that future CPUs will make inc slow in all cases, like P4 did.
Clang 5.0 and Intel's ICC 17 (on Godbolt) do use inc when optimizing for speed (-O3), not just for size. -mtune=pentium4 makes them avoid inc/dec, but the default -mtune=generic doesn't put much weight on P4.
ICC17 -xMIC-AVX512 (equivalent to gcc's -march=knl) does avoid inc, which is probably a good bet in general for Silvermont / KNL. But it's not usually a performance disaster to use inc, so it's probably still appropriate for "generic" tuning to use inc/dec in most code, especially when the flag result isn't part of the critical path.
Other than Silvermont, this is mostly-stale optimization advice left over from Pentium4. On modern CPUs, there's only a problem if you actually read a flag that wasn't written by the last insn that wrote any flags. e.g. in BigInteger adc loops. (And in that case, you need to preserve CF so using add would break your code.)
add writes all the condition-flag bits in the EFLAGS register. Register-renaming makes write-only easy for out-of-order execution: see write-after-write and write-after-read hazards. add eax, 1 and add ecx, 1 can execute in parallel because they are fully independent of each other. (Even Pentium4 renames the condition flag bits separate from the rest of EFLAGS, since even add leaves the interrupts-enabled and many other bits unmodified.)
On P4, inc and dec depend on the previous value of the all the flags, so they can't execute in parallel with each other or preceding flag-setting instructions. (e.g. add eax, [mem] / inc ecx makes the inc wait until after the add, even if the add's load misses in cache.) This is called a false dependency. Partial-flag writes work by reading the old value of the flags, updating the bits other than CF, then writing the full flags.
All other out-of-order x86 CPUs (including AMD's), rename different parts of flags separately, so internally they do a write-only update to all the flags except CF. (source: Agner Fog's microarchitecture guide). Only a few instructions, like adc or cmc, truly read and then write flags. But also shl r, cl (see below).
Cases where add dest, 1 is preferable to inc dest, at least for Intel P6/SnB uarch families:
Memory-destination: add [rdi], 1 can micro-fuse the store and the load+add on Intel Core2 and SnB-family, so it's 2 fused-domain uops / 4 unfused-domain uops.
inc [rdi] can only micro-fuse the store, so it's 3F / 4U.
According to Agner Fog's tables, AMD and Silvermont run memory-dest inc and add the same, as a single macro-op / uop.
But beware of uop-cache effects with add [label], 1 which needs a 32-bit address and an 8-bit immediate for the same uop.
Before a variable-count shift/rotate to break the dependency on flags and avoid partial-flag merging: shl reg, cl has an input dependency on the flags, because of unfortunate CISC history: it has to leave them unmodified if the shift count is 0.
On Intel SnB-family, variable-count shifts are 3 uops (up from 1 on Core2/Nehalem). AFAICT, two of the uops read/write flags, and an independent uop reads reg and cl, and writes reg. It's a weird case of having better latency (1c + inevitable resource conflicts) than throughput (1.5c), and only being able to achieve max throughput if mixed with instructions that break dependencies on flags. (I posted more about this on Agner Fog's forum). Use BMI2 shlx when possible; it's 1 uop and the count can be in any register.
Anyway, inc (writing flags but leaving CF unmodified) before variable-count shl leaves it with a false dependency on whatever wrote CF last, and on SnB/IvB can require an extra uop to merge flags.
Core2/Nehalem manage to avoid even the false dep on flags: Merom runs a loop of 6 independent shl reg,cl instructions at nearly two shifts per clock, same performance with cl=0 or cl=13. Anything better than 1 per clock proves there's no input-dependency on flags.
I tried loops with shl edx, 2 and shl edx, 0 (immediate-count shifts), but didn't see a speed difference between dec and sub on Core2, HSW, or SKL. I don't know about AMD.
Update: The nice shift performance on Intel P6-family comes at the cost of a large performance pothole which you need to avoid: when an instruction depends on the flag-result of a shift instruction: The front end stalls until the instruction is retired. (Source: Intel's optimization manual, (Section 3.5.2.6: Partial Flag Register Stalls)). So shr eax, 2 / jnz is pretty catastrophic for performance on Intel pre-Sandybridge, I guess! Use shr eax, 2 / test eax,eax / jnz if you care about Nehalem and earlier. Intel's examples makes it clear this applies to immediate-count shifts, not just count=cl.
In processors based on Intel Core microarchitecture [this means Core 2 and later], shift immediate by 1 is handled by special hardware such that it does not experience partial flag stall.
Intel actually means the special opcode with no immediate, which shifts by an implicit 1. I think there is a performance difference between the two ways of encoding shr eax,1, with the short encoding (using the original 8086 opcode D1 /5) producing a write-only (partial) flag result, but the longer encoding (C1 /5, imm8 with an immediate 1) not having its immediate checked for 0 until execution time, but without tracking the flag output in the out-of-order machinery.
Since looping over bits is common, but looping over every 2nd bit (or any other stride) is very uncommon, this seems like a reasonable design choice. This explains why compilers like to test the result of a shift instead of directly using flag results from shr.
Update: for variable count shifts on SnB-family, Intel's optimization manual says:
3.5.1.6 Variable Bit Count Rotation and Shift
In Intel microarchitecture code name Sandy Bridge, The “ROL/ROR/SHL/SHR reg, cl” instruction has three micro-ops. When the flag result is not needed, one of these micro-ops may be discarded, providing
better performance in many common usages. When these instructions update partial flag results that are subsequently used, the full three micro-ops flow must go through the execution and retirement pipeline,
experiencing slower performance. In Intel microarchitecture code name Ivy Bridge, executing the full three micro-ops flow to use the updated partial flag result has additional delay.
Consider the looped sequence below:
loop:
shl eax, cl
add ebx, eax
dec edx ; DEC does not update carry, causing SHL to execute slower three micro-ops flow
jnz loop
The DEC instruction does not modify the carry flag. Consequently, the
SHL EAX, CL instruction needs to execute the three micro-ops flow in
subsequent iterations. The SUB instruction will update all flags. So
replacing DEC with SUB will allow SHL EAX, CL to execute the two
micro-ops flow.
Terminology
Partial-flag stalls happen when flags are read, if they happen at all. P4 never has partial-flag stalls, because they never need to be merged. It has false dependencies instead.
Several answers / comments mix up the terminology. They describe a false dependency, but then call it a partial-flag stall. It's a slowdown which happens because of writing only some of the flags, but the term "partial-flag stall" is what happens on pre-SnB Intel hardware when partial-flag writes have to be merged. Intel SnB-family CPUs insert an extra uop to merge flags without stalling. Nehalem and earlier stall for ~7 cycles. I'm not sure how big the penalty is on AMD CPUs.
(Note that partial-register penalties are not always the same as partial-flags, see below).
### Partial flag stall on Intel P6-family CPUs:
bigint_loop:
adc eax, [array_end + rcx*4] # partial-flag stall when adc reads CF
inc rcx # rcx counts up from negative values towards zero
# test rcx,rcx # eliminate partial-flag stalls by writing all flags, or better use add rcx,1
jnz
# this loop doesn't do anything useful; it's not normally useful to loop the carry-out back to the carry-in for the same accumulator.
# Note that `test` will change the input to the next adc, and so would replacing inc with add 1
In other cases, e.g. a partial flag write followed by a full flag write, or a read of only flags written by inc, is fine. On SnB-family CPUs, inc/dec can even macro-fuse with a jcc, the same as add/sub.
After P4, Intel mostly gave up on trying to get people to re-compile with -mtune=pentium4 or modify hand-written asm as much to avoid serious bottlenecks. (Tuning for a specific microarchitecture will always be a thing, but P4 was unusual in deprecating so many things that used to be fast on previous CPUs, and thus were common in existing binaries.) P4 wanted people to use a RISC-like subset of the x86, and also had branch-prediction hints as prefixes for JCC instructions. (It also had other serious problems, like the trace cache that just wasn't good enough, and weak decoders that meant bad performance on trace-cache misses. Not to mention the whole philosophy of clocking very high ran into the power-density wall.)
When Intel abandoned P4 (NetBurst uarch), they returned to P6-family designs (Pentium-M / Core2 / Nehalem) which inherited their partial-flag / partial-reg handling from earlier P6-family CPUs (PPro to PIII) which pre-dated the netburst mis-step. (Not everything about P4 was inherently bad, and some of the ideas re-appeared in Sandybridge, but overall NetBurst is widely considered a mistake.) Some very-CISC instructions are still slower than the multi-instruction alternatives, e.g. enter, loop, or bt [mem], reg (because the value of reg affects which memory address is used), but these were all slow in older CPUs so compilers already avoided them.
Pentium-M even improved hardware support for partial-regs (lower merging penalties). In Sandybridge, Intel kept partial-flag and partial-reg renaming and made it much more efficient when merging is needed (merging uop inserted with no or minimal stall). SnB made major internal changes and is considered a new uarch family, even though it inherits a lot from Nehalem, and some ideas from P4. (But note that SnB's decoded-uop cache is not a trace cache, though, so it's a very different solution to the decoder throughput/power problem that NetBurst's trace cache tried to solve.)
For example, inc al and inc ah can run in parallel on P6/SnB-family CPUs, but reading eax afterwards requires merging.
PPro/PIII stall for 5-6 cycles when reading the full reg. Core2/Nehalem stall for only 2 or 3 cycles while inserting a merging uop for partial regs, but partial flags are still a longer stall.
SnB inserts a merging uop without stalling, like for flags. Intel's optimization guide says that for merging AH/BH/CH/DH into the wider reg, inserting the merging uop takes an entire issue/rename cycle during which no other uops can be allocated. But for low8/low16, the merging uop is "part of the flow", so it apparently doesn't cause additional front-end throughput penalties beyond taking up one of the 4 slots in an issue/rename cycle.
In IvyBridge (or at least Haswell), Intel dropped partial-register renaming for low8 and low16 registers, keeping it only for high8 registers (AH/BH/CH/DH). Reading high8 registers has extra latency. Also, setcc al has a false dependency on the old value of rax, unlike in Nehalem and earlier (and probably Sandybridge). See this HSW/SKL partial-register performance Q&A for the details.
(I've previously claimed that Haswell could merge AH with no uop, but that's not true and not what Agner Fog's guide says. I skimmed too quickly and unfortunately repeated my wrong understanding in lots of comments and other posts.)
AMD CPUs, and Intel Silvermont, don't rename partial regs (other than flags), so mov al, [mem] has a false dependency on the old value of eax. (The upside is no partial-reg merging slowdowns when reading the full reg later.)
Normally, the only time add instead of inc will make your code faster on AMD or mainstream Intel is when your code actually depends on the doesn't-touch-CF behaviour of inc. i.e. usually add only helps when it would break your code, but note the shl case mentioned above, where the instruction reads flags but usually your code doesn't care about that, so it's a false dependency.
If you do actually want to leave CF unmodified, pre SnB-family CPUs have serious problems with partial-flag stalls, but on SnB-family the overhead of having the CPU merge the partial flags is very low, so it can be best to keep using inc or dec as part of a loop condition when targeting those CPU, with some unrolling. (For details, see the BigInteger adc Q&A I linked earlier). It can be useful to use lea to do arithmetic without affecting flags at all, if you don't need to branch on the result.
Skylake doesn't have partial-flag merging costs
Update: Skylake doesn't have partial-flag merging uops at all: CF is just a separate register from the rest of FLAGS. Instructions that need both parts (like cmovbe) read both inputs separately. That makes cmovbe a 2-uop instruction, but most other cmovcc instructions 1-uop on Skylake. See What is a Partial Flag Stall?.
adc only reads CF so it can be single-uop on Skylake with no interaction at all with an inc or dec in the same loop.
(TODO: rewrite earlier parts of this answer.)

Depending on the CPU implementation of the instructions, a partial register update may cause a stall. According to Agner Fog's optimization guide, page 62,
For historical reasons, the INC and DEC instructions leave the carry flag unchanged, while the other arithmetic flags are written to. This causes a false dependence on the previous value of the flags and costs an extra μop. To avoid these problems, it is recommended that you always use ADD and SUB instead of INC and DEC. For example, INC EAX should be replaced by ADD EAX,1.
See also page 83 on "Partial flags stalls" and page 100 on "Partial flags stall".

Related

Timing of executing 8 bit and 64 bit instructions on 64 bit x64/Amd64 processors

Is there any execution timing difference between 8 it and 64 bit instructions on 64 bit x64/Amd64 processor, when those instructions are similar/same except bit width?
Is there a way to find real processor timing of executing these 2 tiny assembly functions?
-Thanks.
; 64 bit instructions
add64:
mov $0x1, %rax
add $0x2, %rax
ret
; 8 bit instructions
add8:
mov $0x1, %al
add $0x2, %al
ret
Yes, there's a difference. mov $0x1, %al has a false dependency on the old value of RAX on most CPUs, including everything newer than Sandybridge. It's a 2-input 1-output instruction; from the CPU's point of view it's like add $1, %al as far as scheduling it independently or not relative to other uses of RAX. Only writing a 32 or 64-bit register starts a new dependency chain.
This means the AL return value of your add8 function might not be ready until after a cache miss for some independent work the caller happened to be doing in EAX before the call, but the RAX result of add64 could be ready right away for out-of-order execution to get started on later instructions in the caller that use the return value. (Assuming their other inputs are also ready.)
Why doesn't GCC use partial registers? and
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent
and What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? - Important background for understanding performance on modern OoO exec CPUs.
Their code-size also differs: Both the 8-bit instructions are 2 bytes long. (Thanks to the AL, imm8 short-form encoding; add $1, %dl would be 3 bytes). The RAX instructions are 7 and 4 bytes long. This matters for L1i cache footprint (and on a large scale, for how many bytes have to get paged in from disk). On a small scale, how many instructions can fit into a 16 or 32-byte fetch block if the CPU is doing legacy decode because the code wasn't already hot in the uop cache. Also code-alignment of later instructions is affected by varying lengths of previous instructions, sometimes affecting which branches alias each other.
https://agner.org/optimize/ explains the details of the pipelines of various x86 microarchitectures, including front-end decoding effects that can make instruction-length matter beyond just code density in the I-cache / uop-cache.
Generally 32-bit operand-size is the most efficient (for performance, and pretty good for code-size). 32 and 8 are the operand-sizes that x86-64 can use without extra prefixes, and in practice with 8-bit to avoid stalls and badness you need more instructions or longer instructions because they don't zero-extend. The advantages of using 32bit registers/instructions in x86-64.
A few instructions are actually slower in the ALUs for 64-bit operand-size, not just front-end effects. That includes div on most CPUs, and imul on some older CPUs. Also popcnt and bswap. e.g. Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux
Note that mov $0x1, %rax will assemble to 7 bytes with GAS, unless you use as -O2 (not the same as gcc -O2, see this for examples) to get it to optimize to mov $1, %eax which exactly the same architectural effects, but is shorter (no REX or ModRM byte). Some assemblers do that optimization by default, but GAS doesn't. Why NASM on Linux changes registers in x86_64 assembly has more about why this optimization is safe and good, and why you should do it yourself in the source especially if your assembler doesn't do it for you.
But other than the false dep and code-size, they're the same for the back-end of the CPU: all those instructions are single-uop and can run on any scalar-integer ALU execution port1. (https://uops.info/ has automated test results for every form of every unprivileged instruction).
Footnote 1: Excavator (last-gen Bulldozer-family) can also run mov $imm, %reg on 2 more ports (AGU) for 32 and 64-bit operand-size. But merging a new low-8 or low-16 into a full register needs an ALU port. So mov $1, %rax has 4/clock throughput on Excavator, but mov $1, %al only has 2/clock throughput. (And of course only if you use a few different destination registers, not actually AL repeatedly; that would be a latency bottleneck of 1/clock because of the false dependency from writing a partial register on that microarchitecture.)
Previous Bulldozer-family CPUs starting with Piledriver can run mov reg, reg (for r32 or r64) on EX0, EX1, AGU0, AGU1, while most ALU instructions including mov $imm, %reg can only run on EX0/1. Further extending the AGU port's capabilities to also handle mov-immediate was a new feature in Excavator.
Fortunately Bulldozer was obsoleted by AMD's much better Zen architecture which has 4 full scalar integer ALU ports / execution units. (And a wider front end and a uop cache, good caches, and generally doesn't suck in a lot of the ways that Bulldozer sucked.)
Is there a way to measure it?
yes, but generally not in a function you call with call. Instead put it in an unrolled loop so you can run it lots of times with minimal other instructions. Especially useful to look at CPU performance counter results to find front-end / back-end uop counts, as well as just the overall time for your loop.
You can construct your loop to measure latency or throughput; see RDTSCP in NASM always returns the same value (timing a single instruction). Also:
Assembly - How to score a CPU instruction by latency and throughput
Idiomatic way of performance evaluation?
Can x86's MOV really be "free"? Why can't I reproduce this at all? is a good specific example of constructing a microbenchmark to measure / prove something specific.
Generally you don't need to measure yourself (although it's good to understand how, that helps you know what the measurements really mean). People have already done that for most CPU microarchitectures. You can predict performance for a specific CPU for some loops (if you can assume no stalls or cache misses) based on analyzing the instructions. Often that can predict performance fairly accurately, but medium-length dependency chains that OoO exec can only partially hide makes it too hard to accurately predict or account for every cycle.
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? has links to lots of good details, and stuff about CPU internals.
How many CPU cycles are needed for each assembly instruction? (you can't add up a cycle count for each instruction; front-end and back-end throughput, and latency, could each be the bottleneck for a loop.)

Are scaled-index addressing modes a good idea?

Consider the following code:
void foo(int* __restrict__ a)
{
int i; int val = 0;
for (i = 0; i < 100; i++) {
val = 2 * i;
a[i] = val;
}
}
This complies (with maximum optimization but no unrolling or vectorization) into...
GCC 7.2:
foo(int*):
xor eax, eax
.L2:
mov DWORD PTR [rdi], eax
add eax, 2
add rdi, 4
cmp eax, 200
jne .L2
rep ret
clang 5.0:
foo(int*): # #foo(int*)
xor eax, eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
mov dword ptr [rdi + 2*rax], eax
add rax, 2
cmp rax, 200
jne .LBB0_1
ret
What are the pros and cons of GCC's vs clang's approach? i.e. an extra variable incremented separately, vs multiplying via a more complex addressing mode?
Notes:
This question also relates to this one with about the same code, but with float's rather than int's.
Yes, take advantage of the power of x86 addressing modes to save uops, in cases where an index doesn't unlaminate into more extra uops than it would cost to do pointer increments.
(In many cases unrolling and using pointer increments is a win because of unlamination on Intel Sandybridge-family, but if you're not unrolling or if you're only using mov loads instead of folding memory operands into ALU ops for micro-fusion, then indexed addressing modes are often break even on some CPUs and a win on others.)
It's essential to read and understand Micro fusion and addressing modes if you want to make optimal choices here. (And note that IACA gets it wrong, and doesn't simulate Haswell and later keeping some uops micro-fused, so you can't even just check your work by having it do static analysis for you.)
Indexed addressing modes are generally cheap. At worst they cost one extra uop for the front-end (on Intel SnB-family CPUs in some situations), and/or prevent a store-address uop from using port7 (which only supports base + displacement addressing modes). See Agner Fog's microarch pdf, and also David Kanter's Haswell write-up, for more about the store-AGU on port7 which Intel added in Haswell.
On Haswell+, if you need your loop to sustain more than 2 memory ops per clock, then avoid indexed stores.
At best they're free other than the code-size cost of the extra byte in the machine-code encoding. (Having an index register requires a SIB (Scale Index Base) byte in the encoding).
More often the only penalty is the 1 extra cycle of load-use latency vs. a simple [base + 0-2047] addressing mode, on Intel Sandybridge-family CPUs.
It's usually only worth using an extra instruction to avoid an indexed addressing mode if you're going to use that addressing mode in multiple instructions. (e.g. load / modify / store).
Scaling the index is free (on modern CPUs at least) if you're already using a 2-register addressing mode. For lea, Agner Fog's table lists AMD Ryzen as having 2c latency and only 2 per clock throughput for lea with scaled-index addressing modes (or 3-component), otherwise 1c latency and 0.25c throughput. e.g. lea rax, [rcx + rdx] is faster than lea rax, [rcx + 2*rdx], but not by enough to be worth using extra instructions instead.) Ryzen also doesn't like a 32-bit destination in 64-bit mode, for some reason. But the worst-case LEA is still not bad at all. And anyway, mostly unrelated to address-mode choice for loads, because most CPUs (other than in-order Atom) run LEA on the ALUs, not the AGUs used for actual loads/stores.
The main question is between one-register unscaled (so it can be a "base" register in the machine-code encoding: [base + idx*scale + disp]) or two-register. Note that for Intel's micro-fusion limitations, [disp32 + idx*scale] (e.g. indexing a static array) is an indexed addressing mode.
Neither function is totally optimal (even without considering unrolling or vectorization), but clang's looks very close.
The only thing clang could do better is save 2 bytes of code size by avoiding the REX prefixes with add eax, 2 and cmp eax, 200. It promoted all the operands to 64-bit because it's using them with pointers and I guess proved that the C loop doesn't need them to wrap, so in asm it uses 64-bit everywhere. This is pointless; 32-bit operations are always at least as fast as 64, and implicit zero-extension is free. But this only costs 2 bytes of code-size, and costs no performance other than indirect front-end effects from that.
You've constructed your loop so the compiler needs to keep a specific value in registers and can't totally transform the problem into just a pointer-increment + compare against an end pointer (which compilers often do when they don't need the loop variable for anything except array indexing).
You also can't transform to counting a negative index up towards zero (which compilers never do, but reduces the loop overhead to a total of 1 macro-fused add + branch uop on Intel CPUs (which can fuse add + jcc, while AMD can only fuse test or cmp / jcc).
Clang has done a good job noticing that it can use 2*var as the array index (in bytes). This is a good optimization for tune=generic. The indexed store will un-laminate on Intel Sandybridge and Ivybridge, but stay micro-fused on Haswell and later. (And on other CPUs, like Nehalem, Silvermont, Ryzen, Jaguar, or whatever, there's no disadvantage.)
gcc's loop has 1 extra uop in the loop. It can still in theory run at 1 store per clock on Core2 / Nehalem, but it's right up against the 4 uops per clock limit. (And actually, Core2 can't macro-fuse the cmp/jcc in 64-bit mode, so it bottlenecks on the front-end).
Indexed addressing (in loads and stores, lea is different still) has some trade-offs, for example
On many µarchs, instructions that use indexed addressing have a slightly longer latency than instruction that don't. But usually throughput is a more important consideration.
On Netburst, stores with a SIB byte generate an extra µop, and therefore may cost throughput as well. The SIB byte causes an extra µop regardless of whether you use it for indexes addressing or not, but indexed addressing always costs the extra µop. It doesn't apply to loads.
On Haswell/Broadwell (and still in Skylake/Kabylake), stores with indexed addressing cannot use port 7 for address generation, instead one of the more general address generation ports will be used, reducing the throughput available for loads.
So for loads it's usually good (or not bad) to use indexed addressing if it saves an add somewhere, unless they are part of a chain of dependent loads. For stores it's more dangerous to use indexed addressing. In the example code it shouldn't make a large difference. Saving the add is not really relevant, ALU instructions wouldn't be the bottleneck. The address generation happening in ports 2 or 3 doesn't matter either since there are no loads.

Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?

I'm doing micro-optimization on a performance critical part of my code and came across the sequence of instructions (in AT&T syntax):
add %rax, %rbx
mov %rdx, %rax
mov %rbx, %rdx
I thought I finally had a use case for xchg which would allow me to shave an instruction and write:
add %rbx, %rax
xchg %rax, %rdx
However, to my dimay I found from Agner Fog's instruction tables, that xchg is a 3 micro-op instruction with a 2 cycle latency on Sandy Bridge, Ivy Bridge, Broadwell, Haswell and even Skylake. 3 whole micro-ops and 2 cycles of latency! The 3 micro-ops throws off my 4-1-1-1 cadence and the 2 cycle latency makes it worse than the original in the best case since the last 2 instructions in the original might execute in parallel.
Now... I get that the CPU might be breaking the instruction into micro-ops that are equivalent to:
mov %rax, %tmp
mov %rdx, %rax
mov %tmp, %rdx
where tmp is an anonymous internal register and I suppose the last two micro-ops could be run in parallel so the latency is 2 cycles.
Given that register renaming occurs on these micro-architectures, though, it doesn't make sense to me that this is done this way. Why wouldn't the register renamer just swap the labels? In theory, this would have a latency of only 1 cycle (possibly 0?) and could be represented as a single micro-op so it would be much cheaper.
Supporting efficient xchg is non-trivial, and presumably not worth the extra complexity it would require in various parts of the CPU. A real CPU's microarchitecture is much more complicated than the mental model that you can use while optimizing software for it. For example, speculative execution makes everything more complicated, because it has to be able to roll back to the point where an exception occurred.
Making fxch efficient was important for x87 performance because the stack nature of x87 makes it (or alternatives like fld st(2)) hard to avoid. Compiler-generated FP code (for targets without SSE support) really does use fxch a significant amount. It seems that fast fxch was done because it was important, not because it's easy. Intel Haswell even dropped support for single-uop fxch. It's still zero-latency, but decodes to 2 uops on HSW and later (up from 1 in P5, and PPro through IvyBridge).
xchg is usually easy to avoid. In most cases, you can just unroll a loop so it's ok that the same value is now in a different register. e.g. Fibonacci with add rax, rdx / add rdx, rax instead of add rax, rdx / xchg rax, rdx. Compilers generally don't use xchg reg,reg, and usually hand-written asm doesn't either. (This chicken/egg problem is pretty similar to loop being slow (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?). loop would have been very useful for for adc loops on Core2/Nehalem where an adc + dec/jnz loop causes partial-flag stalls.)
Since xchg is still slow-ish on previous CPUs, compilers wouldn't start using it with -mtune=generic for several years. Unlike fxch or mov-elimination, a design-change to support fast xchg wouldn't help the CPU run most existing code faster, and would only enable performance gains over the current design in rare cases where it's actually a useful peephole optimization.
Integer registers are complicated by partial-register stuff, unlike x87
There are 4 operand sizes of xchg, 3 of which use the same opcode with REX or operand-size prefixes. (xchg r8,r8 is a separate opcode, so it's probably easier to make the decoders decode it differently from the others). The decoders already have to recognize xchg with a memory operand as special, because of the implicit lock prefix, but it's probably less decoder complexity (transistor-count + power) if the reg-reg forms all decode to the same number of uops for different operand sizes.
Making some r,r forms decode to a single uop would be even more complexity, because single-uop instructions have to be handled by the "simple" decoders as well as the complex decoder. So they would all need to be able to parse xchg and decide whether it was a single uop or multi-uop form.
AMD and Intel CPUs behave somewhat similarly from a programmer's perspective, but there are many signs that the internal implementation is vastly different. For example, Intel mov-elimination only works some of the time, limited by some kind of microarchitectural resources, but AMD CPUs that do mov-elimination do it 100% of the time (e.g. Bulldozer for the low lane of vector regs).
See Intel's optimization manual, Example 3-23. Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions, where they discuss overwriting the zero-latency-movzx result right away to free up the internal resource sooner. (I tried the examples on Haswell and Skylake, and found that mov-elimination did in fact work significantly more of the time when doing that, but that it was actually slightly slower in total cycles, instead of faster. The example was intended to show the benefit on IvyBridge, which probably bottlenecks on its 3 ALU ports, but HSW/SKL only bottleneck on resource conflicts in the dep chains and don't seem to be bothered by needing an ALU port for more of the movzx instructions.)
I don't know exactly what needs tracking in a limited-size table(?) for mov-elimination. Probably it's related to needing to free register-file entries as soon as possible when they're no longer needed, because Physical Register File size limits rather than ROB size can be the bottleneck for the out-of-order window size. Swapping around indices might make this harder.
xor-zeroing is eliminated 100% of the time on Intel Sandybridge-family; it's assumed that this works by renaming to a physical zero register, and this register never needs to be freed.
If xchg used the same mechanism that mov-elimination does, it also could probably only work some of the time. It would need to decode to enough uops to work in cases where it isn't handled at rename. (Or else the issue/rename stage would have to insert extra uops when an xchg will take more than 1 uop, like it does when un-laminating micro-fused uops with indexed addressing modes that can't stay micro-fused in the ROB, or when inserting merging uops for flags or high-8 partial registers. But that's a significant complication that would only be worth doing if xchg was a common and important instruction.)
Note that xchg r32,r32 has to zero-extend both results to 64 bits, so it can't be a simple swap of RAT (Register Alias Table) entries. It would be more like truncating both registers in-place. And note that Intel CPUs never eliminate mov same,same. It does already need to support mov r32,r32 and movzx r32, r8 with no execution port, so presumably it has some bits that indicate that rax = al or something. (And yes, Intel HSW/SKL do that, not just Ivybridge, despite what Agner's microarch guide says.)
We know P6 and SnB had upper-zeroed bits like this, because xor eax,eax before setz al avoids a partial-register stall when reading eax. HSW/SKL never rename al separately in the first place, only ah. It may not be a coincidence that partial-register renaming (other than AH) seems to have been dropped in the same uarch that introduced mov-elimination (Ivybridge). Still, setting that bit for 2 registers at once would be a special case that required special support.
xchg r64,r64 could maybe just swap the RAT entries, but decoding that differently from the r32 case is yet another complication. It might still need to trigger partial-register merging for both inputs, but add r64,r64 needs to do that, too.
Also note that an Intel uop (other than fxch) only ever produces one register result (plus flags). Not touching flags doesn't "free up" an output slot; For example mulx r64,r64,r64 still takes 2 uops to produce 2 integer outputs on HSW/SKL, even though all the "work" is done in the multiply unit on port 1, same as with mul r64 which does produce a flag result.)
Even if it is as simple as "swap the RAT entries", building a RAT that supports writing more than one entry per uop is a complication. What to do when renaming 4 xchg uops in a single issue group? It seems to me like it would make the logic significantly more complicated. Remember that this has to be built out of logic gates / transistors. Even if you say "handle that special case with a trap to microcode", you have to build the whole pipeline to support the possibility that that pipeline stage could take that kind of exception.
Single-uop fxch requires support for swapping RAT entries (or some other mechanism) in the FP RAT (fRAT), but it's a separate block of hardware from the integer RAT (iRAT). Leaving out that complication in the iRAT seems reasonable even if you have it in the fRAT (pre-Haswell).
Issue/rename complexity is definitely an issue for power consumption, though. Note that Skylake widened a lot of the front-end (legacy decode and uop cache fetch), and retirement, but kept the 4-wide issue/rename limit. SKL also added replicated execution units on more port in the back-end, so issue bandwidth is a bottleneck even more of the time, especially in code with a mix of loads, stores, and ALU.
The RAT (or the integer register file, IDK) may even have limited read ports, since there seem to be some front-end bottlenecks in issuing/renaming many 3-input uops like add rax, [rcx+rdx]. I posted some microbenchmarks (this and the follow-up post) showing Skylake being faster than Haswell when reading lots of registers, e.g. with micro-fusion of indexed addressing modes. Or maybe the bottleneck there was really some other microarchitectural limit.
But how does 1-uop fxch work? IDK how it's done in Sandybridge / Ivybridge. In P6-family CPUs, an extra remapping table exists basically to support FXCH. That might only be needed because P6 uses a Retirement Register File with 1 entry per "logical" register, instead of a physical register file (PRF). As you say, you'd expect it to be simpler when even "cold" register values are just a pointer to a PRF entry. (Source: US patent 5,499,352: Floating point register alias table FXCH and retirement floating point register array (describes Intel's P6 uarch).
One main reason the rfRAT array 802 is included within the present invention fRAT logic is a direct result of the manner in which the present invention implements the FXCH instruction.
(Thanks Andy Glew (#krazyglew), I hadn't thought of looking up patents to find out about CPU internals.) It's pretty heavy going, but may provide some insight into the bookkeeping needed for speculative execution.
Interesting tidbit: the patent describes integer as well, and mentions that there are some "hidden" logical registers which are reserved for use by microcode. (Intel's 3-uop xchg almost certain uses one of these as a temporary.)
We might be able to get some insight from looking at what AMD does.
Interestingly, AMD has 2-uop xchg r,r in K10, Bulldozer-family, Bobcat/Jaguar, and Ryzen. (But Jaguar xchg r8,r8 is 3 uops. Maybe to support the xchg ah,al corner case without a special uop for swapping the low 16 of a single reg).
Presumably both uops read the old values of the input architectural registers before the first one updates the RAT. IDK exactly how this works, since they aren't necessarily issued/renamed in the same cycle (but they are at least contiguous in the uop flow, so at worst the 2nd uop is the first uop in the next cycle). I have no idea if Haswell's 2-uop fxch works similarly, or if they're doing something else.
Ryzen is a new architecture designed after mov-elimination was "invented", so presumably they take advantage of it wherever possible. (Bulldozer-family renames vector moves (but only for the low 128b lane of YMM vectors); Ryzen is the first AMD architecture to do it for GP regs too.) xchg r32,r32 and r64,r64 are zero-latency (renamed), but still 2 uops each. (r8 and r16 need an execution unit, because they merge with the old value instead of zero-extending or copying the entire reg, but are still only 2 uops).
Ryzen's fxch is 1 uop. AMD (like Intel) probably isn't spending a lot of transistors on making x87 fast (e.g. fmul is only 1 per clock and on the same port as fadd), so presumably they were able to do this without a lot of extra support. Their micro-coded x87 instructions (like fyl2x) are faster than on recent Intel CPUs, so maybe Intel cares even less (at least about the microcoded x87 instruction).
Maybe AMD could have made xchg r64,r64 a single uop too, more easily than Intel. Maybe even xchg r32,r32 could be single uop, since like Intel it needs to support mov r32,r32 zero-extension with no execution port, so maybe it could just set whatever "upper 32 zeroed" bit exists to support that. Ryzen doesn't eliminate movzx r32, r8 at rename, so presumably there's only an upper32-zero bit, not bits for other widths.
What Intel might be able to do cheaply if they wanted to:
It's possible that Intel could support 2-uop xchg r,r the way Ryzen does (zero latency for the r32,r32 and r64,r64 forms, or 1c for the r8,r8 and r16,r16 forms) without too much extra complexity in critical parts of the core, like the issue/rename and retirement stages that manage the Register Alias Table (RAT). But maybe not, if they can't have 2 uops read the "old" value of a register when the first uop writes it.
Stuff like xchg ah,al is definitely a extra complication, since Intel CPUs don't rename partial registers separately anymore, except AH/BH/CH/DH.
xchg latency in practice on current hardware
Your guess about how it might work internally is good. It almost certainly uses one of the internal temporary registers (accessible only to microcode). Your guess about how they can reorder is too limited, though.
In fact, one direction has 2c latency and the other direction has ~1c latency.
00000000004000e0 <_start.loop>:
4000e0: 48 87 d1 xchg rcx,rdx # slow version
4000e3: 48 83 c1 01 add rcx,0x1
4000e7: 48 83 c1 01 add rcx,0x1
4000eb: 48 87 ca xchg rdx,rcx
4000ee: 48 83 c2 01 add rdx,0x1
4000f2: 48 83 c2 01 add rdx,0x1
4000f6: ff cd dec ebp
4000f8: 7f e6 jg 4000e0 <_start.loop>
This loop runs in ~8.06 cycles per iteration on Skylake. Reversing the xchg operands makes it run in ~6.23c cycles per iteration (measured with perf stat on Linux). uops issued/executed counters are equal, so no elimination happened. It looks like the dst <- src direction is the slow one, since putting the add uops on that dependency chain makes things slower than when they're on the dst -> src dependency chain.
If you ever want to use xchg reg,reg on the critical path (code-size reasons?), do it with the dst -> src direction on the critical path, because that's only about 1c latency.
Other side-topics from comments and the question
The 3 micro-ops throws off my 4-1-1-1 cadence
Sandybridge-family decoders are different from Core2/Nehalem. They can produce up to 4 uops total, not 7, so the patterns are 1-1-1-1, 2-1-1, 3-1, or 4.
Also beware that if the last uop is one that can macro-fuse, they will hang onto it until the next decode cycle in case the first instruction in the next block is a jcc. (This is a win when code runs multiple times from the uop cache for each time it's decoded. And that's still usually 3 uops per clock decode throughput.)
Skylake has an extra "simple" decoder so it can do 1-1-1-1-1 up to 4-1 I guess, but > 4 uops for one instruction still requires the microcode ROM. Skylake beefed up the uop cache, too, and can often bottleneck on the 4 fused-domain uops per clock issue/rename throughput limit if the back-end (or branch misses) aren't a bottleneck first.
I'm literally searching for ~1% speed bumps so hand optimization has been working out on the main loop code. Unfortunately that's ~18kB of code so I'm not even trying to consider the uop cache anymore.
That seems kinda crazy, unless you're mostly limiting yourself to asm-level optimization in shorter loops inside your main loop. Any inner loops within the main loop will still run from the uop cache, and that should probably be where you're spending most of your time optimizing. Compilers usually do a good-enough job that it's not practical for a human to do much over a large scale. Try to write your C or C++ in such a way that the compiler can do a good job with it, of course, but looking for tiny peephole optimizations like this over 18kB of code seems like going down the rabbit hole.
Use perf counters like idq.dsb_uops vs. uops_issued.any to see how many of your total uops came from the uop cache (DSB = Decoded Stream Buffer or something). Intel's optimization manual has some suggestions for other perf counters to look at for code that doesn't fit in the uop cache, such as DSB2MITE_SWITCHES.PENALTY_CYCLES. (MITE is the legacy-decode path). Search the pdf for DSB to find a few places it's mentioned.
Perf counters will help you find spots with potential problems, e.g. regions with higher than average uops_issued.stall_cycles could benefit from finding ways to expose more ILP if there are any, or from solving a front-end problem, or from reducing branch-mispredicts.
As discussed in comments, a single uop produces at most 1 register result
As an aside, with a mul %rbx, do you really get %rdx and %rax all at once or does the ROB technically have access to the lower part of the result one cycle earlier than the higher part? Or is it like the "mul" uop goes into the multiplication unit and then the multiplication unit issues two uops straight into the ROB to write the result at the end?
Terminology: the multiply result doesn't go into the ROB. It goes over the forwarding network to whatever other uops read it, and goes into the PRF.
The mul %rbx instruction decodes to 2 uops in the decoders. They don't even have to issue in the same cycle, let alone execute in the same cycle.
However, Agner Fog's instruction tables only list a single latency number. It turns out that 3 cycles is the latency from both inputs to RAX. The minimum latency for RDX is 4c, according to InstlatX64 testing on both Haswell and Skylake-X.
From this, I conclude that the 2nd uop is dependent on the first, and exists to write the high half of the result to an architectural register. The port1 uop produces a full 128b multiply result.
I don't know where the high-half result lives until the p6 uop reads it. Perhaps there's some sort of internal queue between the multiply execution unit and hardware connected to port 6. By scheduling the p6 uop with a dependency on the low-half result, that might arrange for the p6 uops from multiple in-flight mul instructions to run in the correct order. But then instead of actually using that dummy low-half input, the uop would take the high half result from the queue output in an execution unit that's connected to port 6 and return that as the result. (This is pure guess work, but I think it's plausible as one possible internal implementation. See comments for some earlier ideas).
Interestingly, according to Agner Fog's instruction tables, on Haswell the two uops for mul r64 go to ports 1 and 6. mul r32 is 3 uops, and runs on p1 + p0156. Agner doesn't say whether that's really 2p1 + p0156 or p1 + 2p0156 like he does for some other insns. (However, he says that mulx r32,r32,r32 runs on p1 + 2p056 (note that p056 doesn't include p1).)
Even more strangely, he says that Skylake runs mulx r64,r64,r64 on p1 p5 but mul r64 on p1 p6. If that's accurate and not a typo (which is a possibility), it pretty much rules out the possibility that the extra uop is an upper-half multiplier.

Deterministic execution time on x86-64

Is there an x64 instruction(s) that takes a fixed amount of time, regardless of the micro-architectural state such as caches, branch predictors, etc.?
For instance, if a hypothetical add or increment instruction always takes n cycles, then I can implement a timer in my program by performing that add instruction multiple times. Perhaps an increment instruction with register operands may work, but it's not clear to me whether Intel's spec guarantees that it would take deterministic number of cycles. Note that I am not interested in current time, but only a primitive / instruction sequence that takes a fixed number of cycles.
Assume that I have a way to force atomic execution i.e. no context switches during timer's execution i.e. only my program gets to run.
On a related note, I also cannot use system services to keep track of time, because I am working in a setting where my program is a user-level program running on an untrusted OS.
The x86 ISA documents don't guarantee anything about what takes a certain amount of cycles. The ISA allows things like Transmeta's Crusoe that JIT-compiled x86 instructions to an internal VLIW instruction set. It could conceivably do optimizations between adjacent instructions.
The best you can do is write something that will work on as many known microarchitectures as possible. I'm not aware of any x86-64 microarchitectures that are "weird" like Transmeta, only the usual superscalar decode-to-uops designs like Intel and AMD use.
Simple integer ALU instructions like ADD are almost all 1c latency, and tiny loops that don't touch memory are almost totally unaffected anything, and are very predictable. If they run a lot of iterations, they're also almost totally unaffected by anything to do with the impact of surrounding code on the out-of-order core, and recover very quickly from disruptions like timer interrupts.
On nearly every Intel microarchitecture, this loop will run at one iteration per clock:
mov ecx, 1234567 ; or use a 64-bit register for higher counts.
ALIGN 16
.loop:
sub ecx, 1 ; not dec because of Pentium 4.
jnz .loop
Agner Fog's microarch guide and instruction tables say that VIA Nano3000 has a taken-branch throughput of one per 3 cycles, so this loop would only run at one iteration per 3 clocks there. AMD Bulldozer-family and Jaguar similarly have a max throughput of one taken JCC per 2 clocks.
See also other performance links in the x86 tag wiki.
If you want a more power-efficient loop, you could use PAUSE in the loop, but it waits ~100 cycles on Skylake, up from ~5 cycles on previous microarchitectures. (You can make cycle-accurate predictions for more complicated loops that don't touch memory, but that depends on microarchitectural details.)
You could make a more reliable loop that's less likely to have different bottlenecks on different CPUs by making a longer dependency chain within each iteration. Since each instruction depends on the previous, it can still only run at one instruction per cycle (not counting the branch), drastically the branches per cycle.
# one add/sub per clock, limited by latency
# should run one iteration per 6 cycles on every CPU listed in Agner Fog's tables
# And should be the same on all future CPUs unless they do magic inter-instruction optimizations.
# Or it could be slower on CPUs that always have a bubble on taken branches, but it seems unlikely anyone would design one.
ALIGN 16
.loop:
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 2 ; net result ecx-1
jnz .loop
Unrolling like this ensures that front-end effects are not a bottleneck. It gives the frontend decoders plenty of time to queue up the 6 add/sub insns and the jcc before the next branch.
Using add/sub instead of dec/inc avoids a partial-flag false dependency on Pentium 4. (Although I don't think that would be an issue anyway.)
Pentium4's double-clocked ALUs can each run two ADDs per clock, but the latency is still one cycle. i.e. apparently it can't forward a result internally to chew through this dependency chain twice as fast as any other CPU.
And yes, Prescott P4 is an x86-64 CPU, so we can't quite ignore P4 if we need a general purpose answer.

Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly?

Background:
While optimizing some Pascal code with embedded assembly language, I noticed an unnecessary MOV instruction, and removed it.
To my surprise, removing the un-necessary instruction caused my program to slow down.
I found that adding arbitrary, useless MOV instructions increased performance even further.
The effect is erratic, and changes based on execution order: the same junk instructions transposed up or down by a single line produce a slowdown.
I understand that the CPU does all kinds of optimizations and streamlining, but, this seems more like black magic.
The data:
A version of my code conditionally compiles three junk operations in the middle of a loop that runs 2**20==1048576 times. (The surrounding program just calculates SHA-256 hashes).
The results on my rather old machine (Intel(R) Core(TM)2 CPU 6400 # 2.13 GHz):
avg time (ms) with -dJUNKOPS: 1822.84 ms
avg time (ms) without: 1836.44 ms
The programs were run 25 times in a loop, with the run order changing randomly each time.
Excerpt:
{$asmmode intel}
procedure example_junkop_in_sha256;
var s1, t2 : uint32;
begin
// Here are parts of the SHA-256 algorithm, in Pascal:
// s0 {r10d} := ror(a, 2) xor ror(a, 13) xor ror(a, 22)
// s1 {r11d} := ror(e, 6) xor ror(e, 11) xor ror(e, 25)
// Here is how I translated them (side by side to show symmetry):
asm
MOV r8d, a ; MOV r9d, e
ROR r8d, 2 ; ROR r9d, 6
MOV r10d, r8d ; MOV r11d, r9d
ROR r8d, 11 {13 total} ; ROR r9d, 5 {11 total}
XOR r10d, r8d ; XOR r11d, r9d
ROR r8d, 9 {22 total} ; ROR r9d, 14 {25 total}
XOR r10d, r8d ; XOR r11d, r9d
// Here is the extraneous operation that I removed, causing a speedup
// s1 is the uint32 variable declared at the start of the Pascal code.
//
// I had cleaned up the code, so I no longer needed this variable, and
// could just leave the value sitting in the r11d register until I needed
// it again later.
//
// Since copying to RAM seemed like a waste, I removed the instruction,
// only to discover that the code ran slower without it.
{$IFDEF JUNKOPS}
MOV s1, r11d
{$ENDIF}
// The next part of the code just moves on to another part of SHA-256,
// maj { r12d } := (a and b) xor (a and c) xor (b and c)
mov r8d, a
mov r9d, b
mov r13d, r9d // Set aside a copy of b
and r9d, r8d
mov r12d, c
and r8d, r12d { a and c }
xor r9d, r8d
and r12d, r13d { c and b }
xor r12d, r9d
// Copying the calculated value to the same s1 variable is another speedup.
// As far as I can tell, it doesn't actually matter what register is copied,
// but moving this line up or down makes a huge difference.
{$IFDEF JUNKOPS}
MOV s1, r9d // after mov r12d, c
{$ENDIF}
// And here is where the two calculated values above are actually used:
// T2 {r12d} := S0 {r10d} + Maj {r12d};
ADD r12d, r10d
MOV T2, r12d
end
end;
Try it yourself:
The code is online at GitHub if you want to try it out yourself.
My questions:
Why would uselessly copying a register's contents to RAM ever increase performance?
Why would the same useless instruction provide a speedup on some lines, and a slowdown on others?
Is this behavior something that could be exploited predictably by a compiler?
The most likely cause of the speed improvement is that:
inserting a MOV shifts the subsequent instructions to different memory addresses
one of those moved instructions was an important conditional branch
that branch was being incorrectly predicted due to aliasing in the branch prediction table
moving the branch eliminated the alias and allowed the branch to be predicted correctly
Your Core2 doesn't keep a separate history record for each conditional jump. Instead it keeps a shared history of all conditional jumps. One disadvantage of global branch prediction is that the history is diluted by irrelevant information if the different conditional jumps are uncorrelated.
This little branch prediction tutorial shows how branch prediction buffers work. The cache buffer is indexed by the lower portion of the address of the branch instruction. This works well unless two important uncorrelated branches share the same lower bits. In that case, you end-up with aliasing which causes many mispredicted branches (which stalls the instruction pipeline and slowing your program).
If you want to understand how branch mispredictions affect performance, take a look at this excellent answer: https://stackoverflow.com/a/11227902/1001643
Compilers typically don't have enough information to know which branches will alias and whether those aliases will be significant. However, that information can be determined at runtime with tools such as Cachegrind and VTune.
You may want to read http://research.google.com/pubs/pub37077.html
TL;DR: randomly inserting nop instructions in programs can easily increase performance by 5% or more, and no, compilers cannot easily exploit this. It's usually a combination of branch predictor and cache behaviour, but it can just as well be e.g. a reservation station stall (even in case there are no dependency chains that are broken or obvious resource over-subscriptions whatsoever).
I believe in modern CPUs the assembly instructions, while being the last visible layer to a programmer for providing execution instructions to a CPU, actually are several layers from actual execution by the CPU.
Modern CPUs are RISC/CISC hybrids that translate CISC x86 instructions into internal instructions that are more RISC in behavior. Additionally there are out-of-order execution analyzers, branch predictors, Intel's "micro-ops fusion" that try to group instructions into larger batches of simultaneous work (kind of like the VLIW/Itanium titanic). There are even cache boundaries that could make the code run faster for god-knows-why if it's bigger (maybe the cache controller slots it more intelligently, or keeps it around longer).
CISC has always had an assembly-to-microcode translation layer, but the point is that with modern CPUs things are much much much more complicated. With all the extra transistor real estate in modern semiconductor fabrication plants, CPUs can probably apply several optimization approaches in parallel and then select the one at the end that provides the best speedup. The extra instructions may be biasing the CPU to use one optimization path that is better than others.
The effect of the extra instructions probably depends on the CPU model / generation / manufacturer, and isn't likely to be predictable. Optimizing assembly language this way would require execution against many CPU architecture generations, perhaps using CPU-specific execution paths, and would only be desirable for really really important code sections, although if you're doing assembly, you probably already know that.
Preparing the cache
Move operations to memory can prepare the cache and make subsequent move operations faster. A CPU usually have two load units and one store units. A load unit can read from memory into a register (one read per cycle), a store unit stores from register to memory. There are also other units that do operations between registers. All the units work in parallel. So, on each cycle, we may do several operations at once, but no more than two loads, one store, and several register operations. Usually it is up to 4 simple operations with plain registers, up to 3 simple operations with XMM/YMM registers and a 1-2 complex operations with any kind of registers. Your code has lots of operations with registers, so one dummy memory store operation is free (since there are more than 4 register operations anyway), but it prepares memory cache for the subsequent store operation. To find out how memory stores work, please refer to the Intel 64 and IA-32 Architectures Optimization Reference Manual.
Breaking the false dependencies
Although this does not exactly refer to your case, but sometimes using 32-bit mov operations under the 64-bit processor (as in your case) are used to clear the higher bits (32-63) and break the dependency chains.
It is well known that under x86-64, using 32-bit operands clears the higher bits of the 64-bit register. Pleas read the relevant section - 3.4.1.1 - of The Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1:
32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register
So, the mov instructions, that may seem useless at the first sight, clear the higher bits of the appropriate registers. What it gives to us? It breaks dependency chains and allows the instructions to execute in parallel, in random order, by the Out-of-Order algorithm implemented internally by CPUs since Pentium Pro in 1995.
A Quote from the Intel® 64 and IA-32 Architectures Optimization Reference Manual, Section 3.5.1.8:
Code sequences that modifies partial register can experience some delay in its dependency chain, but can be avoided by using dependency breaking idioms. In processors based on Intel Core micro-architecture, a number of instructions can help clear execution dependency when software uses these instruction to clear register content to zero. Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For
moves, this can be accomplished with 32-bit moves or by using MOVZX.
Assembly/Compiler Coding Rule 37. (M impact, MH generality): Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.
The MOVZX and MOV with 32-bit operands for x64 are equivalent - they all break dependency chains.
That's why your code executes faster. If there are no dependencies, the CPU can internally rename the registers, even though at the first sight it may seem that the second instruction modifies a register used by the first instruction, and the two cannot execute in parallel. But due to register renaming they can.
Register renaming is a technique used internally by a CPU that eliminates the false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them.
I think you now see that it is too obvious.

Resources