Does a Length-Changing Prefix (LCP) incur a stall on a simple x86_64 instruction? - performance

Consider a simple instruction like
mov RCX, RDI # 48 89 f9
The 48 is the REX prefix for x86_64. It is not an LCP. But consider adding an LCP (for alignment purposes):
.byte 0x67
mov RCX, RDI # 67 48 89 f9
67 is an address size prefix which in this case is for an instruction without addresses. This instruction also has no immediates, and it doesn't use the F7 opcode (False LCP stalls; F7 would be TEST, NOT, NEG, MUL, IMUL, DIV + IDIV). Assume that it doesn't cross a 16-byte boundary either. Those are the LCP stall cases mentioned in Intel's Optimization Reference Manual.
Would this instruction incur an LCP stall (on Skylake, Haswell, ...) ? What about two LCPs?
My daily driver is a MacBook. So I don't have access to VTune and I can't look at the ILD_STALL event. Is there any other way to know?

TL:DR: 67h is safe here on all CPUs. In 64-bit mode1, 67h can only LCP-stall with addr32 movabs load/store of the accumulator (AL/AX/EAX/RAX) from/to a moffs 32-bit absolute address (vs. the normal 64-bit absolute for that special opcode).
67h isn't length-changing with normal instructions that use a ModRM, even with a disp32 component in the addressing mode, because 32 and 64-bit address-size use identical ModRM formats. That 67h-LCP-stallable form of mov is special and doesn't use a modrm addressing mode.
(It also almost certainly won't have some other meaning in future CPUs, like being part of longer opcode the way rep is3.)
A Length Changing Prefix is when the opcode(+modrm) would imply a different length in bytes for the non-prefixes part of the instruction's machine code, if you ignored prefixes. I.e. it changes the length of the rest of the instruction. (Parallel length-finding is hard, and done separately from full decode: Later insns in a 16-byte block don't even have known start points. So this min(16-byte, 6-instruction) stage needs to look at as few bits as possible after prefixes, for the normal fast case to work. This is the stage where LCP stalls can happen.)
Usually only with an actual imm16 / imm32 opcode, e.g. 66h is length-changing in add cx, 1234, but not add cx, 12: after prefixes or in the appropriate mode, add r/m16, imm8 and add r/m32, imm8 are both opcode + modrm + imm8, 3 bytes regardless, (https://www.felixcloutier.com/x86/add). Pre-decode hardware can find the right length by just skipping prefixes, not modifying interpretation of later opcode+modrm based on what it saw, unlike when 66h means the opcode implies 2 immediate bytes instead of 4. Assemblers will always pick the imm8 encoding when possible because it's shorter (or equal length for the no-modrm add ax, imm16 special case).
(Note that REX.W=1 is length-changing for mov r64, imm64 vs. mov r32, imm32, but all hardware handles that relatively common instruction efficiently so only 66h and 67h can ever actually LCP-stall.)
SnB-family doesn't have any false2 LCP stalls for prefixes that can be length-changing for this opcode but not this this particular instruction, for either 66h or 67h. So F7 is a non-issue on SnB, unlike Core2 and Nehalem. (Earlier P6-family Intel CPUs didn't support 64-bit mode.) Atom/Silvermont don't have LCP penalties at all, nor do AMD or Via CPUs.
Agner Fog's microarch guide covers this well, and explains things clearly. Search for "length-changing prefixes". (This answer is an attempt to put those pieces together with some reminders about how x86 instruction encoding works, etc.)
Footnote 1: 67h increases length-finding difficulty more in non-64-bit modes:
In 64-bit mode, 67h changes from 64 to 32-bit address size, both of which use disp0 / 8 / 32 (0, 1, or 4 bytes of immediate displacement as part of the instruction), and which use the same ModRM + optional SIB encoding for normal addressing modes. RIP+rel32 repurposes the shorter (no SIB) encoding of 32-bit mode's two redundant ways to encode [disp32], so length decoding is unaffected. Note that REX was already designed not to be length-changing (except for mov r64, imm64), by burdening R13 and R12 in the same ways as RBP and RSP as ModRM "escape codes" to signal no base reg, or presence of a SIB byte, respectively.
In 16 and 32-bit modes, 67h switches to 32 or 16-bit address size. Not only are [x + disp32] vs. [x + disp16] different lengths after the ModRM byte (just like immediates for the operand-size prefix), but also 16-bit address-size can't signal a SIB byte. Why don't x86 16-bit addressing modes have a scale factor, while the 32-bit version has it? So the same bits in the mode and /rm fields can imply different lengths.
Footnote 2: "False" LCP stalls
This need (see footnote 1) to sometimes look differently at ModRM even to find the length is presumably why Intel CPUs before Sandybridge have "false" LCP stalls in 16/32-bit modes on 67h prefixes on any instruction with a ModRM, even when they aren't length-changing (e.g. register addressing mode). Instead of optimistically length-finding and checking somehow, a Core2/Nehalem just punts if they see addr32 + most opcodes, if they're not in 64-bit mode.
Fortunately there's basically zero reason to ever use it in 32-bit code so this mostly only matters for 16-bit code that uses 32-bit registers without switching to protected mode. Or code using 67h for padding like you're doing, except in 32-bit mode. .byte 0x67 / mov ecx, edi would be a problem for Core 2 / Nehalem. (I didn't check earlier 32-bit-only P6 family CPUs. They're a lot more obsolete than Nehalem.)
False LCP stalls for 67h never happen in 64-bit mode; as discussed above that's the easy case, and the length pre-decoders already have to know what mode they're in, so fortunately there's no downside to using it for padding. Unlike rep (which could become part of some future opcode), 67h is extremely likely to be safely ignored for instructions where it can apply to some form of the same opcode, even if there isn't actually a memory operand for this one.
Sandybridge-family doesn't ever have any false LCP stalls, removing both the 16/32-bit mode address-size (67h) and the all-modes 66 F7 cases (which needs to look at ModRM to disambiguate instructions like neg di or mul di from test di, imm16.)
SnB-family also removes some 66h true-LCP stalls, e.g. from mov-immediate like mov word ptr [rdi], 0 which is actually useful.
Footnote 3: forward compat of using 67h for padding
When 67h applies to the opcode in general (i.e. it can use a memory operand), it's very unlikely that it will mean something else for the same opcode with a modrm that just happens to encode reg,reg operands. So this is safe for What methods can be used to efficiently extend instruction length on modern x86?.
In fact, "relaxing" a 6-byte call [RIP+rel32] to a 5-byte call rel32 is done by GNU binutils by padding the call rel32 with a 67h address-size prefix, even though that's never meaningful for E8 call rel32. (This happens when linking code compiled with -fno-plt, which uses call [RIP + foo#gotpcrel] for any foo that's not found in the current compilation unit and doesn't have "hidden" visibility.)
But that's not a good precedent: at this point it's too widespread for CPU vendors to want to break that specific prefix+opcode combo (like for What does `rep ret` mean?), but some homebrewed thing in your program like 67h cdq would not get the same treatment from vendors.
The rules, for Sandybridge-family CPUs
edited/condensed from Agner's microarch PDF, these cases can LCP-stall, taking an extra 2 to 3 cycles in pre-decode (if they miss in the uop cache).
Any ALU op with an imm16 that would be imm32 without a 66h. (Except mov-immediate).
Remember that mov and test don't have imm8 forms for wider operand-size so prefer test al, 1, or imm32 if necessary. Or sometimes even test ah, imm8 if you want to test bits in the top half of AX, although beware of 1 cycle of extra latency for reading AH after writing the full reg on HSW and later. GCC uses this trick but should maybe start being careful with it, perhaps sometimes using bt reg, imm8 when feeding a setcc or cmovcc (which can't macro-fuse with test like JCC can).
67h with movabs moffs (A0/A1/A2/A3 opcodes in 64-bit mode, and probably also in 16 or 32-bit mode). Confirmed by my testing with perf counters for ild_stall.lcp on Skylake when LLVM was deciding whether to optimize mov al, [0x123456] to use 67 A0 4-byte-address or a normal opcode + modrm + sib + disp32 (to get absolute instead of rip-relative). That refers to an old version of Agner's guide; he updated soon after I sent him my test results.
If one of the instructions NEG, NOT, DIV, IDIV, MUL and IMUL with a single operand
has a 16-bit operand and there is a 16-bytes boundary between the opcode byte and
the mod-reg-rm byte. These instructions have a bogus length-changing prefix
because these instructions have the same opcode as the TEST instruction with a 16-
bit immediate operand [...]
No penalty on SnB-family for div cx or whatever, regardless of alignment.
The address size prefix (67H) will always cause a delay in 16-bit and 32-bit mode on any
instruction that has a mod/reg/rm byte even if it doesn't change the length of the instruction.
SnB-family removed this penalty, making address-size prefixes usable as padding if you're careful.
Or to summarize another way:
SnB-family has no false LCP stalls.
SnB-family has LCP stalls on every 66h and 67h true LCP except for:
mov r/m16, imm16 and the mov r16, imm16 no-modrm version.
67h address size interaction with ModRM (in 16/32-bit modes).
(That excludes the no-modrm absolute address load/store of AL/AX/EAX/RAX forms- they can still LCP-stall, presumably even in 32-bit mode, like in 64-bit.)
Length-changing REX doesn't stall (on any CPU).
Some examples
(This part ignores the false LCP stalls that some CPUs have in some non-length-changing cases which turn out not to matter here, but perhaps that's why you were worried about 67h for mov reg,reg.)
In your case, the rest of the instruction bytes, starting after the 67, decode as a 3-byte instruction whether the current address-size is 32 or 64. Same even with addressing modes like mov eax, [e/rsi + 1024] (reg+disp32) or addr32 mov edx, [RIP + rel32].
In 16 and 32-bit modes, 67h switches between 16 and 32-bit address size. [x + disp32] vs. [x + disp16] are different lengths after the ModRM byte, but also non-16-bit address-size can signal a SIB byte depending on the R/M field. But in 64-bit mode, 32 and 64-bit address-size both use [x + disp32], and the same ModRM->SIB or not encoding.
There is only one case where a 67h address-size prefix is length-changing in 64-bit mode: movabs load/store with 8-byte vs. 4-byte absolute addresses, and yes it does LCP-stall Intel CPUs. (I posted test results on https://bugs.llvm.org/show_bug.cgi?id=34733#c3)
For example, addr32 movabs [0x123456], al
.intel_syntax noprefix
addr32 mov [0x123456], cl # non-AL to make movabs impossible
mov [0x123456], al # GAS picks normal absolute [disp32]
addr32 mov [0x123456], al # GAS picks A2 movabs since addr32 makes that the shortest choice, same as NASM does.
movabs [0x123456], al # 64-bit absolute address
Note that GAS (fortunately) doesn't choose to use an addr32 prefix on its own, even with as -Os (gcc -Wa,-Os).
$ gcc -c foo.s
$ objdump -drwC -Mintel foo.o
...
0: 67 88 0c 25 56 34 12 00 mov BYTE PTR ds:0x123456,cl
8: 88 04 25 56 34 12 00 mov BYTE PTR ds:0x123456,al # same encoding after the 67
f: 67 a2 56 34 12 00 addr32 mov ds:0x123456,al
15: a2 56 34 12 00 00 00 00 00 movabs ds:0x123456,al # different length for same opcode
As you can see from the last 2 instructions, using the a2 mov moffs, al opcode, with a 67 the rest of the instruction is a different length for the same opcode.
This does LCP-stall on Skylake, so it's only fast when running from the uop cache.
Of course the more common source of LCP stalls is with the 66 prefix and an imm16 (instead of imm32). Like add ax, 1234, as in this random test where I wanted to see if jumping over the LCP-stalling instruction could avoid the problem: Label in %rep section in NASM. But not cases like add ax, 12 which will use add r/m16, imm8 (which is the same length after the 66 prefix as add r/m32, imm8).
Also, Sandybridge-family reportedly avoid LCP stalls for mov-immediate with 16-bit immediate.
Related:
Another example of working around add r/m16, imm16: add 1 byte immediate value to a 2 bytes memory location
x86 assembly 16 bit vs 8 bit immediate operand encoding - choose add r/m16, imm8 instead of the also-3-byte add ax, imm16 form.
Sign or Zero Extension of address in 64bit mode for MOV moffs32? - how address-size interacts with the moffs forms of movabs. (The kind that can LCP-stall)
What methods can be used to efficiently extend instruction length on modern x86? - the general case of what you're doing.
Tuning advice and uarch details:
Usually don't try to save space with addr32 mov [0x123456], al, except maybe when it's a choice between saving 1 byte or using 15 bytes of padding including actual NOPs inside a loop. (more tuning advice below)
One LCP stall usually won't be a disaster with a uop cache, especially if length-decoding probably isn't a front-end bottleneck here (although it often can be if the front-end is a bottleneck at all). Hard to test a single instance in one function by micro-benchmarking, though; only a real full-app benchmark will accurately reflect when the code can run from the uop cache (what Intel perf counters call the DSB), bypassing legacy decode (MITE).
There are queues between stages in modern CPUs that can at least partly absorb stalls https://www.realworldtech.com/haswell-cpu/2/ (moreso than in PPro/PIII), and SnB-family has shorter LCP-stalls than Core2/Nehalem. (But other reasons for pre-decode slowness already dips into their capacity, and after an I-cache miss they may all be empty.)
When prefixes aren't length-changing, the pre-decode pipeline stage that finds instruction boundaries (before steering chunks of bytes to actual complex/simple decoders or doing actual decoding) will find the correct instruction-length / end by skipping all prefixes and then looking at just the opcode (and modrm if applicable).
This pre-decode length-finding is where LCP stalls happen, so fun fact: even Core 2's pre-decode loop buffer can hide LCP stalls in subsequent iterations because it locks down up to 64 bytes / 18 insns of x86 machine code after finding instruction boundaries, using the decode queue (pre-decode output) as a buffer.
In later CPUs, the LSD and uop cache are post decode, so unless something defeats the uop cache (like the pesky JCC-erratum mitigation or simply having too many uops for the uop cache in a 32-byte aligned block of x86 machine code), loops only pay the LCP-stall cost on the first iteration, if they weren't already hot.
I'd say generally work around LCP stalls if you can do so cheaply, especially for code that usually runs "cold". Or if you can just use 32-bit operand-size and avoid partial-register shenanigans, costing usually only a byte of code-size and no extra instructions or uops. Or if you'd have multiple LCP stalls in a row, e.g. from naively using 16-bit immediates, that would be too many bubbles for buffers to hide so you'd have a real problem and it's worth spending extra instructions. (e.g. mov eax, imm32 / add [mem], ax, or movzx load / add r32,imm32 / store, or whatever.)
Padding to end 16-byte fetch blocks at instruction boundaries: not needed
(This is separate from aligning the start of a fetch block at a branch target, which is also sometimes unnecessary given the uop cache.)
Wikichip's section on Skylake pre-decode is incorrectly implies that a partial instruction left at the end of a block has to pre-decode on its own, rather than along with the next 16-byte group that contains the end of the instruction. It seems to be paraphrased from Agner Fog's text, with some changes and additions that make it wrong:
[from wikichip...] As with previous microarchitectures, the pre-decoder has a throughput of 6 macro-ops per cycle or until all 16 bytes are consumed, whichever happens first. Note that the predecoder will not load a new 16-byte block until the previous block has been fully exhausted. For example, suppose a new chunk was loaded, resulting in 7 instructions. In the first cycle, 6 instructions will be processed and a whole second cycle will be wasted for that last instruction. This will produce the much lower throughput of 3.5 instructions per cycle which is considerably less than optimal.
[this part is paraphrased from Agner Fog's Core2/Nehalem section, with the word "fully" having been added"]
Likewise, if the 16-byte block resulted in just 4 instructions with 1 byte of the 5th instruction received, the first 4 instructions will be processed in the first cycle and a second cycle will be required for the last instruction. This will produce an average throughput of 2.5 instructions per cycle.
[nothing like this appears in the current version of Agner's guide, IDK where this misinformation came from. Perhaps made up based on a misunderstanding of what Agner said, but without testing.]
Fortunately no. The rest of the instruction is in the next fetch block, so reality makes a lot more sense: the leftover bytes are prepended to the next 16-byte block.
(Starting a new 16-byte pre-decode block starting with this instruction would also have been plausible, but my testing rules that out: 2.82 IPC with a repeating 5,6,6 byte = 17-byte pattern. If it only ever looked at 16 bytes and left the partial 5 or 6-byte instruction to be the start of the next block, that would give us 2 IPC.)
A repeating pattern of 3x 5 byte instructions unrolled many times (a NASM %rep 2500 or GAS .rept 2500 block, so 7.5k instructions in ~36kiB) runs at 3.19 IPC, pre-decoding and decoding at ~16 bytes per cycle. (16 bytes/cycle) / (5 bytes/insn) = 3.2 instructions per cycle theoretical.
(If wikichip was right, it would predict close to 2 IPC in a 3-1 pattern, which is of course unreasonably low and would be not be an acceptable design for Intel for long runs of long or medium-length when running from legacy decode. 2 IPC is so much narrower than the 4-wide pipeline that it would not be ok even for legacy decode. Intel learned from P4 that running at least decently well from legacy decode is important, even when your CPU caches decoded uops. That's why SnB's uop cache can be so small, only ~1.5k uops. A lot smaller than P4's trace cache, but P4's problem was trying to replace L1i with a trace cache, and having weak decoders. (Also the fact it was a trace cache, so it cached the same code multiple times.))
These perf differences are large enough that you can verify it on your Mac, using a plenty-large repeat count so you don't need perf counters to verify uop-cache misses. (Remember that L1i is inclusive of uop cache, so loops that don't fit in L1i will also evict themselves from uop cache.) Anyway, just measuring total time and knowing the approximate max-turbo that you'll hit is sufficient for a sanity check like this.
Getting better than the theoretical-max that wikichip predicts, even after startup overhead and conservative frequency estimates, will completely rule out that behaviour even on a machine where you don't have perf counters.
$ nasm -felf64 && ld # 3x 5 bytes, repeated 2.5k times
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_retired.retire_slots,uops_executed.thread,idq.dsb_uops -r2 ./testloop
Performance counter stats for './testloop' (2 runs):
604.16 msec task-clock # 1.000 CPUs utilized ( +- 0.02% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
1 page-faults # 0.002 K/sec
2,354,699,144 cycles # 3.897 GHz ( +- 0.02% )
7,502,000,195 instructions # 3.19 insn per cycle ( +- 0.00% )
7,506,746,328 uops_issued.any # 12425.167 M/sec ( +- 0.00% )
7,506,686,463 uops_retired.retire_slots # 12425.068 M/sec ( +- 0.00% )
7,506,726,076 uops_executed.thread # 12425.134 M/sec ( +- 0.00% )
0 idq.dsb_uops # 0.000 K/sec
0.6044392 +- 0.0000998 seconds time elapsed ( +- 0.02% )
(and from another run):
7,501,076,096 idq.mite_uops # 12402.209 M/sec ( +- 0.00% )
No clue why idq.mite_uops:u is not equal to issued or retired. There's nothing to un-laminate, and no stack-sync uops should be necessary, so IDK where the extra issued+retired uops could be coming from. The excess is consistent across runs, and proportional to the %rep count I think.
With other pattern like 5-5-6 (16 bytes) and 5-6-6 (17 bytes), I get similar results.
I do sometimes measure a slight difference when the 16-byte groups are misaligned relative to an absolute 16-byte boundary or not (put a nop at the top of the loop). But that seems to only happen with larger repeat counts. %rep 2500 for 39kiB total size, I still get 2.99 IPC (just under one 16-byte group per cycle), with 0 DSB uops, regardless of aligned vs. misaligned.
I still get 2.99IPC at %rep 5000, but I see a diff at %rep 10000: 2.95 IPC misaligned vs. 2.99 IPC aligned. That largest %rep count is ~156kiB and still fits in the 256k L2 cache so IDK why anything would be different from half that size. (They're much larger than 32k Li1). I think earlier I was seeing a different at 5k, but I can't repro it now. Maybe that was with 17-byte groups.
The actual loop runs 1000000 times in a static executable under _start, with a raw syscall to _exit, so perf counters (and time) for the whole process is basically just the loop. (especially with perf --all-user to only count user-space.)
; complete Linux program
default rel
%use smartalign
alignmode p6, 64
global _start
_start:
mov ebp, 1000000
align 64
.loop:
%ifdef MISALIGN
nop
%endif
%rep 2500
mov eax, 12345 ; 5 bytes.
mov ecx, 123456 ; 5 bytes. Use r8d for 6 bytes
mov edx, 1234567 ; 5 bytes. Use r9d for 6 bytes
%endrep
dec ebp
jnz .loop
.end:
xor edi,edi
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)

Related

Are some general purpose registers faster than others?

In x86-64, will certain instructions execute faster if some general purpose registers are preferred over others?
For instance, would mov eax, ecx execute faster than mov r8d, ecx? I can imagine that the latter would need a REX prefix which would make the instruction fetch slower?
What about using rax instead of rcx? What about add or xor? Other operations? Smaller registers like r15b vs al? al vs ah?
AMD vs Intel? Newer processors? Older processors? Combinations of instructions?
Clarification: Should certain general purpose registers be preferred over others, and which ones are they?
In general, architectural registers are all equal, and renamed onto a large array of physical registers.
(Except partial registers can be slower, especially high-byte AH/BH/CH/DH which are slow to read after writing the full register, on Haswell and later. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent and also Why doesn't GCC use partial registers? for problems when writing 8-bit and 16-bit registers). The rest of this answer is just going to consider 32/64-bit operand-size.)
But some instruction require specific registers, like legacy variable-count shifts (without BMI2 shrx etc) require the count in CL. Division requires the dividend in EDX:EAX (or RDX:RAX for the slower 64-bit version).
Using a call-preserved register like RBX means your function has to spend extra instructions saving/restoring it.
But of course there are perf differences if you need more instructions. So lets assume all else is equal, and just talk about the uops, latency, and code-size of a single instruction just by changing which register is used for one of its operands. TL:DR: the only perf difference is due to instruction-encoding restrictions / differences. Sometimes a different register will allow / require (or get the assembler to pick) a different encoding, which will often be smaller / larger as a special case, and sometimes even executes differently.
Generally smaller code is faster, and packs better in the uop cache and I-cache, so unless you've analyzed a specific case and found a problem, favour the smaller encoding. Often that means keeping a byte value in AL so you can use those special-case instructions, and avoiding RBP / R13 for pointers.
Special cases where a specific encoding is extra slow, not just size
LEA with RBP or R13 as a base can be slower on Intel if the addressing mode didn't already have a +displacement constant.
e.g. lea eax, [rbp + 12] is encodeable as-written, and is just as fast as lea eax, [rcx + 12].
But lea eax, [rbp + rcx*4] can only be encoded in machine code as lea eax, [rbp + rcx*4 + 0] (because of addressing mode escape-code stuff), which is a 3-component LEA, and thus slower on Intel (3 cycle latency on Sandybridge-family instead of 1 cycle, see https://agner.org/optimize/ instruction tables and microarch PDF). On AMD, having a scaled-index would already make it a slow-LEA even with lea eax, [rdx + rcx*4]
Outside of LEA, using RBP / R13 as the base in any addressing mode always requires a disp8/32 byte or dword, but I don't think the actual AGUs are slower for a 3-component addressing mode. So it's just a code-size effect.
Other cases include Which Intel microarchitecture introduced the ADC reg,0 single-uop special case? where the short-form 2-byte encoding for adc al, imm8 is 2 uops even on modern uarches like Skylake, where adc bl, imm8 is 1 uop.
So not only does the adc reg,0 special case not work for adc al,0 on Sandybridge through Haswell, Broadwell and newer forgot (or chose not to) optimize how that encoding decodes to uops. (Of course you could manually encode adc al,0 using the 3-byte Mod/RM encoding, but assemblers will always pick the shortest encoding so adc al,0 will assemble to the short form by default.) Only a problem with byte registers; adc eax,0 will use the opcode ModRM imm8 3-byte encoding, not 5-byte opcode imm32.
For other cases of op al,imm8, the only difference is code-size, which only indirectly matters for performance. (Because of decoding, uop-cache packing, and I-cache misses).
See Tips for golfing in x86/x64 machine code for more about special cases of code-size, like xchg eax, ecx being 1-byte vs. xchg edx, ecx being 2 bytes.
add rsp, 8 can need an extra stack-sync uop if there hasn't been an explicit use of RSP or ESP since the last push/pop/call/ret (along the path of execution of course, not in the static code layout). (What is the stack engine in the Sandybridge microarchitecture?). This is why compilers like clang use a dummy push or pop to reserve / free a single stack slot: Why does this function push RAX to the stack as the first operation?
LEA will be slower with EBP, RBP, or R13 as the base (PDF warning, page 3-22). But generally the answer is No.
Taking a step back, it's important to realize that since the advent of register renaming that architectural registers don't deal with actual, physical registers on most micro-architectures. For example, each Cascade Lake core has a register file of 180 integer and 168 FP registers.
You have stuffed too many questions altogether, however, if I understood the question well, you are confusing the processor architecture with the small but fast Register file, which fills in the speed gap between the processor and memory technologies. The register file is small enough that it can only support one instruction at a time, i.e. the current instruction, and fast enough that it can almost catch up with the processor speed.
I would like to build a short background, the naming conventions of these registers serves two purposes: one, it makes the older versions of the x86 ISA implementations compatible up till now, and two, every name of these registers has a special purpose to it besides its general purpose use. For example, the ECX register is used as a counter to implement loops i.e. instructions like JECXZ and LOOP uses ECX register exclusively. Though you need to watch out for some flags that you would not want to lose.
And now the answer to your question stems from the second purpose. So some registers would seem to be faster because these special registers are hardcoded into the processor and can be accessed much quicker, however, the difference should not be much.
And the second thing that you might know, not all instructions are of the same complexity, especially in x86, the opcode of instructions can be from 1-3 bytes and as more and more functionality is added to the instruction in terms of, prefixes, addressing modes, etc. these instructions start to become slower, So it is not the case that some registers are slower than other, it is just that some registers are encoded into the instruction and therefore those instructions run faster with that combination of register. And if otherwise used, it would seem slower. I hope that helps. Thanks

Which is generally faster to test for zero in x86 ASM: "TEST EAX, EAX" versus "TEST AL, AL"?

Which is generally faster to test the byte in AL for zero / non-zero?
TEST EAX, EAX
TEST AL, AL
Assume a previous "MOVZX EAX, BYTE PTR [ESP+4]" instruction loaded a byte parameter with zero-extension to the remainder of EAX, preventing the combine-value penalty that I already know about.
So AL=EAX and there are no partial-register penalties for reading EAX.
Intuitively just examining AL might let you think it's faster, but I'm betting there are more penalty issues to consider for byte access of a >32-bit register.
Any info/details appreciated, thanks!
Code-size is equal, and so is performance on all x86 CPUs AFAIK.
Intel CPUs (with partial-register renaming) definitely don't have a penalty for reading AL after writing EAX. Other CPUs also have no penalty for reading low-byte registers.
Reading AH would have a penalty on Intel CPUs, like some extra latency. (How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent)
In general 32-bit operand-size and 8-bit operand size (with low-8 not high-8) are equal speed except for the false-dependencies or later partial-register reading penalties of writing an 8-bit register. Since TEST only reads registers, this can't be a problem. Even add al, bl is fine: the instruction already had an input dependency on both registers, and on Sandybridge-family a RMW to the low byte of a register doesn't rename it separately. (Haswell and later don't rename low-byte registers separately anyway).
Pick whichever operand-size you like. 8-bit and 32-bit are basically equal. The choice is just a matter of human readability. If you're going to work with the value as a 32-bit integer later, then go 32-bit. If it's logically still an 8-bit value and you were only using movzx as the x86 equivalent of ARM ldrb or MIPS lbu, then using 8-bit makes sense.
There are code-size advantages to instructions like cmp al, imm which can use the no-modrm short-form encoding. cmp al, 0 is still worse than test al,al on some old CPUs (Core 2), where cmp/jcc macro-fusion is less flexible than test/jcc macro-fusion. (Test whether a register is zero with CMP reg,0 vs OR reg,reg?)
There is one difference between these instructions: test al,al sets SF according to the high bit of AL (which can be non-zero). test eax,eax will always clear SF. If you only care about ZF then that makes no difference, but if you have a use for the high bit in SF for a later branch or cmovcc/setcc then you can avoid doing a 2nd test.
Other ways to test a byte in memory:
If you're consuming the flag result with setcc or cmovcc, not a jcc branch, then macro-fusion doesn't matter in the discussion below.
If you also need the actual value in a register later, movzx/test/jcc is almost certainly best. Otherwise you can consider a memory-destination compare.
cmp [mem], immediate can micro-fuse into a load+cmp uop on Intel, as long as the addressing mode is not RIP-relative. (On Sandybridge-family, indexed addressing modes will un-laminate even on Haswell and later: See Micro fusion and addressing modes). Agner Fog doesn't mention whether AMD has this limitation for fusing cmp/jcc with a memory operand.
;;; no downside for setcc or cmovcc, only with JCC on Intel
;;; unknown on AMD
cmp byte [esp+4], 0 ; micro-fuses into load+cmp with this addressing mode
jnz ... ; breaks macro-fusion on SnB-family
I don't have an AMD CPU to test whether Ryzen or any other AMD still fuses cmp/jcc when the cmp is mem, immediate. Modern AMD CPUs do in general do cmp/jcc and test/jcc fusion. (But not add/sub/and/jcc fusion like SnB-family).
cmp mem,imm / jcc (vs. movzx/test+jcc):
smaller code-size in bytes
same number of front-end / fused-domain uops (2) on mainstream Intel. This would be 3 front-end uops if micro-fusion of the cmp+load wasn't possible, e.g. with a RIP-relative addressing mode + immediate. Or on Sandybridge-family with an indexed addressing mode, it would unlaminate to 3 uops after decode but before issuing into the back-end.
Advantage: this is still 2 on Silvermont/Goldmont / KNL or very old CPUs without macro-fusion. The main advantage of movzx/test/jcc over this is macro-fusion, so it falls behind on CPUs where that doesn't happen.
3 back-end uops (unfused domain = execution ports and space in the scheduler aka RS) because cmp-immediate can't macro-fuse with a JCC on Intel Sandybridge-family CPUs (tested on Skylake). The uops are load, cmp, and a separate branch uop. (vs. 2 for movzx / test+jcc). Back-end uops usually aren't a bottleneck directly, but if the load isn't ready for a while it takes up more space in the RS, limiting how much further past this out-of-order execution can see.
cmp [mem], reg / jcc can macro + micro-fuse into a single compare+branch uop so it's excellent. If you need a zeroed register for anything later in your function, do xor-zero it first and use it for a single-uop compare+branch on memory.
movzx eax, [esp+4] ; 1 uop (load-port only on Intel and Ryzen)
test al,al ; fuses with jcc
jnz ... ; 1 uop
This is still 2 uops for the front-end but only 2 for the back-end as well. The test/jcc macro-fuse together. It costs more code-size, though.
If you aren't branching but instead using the FLAGS result for cmovcc or setcc, using cmp mem, imm has no downside. It can micro-fuse as long as you don't use a RIP-relative addressing mode (which always blocks micro-fusion when there's also an immediate), or an indexed addressing mode.

Is there a penalty when base+offset is in a different page than the base?

The execution times for these three snippets:
pageboundary: dq (pageboundary + 8)
...
mov rdx, [rel pageboundary]
.loop:
mov rdx, [rdx - 8]
sub ecx, 1
jnz .loop
And this:
pageboundary: dq (pageboundary - 8)
...
mov rdx, [rel pageboundary]
.loop:
mov rdx, [rdx + 8]
sub ecx, 1
jnz .loop
And this:
pageboundary: dq (pageboundary - 4096)
...
mov rdx, [rel pageboundary]
.loop:
mov rdx, [rdx + 4096]
sub ecx, 1
jnz .loop
Are, on a 4770K, roughly 5 cycles per iteration for the first snippet and roughly 9 cycles per iteration for the second snippet, then 5 cycles for the third snippet. They both access the exact same address, which is 4K-aligned. In the second snippet, only the address calculation crosses the page boundary: rdx and rdx + 8 don't belong to the same page, the load is still aligned. With a large offset it's back to 5 cycles again.
How does this effect work in general?
Routing the result from the load through an ALU instruction like this:
.loop:
mov rdx, [rdx + 8]
or rdx, 0
sub ecx, 1
jnz .loop
Makes it take 6 cycles per iteration, which makes sense as 5+1. Reg+8 should be a special fast load and AFAIK take 4 cycles, so even in this case there seems to be some penalty, but only 1 cycle.
A test like this was used in response to some of the comments:
.loop:
lfence
; or rdx, 0
mov rdx, [rdx + 8]
; or rdx, 0
; uncomment one of the ORs
lfence
sub ecx, 1
jnz .loop
Putting the or before the mov makes the loop faster than without any or, putting the or after the mov makes it a cycle slower.
Optimization rule: in pointer-connected data structures like linked-lists / trees, put the next or left/right pointers in the first 16 bytes of the object. malloc typically returns 16-byte aligned blocks (alignof(maxalign_t)), so this will ensure the linking pointers are in the same page as the start of the object.
Any other way of ensuring that important struct members are in the same page as the start of the object will also work.
Sandybridge-family normally has 5 cycle L1d load-use latency, but there's a special case for pointer-chasing with small positive displacements with base+disp addressing modes.
Sandybridge-family has 4 cycle load-use latency for [reg + 0..2047] addressing modes, when the base reg is the result of a mov load, not an ALU instruction. Or a penalty if reg+disp is in a different page than reg.
Based on these test results on Haswell and Skylake (and probably original SnB but we don't know), it appears that all of the following conditions must be true:
base reg comes from another load. (A rough heuristic for pointer-chasing, and usually means that load latency is probably part of a dep chain). If objects are usually allocated not crossing a page boundary, then this is a good heuristic. (The HW can apparently detect which execution unit the input is being forwarded from.)
Addressing mode is [reg] or [reg+disp8/disp32]. (Or an indexed load with an xor-zeroed index register! Usually not practically useful, but might provide some insight into the issue/rename stage transforming load uops.)
displacement < 2048. i.e. all bits above bit 11 are zero (a condition HW can check without a full integer adder/comparator.)
(Skylake but not Haswell/Broadwell): the last load wasn't a retried-fastpath. (So base = result of a 4 or 5 cycle load, it will attempt the fast path. But base = result of a 10 cycle retried load, it won't. The penalty on SKL seems to be 10, vs. 9 on HSW).
I don't know if it's the last load attempted on that load port that matters, or if it's actually what happened to the load that produced that input. Perhaps experiments chasing two dep chains in parallel could shed some light; I've only tried one pointer chasing dep chain with a mix of page-changing and non-page-changing displacements.
If all those things are true, the load port speculates that the final effective address will be in the same page as the base register. This is a useful optimization in real cases when load-use latency forms a loop-carried dep chain, like for a linked list or binary tree.
microarchitectural explanation (my best guess at explaining the result, not from anything Intel published):
It seems that indexing the L1dTLB is on the critical path for L1d load latency. Starting that 1 cycle early (without waiting for the output of an adder to calculate the final address) shaves a cycle off the full process of indexing L1d using the low 12 bits of the address, then comparing the 8 tags in that set against the high bits of the physical address produced by the TLB. (Intel's L1d is VIPT 8-way 32kiB, so it has no aliasing problems because the index bits all come from the low 12 bits of the address: the offset within a page which is the same in both the virtual and physical address. i.e. the low 12 bits translate for free from virt to phys.)
Since we don't find an effect for crossing 64-byte boundaries, we know the load port is adding the displacement before indexing the cache.
As Hadi suggests, it seems likely that if there's carry-out from bit 11, the load port lets the wrong-TLB load complete and then redoes it using the normal path. (On HSW, the total load latency = 9. On SKL the total load latency can be 7.5 or 10).
Aborting right away and retrying on the next cycle (to make it 5 or 6 cycles instead of 9) would in theory be possible, but remember that the load ports are pipelined with 1 per clock throughput. The scheduler is expecting to be able to send another uop to the load port in the next cycle, and Sandybridge-family standardizes latencies for everything of 5 cycles and shorter. (There are no 2-cycle instructions).
I didn't test if 2M hugepages help, but probably not. I think the TLB hardware is simple enough that it couldn't recognize that a 1-page-higher index would still pick the same entry. So it probably does the slow retry any time the displacement crosses a 4k boundary, even if that's in the same hugepage. (Page-split loads work this way: if the data actually crosses a 4k boundary (e.g. 8-byte load from page-4), you pay the page-split penalty not just the cache-line split penalty, regardless of hugepages)
Intel's optimization manual documents this special case in section 2.4.5.2 L1 DCache (in the Sandybridge section), but doesn't mention any different-page limitation, or the fact that it's only for pointer-chasing, and doesn't happen when there's an ALU instruction in the dep chain.
(Sandybridge)
Table 2-21. Effect of Addressing Modes on Load Latency
-----------------------------------------------------------------------
Data Type | Base + Offset > 2048 | Base + Offset < 2048
| Base + Index [+ Offset] |
----------------------+--------------------------+----------------------
Integer | 5 | 4
MMX, SSE, 128-bit AVX | 6 | 5
X87 | 7 | 6
256-bit AVX | 7 | 7
(remember, 256-bit loads on SnB take 2 cycles in the load port, unlike on HSW/SKL)
The text around this table also doesn't mention the limitations that exist on Haswell/Skylake, and may also exist on SnB (I don't know).
Maybe Sandybridge doesn't have those limitations and Intel didn't document the Haswell regression, or else Intel just didn't document the limitations in the first place. The table is pretty definite about that addressing mode always being 4c latency with offset = 0..2047.
#Harold's experiment of putting an ALU instruction as part of the load/use pointer-chasing dependency chain confirms that it's this effect that's causing the slowdown: an ALU insn decreased the total latency, effectively giving an instruction like and rdx, rdx negative incremental latency when added to the mov rdx, [rdx-8] dep chain in this specific page-crossing case.
Previous guesses in this answer included the suggestion that using the load result in an ALU vs. another load was what determined the latency. That would be super weird and require looking into the future. That was a wrong interpretation on my part of the effect of adding an ALU instruction into the loop. (I hadn't known about the 9-cycle effect on page crossing, and was thinking that the HW mechanism was a forwarding fast-path for the result inside the load port. That would make sense.)
We can prove that it's the source of the base reg input that matters, not the destination of the load result: Store the same address at 2 separate locations, before and after a page boundary. Create a dep chain of ALU => load => load, and check that it's the 2nd load that's vulnerable to this slowdown / able to benefit from the speedup with a simple addressing mode.
%define off 16
lea rdi, [buf+4096 - 16]
mov [rdi], rdi
mov [rdi+off], rdi
mov ebp, 100000000
.loop:
and rdi, rdi
mov rdi, [rdi] ; base comes from AND
mov rdi, [rdi+off] ; base comes from a load
dec ebp
jnz .loop
... sys_exit_group(0)
section .bss
align 4096
buf: resb 4096*2
Timed with Linux perf on SKL i7-6700k.
off = 8, the speculation is correct and we get total latency = 10 cycles = 1 + 5 + 4. (10 cycles per iteration).
off = 16, the [rdi+off] load is slow, and we get 16 cycles / iter = 1 + 5 + 10. (The penalty seems to be higher on SKL than HSW)
With the load order reversed (doing the [rdi+off] load first), it's always 10c regardless of off=8 or off=16, so we've proved that mov rdi, [rdi+off] doesn't attempt the speculative fast-path if its input is from an ALU instruction.
Without the and, and off=8, we get the expected 8c per iter: both use the fast path. (#harold confirms HSW also gets 8 here).
Without the and, and off=16, we get 15c per iter: 5+10. The mov rdi, [rdi+16] attempts the fast path and fails, taking 10c. Then mov rdi, [rdi] doesn't attempt the fast-path because its input failed. (#harold's HSW takes 13 here: 4 + 9. So that confirms HSW does attempt the fast-path even if the last fast-path failed, and that the fast-path fail penalty really is only 9 on HSW vs. 10 on SKL)
It's unfortunate that SKL doesn't realize that [base] with no displacement can always safely use the fast path.
On SKL, with just mov rdi, [rdi+16] in the loop, the average latency is 7.5 cycles. Based on tests with other mixes, I think it alternates between 5c and 10c: after a 5c load that didn't attempt the fast path, the next one does attempt it and fails, taking 10c. That makes the next load use the safe 5c path.
Adding a zeroed index register actually speeds it up in this case where we know the fast-path is always going to fail. Or using no base register, like [nosplit off + rdi*1], which NASM assembles to 48 8b 3c 3d 10 00 00 00 mov rdi,QWORD PTR [rdi*1+0x10]. Notice that this requires a disp32, so it's bad for code size.
Also beware that indexed addressing modes for micro-fused memory operands are un-laminated in some cases, while base+disp modes aren't. But if you're using pure loads (like mov or vbroadcastss), there's nothing inherently wrong with an indexed addressing mode. Using an extra zeroed register isn't great, though.
On Ice Lake, this special 4 cycle fast path for pointer chasing loads is gone: GP register loads that hit in L1 now generally take 5 cycles, with no difference based on the presence of indexing or the size of the offset.
I've conducted a sufficient number of experiments on Haswell to determine exactly when memory loads are issued speculatively before the effective address is fully calculated. These results also confirm Peter's guess.
I've varied the following parameters:
The offset from pageboundary. The offset used is the same in the definition of pageboundary and the load instruction.
The sign of the offset is either + or -. The sign used in the definition is always the opposite of the one used in the load instruction.
The alignment of pageboundary within the executable binary.
In all of the following graphs, the Y axis represents the load latency in core cycles. The X axis represents the configuration in the form NS1S2, where N is the offset, S1 is the sign of the offset used in the definition, and S2 is the sign used in the load instruction.
The following graph shows that loads are issued before calculating the effective address only when the offset is positive or zero. Note that for all of the offsets between 0-15, the base address and the effective address used in the load instruction are both within the same 4K page.
The next graph shows the point where this pattern changes. The change occurs at offset 213, which is the smallest offset where the base address and the effective address used in the load instruction are both within different 4K pages.
Another important observation that can be made from the previous two graphs is that even if the base address points to a different cache set than the effective address, no penalty is incurred. So it seems that the cache set is opened after calculating the effective address. This indicates that the L1 DTLB hit latency is 2 cycles (that is, it takes 2 cycles for the L1D to receive the tag), but it takes only 1 cycle to open the cache's data array set and the cache's tag array set (which occurs in parallel).
The next graph shows what happens when pageboundary is aligned on a 4K page boundary. In this case, any offset that is not zero will make the base and effective addresses reside within different pages. For example, if the base address of pageboundary is 4096, then the base address of pageboundary used in the load instruction is 4096 - offset, which is obviously in a different 4K page for any non-zero offset.
The next graph shows that the pattern changes again starting from offset 2048. At this point, loads are never issued before calculating the effective address.
This analysis can be confirmed by measuring the number of uops dispatched to the load ports 2 and 3. The total number of retired load uops is 1 billion (equal to the number of iterations). However, when the measured load latency is 9 cycles, the number of load uops dispatched to each of the two ports is 1 billion. Also when the load latency is 5 or 4 cycles, the number of load uops dispatched to each of the two ports is 0.5 billion. So something like this would be happening:
The load unit checks whether the offset is non-negative and smaller than 2048. In that case, it will issue a data load request using the base address. It will also begin calculating the effective address.
In the next cycle, the effective address calculation is completed. If it turns out that the load is to a different 4K page, the load unit waits until the issued load completes and then it discards the results and replays the load. Either way, it supplies the data cache with the set index and line offset.
In the next cycle, the tag comparison is performed and the data is forwarded to the load buffer. (I'm not sure whether the address-speculative load will be aborted in the case of a miss in the L1D or the DTLB.)
In the next cycle, the load buffer receives the data from the cache. If it's supposed to discard the data, it's discarded and it tells the dispatcher to replay the load with address speculation disabled for it. Otherwise, the data is written back. If a following instruction requires the data for its address calculation, it will receive the data in the next cycle (so it will be dispatched in the next cycle if all of its other operands are ready).
These steps explain the observed 4, 5, and 9 cycle latencies.
It might happen that the target page is a hugepage. The only way for the load unit to know whether the base address and the effective address point to the same page when using hugepages is to have the TLB supply the load unit with the size of the page being accessed. Then the load unit has to check whether the effective address is within that page. In modern processors, on a TLB miss, dedicated page-walk hardware is used. In this case, I think that the load unit will not supply the cache set index and cache line offset to the data cache and will use the actual effective address to access the TLB. This requires enabling the page-walk hardware to distinguish between loads with speculative addresses and other loads. Only if that other access missed the TLB will the page walk take place. Now if the target page turned out to be a hugepage and it's a hit in the TLB, it might be possible to inform the load unit that the size of the page is larger than 4K or maybe even of the exact size of the page. The load unit can then make a better decision regarding whether the load should be replayed. However, this logic should take no more than the time for the (potentially wrong) data to reach the load buffer allocated for the load. I think this time is only one cycle.

How does RIP-relative addressing perform compared to mov reg, imm64?

It is known fact that x86-64 instructions do not support 64-bit immediate values (except for mov). Hence, when migrating code from 32 to 64 bits, an instruction like this:
cmp rax, addr32
cannot be replaced with the following:
cmp rax, addr64
Under these circumstances, I'm considering two alternatives: (a) using a scratch register for loading the constant or (b) using rip-relative addressing. The two approaches look like this:
mov r11, addr64 ; scratch register
cmp rax, r11
ptr64: dq addr64
...
cmp rax, [rel ptr64] ; encoded as cmp rax, [rip+offset]
I wrote a very simple loop to compare the performance of both approaches (which I paste below). While (b) uses an indirect pointer, (a) has the the immediate encoded in the instruction (which could lead to a worse usage of i-cache). Surprisingly, I found that (b) run ~10% faster than (a). Is this result something to be expected in more common real-world code?
true: dq 0xFFFF0000FFFF0000
false: dq 0xAAAABBBBAAAABBBB
main:
or rax, 1 ; rax is odd and constant "true" is even
mov rcx, 0x1
shl rcx, 30
branch:
mov r11, 0xFFFF0000FFFF0000 ; not present in (b)
cmp rax, r11 ; vs cmp rax, [rel true]
je next
add rax, 2
loop branch
next:
mov rax, 0
ret
Surprisingly, I found that (b) run ~10% faster than (a)
You probably tested on a CPU other than AMD Bulldozer-family or Ryzen, which have a fast loop instruction. On other CPUs, loop is very slow, mostly on purpose for historical reasons, so you bottleneck on it. e.g. 7 uops, one per 5c throughput on Haswell.
mov r64, imm64 is bad for uop cache throughput because of the large immediate taking 2 slots in Intel's uop cache. (See the Sandybridge uop cache section in Agner Fog's microarch pdf), and Which is faster, imm64 or m64 for x86-64? where I listed the details.
Even apart from that, it's not too surprising that 1 extra uop in the loop makes it run slower. You're probably not on an AMD CPU (with single-uop / 1 per 2 clock loop), because the extra mov in such a tiny loop would make more than 10% difference. Or no difference at all, since it's just 3 vs. 4 uops per 2 clocks, if that's correct that even tiny loop loops are limited to one jump per 2 clocks.
On Intel, loop is 7 uops, one per 5 clocks throughput on most CPUs, so the 4-per-clock issue/rename bottleneck won't be what you're hitting. loop is micro-coded, so the front-end can't run from the loop buffer. (And Skylake CPUs have their LSD disabled by a microcode update to fix the partial-register erratum anyway.) So the mov r64,imm64 uop has to be re-read from the uop cache every time through the loop.
A load that hits in cache has very good throughput (2 loads per clock, and in this case micro-fusion means no extra uops to use a memory operand instead of register for cmp). So the main penalty in using a constant from memory is the extra cache footprint and cache misses, but your microbenchmark won't reveal that at all. It also has no other pressure on the load ports.
In the general case:
If possible, use a RIP-relative lea to generate 64-bit address constants.
e.g. lea rax, [rel addr64]. Yes, this takes an extra instruction to get the constant into a register. (BTW, just use default rel. You can use [abs fs:0] if you need it.
You can avoid the extra instruction if you build position-dependent code with the default (small) code model, so static addresses fit in the low 32 bits of virtual address space and can be used as immediates. (Actually low 2GiB, so sign or zero extending both work). See 32-bit absolute addresses no longer allowed in x86-64 Linux? if gcc complains about absolute addressing; -pie is enabled by default on most distros. This of course doesn't work in Linux shared libraries, which only support text relocations for 64-bit addresses. But you should avoid relocations whenever possible by using lea to make position-indepdendent code.
Most integer build-time constants fit in 32 bits, so you can use cmp r64, imm32 or cmp r32, imm32 even in PIC code.
If you do need a 64-bit non-address constant, try to hoist the mov r64, imm64 out of a loop. Your cmp loop would have been fine if the mov wasn't inside the loop. x86-64 has enough registers that you (or the compiler) can usually avoid reloads inside inner-most loops in integer code.

Are scaled-index addressing modes a good idea?

Consider the following code:
void foo(int* __restrict__ a)
{
int i; int val = 0;
for (i = 0; i < 100; i++) {
val = 2 * i;
a[i] = val;
}
}
This complies (with maximum optimization but no unrolling or vectorization) into...
GCC 7.2:
foo(int*):
xor eax, eax
.L2:
mov DWORD PTR [rdi], eax
add eax, 2
add rdi, 4
cmp eax, 200
jne .L2
rep ret
clang 5.0:
foo(int*): # #foo(int*)
xor eax, eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
mov dword ptr [rdi + 2*rax], eax
add rax, 2
cmp rax, 200
jne .LBB0_1
ret
What are the pros and cons of GCC's vs clang's approach? i.e. an extra variable incremented separately, vs multiplying via a more complex addressing mode?
Notes:
This question also relates to this one with about the same code, but with float's rather than int's.
Yes, take advantage of the power of x86 addressing modes to save uops, in cases where an index doesn't unlaminate into more extra uops than it would cost to do pointer increments.
(In many cases unrolling and using pointer increments is a win because of unlamination on Intel Sandybridge-family, but if you're not unrolling or if you're only using mov loads instead of folding memory operands into ALU ops for micro-fusion, then indexed addressing modes are often break even on some CPUs and a win on others.)
It's essential to read and understand Micro fusion and addressing modes if you want to make optimal choices here. (And note that IACA gets it wrong, and doesn't simulate Haswell and later keeping some uops micro-fused, so you can't even just check your work by having it do static analysis for you.)
Indexed addressing modes are generally cheap. At worst they cost one extra uop for the front-end (on Intel SnB-family CPUs in some situations), and/or prevent a store-address uop from using port7 (which only supports base + displacement addressing modes). See Agner Fog's microarch pdf, and also David Kanter's Haswell write-up, for more about the store-AGU on port7 which Intel added in Haswell.
On Haswell+, if you need your loop to sustain more than 2 memory ops per clock, then avoid indexed stores.
At best they're free other than the code-size cost of the extra byte in the machine-code encoding. (Having an index register requires a SIB (Scale Index Base) byte in the encoding).
More often the only penalty is the 1 extra cycle of load-use latency vs. a simple [base + 0-2047] addressing mode, on Intel Sandybridge-family CPUs.
It's usually only worth using an extra instruction to avoid an indexed addressing mode if you're going to use that addressing mode in multiple instructions. (e.g. load / modify / store).
Scaling the index is free (on modern CPUs at least) if you're already using a 2-register addressing mode. For lea, Agner Fog's table lists AMD Ryzen as having 2c latency and only 2 per clock throughput for lea with scaled-index addressing modes (or 3-component), otherwise 1c latency and 0.25c throughput. e.g. lea rax, [rcx + rdx] is faster than lea rax, [rcx + 2*rdx], but not by enough to be worth using extra instructions instead.) Ryzen also doesn't like a 32-bit destination in 64-bit mode, for some reason. But the worst-case LEA is still not bad at all. And anyway, mostly unrelated to address-mode choice for loads, because most CPUs (other than in-order Atom) run LEA on the ALUs, not the AGUs used for actual loads/stores.
The main question is between one-register unscaled (so it can be a "base" register in the machine-code encoding: [base + idx*scale + disp]) or two-register. Note that for Intel's micro-fusion limitations, [disp32 + idx*scale] (e.g. indexing a static array) is an indexed addressing mode.
Neither function is totally optimal (even without considering unrolling or vectorization), but clang's looks very close.
The only thing clang could do better is save 2 bytes of code size by avoiding the REX prefixes with add eax, 2 and cmp eax, 200. It promoted all the operands to 64-bit because it's using them with pointers and I guess proved that the C loop doesn't need them to wrap, so in asm it uses 64-bit everywhere. This is pointless; 32-bit operations are always at least as fast as 64, and implicit zero-extension is free. But this only costs 2 bytes of code-size, and costs no performance other than indirect front-end effects from that.
You've constructed your loop so the compiler needs to keep a specific value in registers and can't totally transform the problem into just a pointer-increment + compare against an end pointer (which compilers often do when they don't need the loop variable for anything except array indexing).
You also can't transform to counting a negative index up towards zero (which compilers never do, but reduces the loop overhead to a total of 1 macro-fused add + branch uop on Intel CPUs (which can fuse add + jcc, while AMD can only fuse test or cmp / jcc).
Clang has done a good job noticing that it can use 2*var as the array index (in bytes). This is a good optimization for tune=generic. The indexed store will un-laminate on Intel Sandybridge and Ivybridge, but stay micro-fused on Haswell and later. (And on other CPUs, like Nehalem, Silvermont, Ryzen, Jaguar, or whatever, there's no disadvantage.)
gcc's loop has 1 extra uop in the loop. It can still in theory run at 1 store per clock on Core2 / Nehalem, but it's right up against the 4 uops per clock limit. (And actually, Core2 can't macro-fuse the cmp/jcc in 64-bit mode, so it bottlenecks on the front-end).
Indexed addressing (in loads and stores, lea is different still) has some trade-offs, for example
On many µarchs, instructions that use indexed addressing have a slightly longer latency than instruction that don't. But usually throughput is a more important consideration.
On Netburst, stores with a SIB byte generate an extra µop, and therefore may cost throughput as well. The SIB byte causes an extra µop regardless of whether you use it for indexes addressing or not, but indexed addressing always costs the extra µop. It doesn't apply to loads.
On Haswell/Broadwell (and still in Skylake/Kabylake), stores with indexed addressing cannot use port 7 for address generation, instead one of the more general address generation ports will be used, reducing the throughput available for loads.
So for loads it's usually good (or not bad) to use indexed addressing if it saves an add somewhere, unless they are part of a chain of dependent loads. For stores it's more dangerous to use indexed addressing. In the example code it shouldn't make a large difference. Saving the add is not really relevant, ALU instructions wouldn't be the bottleneck. The address generation happening in ports 2 or 3 doesn't matter either since there are no loads.

Resources