x86_64: is IMUL faster than 2x SHL + 2x ADD?

x86_64: is IMUL faster than 2x SHL + 2x ADD? - performance

When looking at the assembly produced by Visual Studio (2015U2) in /O2 (release) mode I saw that this 'hand-optimized' piece of C code is translated back into a multiplication:
int64_t calc(int64_t a) {
return (a << 6) + (a << 16) - a;
}
Assembly:
imul rdx,qword ptr [a],1003Fh
So I was wondering if that is really faster than doing it the way it is written, something like:
mov rbx,qword ptr [a]
mov rax,rbx
shl rax,6
mov rcx,rbx
shl rcx,10h
add rax,rcx
sub rax,rbx
I was always under the impression that multiplication is always slower than a few shifts/adds? Is that no longer the case with modern Intel x86_64 processors?

That's right, modern x86 CPUs (especially Intel) have very high performance multipliers.
imul r, r/m and imul r, r/m, imm are both 3 cycle latency, one per 1c throughput on Intel SnB-family and AMD Ryzen, even for 64bit operand size.
On AMD Bulldozer-family, it's 4c or 6c latency, and one per 2c or one per 4c throughput. (The slower times for 64-bit operand-size).
Data from Agner Fog's instruction tables. See also other stuff in the x86 tag wiki.
The transistor budget in modern CPUs is pretty huge, and allows for the amount of hardware parallelism needed to do a 64 bit multiply with such low latency. (It takes a lot of adders to make a large fast multiplier. How modern X86 processors actually compute multiplications?).
Being limited by power budget, not transistor budget, means that having dedicated hardware for many different functions is possible, as long as they can't all be switching at the same time (https://en.wikipedia.org/wiki/Dark_silicon). e.g. you can't saturate the pext/pdep unit, the integer multiplier, and the vector FMA units all at once on an Intel CPU, because many of them are on the same execution ports.
Fun fact: imul r64 is also 3c, so you can get a full 64*64 => 128b multiply result in 3 cycles. imul r32 is 4c latency and an extra uop, though. My guess is that the extra uop / cycle is splitting the 64bit result from the regular 64bit multiplier into two 32bit halves.
Compilers typically optimize for latency, and typically don't know how to optimize short independent dependency chains for throughput vs. long loop-carried dependency chains that bottleneck on latency.
gcc and clang3.8 and later use up to two LEA instructions instead of imul r, r/m, imm. I think gcc will use imul if the alternative is 3 or more instructions (not including mov), though.
That's a reasonable tuning choice, since a 3 instruction dep chain would be the same length as an imul on Intel. Using two 1-cycle instructions spends an extra uop to shorten the latency by 1 cycle.
clang3.7 and earlier tends to favour imul except for multipliers that only require a single LEA or shift. So clang very recently changed to optimizing for latency instead of throughput for multiplies by small constants. (Or maybe for other reasons, like not competing with other things that are only on the same port as the multiplier.)
e.g. this code on the Godbolt compiler explorer:
int foo (int a) { return a * 63; }
# gcc 6.1 -O3 -march=haswell (and clang actually does the same here)
mov eax, edi # tmp91, a
sal eax, 6 # tmp91,
sub eax, edi # tmp92, a
ret
clang3.8 and later makes the same code.

Related

Does a Length-Changing Prefix (LCP) incur a stall on a simple x86_64 instruction?

Consider a simple instruction like
mov RCX, RDI # 48 89 f9
The 48 is the REX prefix for x86_64. It is not an LCP. But consider adding an LCP (for alignment purposes):
.byte 0x67
mov RCX, RDI # 67 48 89 f9
67 is an address size prefix which in this case is for an instruction without addresses. This instruction also has no immediates, and it doesn't use the F7 opcode (False LCP stalls; F7 would be TEST, NOT, NEG, MUL, IMUL, DIV + IDIV). Assume that it doesn't cross a 16-byte boundary either. Those are the LCP stall cases mentioned in Intel's Optimization Reference Manual.
Would this instruction incur an LCP stall (on Skylake, Haswell, ...) ? What about two LCPs?
My daily driver is a MacBook. So I don't have access to VTune and I can't look at the ILD_STALL event. Is there any other way to know?

TL:DR: 67h is safe here on all CPUs. In 64-bit mode1, 67h can only LCP-stall with addr32 movabs load/store of the accumulator (AL/AX/EAX/RAX) from/to a moffs 32-bit absolute address (vs. the normal 64-bit absolute for that special opcode).
67h isn't length-changing with normal instructions that use a ModRM, even with a disp32 component in the addressing mode, because 32 and 64-bit address-size use identical ModRM formats. That 67h-LCP-stallable form of mov is special and doesn't use a modrm addressing mode.
(It also almost certainly won't have some other meaning in future CPUs, like being part of longer opcode the way rep is3.)
A Length Changing Prefix is when the opcode(+modrm) would imply a different length in bytes for the non-prefixes part of the instruction's machine code, if you ignored prefixes. I.e. it changes the length of the rest of the instruction. (Parallel length-finding is hard, and done separately from full decode: Later insns in a 16-byte block don't even have known start points. So this min(16-byte, 6-instruction) stage needs to look at as few bits as possible after prefixes, for the normal fast case to work. This is the stage where LCP stalls can happen.)
Usually only with an actual imm16 / imm32 opcode, e.g. 66h is length-changing in add cx, 1234, but not add cx, 12: after prefixes or in the appropriate mode, add r/m16, imm8 and add r/m32, imm8 are both opcode + modrm + imm8, 3 bytes regardless, (https://www.felixcloutier.com/x86/add). Pre-decode hardware can find the right length by just skipping prefixes, not modifying interpretation of later opcode+modrm based on what it saw, unlike when 66h means the opcode implies 2 immediate bytes instead of 4. Assemblers will always pick the imm8 encoding when possible because it's shorter (or equal length for the no-modrm add ax, imm16 special case).
(Note that REX.W=1 is length-changing for mov r64, imm64 vs. mov r32, imm32, but all hardware handles that relatively common instruction efficiently so only 66h and 67h can ever actually LCP-stall.)
SnB-family doesn't have any false2 LCP stalls for prefixes that can be length-changing for this opcode but not this this particular instruction, for either 66h or 67h. So F7 is a non-issue on SnB, unlike Core2 and Nehalem. (Earlier P6-family Intel CPUs didn't support 64-bit mode.) Atom/Silvermont don't have LCP penalties at all, nor do AMD or Via CPUs.
Agner Fog's microarch guide covers this well, and explains things clearly. Search for "length-changing prefixes". (This answer is an attempt to put those pieces together with some reminders about how x86 instruction encoding works, etc.)
Footnote 1: 67h increases length-finding difficulty more in non-64-bit modes:
In 64-bit mode, 67h changes from 64 to 32-bit address size, both of which use disp0 / 8 / 32 (0, 1, or 4 bytes of immediate displacement as part of the instruction), and which use the same ModRM + optional SIB encoding for normal addressing modes. RIP+rel32 repurposes the shorter (no SIB) encoding of 32-bit mode's two redundant ways to encode [disp32], so length decoding is unaffected. Note that REX was already designed not to be length-changing (except for mov r64, imm64), by burdening R13 and R12 in the same ways as RBP and RSP as ModRM "escape codes" to signal no base reg, or presence of a SIB byte, respectively.
In 16 and 32-bit modes, 67h switches to 32 or 16-bit address size. Not only are [x + disp32] vs. [x + disp16] different lengths after the ModRM byte (just like immediates for the operand-size prefix), but also 16-bit address-size can't signal a SIB byte. Why don't x86 16-bit addressing modes have a scale factor, while the 32-bit version has it? So the same bits in the mode and /rm fields can imply different lengths.
Footnote 2: "False" LCP stalls
This need (see footnote 1) to sometimes look differently at ModRM even to find the length is presumably why Intel CPUs before Sandybridge have "false" LCP stalls in 16/32-bit modes on 67h prefixes on any instruction with a ModRM, even when they aren't length-changing (e.g. register addressing mode). Instead of optimistically length-finding and checking somehow, a Core2/Nehalem just punts if they see addr32 + most opcodes, if they're not in 64-bit mode.
Fortunately there's basically zero reason to ever use it in 32-bit code so this mostly only matters for 16-bit code that uses 32-bit registers without switching to protected mode. Or code using 67h for padding like you're doing, except in 32-bit mode. .byte 0x67 / mov ecx, edi would be a problem for Core 2 / Nehalem. (I didn't check earlier 32-bit-only P6 family CPUs. They're a lot more obsolete than Nehalem.)
False LCP stalls for 67h never happen in 64-bit mode; as discussed above that's the easy case, and the length pre-decoders already have to know what mode they're in, so fortunately there's no downside to using it for padding. Unlike rep (which could become part of some future opcode), 67h is extremely likely to be safely ignored for instructions where it can apply to some form of the same opcode, even if there isn't actually a memory operand for this one.
Sandybridge-family doesn't ever have any false LCP stalls, removing both the 16/32-bit mode address-size (67h) and the all-modes 66 F7 cases (which needs to look at ModRM to disambiguate instructions like neg di or mul di from test di, imm16.)
SnB-family also removes some 66h true-LCP stalls, e.g. from mov-immediate like mov word ptr [rdi], 0 which is actually useful.
Footnote 3: forward compat of using 67h for padding
When 67h applies to the opcode in general (i.e. it can use a memory operand), it's very unlikely that it will mean something else for the same opcode with a modrm that just happens to encode reg,reg operands. So this is safe for What methods can be used to efficiently extend instruction length on modern x86?.
In fact, "relaxing" a 6-byte call [RIP+rel32] to a 5-byte call rel32 is done by GNU binutils by padding the call rel32 with a 67h address-size prefix, even though that's never meaningful for E8 call rel32. (This happens when linking code compiled with -fno-plt, which uses call [RIP + foo#gotpcrel] for any foo that's not found in the current compilation unit and doesn't have "hidden" visibility.)
But that's not a good precedent: at this point it's too widespread for CPU vendors to want to break that specific prefix+opcode combo (like for What does `rep ret` mean?), but some homebrewed thing in your program like 67h cdq would not get the same treatment from vendors.
The rules, for Sandybridge-family CPUs
edited/condensed from Agner's microarch PDF, these cases can LCP-stall, taking an extra 2 to 3 cycles in pre-decode (if they miss in the uop cache).
Any ALU op with an imm16 that would be imm32 without a 66h. (Except mov-immediate).
Remember that mov and test don't have imm8 forms for wider operand-size so prefer test al, 1, or imm32 if necessary. Or sometimes even test ah, imm8 if you want to test bits in the top half of AX, although beware of 1 cycle of extra latency for reading AH after writing the full reg on HSW and later. GCC uses this trick but should maybe start being careful with it, perhaps sometimes using bt reg, imm8 when feeding a setcc or cmovcc (which can't macro-fuse with test like JCC can).
67h with movabs moffs (A0/A1/A2/A3 opcodes in 64-bit mode, and probably also in 16 or 32-bit mode). Confirmed by my testing with perf counters for ild_stall.lcp on Skylake when LLVM was deciding whether to optimize mov al, [0x123456] to use 67 A0 4-byte-address or a normal opcode + modrm + sib + disp32 (to get absolute instead of rip-relative). That refers to an old version of Agner's guide; he updated soon after I sent him my test results.
If one of the instructions NEG, NOT, DIV, IDIV, MUL and IMUL with a single operand
has a 16-bit operand and there is a 16-bytes boundary between the opcode byte and
the mod-reg-rm byte. These instructions have a bogus length-changing prefix
because these instructions have the same opcode as the TEST instruction with a 16-
bit immediate operand [...]
No penalty on SnB-family for div cx or whatever, regardless of alignment.
The address size prefix (67H) will always cause a delay in 16-bit and 32-bit mode on any
instruction that has a mod/reg/rm byte even if it doesn't change the length of the instruction.
SnB-family removed this penalty, making address-size prefixes usable as padding if you're careful.
Or to summarize another way:
SnB-family has no false LCP stalls.
SnB-family has LCP stalls on every 66h and 67h true LCP except for:
mov r/m16, imm16 and the mov r16, imm16 no-modrm version.
67h address size interaction with ModRM (in 16/32-bit modes).
(That excludes the no-modrm absolute address load/store of AL/AX/EAX/RAX forms- they can still LCP-stall, presumably even in 32-bit mode, like in 64-bit.)
Length-changing REX doesn't stall (on any CPU).
Some examples
(This part ignores the false LCP stalls that some CPUs have in some non-length-changing cases which turn out not to matter here, but perhaps that's why you were worried about 67h for mov reg,reg.)
In your case, the rest of the instruction bytes, starting after the 67, decode as a 3-byte instruction whether the current address-size is 32 or 64. Same even with addressing modes like mov eax, [e/rsi + 1024] (reg+disp32) or addr32 mov edx, [RIP + rel32].
In 16 and 32-bit modes, 67h switches between 16 and 32-bit address size. [x + disp32] vs. [x + disp16] are different lengths after the ModRM byte, but also non-16-bit address-size can signal a SIB byte depending on the R/M field. But in 64-bit mode, 32 and 64-bit address-size both use [x + disp32], and the same ModRM->SIB or not encoding.
There is only one case where a 67h address-size prefix is length-changing in 64-bit mode: movabs load/store with 8-byte vs. 4-byte absolute addresses, and yes it does LCP-stall Intel CPUs. (I posted test results on https://bugs.llvm.org/show_bug.cgi?id=34733#c3)
For example, addr32 movabs [0x123456], al
.intel_syntax noprefix
addr32 mov [0x123456], cl # non-AL to make movabs impossible
mov [0x123456], al # GAS picks normal absolute [disp32]
addr32 mov [0x123456], al # GAS picks A2 movabs since addr32 makes that the shortest choice, same as NASM does.
movabs [0x123456], al # 64-bit absolute address
Note that GAS (fortunately) doesn't choose to use an addr32 prefix on its own, even with as -Os (gcc -Wa,-Os).
$ gcc -c foo.s
$ objdump -drwC -Mintel foo.o
...
0: 67 88 0c 25 56 34 12 00 mov BYTE PTR ds:0x123456,cl
8: 88 04 25 56 34 12 00 mov BYTE PTR ds:0x123456,al # same encoding after the 67
f: 67 a2 56 34 12 00 addr32 mov ds:0x123456,al
15: a2 56 34 12 00 00 00 00 00 movabs ds:0x123456,al # different length for same opcode
As you can see from the last 2 instructions, using the a2 mov moffs, al opcode, with a 67 the rest of the instruction is a different length for the same opcode.
This does LCP-stall on Skylake, so it's only fast when running from the uop cache.
Of course the more common source of LCP stalls is with the 66 prefix and an imm16 (instead of imm32). Like add ax, 1234, as in this random test where I wanted to see if jumping over the LCP-stalling instruction could avoid the problem: Label in %rep section in NASM. But not cases like add ax, 12 which will use add r/m16, imm8 (which is the same length after the 66 prefix as add r/m32, imm8).
Also, Sandybridge-family reportedly avoid LCP stalls for mov-immediate with 16-bit immediate.
Related:
Another example of working around add r/m16, imm16: add 1 byte immediate value to a 2 bytes memory location
x86 assembly 16 bit vs 8 bit immediate operand encoding - choose add r/m16, imm8 instead of the also-3-byte add ax, imm16 form.
Sign or Zero Extension of address in 64bit mode for MOV moffs32? - how address-size interacts with the moffs forms of movabs. (The kind that can LCP-stall)
What methods can be used to efficiently extend instruction length on modern x86? - the general case of what you're doing.
Tuning advice and uarch details:
Usually don't try to save space with addr32 mov [0x123456], al, except maybe when it's a choice between saving 1 byte or using 15 bytes of padding including actual NOPs inside a loop. (more tuning advice below)
One LCP stall usually won't be a disaster with a uop cache, especially if length-decoding probably isn't a front-end bottleneck here (although it often can be if the front-end is a bottleneck at all). Hard to test a single instance in one function by micro-benchmarking, though; only a real full-app benchmark will accurately reflect when the code can run from the uop cache (what Intel perf counters call the DSB), bypassing legacy decode (MITE).
There are queues between stages in modern CPUs that can at least partly absorb stalls https://www.realworldtech.com/haswell-cpu/2/ (moreso than in PPro/PIII), and SnB-family has shorter LCP-stalls than Core2/Nehalem. (But other reasons for pre-decode slowness already dips into their capacity, and after an I-cache miss they may all be empty.)
When prefixes aren't length-changing, the pre-decode pipeline stage that finds instruction boundaries (before steering chunks of bytes to actual complex/simple decoders or doing actual decoding) will find the correct instruction-length / end by skipping all prefixes and then looking at just the opcode (and modrm if applicable).
This pre-decode length-finding is where LCP stalls happen, so fun fact: even Core 2's pre-decode loop buffer can hide LCP stalls in subsequent iterations because it locks down up to 64 bytes / 18 insns of x86 machine code after finding instruction boundaries, using the decode queue (pre-decode output) as a buffer.
In later CPUs, the LSD and uop cache are post decode, so unless something defeats the uop cache (like the pesky JCC-erratum mitigation or simply having too many uops for the uop cache in a 32-byte aligned block of x86 machine code), loops only pay the LCP-stall cost on the first iteration, if they weren't already hot.
I'd say generally work around LCP stalls if you can do so cheaply, especially for code that usually runs "cold". Or if you can just use 32-bit operand-size and avoid partial-register shenanigans, costing usually only a byte of code-size and no extra instructions or uops. Or if you'd have multiple LCP stalls in a row, e.g. from naively using 16-bit immediates, that would be too many bubbles for buffers to hide so you'd have a real problem and it's worth spending extra instructions. (e.g. mov eax, imm32 / add [mem], ax, or movzx load / add r32,imm32 / store, or whatever.)
Padding to end 16-byte fetch blocks at instruction boundaries: not needed
(This is separate from aligning the start of a fetch block at a branch target, which is also sometimes unnecessary given the uop cache.)
Wikichip's section on Skylake pre-decode is incorrectly implies that a partial instruction left at the end of a block has to pre-decode on its own, rather than along with the next 16-byte group that contains the end of the instruction. It seems to be paraphrased from Agner Fog's text, with some changes and additions that make it wrong:
[from wikichip...] As with previous microarchitectures, the pre-decoder has a throughput of 6 macro-ops per cycle or until all 16 bytes are consumed, whichever happens first. Note that the predecoder will not load a new 16-byte block until the previous block has been fully exhausted. For example, suppose a new chunk was loaded, resulting in 7 instructions. In the first cycle, 6 instructions will be processed and a whole second cycle will be wasted for that last instruction. This will produce the much lower throughput of 3.5 instructions per cycle which is considerably less than optimal.
[this part is paraphrased from Agner Fog's Core2/Nehalem section, with the word "fully" having been added"]
Likewise, if the 16-byte block resulted in just 4 instructions with 1 byte of the 5th instruction received, the first 4 instructions will be processed in the first cycle and a second cycle will be required for the last instruction. This will produce an average throughput of 2.5 instructions per cycle.
[nothing like this appears in the current version of Agner's guide, IDK where this misinformation came from. Perhaps made up based on a misunderstanding of what Agner said, but without testing.]
Fortunately no. The rest of the instruction is in the next fetch block, so reality makes a lot more sense: the leftover bytes are prepended to the next 16-byte block.
(Starting a new 16-byte pre-decode block starting with this instruction would also have been plausible, but my testing rules that out: 2.82 IPC with a repeating 5,6,6 byte = 17-byte pattern. If it only ever looked at 16 bytes and left the partial 5 or 6-byte instruction to be the start of the next block, that would give us 2 IPC.)
A repeating pattern of 3x 5 byte instructions unrolled many times (a NASM %rep 2500 or GAS .rept 2500 block, so 7.5k instructions in ~36kiB) runs at 3.19 IPC, pre-decoding and decoding at ~16 bytes per cycle. (16 bytes/cycle) / (5 bytes/insn) = 3.2 instructions per cycle theoretical.
(If wikichip was right, it would predict close to 2 IPC in a 3-1 pattern, which is of course unreasonably low and would be not be an acceptable design for Intel for long runs of long or medium-length when running from legacy decode. 2 IPC is so much narrower than the 4-wide pipeline that it would not be ok even for legacy decode. Intel learned from P4 that running at least decently well from legacy decode is important, even when your CPU caches decoded uops. That's why SnB's uop cache can be so small, only ~1.5k uops. A lot smaller than P4's trace cache, but P4's problem was trying to replace L1i with a trace cache, and having weak decoders. (Also the fact it was a trace cache, so it cached the same code multiple times.))
These perf differences are large enough that you can verify it on your Mac, using a plenty-large repeat count so you don't need perf counters to verify uop-cache misses. (Remember that L1i is inclusive of uop cache, so loops that don't fit in L1i will also evict themselves from uop cache.) Anyway, just measuring total time and knowing the approximate max-turbo that you'll hit is sufficient for a sanity check like this.
Getting better than the theoretical-max that wikichip predicts, even after startup overhead and conservative frequency estimates, will completely rule out that behaviour even on a machine where you don't have perf counters.
$ nasm -felf64 && ld # 3x 5 bytes, repeated 2.5k times
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_retired.retire_slots,uops_executed.thread,idq.dsb_uops -r2 ./testloop
Performance counter stats for './testloop' (2 runs):
604.16 msec task-clock # 1.000 CPUs utilized ( +- 0.02% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
1 page-faults # 0.002 K/sec
2,354,699,144 cycles # 3.897 GHz ( +- 0.02% )
7,502,000,195 instructions # 3.19 insn per cycle ( +- 0.00% )
7,506,746,328 uops_issued.any # 12425.167 M/sec ( +- 0.00% )
7,506,686,463 uops_retired.retire_slots # 12425.068 M/sec ( +- 0.00% )
7,506,726,076 uops_executed.thread # 12425.134 M/sec ( +- 0.00% )
0 idq.dsb_uops # 0.000 K/sec
0.6044392 +- 0.0000998 seconds time elapsed ( +- 0.02% )
(and from another run):
7,501,076,096 idq.mite_uops # 12402.209 M/sec ( +- 0.00% )
No clue why idq.mite_uops:u is not equal to issued or retired. There's nothing to un-laminate, and no stack-sync uops should be necessary, so IDK where the extra issued+retired uops could be coming from. The excess is consistent across runs, and proportional to the %rep count I think.
With other pattern like 5-5-6 (16 bytes) and 5-6-6 (17 bytes), I get similar results.
I do sometimes measure a slight difference when the 16-byte groups are misaligned relative to an absolute 16-byte boundary or not (put a nop at the top of the loop). But that seems to only happen with larger repeat counts. %rep 2500 for 39kiB total size, I still get 2.99 IPC (just under one 16-byte group per cycle), with 0 DSB uops, regardless of aligned vs. misaligned.
I still get 2.99IPC at %rep 5000, but I see a diff at %rep 10000: 2.95 IPC misaligned vs. 2.99 IPC aligned. That largest %rep count is ~156kiB and still fits in the 256k L2 cache so IDK why anything would be different from half that size. (They're much larger than 32k Li1). I think earlier I was seeing a different at 5k, but I can't repro it now. Maybe that was with 17-byte groups.
The actual loop runs 1000000 times in a static executable under _start, with a raw syscall to _exit, so perf counters (and time) for the whole process is basically just the loop. (especially with perf --all-user to only count user-space.)
; complete Linux program
default rel
%use smartalign
alignmode p6, 64
global _start
_start:
mov ebp, 1000000
align 64
.loop:
%ifdef MISALIGN
nop
%endif
%rep 2500
mov eax, 12345 ; 5 bytes.
mov ecx, 123456 ; 5 bytes. Use r8d for 6 bytes
mov edx, 1234567 ; 5 bytes. Use r9d for 6 bytes
%endrep
dec ebp
jnz .loop
.end:
xor edi,edi
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)

Timing of executing 8 bit and 64 bit instructions on 64 bit x64/Amd64 processors

Is there any execution timing difference between 8 it and 64 bit instructions on 64 bit x64/Amd64 processor, when those instructions are similar/same except bit width?
Is there a way to find real processor timing of executing these 2 tiny assembly functions?
-Thanks.
; 64 bit instructions
add64:
mov $0x1, %rax
add $0x2, %rax
ret
; 8 bit instructions
add8:
mov $0x1, %al
add $0x2, %al
ret

Yes, there's a difference. mov $0x1, %al has a false dependency on the old value of RAX on most CPUs, including everything newer than Sandybridge. It's a 2-input 1-output instruction; from the CPU's point of view it's like add $1, %al as far as scheduling it independently or not relative to other uses of RAX. Only writing a 32 or 64-bit register starts a new dependency chain.
This means the AL return value of your add8 function might not be ready until after a cache miss for some independent work the caller happened to be doing in EAX before the call, but the RAX result of add64 could be ready right away for out-of-order execution to get started on later instructions in the caller that use the return value. (Assuming their other inputs are also ready.)
Why doesn't GCC use partial registers? and
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent
and What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? - Important background for understanding performance on modern OoO exec CPUs.
Their code-size also differs: Both the 8-bit instructions are 2 bytes long. (Thanks to the AL, imm8 short-form encoding; add $1, %dl would be 3 bytes). The RAX instructions are 7 and 4 bytes long. This matters for L1i cache footprint (and on a large scale, for how many bytes have to get paged in from disk). On a small scale, how many instructions can fit into a 16 or 32-byte fetch block if the CPU is doing legacy decode because the code wasn't already hot in the uop cache. Also code-alignment of later instructions is affected by varying lengths of previous instructions, sometimes affecting which branches alias each other.
https://agner.org/optimize/ explains the details of the pipelines of various x86 microarchitectures, including front-end decoding effects that can make instruction-length matter beyond just code density in the I-cache / uop-cache.
Generally 32-bit operand-size is the most efficient (for performance, and pretty good for code-size). 32 and 8 are the operand-sizes that x86-64 can use without extra prefixes, and in practice with 8-bit to avoid stalls and badness you need more instructions or longer instructions because they don't zero-extend. The advantages of using 32bit registers/instructions in x86-64.
A few instructions are actually slower in the ALUs for 64-bit operand-size, not just front-end effects. That includes div on most CPUs, and imul on some older CPUs. Also popcnt and bswap. e.g. Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux
Note that mov $0x1, %rax will assemble to 7 bytes with GAS, unless you use as -O2 (not the same as gcc -O2, see this for examples) to get it to optimize to mov $1, %eax which exactly the same architectural effects, but is shorter (no REX or ModRM byte). Some assemblers do that optimization by default, but GAS doesn't. Why NASM on Linux changes registers in x86_64 assembly has more about why this optimization is safe and good, and why you should do it yourself in the source especially if your assembler doesn't do it for you.
But other than the false dep and code-size, they're the same for the back-end of the CPU: all those instructions are single-uop and can run on any scalar-integer ALU execution port1. (https://uops.info/ has automated test results for every form of every unprivileged instruction).
Footnote 1: Excavator (last-gen Bulldozer-family) can also run mov $imm, %reg on 2 more ports (AGU) for 32 and 64-bit operand-size. But merging a new low-8 or low-16 into a full register needs an ALU port. So mov $1, %rax has 4/clock throughput on Excavator, but mov $1, %al only has 2/clock throughput. (And of course only if you use a few different destination registers, not actually AL repeatedly; that would be a latency bottleneck of 1/clock because of the false dependency from writing a partial register on that microarchitecture.)
Previous Bulldozer-family CPUs starting with Piledriver can run mov reg, reg (for r32 or r64) on EX0, EX1, AGU0, AGU1, while most ALU instructions including mov $imm, %reg can only run on EX0/1. Further extending the AGU port's capabilities to also handle mov-immediate was a new feature in Excavator.
Fortunately Bulldozer was obsoleted by AMD's much better Zen architecture which has 4 full scalar integer ALU ports / execution units. (And a wider front end and a uop cache, good caches, and generally doesn't suck in a lot of the ways that Bulldozer sucked.)
Is there a way to measure it?
yes, but generally not in a function you call with call. Instead put it in an unrolled loop so you can run it lots of times with minimal other instructions. Especially useful to look at CPU performance counter results to find front-end / back-end uop counts, as well as just the overall time for your loop.
You can construct your loop to measure latency or throughput; see RDTSCP in NASM always returns the same value (timing a single instruction). Also:
Assembly - How to score a CPU instruction by latency and throughput
Idiomatic way of performance evaluation?
Can x86's MOV really be "free"? Why can't I reproduce this at all? is a good specific example of constructing a microbenchmark to measure / prove something specific.
Generally you don't need to measure yourself (although it's good to understand how, that helps you know what the measurements really mean). People have already done that for most CPU microarchitectures. You can predict performance for a specific CPU for some loops (if you can assume no stalls or cache misses) based on analyzing the instructions. Often that can predict performance fairly accurately, but medium-length dependency chains that OoO exec can only partially hide makes it too hard to accurately predict or account for every cycle.
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? has links to lots of good details, and stuff about CPU internals.
How many CPU cycles are needed for each assembly instruction? (you can't add up a cycle count for each instruction; front-end and back-end throughput, and latency, could each be the bottleneck for a loop.)

Which is generally faster to test for zero in x86 ASM: "TEST EAX, EAX" versus "TEST AL, AL"?

Which is generally faster to test the byte in AL for zero / non-zero?
TEST EAX, EAX
TEST AL, AL
Assume a previous "MOVZX EAX, BYTE PTR [ESP+4]" instruction loaded a byte parameter with zero-extension to the remainder of EAX, preventing the combine-value penalty that I already know about.
So AL=EAX and there are no partial-register penalties for reading EAX.
Intuitively just examining AL might let you think it's faster, but I'm betting there are more penalty issues to consider for byte access of a >32-bit register.
Any info/details appreciated, thanks!

Code-size is equal, and so is performance on all x86 CPUs AFAIK.
Intel CPUs (with partial-register renaming) definitely don't have a penalty for reading AL after writing EAX. Other CPUs also have no penalty for reading low-byte registers.
Reading AH would have a penalty on Intel CPUs, like some extra latency. (How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent)
In general 32-bit operand-size and 8-bit operand size (with low-8 not high-8) are equal speed except for the false-dependencies or later partial-register reading penalties of writing an 8-bit register. Since TEST only reads registers, this can't be a problem. Even add al, bl is fine: the instruction already had an input dependency on both registers, and on Sandybridge-family a RMW to the low byte of a register doesn't rename it separately. (Haswell and later don't rename low-byte registers separately anyway).
Pick whichever operand-size you like. 8-bit and 32-bit are basically equal. The choice is just a matter of human readability. If you're going to work with the value as a 32-bit integer later, then go 32-bit. If it's logically still an 8-bit value and you were only using movzx as the x86 equivalent of ARM ldrb or MIPS lbu, then using 8-bit makes sense.
There are code-size advantages to instructions like cmp al, imm which can use the no-modrm short-form encoding. cmp al, 0 is still worse than test al,al on some old CPUs (Core 2), where cmp/jcc macro-fusion is less flexible than test/jcc macro-fusion. (Test whether a register is zero with CMP reg,0 vs OR reg,reg?)
There is one difference between these instructions: test al,al sets SF according to the high bit of AL (which can be non-zero). test eax,eax will always clear SF. If you only care about ZF then that makes no difference, but if you have a use for the high bit in SF for a later branch or cmovcc/setcc then you can avoid doing a 2nd test.
Other ways to test a byte in memory:
If you're consuming the flag result with setcc or cmovcc, not a jcc branch, then macro-fusion doesn't matter in the discussion below.
If you also need the actual value in a register later, movzx/test/jcc is almost certainly best. Otherwise you can consider a memory-destination compare.
cmp [mem], immediate can micro-fuse into a load+cmp uop on Intel, as long as the addressing mode is not RIP-relative. (On Sandybridge-family, indexed addressing modes will un-laminate even on Haswell and later: See Micro fusion and addressing modes). Agner Fog doesn't mention whether AMD has this limitation for fusing cmp/jcc with a memory operand.
;;; no downside for setcc or cmovcc, only with JCC on Intel
;;; unknown on AMD
cmp byte [esp+4], 0 ; micro-fuses into load+cmp with this addressing mode
jnz ... ; breaks macro-fusion on SnB-family
I don't have an AMD CPU to test whether Ryzen or any other AMD still fuses cmp/jcc when the cmp is mem, immediate. Modern AMD CPUs do in general do cmp/jcc and test/jcc fusion. (But not add/sub/and/jcc fusion like SnB-family).
cmp mem,imm / jcc (vs. movzx/test+jcc):
smaller code-size in bytes
same number of front-end / fused-domain uops (2) on mainstream Intel. This would be 3 front-end uops if micro-fusion of the cmp+load wasn't possible, e.g. with a RIP-relative addressing mode + immediate. Or on Sandybridge-family with an indexed addressing mode, it would unlaminate to 3 uops after decode but before issuing into the back-end.
Advantage: this is still 2 on Silvermont/Goldmont / KNL or very old CPUs without macro-fusion. The main advantage of movzx/test/jcc over this is macro-fusion, so it falls behind on CPUs where that doesn't happen.
3 back-end uops (unfused domain = execution ports and space in the scheduler aka RS) because cmp-immediate can't macro-fuse with a JCC on Intel Sandybridge-family CPUs (tested on Skylake). The uops are load, cmp, and a separate branch uop. (vs. 2 for movzx / test+jcc). Back-end uops usually aren't a bottleneck directly, but if the load isn't ready for a while it takes up more space in the RS, limiting how much further past this out-of-order execution can see.
cmp [mem], reg / jcc can macro + micro-fuse into a single compare+branch uop so it's excellent. If you need a zeroed register for anything later in your function, do xor-zero it first and use it for a single-uop compare+branch on memory.
movzx eax, [esp+4] ; 1 uop (load-port only on Intel and Ryzen)
test al,al ; fuses with jcc
jnz ... ; 1 uop
This is still 2 uops for the front-end but only 2 for the back-end as well. The test/jcc macro-fuse together. It costs more code-size, though.
If you aren't branching but instead using the FLAGS result for cmovcc or setcc, using cmp mem, imm has no downside. It can micro-fuse as long as you don't use a RIP-relative addressing mode (which always blocks micro-fusion when there's also an immediate), or an indexed addressing mode.

Assembly - How to score a CPU instruction by latency and throughput

I'm looking for a type of a formula / way to measure how fast an instruction is, or more specific to give a "score" each of the instruction by CPU cycles.
Let's take the follow assembly program for an example,
nop
mov eax,dword ptr [rbp+34h]
inc eax
mov dword ptr [rbp+34h],eax
and the following Intel Skylake information:
mov r,m : Throughput=0.5 Latency=2
mov m,r
: Throughput=1 Latency=2
nop : Throughput=0.25 Latency=non
inc : Throughput=0.25 Latency=1
I know that the order of the instructions in the program are matter in here but
I'm looking to create something general that not need to be "accurate to the single cycle"
any one have any idea how can I do that?

There is no formula you can apply; you have to measure.
The same instruction on different versions of the same uarch family can have different performance. e.g. mulps:
Sandybridge 1c / 5c throughput/latency.
HSW 0.5 / 5. BDW 0.5 / 3 (faster multiply path in the FMA unit? FMA is still 5c).
SKL 0.5 / 4 (lower latency FMA, too). SKL runs addps on the FMA unit as well, dropping the dedicated FP multiply unit so add latency is higher, but throughput is higher.
There's no way you could predict any of this without measuring, or knowing some microarchitectural details. We expect FP math ops won't be single-cycle latency, because they're much more complicated than integer ops. (So if they were single cycle, the clock speed is set too low for integer ops.)
You measure by repeating the instruction many times in an unrolled loop. Or fully unrolled with no looping, but then you defeat the uop-cache and can get front-end bottlenecks. (e.g. for decoding 10-byte mov r64, imm64)
https://uops.info/ has already automated this testing for every form of every (unprivileged) instruction, and you can even click on any table entry to see what test loops they used. e.g. Skylake xchg r32, eax latency testing (https://uops.info/html-lat/SKL/XCHG_R32_EAX-Measurements.html) from each input operand to each output. (2 cycle latency from EAX -> R8D, but 1 cycle latency from R8D -> EAX.) So we can guess that the 3 uops include copying EAX to an internal temporary, but moving directly from the other operand to EAX.
https://uops.info/ is the current best source of test data; when it and Agner's tables disagree, my own measurements and/or other sources have always confirmed uops.info's testing was accurate. And they don't try to make up a latency number for 2 halves of a round-trip like movd xmm0,eax and back, they show you the range of possible latencies assuming the rest of the chain was the minimum plausible.
Agner Fog creates his instruction tables (which you appear to be reading) by timing large non-looping blocks of code that repeat an instruction. https://agner.org/optimize/. The intro section of his instruction-tables explains briefly how he measures, and his microarch guide explains more details of how different x86 microarchitectures work internally. Unfortunately there are occasional typos or copy/paste errors in his hand-edited tables.
http://instlatx64.atw.hu/ also has results of experimental measurements. I think they use a similar technique of a large block of the same instruction repeated, maybe small enough to fit in the uop cache. But they don't use perf counters to measure what execution port each instruction needs, so their throughput numbers don't help you figure out which instructions compete with which other instructions.
These latter two sources have been around for longer than uops.info, and cover some older CPUs, especially older AMD.
To measure latency yourself, you make the output of each instruction an input for the next.
mov ecx, 10000000
inc_latency:
inc eax
inc eax
inc eax
inc eax
inc eax
inc eax
sub ecx,1 ; avoid partial-flag false dep for P4
jnz inc_latency ; dec or sub/jnz macro-fuses into 1 uop on Intel SnB-family
This dependency chain of 7 inc instructions will bottleneck the loop at 1 iteration per 7 * inc_latency cycles. Using perf counters for core clock cycles (not RDTSC cycles), you can easily measure the time for all the iterations to 1 part in 10k, and with more care probably even more precisely than that. The repeat count of 10000000 hides start/stop overhead of whatever timing you use.
I normally put a loop like this in a Linux static executable that just makes a sys_exit(0) system call directly (with a syscall) instruction, and time the whole executable with perf stat ./testloop to get time and a cycle count. (See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example).
Another example is Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths, with the added complication of using lfence to drain the out-of-order execution window for two dep chains.
To measure throughput, you use separate registers, and/or include an xor-zeroing occasionally to break dep chains and let out-of-order exec overlap things. Don't forget to also use perf counters to see which ports it can run on, so you can tell which other instructions it will compete with. (e.g. FMA (p01) and shuffles (p5) don't compete at all for back-end resources on Haswell/Skylake, only for front-end throughput.) Don't forget to measure front-end uop counts, too: some instructions decode to multiply uops.
How many different dependency chains do we need to avoid a bottleneck? Well we know the latency (measure it first), and we know the max possible throughput (number of execution ports, or front-end throughput.)
For example, if FP multiply had 0.25c throughput (4 per clock), we could keep 20 in flight at once on Haswell (5c latency). That's more than we have registers, so we could just use all 16 and discover that in fact the throughput is only 0.5c. But if it had turned out that 16 registers was a bottleneck, we could add xorps xmm0,xmm0 occasionally and let out-of-order execution overlap some blocks.
More is normally better; having just barely enough to hide latency can slow down with imperfect scheduling. If we wanted to go nuts measuring inc, we'd do this:
mov ecx, 10000000
inc_latency:
%rep 10 ;; source-level repeat of a block, no runtime branching
inc eax
inc ebx
; not ecx, we're using it as a loop counter
inc edx
inc esi
inc edi
inc ebp
inc r8d
inc r9d
inc r10d
inc r11d
inc r12d
inc r13d
inc r14d
inc r15d
%endrep
sub ecx,1 ; break partial-flag false dep for P4
jnz inc_latency ; dec/jnz macro-fuses into 1 uop on Intel SnB-family
If we were worried about partial-flag false dependencies or flag-merging effects, we might experiment with mixing in an xor eax,eax somewhere to let OoO exec overlap more than just when sub wrote all flags. (See INC instruction vs ADD 1: Does it matter?)
There's a similar problem for measuring throughput and latency of shl r32, cl on Sandybridge-family: the flag dependency chain isn't normally relevant for a computation, but putting shl back-to-back creates a dependency through FLAGS as well as through the register. (Or for throughput, there isn't even a register dep).
I posted about this on Agner Fog's blog: https://www.agner.org/optimize/blog/read.php?i=415#860. I mixed shl edx,cl in with four add edx,1 instructions, to see what incremental slowdown adding one more instruction had, where the FLAGS dependency was a non-issue. On SKL, it only slows down by an extra 1.23 cycles on average, so the true latency cost of that shl was only ~1.23 cycles, not 2. (It's not a whole number or just 1 because of resource conflicts to run the flag-merging uops of the shl, I guess. BMI2 shlx edx, edx, ecx would be exactly 1c because it's only a single uop.)
Related: for static performance analysis of whole blocks of code (containing different instructions), see What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?. (It's using the word "latency" for the end-to-end latency of a whole computation, but actually asking about things small enough for OoO exec to overlap different parts, so instruction latency and throughput both matter.)
The Latency=2 numbers for load/store appear to be from Agner Fog's instruction tables (https://agner.org/optimize/). They unfortunately aren't accurate for a chain of mov rax, [rax]. You'll find that's 4c
latency if you measure it by putting that in a loop.
Agner splits up load/store latency into something that makes the total store/reload latency come out correct, but for some reason he doesn't make the load part equal to the L1d load-use latency when it comes from cache instead of the store buffer. (But also note that if the load feeds an ALU instruction instead of another load, the latency is 5c. So the simple addressing-mode fast-path only helps for pure pointer-chasing.)

Are scaled-index addressing modes a good idea?

Consider the following code:
void foo(int* __restrict__ a)
{
int i; int val = 0;
for (i = 0; i < 100; i++) {
val = 2 * i;
a[i] = val;
}
}
This complies (with maximum optimization but no unrolling or vectorization) into...
GCC 7.2:
foo(int*):
xor eax, eax
.L2:
mov DWORD PTR [rdi], eax
add eax, 2
add rdi, 4
cmp eax, 200
jne .L2
rep ret
clang 5.0:
foo(int*): # #foo(int*)
xor eax, eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
mov dword ptr [rdi + 2*rax], eax
add rax, 2
cmp rax, 200
jne .LBB0_1
ret
What are the pros and cons of GCC's vs clang's approach? i.e. an extra variable incremented separately, vs multiplying via a more complex addressing mode?
Notes:
This question also relates to this one with about the same code, but with float's rather than int's.

Yes, take advantage of the power of x86 addressing modes to save uops, in cases where an index doesn't unlaminate into more extra uops than it would cost to do pointer increments.
(In many cases unrolling and using pointer increments is a win because of unlamination on Intel Sandybridge-family, but if you're not unrolling or if you're only using mov loads instead of folding memory operands into ALU ops for micro-fusion, then indexed addressing modes are often break even on some CPUs and a win on others.)
It's essential to read and understand Micro fusion and addressing modes if you want to make optimal choices here. (And note that IACA gets it wrong, and doesn't simulate Haswell and later keeping some uops micro-fused, so you can't even just check your work by having it do static analysis for you.)
Indexed addressing modes are generally cheap. At worst they cost one extra uop for the front-end (on Intel SnB-family CPUs in some situations), and/or prevent a store-address uop from using port7 (which only supports base + displacement addressing modes). See Agner Fog's microarch pdf, and also David Kanter's Haswell write-up, for more about the store-AGU on port7 which Intel added in Haswell.
On Haswell+, if you need your loop to sustain more than 2 memory ops per clock, then avoid indexed stores.
At best they're free other than the code-size cost of the extra byte in the machine-code encoding. (Having an index register requires a SIB (Scale Index Base) byte in the encoding).
More often the only penalty is the 1 extra cycle of load-use latency vs. a simple [base + 0-2047] addressing mode, on Intel Sandybridge-family CPUs.
It's usually only worth using an extra instruction to avoid an indexed addressing mode if you're going to use that addressing mode in multiple instructions. (e.g. load / modify / store).
Scaling the index is free (on modern CPUs at least) if you're already using a 2-register addressing mode. For lea, Agner Fog's table lists AMD Ryzen as having 2c latency and only 2 per clock throughput for lea with scaled-index addressing modes (or 3-component), otherwise 1c latency and 0.25c throughput. e.g. lea rax, [rcx + rdx] is faster than lea rax, [rcx + 2*rdx], but not by enough to be worth using extra instructions instead.) Ryzen also doesn't like a 32-bit destination in 64-bit mode, for some reason. But the worst-case LEA is still not bad at all. And anyway, mostly unrelated to address-mode choice for loads, because most CPUs (other than in-order Atom) run LEA on the ALUs, not the AGUs used for actual loads/stores.
The main question is between one-register unscaled (so it can be a "base" register in the machine-code encoding: [base + idx*scale + disp]) or two-register. Note that for Intel's micro-fusion limitations, [disp32 + idx*scale] (e.g. indexing a static array) is an indexed addressing mode.
Neither function is totally optimal (even without considering unrolling or vectorization), but clang's looks very close.
The only thing clang could do better is save 2 bytes of code size by avoiding the REX prefixes with add eax, 2 and cmp eax, 200. It promoted all the operands to 64-bit because it's using them with pointers and I guess proved that the C loop doesn't need them to wrap, so in asm it uses 64-bit everywhere. This is pointless; 32-bit operations are always at least as fast as 64, and implicit zero-extension is free. But this only costs 2 bytes of code-size, and costs no performance other than indirect front-end effects from that.
You've constructed your loop so the compiler needs to keep a specific value in registers and can't totally transform the problem into just a pointer-increment + compare against an end pointer (which compilers often do when they don't need the loop variable for anything except array indexing).
You also can't transform to counting a negative index up towards zero (which compilers never do, but reduces the loop overhead to a total of 1 macro-fused add + branch uop on Intel CPUs (which can fuse add + jcc, while AMD can only fuse test or cmp / jcc).
Clang has done a good job noticing that it can use 2*var as the array index (in bytes). This is a good optimization for tune=generic. The indexed store will un-laminate on Intel Sandybridge and Ivybridge, but stay micro-fused on Haswell and later. (And on other CPUs, like Nehalem, Silvermont, Ryzen, Jaguar, or whatever, there's no disadvantage.)
gcc's loop has 1 extra uop in the loop. It can still in theory run at 1 store per clock on Core2 / Nehalem, but it's right up against the 4 uops per clock limit. (And actually, Core2 can't macro-fuse the cmp/jcc in 64-bit mode, so it bottlenecks on the front-end).

Indexed addressing (in loads and stores, lea is different still) has some trade-offs, for example
On many µarchs, instructions that use indexed addressing have a slightly longer latency than instruction that don't. But usually throughput is a more important consideration.
On Netburst, stores with a SIB byte generate an extra µop, and therefore may cost throughput as well. The SIB byte causes an extra µop regardless of whether you use it for indexes addressing or not, but indexed addressing always costs the extra µop. It doesn't apply to loads.
On Haswell/Broadwell (and still in Skylake/Kabylake), stores with indexed addressing cannot use port 7 for address generation, instead one of the more general address generation ports will be used, reducing the throughput available for loads.
So for loads it's usually good (or not bad) to use indexed addressing if it saves an add somewhere, unless they are part of a chain of dependent loads. For stores it's more dangerous to use indexed addressing. In the example code it shouldn't make a large difference. Saving the add is not really relevant, ALU instructions wouldn't be the bottleneck. The address generation happening in ports 2 or 3 doesn't matter either since there are no loads.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio