Cannot interpret memory bandwidth numbers

Cannot interpret memory bandwidth numbers - performance

I have written a benchmark to compute memory bandwidth:
#include <benchmark/benchmark.h>
double sum_array(double* v, long n)
{
double s = 0;
for (long i =0 ; i < n; ++i) {
s += v[i];
}
return s;
}
void BM_MemoryBandwidth(benchmark::State& state) {
long n = state.range(0);
double* v = (double*) malloc(state.range(0)*sizeof(double));
for (auto _ : state) {
benchmark::DoNotOptimize(sum_array(v, n));
}
free(v);
state.SetComplexityN(state.range(0));
state.SetBytesProcessed(int64_t(state.range(0))*int64_t(state.iterations())*sizeof(double));
}
BENCHMARK(BM_MemoryBandwidth)->RangeMultiplier(2)->Range(1<<5, 1<<23)->Complexity(benchmark::oN);
BENCHMARK_MAIN();
I compile with
g++-9 -masm=intel -fverbose-asm -S -g -O3 -ffast-math -march=native --std=c++17 -I/usr/local/include memory_bandwidth.cpp
This produces a bunch of moves from RAM, and then some addpd instructions which perf says are hot, so I go into the generated asm and remove them, then assemble and link via
$ g++-9 -c memory_bandwidth.s -o memory_bandwidth.o
$ g++-9 memory_bandwidth.o -o memory_bandwidth.x -L/usr/local/lib -lbenchmark -lbenchmark_main -pthread -fPIC
At this point, get a perf output that I expect: Movement of data into xmm registers, increments of the pointer, and a jmp at the end of the loop:
All fine and well up to here. Now here's where things get weird:
I inquire of my hardware what the memory bandwidth is:
$ sudo lshw -class memory
*-memory
description: System Memory
physical id: 3c
slot: System board or motherboard
size: 16GiB
*-bank:1
description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
vendor: AMI
physical id: 1
slot: ChannelA-DIMM1
size: 8GiB
width: 64 bits
clock: 2400MHz (0.4ns)
So I should be getting at most 8 bytes * 2.4 GHz = 19.2 gigabytes/second.
But instead I get 48 gigabytes/second:
-------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------
BM_MemoryBandwidth/32 6.43 ns 6.43 ns 108045392 bytes_per_second=37.0706G/s
BM_MemoryBandwidth/64 11.6 ns 11.6 ns 60101462 bytes_per_second=40.9842G/s
BM_MemoryBandwidth/128 21.4 ns 21.4 ns 32667394 bytes_per_second=44.5464G/s
BM_MemoryBandwidth/256 47.6 ns 47.6 ns 14712204 bytes_per_second=40.0884G/s
BM_MemoryBandwidth/512 86.9 ns 86.9 ns 8057225 bytes_per_second=43.9169G/s
BM_MemoryBandwidth/1024 165 ns 165 ns 4233063 bytes_per_second=46.1437G/s
BM_MemoryBandwidth/2048 322 ns 322 ns 2173012 bytes_per_second=47.356G/s
BM_MemoryBandwidth/4096 636 ns 636 ns 1099074 bytes_per_second=47.9781G/s
BM_MemoryBandwidth/8192 1264 ns 1264 ns 553898 bytes_per_second=48.3047G/s
BM_MemoryBandwidth/16384 2524 ns 2524 ns 277224 bytes_per_second=48.3688G/s
BM_MemoryBandwidth/32768 5035 ns 5035 ns 138843 bytes_per_second=48.4882G/s
BM_MemoryBandwidth/65536 10058 ns 10058 ns 69578 bytes_per_second=48.5455G/s
BM_MemoryBandwidth/131072 20103 ns 20102 ns 34832 bytes_per_second=48.5802G/s
BM_MemoryBandwidth/262144 40185 ns 40185 ns 17420 bytes_per_second=48.6035G/s
BM_MemoryBandwidth/524288 80351 ns 80347 ns 8708 bytes_per_second=48.6171G/s
BM_MemoryBandwidth/1048576 160855 ns 160851 ns 4353 bytes_per_second=48.5699G/s
BM_MemoryBandwidth/2097152 321657 ns 321643 ns 2177 bytes_per_second=48.5787G/s
BM_MemoryBandwidth/4194304 648490 ns 648454 ns 1005 bytes_per_second=48.1915G/s
BM_MemoryBandwidth/8388608 1307549 ns 1307485 ns 502 bytes_per_second=47.8017G/s
BM_MemoryBandwidth_BigO 0.16 N 0.16 N
BM_MemoryBandwidth_RMS 1 % 1 %
What am I misunderstanding about memory bandwidth that has made my calculations come out wrong by more than a factor of 2?
(Also, this is kinda an insane workflow to empirically determine how much memory bandwidth I have. Is there a better way?)
Full asm for sum_array after removing add instructions:
_Z9sum_arrayPdl:
.LVL0:
.LFB3624:
.file 1 "example_code/memory_bandwidth.cpp"
.loc 1 5 1 view -0
.cfi_startproc
.loc 1 6 5 view .LVU1
.loc 1 7 5 view .LVU2
.LBB1545:
# example_code/memory_bandwidth.cpp:7: for (long i =0 ; i < n; ++i) {
.loc 1 7 24 is_stmt 0 view .LVU3
test rsi, rsi # n
jle .L7 #,
lea rax, -1[rsi] # tmp105,
cmp rax, 1 # tmp105,
jbe .L8 #,
mov rdx, rsi # bnd.299, n
shr rdx # bnd.299
sal rdx, 4 # tmp107,
mov rax, rdi # ivtmp.311, v
add rdx, rdi # _44, v
pxor xmm0, xmm0 # vect_s_10.306
.LVL1:
.p2align 4,,10
.p2align 3
.L5:
.loc 1 8 9 is_stmt 1 discriminator 2 view .LVU4
# example_code/memory_bandwidth.cpp:8: s += v[i];
.loc 1 8 11 is_stmt 0 discriminator 2 view .LVU5
movupd xmm2, XMMWORD PTR [rax] # tmp115, MEM[base: _24, offset: 0B]
add rax, 16 # ivtmp.311,
.loc 1 8 11 discriminator 2 view .LVU6
cmp rax, rdx # ivtmp.311, _44
jne .L5 #,
movapd xmm1, xmm0 # tmp110, vect_s_10.306
unpckhpd xmm1, xmm0 # tmp110, vect_s_10.306
mov rax, rsi # tmp.301, n
and rax, -2 # tmp.301,
test sil, 1 # n,
je .L10 #,
.L3:
.LVL2:
.loc 1 8 9 is_stmt 1 view .LVU7
# example_code/memory_bandwidth.cpp:8: s += v[i];
.loc 1 8 11 is_stmt 0 view .LVU8
addsd xmm0, QWORD PTR [rdi+rax*8] # <retval>, *_3
.LVL3:
# example_code/memory_bandwidth.cpp:7: for (long i =0 ; i < n; ++i) {
.loc 1 7 5 view .LVU9
inc rax # i
.LVL4:
# example_code/memory_bandwidth.cpp:7: for (long i =0 ; i < n; ++i) {
.loc 1 7 24 view .LVU10
cmp rsi, rax # n, i
jle .L1 #,
.loc 1 8 9 is_stmt 1 view .LVU11
# example_code/memory_bandwidth.cpp:8: s += v[i];
.loc 1 8 11 is_stmt 0 view .LVU12
addsd xmm0, QWORD PTR [rdi+rax*8] # <retval>, *_6
.LVL5:
.loc 1 8 11 view .LVU13
ret
.LVL6:
.p2align 4,,10
.p2align 3
.L7:
.loc 1 8 11 view .LVU14
.LBE1545:
# example_code/memory_bandwidth.cpp:6: double s = 0;
.loc 1 6 12 view .LVU15
pxor xmm0, xmm0 # <retval>
.loc 1 10 5 is_stmt 1 view .LVU16
.LVL7:
.L1:
# example_code/memory_bandwidth.cpp:11: }
.loc 1 11 1 is_stmt 0 view .LVU17
ret
.p2align 4,,10
.p2align 3
.L10:
.loc 1 11 1 view .LVU18
ret
.LVL8:
.L8:
.LBB1546:
# example_code/memory_bandwidth.cpp:7: for (long i =0 ; i < n; ++i) {
.loc 1 7 15 view .LVU19
xor eax, eax # tmp.301
.LBE1546:
# example_code/memory_bandwidth.cpp:6: double s = 0;
.loc 1 6 12 view .LVU20
pxor xmm0, xmm0 # <retval>
jmp .L3 #
.cfi_endproc
.LFE3624:
.size _Z9sum_arrayPdl, .-_Z9sum_arrayPdl
.section .text.startup,"ax",#progbits
.p2align 4
.globl main
.type main, #function
Full output of lshw -class memory:
*-firmware
description: BIOS
vendor: American Megatrends Inc.
physical id: 0
version: 1.90
date: 10/21/2016
size: 64KiB
capacity: 15MiB
capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification uefi
*-memory
description: System Memory
physical id: 3c
slot: System board or motherboard
size: 16GiB
*-bank:0
description: [empty]
physical id: 0
slot: ChannelA-DIMM0
*-bank:1
description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
product: CMU16GX4M2A2400C16
vendor: AMI
physical id: 1
serial: 00000000
slot: ChannelA-DIMM1
size: 8GiB
width: 64 bits
clock: 2400MHz (0.4ns)
*-bank:2
description: [empty]
physical id: 2
slot: ChannelB-DIMM0
*-bank:3
description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
product: CMU16GX4M2A2400C16
vendor: AMI
physical id: 3
serial: 00000000
slot: ChannelB-DIMM1
size: 8GiB
width: 64 bits
clock: 2400MHz (0.4ns)
Is the CPU relevant here? Well here's the specs:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Pentium(R) CPU G4400 # 3.30GHz
Stepping: 3
CPU MHz: 3168.660
CPU max MHz: 3300.0000
CPU min MHz: 800.0000
BogoMIPS: 6624.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0,1
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust erms invpcid rdseed smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
The data produced by the clang compile is much more intelligible. The performance monotonically decreases until it hits 19.8Gb/s as the vector gets much larger than cache:
Here's the benchmark output:

It looks like from your hardware description that you have two DIMM slots that are placed into two channels. This interleaves memory between the two DIMM chips, so that memory accesses will be reading from both chips. (One possibility is that bytes 0-7 are in DIMM1 and bytes 8-15 are in DIMM2, but this depends on the hardware implementation.) This doubles the memory bandwidth because you're accessing two hardware chips instead of one.
Some systems support three or four channels, further increasing the maximum bandwidth.

Related

Counting differences between 2 buffers seems too slow

My problem
I have 2 adjacent buffers of bytes of identical size (around 20 MB each). I just want to count the differences between them.
My question
How much time this loop should take to run on a 4.8GHz Intel I7 9700K with 3600MT RAM ?
How do we compute max theoretical speed ?
What I tried
uint64_t compareFunction(const char *const __restrict buffer, const uint64_t commonSize)
{
uint64_t diffFound = 0;
for(uint64_t byte = 0; byte < commonSize; ++byte)
diffFound += static_cast<uint64_t>(buffer[byte] != buffer[byte + commonSize]);
return diffFound;
}
It takes 11ms on my PC (9700K 4.8Ghz RAM 3600 Windows 10 Clang 14.0.6 -O3 MinGW ) and I feel it is too slow and that I am missing something.
40MB should take less than 2ms to be read on the CPU (my RAM bandwidth is between 20 and 30GB/s)
I don't know how to count cycles required to execute one iteration (especially because CPUs are superscalar nowadays). If I assume 1 cycle per operation and if I don't mess up my counting, it should be 10 ops per iteration -> 200 million ops -> at 4.8 Ghz with only one execution unit -> 40ms. Obviously I am wrong on how to compute the number of cycles per loop.
Fun fact: I tried on Linux PopOS GCC 11.2 -O3 and it ran at 4.5ms. Why such a difference?
Here are the dissassemblies vectorised and scalar produced by clang:
compareFunction(char const*, unsigned long): # #compareFunction(char const*, unsigned long)
test rsi, rsi
je .LBB0_1
lea r8, [rdi + rsi]
neg rsi
xor edx, edx
xor eax, eax
.LBB0_4: # =>This Inner Loop Header: Depth=1
movzx r9d, byte ptr [rdi + rdx]
xor ecx, ecx
cmp r9b, byte ptr [r8 + rdx]
setne cl
add rax, rcx
add rdx, 1
mov rcx, rsi
add rcx, rdx
jne .LBB0_4
ret
.LBB0_1:
xor eax, eax
ret
Clang14 O3:
.LCPI0_0:
.quad 1 # 0x1
.quad 1 # 0x1
compareFunction(char const*, unsigned long): # #compareFunction(char const*, unsigned long)
test rsi, rsi
je .LBB0_1
cmp rsi, 4
jae .LBB0_4
xor r9d, r9d
xor eax, eax
jmp .LBB0_11
.LBB0_1:
xor eax, eax
ret
.LBB0_4:
mov r9, rsi
and r9, -4
lea rax, [r9 - 4]
mov r8, rax
shr r8, 2
add r8, 1
test rax, rax
je .LBB0_5
mov rdx, r8
and rdx, -2
lea r10, [rdi + 6]
lea r11, [rdi + rsi]
add r11, 6
pxor xmm0, xmm0
xor eax, eax
pcmpeqd xmm2, xmm2
movdqa xmm3, xmmword ptr [rip + .LCPI0_0] # xmm3 = [1,1]
pxor xmm1, xmm1
.LBB0_7: # =>This Inner Loop Header: Depth=1
movzx ecx, word ptr [r10 + rax - 6]
movd xmm4, ecx
movzx ecx, word ptr [r10 + rax - 4]
movd xmm5, ecx
movzx ecx, word ptr [r11 + rax - 6]
movd xmm6, ecx
pcmpeqb xmm6, xmm4
movzx ecx, word ptr [r11 + rax - 4]
movd xmm7, ecx
pcmpeqb xmm7, xmm5
pxor xmm6, xmm2
punpcklbw xmm6, xmm6 # xmm6 = xmm6[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm4, xmm6, 212 # xmm4 = xmm6[0,1,1,3,4,5,6,7]
pshufd xmm4, xmm4, 212 # xmm4 = xmm4[0,1,1,3]
pand xmm4, xmm3
paddq xmm4, xmm0
pxor xmm7, xmm2
punpcklbw xmm7, xmm7 # xmm7 = xmm7[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm0, xmm7, 212 # xmm0 = xmm7[0,1,1,3,4,5,6,7]
pshufd xmm5, xmm0, 212 # xmm5 = xmm0[0,1,1,3]
pand xmm5, xmm3
paddq xmm5, xmm1
movzx ecx, word ptr [r10 + rax - 2]
movd xmm0, ecx
movzx ecx, word ptr [r10 + rax]
movd xmm1, ecx
movzx ecx, word ptr [r11 + rax - 2]
movd xmm6, ecx
pcmpeqb xmm6, xmm0
movzx ecx, word ptr [r11 + rax]
movd xmm7, ecx
pcmpeqb xmm7, xmm1
pxor xmm6, xmm2
punpcklbw xmm6, xmm6 # xmm6 = xmm6[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm0, xmm6, 212 # xmm0 = xmm6[0,1,1,3,4,5,6,7]
pshufd xmm0, xmm0, 212 # xmm0 = xmm0[0,1,1,3]
pand xmm0, xmm3
paddq xmm0, xmm4
pxor xmm7, xmm2
punpcklbw xmm7, xmm7 # xmm7 = xmm7[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm1, xmm7, 212 # xmm1 = xmm7[0,1,1,3,4,5,6,7]
pshufd xmm1, xmm1, 212 # xmm1 = xmm1[0,1,1,3]
pand xmm1, xmm3
paddq xmm1, xmm5
add rax, 8
add rdx, -2
jne .LBB0_7
test r8b, 1
je .LBB0_10
.LBB0_9:
movzx ecx, word ptr [rdi + rax]
movd xmm2, ecx
movzx ecx, word ptr [rdi + rax + 2]
movd xmm3, ecx
add rax, rsi
movzx ecx, word ptr [rdi + rax]
movd xmm4, ecx
pcmpeqb xmm4, xmm2
movzx eax, word ptr [rdi + rax + 2]
movd xmm2, eax
pcmpeqb xmm2, xmm3
pcmpeqd xmm3, xmm3
pxor xmm4, xmm3
punpcklbw xmm4, xmm4 # xmm4 = xmm4[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm4, xmm4, 212 # xmm4 = xmm4[0,1,1,3,4,5,6,7]
pshufd xmm4, xmm4, 212 # xmm4 = xmm4[0,1,1,3]
movdqa xmm5, xmmword ptr [rip + .LCPI0_0] # xmm5 = [1,1]
pand xmm4, xmm5
paddq xmm0, xmm4
pxor xmm2, xmm3
punpcklbw xmm2, xmm2 # xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm2, xmm2, 212 # xmm2 = xmm2[0,1,1,3,4,5,6,7]
pshufd xmm2, xmm2, 212 # xmm2 = xmm2[0,1,1,3]
pand xmm2, xmm5
paddq xmm1, xmm2
.LBB0_10:
paddq xmm0, xmm1
pshufd xmm1, xmm0, 238 # xmm1 = xmm0[2,3,2,3]
paddq xmm1, xmm0
movq rax, xmm1
cmp r9, rsi
je .LBB0_13
.LBB0_11:
lea r8, [r9 + rsi]
sub rsi, r9
add r8, rdi
add rdi, r9
xor edx, edx
.LBB0_12: # =>This Inner Loop Header: Depth=1
movzx r9d, byte ptr [rdi + rdx]
xor ecx, ecx
cmp r9b, byte ptr [r8 + rdx]
setne cl
add rax, rcx
add rdx, 1
cmp rsi, rdx
jne .LBB0_12
.LBB0_13:
ret
.LBB0_5:
pxor xmm0, xmm0
xor eax, eax
pxor xmm1, xmm1
test r8b, 1
jne .LBB0_9
jmp .LBB0_10

TLDR: the reason why the Clang code is so slow comes from a poor vectorization method saturating the port 5 (known to be often an issue). GCC does a better job here, but it is still far from being efficient. One can write a much faster chunk-based code using AVX-2 not saturating the port 5.
Analysis of the unvectorized Clang code
To understand what is going on it is better to start with a simple example. Indeed, as you said, modern processor are superscalar so it is not easy to understand the speed of some generated code on such architecture.
The code generated by Clang using the -O1 optimization flag is a good start. Here is the code of the hot loop produced by GodBold provided in your question:
(instructions) (ports)
.LBB0_4:
movzx r9d, byte ptr [rdi + rdx] p23
xor ecx, ecx p0156
cmp r9b, byte ptr [r8 + rdx] p0156+p23
setne cl p06
add rax, rcx p0156
add rdx, 1 p0156
mov rcx, rsi (optimized)
add rcx, rdx p0156
jne .LBB0_4 p06
Modern processors like the Coffee Lake 9700K are structured in two big parts: a front-end fetching/decoding the instructions (and splitting them into micro-instructions, aka. uops), and a back-end scheduling/executing them. The back-end schedule the uops on many ports and each of them can execute some specific sets of instructions (eg. only memory load, or only arithmetic instruction). For each instruction, I put the ports that can execute them. p0156+p23 means the instruction is split in two uops: the first can be executed by the ports 0 or 1 or 5 or 6, and the second can be executed by the ports 2 or 3. Note that the front-end can somehow optimize the code so not to produce any uops for basic instructions like the mov in the loop (thanks to a mechanism called register renaming).
For each loop iteration, the processor needs to read 2 value from memory. A Coffee Lake processor like the 9700K can load two values per cycle so the loop will at least take 1 cycle/iteration (assuming the loads in r9d and r9b does not conflict due to the use of different part of the same r9 64-bit register). This processor has a uops cache and the loop has a lot of instructions so the decoding part should not be a problem. That being said, there is 9 uops to execute and the processor can only execute 6 of them per cycle so the loop cannot take less than 1.5 cycle/iteration. More precisely, the ports 0, 1, 5 and 6 are under pressure, so even assuming the processor perfectly load balance the uops, 2 cycle/iterations are needed. This is an optimistic lower-bound execution time since the processor may not perfectly schedule the instruction and there are many things that could possibly go wrong (like a sneaky hidden dependency I did not see). With a frequency of 4.8GHz, the final execution time is at least 8.3 ms. It can reach 12.5 ms with 3 cycle/iteration (note that 2.5 cycle/iteration is possible due to the scheduling of uops to ports).
The loop can be improved using unrolling. Indeed, a significant number of instructions are needed just to do the loop and not the actual computation. Unrolling can help to increase the ratio of useful instructions so to make a better usage of available ports. Still, the 2 loads prevent the loop to be faster than 1 cycle/iteration, that is 4.2 ms.
Analysis of the vectorized Clang code
The vectorized code generated by Clang is complex. One could try to apply the same analysis than in the previous code but it would be a tedious task.
One can note that even though the code is vectorized, the loads are not vectorized. This is an issue since only 2 loads can be done per cycle. That being said, loads are performed by pairs two contiguous char values so loads are not so slow compared to the previously generated code.
Clang does that since only two 64-bit values can fit in a 128-bit SSE register and a 64-bit and it needs to do that because diffFound is a 64-bit integer. The 8-bit to 64-bit conversion is the biggest issue in the code because it requires several SSE instructions to do the conversion. Moreover, only 4 integers can be computed at a time since there is 3 SSE integer units on Coffee Lake and each of them can only compute two 64-bit integers at a time. In the end, Clang only put 2 values in each SSE register (and use 4 of them so to compute 8 items per loop iteration) so one should expect a code running more than twice faster (especially due to SSE and the loop unrolling), but this is not much the case due to fewer SSE ports than ALU ports and a more instructions required for the type conversions. Put it shortly, the vectorization is clearly inefficient, but this is not so easy for Clang to generate an efficient code in this case. Still, with 28 SSE instructions and 3 SSE integer units computing 8 items per loop, one should expect the computing part of the code to take about 28/3/8 ~= 1.2 cycle/item which is far from what you can observe (and this is not due to other instruction since they can mostly be executed in parallel as they can mostly be scheduled on other ports).
In fact, the performance issue certainly comes from the saturation of the port 5. Indeed, this port is the only one that can shuffle items of SIMD registers. Thus, the instructions punpcklbw, pshuflw, pshufd and even the movd can only be executed on the port 5. This is a pretty common issue with SIMD codes. This is a big issue since there is 20 instructions per loop and the processor may not even use it perfectly. This means the code should take at least 10.4 ms which is very close to the observed execution time (11 ms).
Analysis of the vectorized GCC code
The code generated by GCC is actually pretty good compared to the one of Clang. Firstly, GCC loads items using SIMD instruction directly which is much more efficient as 16 items are computed per instruction (and by iteration): it only need 2 load uops per iteration reducing the pressure on the port 2 and 3 (1 cycle/iteration for that, so 0.0625 cycle/item). Secondly, GCC only uses 14 punpckhwd instructions while each iteration compute 16 items, reducing critical pressure on the port 5 (0.875 cycle/item for that). Thirdly, the SIMD registers are nearly fully used, at least for the comparison since the pcmpeqb comparison instruction compare 16 items at a time (as opposed to 2 with Clang). The other instructions like paddq are cheap (for example, paddq can be scheduled on the 3 SSE ports) and they should not impact much the execution time. In the end, this version should still be bounded by the port 5, but it should be much faster than the Clang version. Indeed, one should expect the execution time to reach 1 cycle/item (since the port scheduling is certainly not perfect and memory loads may introduce some stalling cycles). This means an execution time of 4.2 ms. This is close to the observed results.
Faster implementation
The GCC implementation is not perfect.
First of all, it does not use AVX2 supported by your processor since the -mavx2 flag is not provided (or any similar flag like -march=native). Indeed, GCC like other mainstream compilers only use SSE2 by default for sake of compatibility with previous architecture: SSE2 is safe to use on all x86-64 processors, but not other instruction sets like SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2. With such flag, GCC should be able to produce a memory bound code.
Moreover, the compiler could theoretically perform a multi-level sum reduction. The idea is to accumulate the result of the comparison in a 8-bit wide SIMD lane using chunks with a size of 1024 items (ie. 64x16 items). This is safe since the value of each lane cannot exceed 64. To avoid overflow, the accumulated values needs to be stored in wider SIMD lanes (eg. 64-bit ones). With this strategy, the overhead of the punpckhwd instructions is 64 time smaller. This is a big improvement since it removes the saturation of the port 5. This strategy should be sufficient to generate a memory-bound code, even using only SSE2. Here is an example of untested code requiring the flag -fopenmp-simd to be efficient.
uint64_t compareFunction(const char *const __restrict buffer, const uint64_t commonSize)
{
uint64_t byteChunk = 0;
uint64_t diffFound = 0;
if(commonSize >= 127)
{
for(; byteChunk < commonSize-127; byteChunk += 128)
{
uint8_t tmpDiffFound = 0;
#pragma omp simd reduction(+:tmpDiffFound)
for(uint64_t byte = byteChunk; byte < byteChunk + 128; ++byte)
tmpDiffFound += buffer[byte] != buffer[byte + commonSize];
diffFound += tmpDiffFound;
}
}
for(uint64_t byte = byteChunk; byte < commonSize; ++byte)
diffFound += buffer[byte] != buffer[byte + commonSize];
return diffFound;
}
Both GCC and Clang generates a rather efficient code (while sub-optimal for data fitting in the cache), especially Clang. Here is for example the code generated by Clang using AVX2:
.LBB0_4:
lea r10, [rdx + 128]
vmovdqu ymm2, ymmword ptr [r9 + rdx - 96]
vmovdqu ymm3, ymmword ptr [r9 + rdx - 64]
vmovdqu ymm4, ymmword ptr [r9 + rdx - 32]
vpcmpeqb ymm2, ymm2, ymmword ptr [rcx + rdx - 96]
vpcmpeqb ymm3, ymm3, ymmword ptr [rcx + rdx - 64]
vpcmpeqb ymm4, ymm4, ymmword ptr [rcx + rdx - 32]
vmovdqu ymm5, ymmword ptr [r9 + rdx]
vpaddb ymm2, ymm4, ymm2
vpcmpeqb ymm4, ymm5, ymmword ptr [rcx + rdx]
vpaddb ymm3, ymm4, ymm3
vpaddb ymm2, ymm3, ymm2
vpaddb ymm2, ymm2, ymm0
vextracti128 xmm3, ymm2, 1
vpaddb xmm2, xmm2, xmm3
vpshufd xmm3, xmm2, 238
vpaddb xmm2, xmm2, xmm3
vpsadbw xmm2, xmm2, xmm1
vpextrb edx, xmm2, 0
add rax, rdx
mov rdx, r10
cmp r10, r8
jb .LBB0_4
All the loads are 256-bit SIMD ones. The number of vpcmpeqb is optimal. The number of vpaddb is relatively good. There are few other instructions, but they should clearly not be a bottleneck. The loop operate on 128 items per iteration and I expect it to takes less than a dozen of cycles per iteration for data already in the cache (otherwise it should be completely memory-bound). This means <0.1 cycle/item, that is, far less than the previous implementation. In fact, the uiCA tool indicates about 0.055 cycle/item, that is 81 GiB/s! One may manually write a better code using SIMD intrinsics, but at the expense of a significantly worse portability, maintenance and readability.
Note that generating a sequential memory-bound does not always mean the RAM throughput will be saturated. In fact, on one core, there is sometimes not enough concurrency to hide the latency of memory operations though it should be fine on your processor (like it is on my i5-9600KF with 2 interleaved 3200 MHz DDR4 memory channels).

Yes, if your data is not hot in cache, even SSE2 should keep up with memory bandwidth. Compare-and-sum of 32 compare results per cycle (from two 32-byte loads) is totally possible if data is hot in L1d cache, or whatever bandwidth outer levels of cache can provide.
If not, the compiler did a bad job. That's unfortunately common for problems like this reducing into a wider variable; compilers don't know good vectorization strategies for summing bytes, especially compare-result bytes that must be 0/-1. They probably widen to 64-bit with pmovsxbq right away (or even worse if SSE4.1 instructions aren't available).
So even -O3 -march=native doesn't help much; this is a big missed-optimization; hopefully GCC and clang will learn how to vectorize this kind of loop at some point, summing compare results probably comes up in enough codebases to be worth recognizing that pattern.
The efficient way is to use psadbw to sum horizontally into qwords. But only after an inner loop does some iterations of vsum -= cmp(p, q), subtracting 0 or -1 to increment a counter or not. 8-bit elements can do 255 iterations of that without risk of overflow. And with unrolling for multiple vector accumulators, that's many vectors of 32 bytes each, so you don't have to break out of that inner loop very often.
See How to count character occurrences using SIMD for manually-vectorized AVX2 code. (And one answer has a Godbolt link to an SSE2 version.) Summing the compare results is the same problem as that, but you're loading two vectors to feed pcmpeqb instead of broadcasting one byte outside the loop to find occurrences of a single char.
An answer there has benchmarks that report 28 GB/s for AVX2, 23 GB/s for SSE2, on an i7-6700 Skylake (at only 3.4GHz, maybe they disabled turbo or are just reporting the rated speed. DRAM speed not mentioned.)
I'd expect 2 input streams of data to achieve about the same sustained bandwidth as one.
This is more interesting to optimize if you benchmark repeated passes over smaller arrays that fit in L2 cache, then efficiency of your ALU instructions matters. (The strategy in the answers on that question are pretty good and well tuned for that case.)
Fast counting the number of equal bytes between two arrays is an older Q&A using a worse strategy, not using psadbw to sum bytes to 64-bit. (But not as bad as GCC/clang, still hsumming as it widens to 32-bit.)
Multiple threads/cores will barely help on a modern desktop, especially at high core clocks like yours. Memory latency is low enough and each core has enough buffers to keep enough requests in flight that it can nearly saturate dual-channel DRAM controllers.
On a big Xeon, that would be very different; you need most of the cores to achieve peak aggregate bandwidth, even for just memcpy or memset so there's zero ALU work, just loads/stores. The higher latency means a single core has much less memory bandwidth available than on a desktop (even in an absolute sense, let alone as a percentage of 6 channels instead of 2). See also Enhanced REP MOVSB for memcpy and Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?
Portable source that compiles to less-bad asm, micro-optimized from Jérôme's: 5.5 cycles per 4x 32-byte vectors, down from 7 or 8, assuming L1d cache hits.
Still not good (as it reduces to scalar every 128 bytes, or 192 if you want to try that), but
#Jérôme Richard came up with a clever way to give clang something it could vectorize a short with a good strategy, with a uint8_t sum, using that as an inner loop short enough to not overflow.
But clang still does some dumb things with that loop, as we can see in his answer. I modified the loop control to use a pointer increment, which reduces the loop overhead a bit, just one pointer-add and compare/jcc, not LEA/MOV. I don't know why clang was doing it inefficiently using integer indexing.
And it avoids an indexed addressing mode for the vpcmpeqb memory source operands, letting them stay micro-fused on Intel CPUs. (Clang doesn't seem to know that this matters at all! Reversing operands to != in the source was enough to make it use indexed addressing modes for vpcmpeqb instead of for vmovdqu pure loads.)
// micro-optimized version of Jérôme's function, clang compiles this better
// instead of 2 arrays, it compares first and 2nd half of one array, which lets it index one relative to the other with an offset if we hand-hold clang into doing that.
uint64_t compareFunction_sink_fixup(const char *const __restrict buffer, const size_t commonSize)
{
uint64_t byteChunk = 0;
uint64_t diffFound = 0;
const char *endp = buffer + commonSize;
const char *__restrict ptr = buffer;
if(commonSize >= 127) {
// A signed type for commonSize wouldn't avoid UB in pointer subtraction creating a pointer before the object
// in practice it would be fine except maybe when inlining into a function where the compiler could see a compile-time-constant array size.
for(; ptr < endp-127 ; ptr += 128)
{
uint8_t tmpDiffFound = 0;
#pragma omp simd reduction(+:tmpDiffFound)
for(int off = 0 ; off < 128; ++off)
tmpDiffFound += ptr[off + commonSize] != ptr[off];
// without AVX-512, we get -1 for ==, 0 for not-equal. So clang adds set1_epi(4) to each bucket that holds the sum of four 0 / -1 elements
diffFound += tmpDiffFound;
}
}
// clang still auto-vectorizes, but knows the max trip count is only 127
// so doesn't unroll, just 4 bytes per iter.
for(int byte = 0 ; byte < commonSize % 128 ; ++byte)
diffFound += ptr[byte] != ptr[byte + commonSize];
return diffFound;
}
Godbolt with clang15 -O3 -fopenmp-simd -mavx2 -march=skylake -mbranches-within-32B-boundaries
# The main loop, from clang 15 for x86-64 Skylake
.LBB0_4: # =>This Inner Loop Header: Depth=1
vmovdqu ymm2, ymmword ptr [rdi + rsi]
vmovdqu ymm3, ymmword ptr [rdi + rsi + 32] # Indexed addressing modes are fine here
vmovdqu ymm4, ymmword ptr [rdi + rsi + 64]
vmovdqu ymm5, ymmword ptr [rdi + rsi + 96]
vpcmpeqb ymm2, ymm2, ymmword ptr [rdi] # non-indexed allow micro-fusion without un-lamination
vpcmpeqb ymm3, ymm3, ymmword ptr [rdi + 32]
vpcmpeqb ymm4, ymm4, ymmword ptr [rdi + 64]
vpaddb ymm2, ymm4, ymm2
vpcmpeqb ymm4, ymm5, ymmword ptr [rdi + 96]
vpaddb ymm3, ymm4, ymm3
vpaddb ymm2, ymm2, ymm3
vpaddb ymm2, ymm2, ymm0 # add a vector of set1_epi8(4) to turn sums of 0 / -1 into sums of 1 / 0
vextracti128 xmm3, ymm2, 1
vpaddb xmm2, xmm2, xmm3
vpshufd xmm3, xmm2, 238 # xmm3 = xmm2[2,3,2,3]
vpaddb xmm2, xmm2, xmm3 # reduced to 8 bytes
vpsadbw xmm2, xmm2, xmm1 # hsum to one qword
vpextrb edx, xmm2, 0 # extract and zero-extend
add rax, rdx # accumulate the chunk sum
sub rdi, -128 # pointer increment (with a sign_extended_imm8 instead of +imm32)
cmp rdi, rcx
jb .LBB0_4 # }while(p < endp)
This could use 192 instead of 128 to further amortize the loop overhead, at the cost of needing to do %192 (not a power of 2), and making the cleanup loop worst case be 191 bytes. We can't go to 256, or anything higher than UINT8_MAX (255), and sticking to multiples of 32 is necessary. Or 64 for good measure.
There's an extra vpaddb of a fixup constant, set1_epi8(4), which turns the sum of four 0 / -1 into a sum of four 1 / 0 results from the C != operator.
I don't think there's any way to get rid of it or sink it out of the loop while still accumulating into a uint8_t, which is necessary for clang to vectorize this way. It doesn't know how to use vpsadbw to do a widening (non-truncating) sum of bytes, which is ironic because that's what it actually does when used against an all-zero register. If you do something like sum += ptr[off + commonSize] == ptr[off] ? -1 : 0 you can get it to use the vpcmpeqb result directly, summing 4 vectors down to one with 3 adds, and eventually feeding that to vpsadbw after some reduction steps. So you get a sum of matches * 0xFF truncated to uint8_t for each block of 128 bytes. Or as an int8_t, that's a sum of -1 * matches, so 0..-128, which doesn't overflow a signed byte. So that's interesting. But adding with zero-extension into a 64-bit counter might destroy information, and sign-extension inside the outer loop would cost another instruction. It would be a scalar movsx instruction instead of vpaddb, but that's not important for Skylake, probably only if using AVX-512 with 512-bit vectors (which clang and GCC both do badly, not using masked adds). Can we do 128*n_chunks - count after the loop to recover the differences from the sum of matches? No, I don't think so.
uiCA static analysis predicts Skylake (such as your CPU) will run the main loop at 5.51 cycles / iter (4 vectors) if data is hot in L1d cache, or 5.05 on Ice Lake / Rocket Lake. (I had to hand-tweak the asm to emulate the padding effect -mbranches-within-32B-boundaries would have, for uiCA's default assumption of where the top of the loop is relative to a 32-byte alignment boundary. I could have just changed that setting in uiCA instead. :/)
The only missed micro-optimization in implementing this sub-optimal strategy is that it's using vpextrb (because it doesn't prove that truncation to uint8_t isn't needed?) instead of vmovd or vmovq. So it costs an extra uop for the front-end, and for port 5 in the back end. With that optimized (comment + uncomment in the link), 5.25c / iter on Skylake, or 4.81 on Ice Lake, pretty close to the 2 load/clock bottleneck.
(Doing 6 vectors per iter, 192 bytes, predicts 7 cycles per iter on SKL, or 1.166 per vector, down from 5.5 / iter = 1.375 per vector. Or about 6.5 on ICL/RKL = 1.08 c/vec, hitting back-end ALU port bottlecks.)
This is not bad for something we were able to coax clang into generating from portable C++ source, vs. 4 cycles per 4 vectors of 32 byte-compares each for efficient manual vectorization. This will very likely keep up with memory or cache bandwidth even from L2 cache, so it's pretty usable, and not much slower with data hot in L1d. Taking a few more uops does hurt out-of-order exec, and uses up more execution resources that another logical core sharing a physical core could use. (Hyperthreading).
Unfortunately gcc/clang do not make good use of AVX-512 for this. If you were using 512-bit vectors (or AVX-512 features on 256-bit vectors), you'd compare into mask registers, then do something like vpaddb zmm0{k1}, zmm0, zmm1 merge-masking to conditionally increment a vector, where zmm1 = set1_epi8( 1 ). (Or a -1 constant with sub.) Instruction and uop count per vector should be about the same as AVX2 if done properly, but gcc/clang use about twice as many, so the only saving is in the reduction to scalar which seems to be the price for getting anything at all usable.
This version also avoids unrolling of the clean-up loop, just vectorizing with its dumb 4 bytes per iter strategy, which is about right for cleanup of size%128 bytes. It's pretty silly that it uses both vpxor to flip and vpand to turn 0xff into 0x01, when it could have used vpandn to do both those things in one instruction. That would get that cleanup loop down to 8 uops, just twice the pipeline width on Haswell / Skylake, so it would issue more efficiently from the loop buffer, except Skylake disabled that in microcode updates. It would help a bit on Haswell

Correct me if I am wrong but the answer seems to be
-march=native for the win.
the scalar version of the code was CPU bottlenecked and not RAM bottlenecked
use uica.uops.info to have an estimate of the cycles per loop
I will try to write my own AVX code to compare.
Details
After an afternoon tinkering around with the suggestions, here is what I found with clang:
-O1 around 10ms, scalar code
-O3 enables SSE2 and is as slow as O1, maybe poor assembly code
-O3 -march=westmere enables also SSE2 but is faster (7ms)
-O3 -march=native enables AVX -> 2.5ms and we are probably RAM bandwidth limited (close to the theoretical speed)
The scalar 10ms makes sense now because according to that awesome tool uica.uops.info it takes
2.35 cycles per loop
47 million cycles for the whole comparison (20 million iterations)
Processor is clocked at 4.8GHz meaning it should take around 9.8ms and it is close to what is measured.
g++ seems to generate better default code when no flags are added
O1 11ms
O2 scalar still but 9ms
O3 SSE 4.5ms
O3 -march=westmere 7ms like clang
O3 -march=native 3.4ms, slightly slower than clang

32-byte aligned routine does not fit the uops cache

KbL i7-8550U
I'm researching the behavior of uops-cache and came across a misunderstanding regarding it.
As specified in the Intel Optimization Manual 2.5.2.2 (emp. mine):
The Decoded ICache consists of 32 sets. Each set contains eight Ways.
Each Way can hold up to six micro-ops.
-
All micro-ops in a Way represent instructions which are statically
contiguous in the code and have their EIPs within the same aligned
32-byte region.
-
Up to three Ways may be dedicated to the same 32-byte aligned chunk,
allowing a total of 18 micro-ops to be cached per 32-byte region of
the original IA program.
-
A non-conditional branch is the last micro-op in a Way.
CASE 1:
Consider the following routine:
uop.h
void inhibit_uops_cache(size_t);
uop.S
align 32
inhibit_uops_cache:
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
jmp decrement_jmp_tgt
decrement_jmp_tgt:
dec rdi
ja inhibit_uops_cache ;ja is intentional to avoid Macro-fusion
ret
To make sure that the code of the routine is actually 32-bytes aligned here is the asm
0x555555554820 <inhibit_uops_cache> mov edx,esi
0x555555554822 <inhibit_uops_cache+2> mov edx,esi
0x555555554824 <inhibit_uops_cache+4> mov edx,esi
0x555555554826 <inhibit_uops_cache+6> mov edx,esi
0x555555554828 <inhibit_uops_cache+8> mov edx,esi
0x55555555482a <inhibit_uops_cache+10> mov edx,esi
0x55555555482c <inhibit_uops_cache+12> jmp 0x55555555482e <decrement_jmp_tgt>
0x55555555482e <decrement_jmp_tgt> dec rdi
0x555555554831 <decrement_jmp_tgt+3> ja 0x555555554820 <inhibit_uops_cache>
0x555555554833 <decrement_jmp_tgt+5> ret
0x555555554834 <decrement_jmp_tgt+6> nop
0x555555554835 <decrement_jmp_tgt+7> nop
0x555555554836 <decrement_jmp_tgt+8> nop
0x555555554837 <decrement_jmp_tgt+9> nop
0x555555554838 <decrement_jmp_tgt+10> nop
0x555555554839 <decrement_jmp_tgt+11> nop
0x55555555483a <decrement_jmp_tgt+12> nop
0x55555555483b <decrement_jmp_tgt+13> nop
0x55555555483c <decrement_jmp_tgt+14> nop
0x55555555483d <decrement_jmp_tgt+15> nop
0x55555555483e <decrement_jmp_tgt+16> nop
0x55555555483f <decrement_jmp_tgt+17> nop
running as
int main(void){
inhibit_uops_cache(4096 * 4096 * 128L);
}
I got the counters
Performance counter stats for './bin':
6 431 201 748 idq.dsb_cycles (56,91%)
19 175 741 518 idq.dsb_uops (57,13%)
7 866 687 idq.mite_uops (57,36%)
3 954 421 idq.ms_uops (57,46%)
560 459 dsb2mite_switches.penalty_cycles (57,28%)
884 486 frontend_retired.dsb_miss (57,05%)
6 782 598 787 cycles (56,82%)
1,749000366 seconds time elapsed
1,748985000 seconds user
0,000000000 seconds sys
This is exactly what I expected to get.
The vast majority of uops came from uops cache. Also uops number perfectly matches with my expectation
mov edx, esi - 1 uop;
jmp imm - 1 uop; near
dec rdi - 1 uop;
ja - 1 uop; near
4096 * 4096 * 128 * 9 = 19 327 352 832 approximately equal to the counters 19 326 755 442 + 3 836 395 + 1 642 975
CASE 2:
Consider the implementation of inhibit_uops_cache which is different by one instruction commented out:
align 32
inhibit_uops_cache:
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
; mov edx, esi
jmp decrement_jmp_tgt
decrement_jmp_tgt:
dec rdi
ja inhibit_uops_cache ;ja is intentional to avoid Macro-fusion
ret
disas:
0x555555554820 <inhibit_uops_cache> mov edx,esi
0x555555554822 <inhibit_uops_cache+2> mov edx,esi
0x555555554824 <inhibit_uops_cache+4> mov edx,esi
0x555555554826 <inhibit_uops_cache+6> mov edx,esi
0x555555554828 <inhibit_uops_cache+8> mov edx,esi
0x55555555482a <inhibit_uops_cache+10> jmp 0x55555555482c <decrement_jmp_tgt>
0x55555555482c <decrement_jmp_tgt> dec rdi
0x55555555482f <decrement_jmp_tgt+3> ja 0x555555554820 <inhibit_uops_cache>
0x555555554831 <decrement_jmp_tgt+5> ret
0x555555554832 <decrement_jmp_tgt+6> nop
0x555555554833 <decrement_jmp_tgt+7> nop
0x555555554834 <decrement_jmp_tgt+8> nop
0x555555554835 <decrement_jmp_tgt+9> nop
0x555555554836 <decrement_jmp_tgt+10> nop
0x555555554837 <decrement_jmp_tgt+11> nop
0x555555554838 <decrement_jmp_tgt+12> nop
0x555555554839 <decrement_jmp_tgt+13> nop
0x55555555483a <decrement_jmp_tgt+14> nop
0x55555555483b <decrement_jmp_tgt+15> nop
0x55555555483c <decrement_jmp_tgt+16> nop
0x55555555483d <decrement_jmp_tgt+17> nop
0x55555555483e <decrement_jmp_tgt+18> nop
0x55555555483f <decrement_jmp_tgt+19> nop
running as
int main(void){
inhibit_uops_cache(4096 * 4096 * 128L);
}
I got the counters
Performance counter stats for './bin':
2 464 970 970 idq.dsb_cycles (56,93%)
6 197 024 207 idq.dsb_uops (57,01%)
10 845 763 859 idq.mite_uops (57,19%)
3 022 089 idq.ms_uops (57,38%)
321 614 dsb2mite_switches.penalty_cycles (57,35%)
1 733 465 236 frontend_retired.dsb_miss (57,16%)
8 405 643 642 cycles (56,97%)
2,117538141 seconds time elapsed
2,117511000 seconds user
0,000000000 seconds sys
The counters are completely unexpected.
I expected all the uops come from dsb as before since the routine matches the requirements of uops cache.
By contrast, almost 70% of uops came from Legacy Decode Pipeline.
QUESTION: What's wrong with the CASE 2? What counters to look at to understand what's going on?
UPD: Following #PeterCordes idea I checked the 32-byte alignment of the unconditional branch target decrement_jmp_tgt. Here is the result:
CASE 3:
Aligning onconditional jump target to 32 byte as follows
align 32
inhibit_uops_cache:
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
mov edx, esi
; mov edx, esi
jmp decrement_jmp_tgt
align 32 ; align 16 does not change anything
decrement_jmp_tgt:
dec rdi
ja inhibit_uops_cache
ret
disas:
0x555555554820 <inhibit_uops_cache> mov edx,esi
0x555555554822 <inhibit_uops_cache+2> mov edx,esi
0x555555554824 <inhibit_uops_cache+4> mov edx,esi
0x555555554826 <inhibit_uops_cache+6> mov edx,esi
0x555555554828 <inhibit_uops_cache+8> mov edx,esi
0x55555555482a <inhibit_uops_cache+10> jmp 0x555555554840 <decrement_jmp_tgt>
#nops to meet the alignment
0x555555554840 <decrement_jmp_tgt> dec rdi
0x555555554843 <decrement_jmp_tgt+3> ja 0x555555554820 <inhibit_uops_cache>
0x555555554845 <decrement_jmp_tgt+5> ret
and running as
int main(void){
inhibit_uops_cache(4096 * 4096 * 128L);
}
I got the following counters
Performance counter stats for './bin':
4 296 298 295 idq.dsb_cycles (57,19%)
17 145 751 147 idq.dsb_uops (57,32%)
45 834 799 idq.mite_uops (57,32%)
1 896 769 idq.ms_uops (57,32%)
136 865 dsb2mite_switches.penalty_cycles (57,04%)
161 314 frontend_retired.dsb_miss (56,90%)
4 319 137 397 cycles (56,91%)
1,096792233 seconds time elapsed
1,096759000 seconds user
0,000000000 seconds sys
The result is perfectly expected. More then 99% of the uops came from dsb.
Avg dsb uops delivery rate = 17 145 751 147 / 4 296 298 295 = 3.99
Which is close to the peak bandwith.

This is not the answer to the OP's problem, but is one to watch out for
See Code alignment dramatically affects performance for compiler options to work around this performance pothole Intel introduced into Skylake-derived CPUs, as part of this workaround.
Other observations: the block of 6 mov instructions should fill a uop cache line, with jmp in a line by itself. In case 2, the 5 mov + jmp should fit in one cache line (or more properly "way").
(Posting this for the benefit of future readers who might have the same symptoms but a different cause. I realized right as I finished writing it that 0x...30 is not a 32-byte boundary, only 0x...20 and 40, so this erratum shouldn't be the problem for the code in the question.)
A recent (late 2019) microcode update introduced a new performance pothole. It works around Intel's JCC erratum on Skylake-derived microarchitectures. (KBL142 on your Kaby-Lake specifically).
Microcode Update (MCU) to Mitigate JCC Erratum
This erratum can be prevented by a microcode update (MCU). The MCU prevents
jump instructions from being cached in the Decoded ICache when the jump
instructions cross a 32-byte boundary or when they end on a 32-byte boundary. In
this context, Jump Instructions include all jump types: conditional jump (Jcc), macrofused op-Jcc (where op is one of cmp, test, add, sub, and, inc, or dec), direct
unconditional jump, indirect jump, direct/indirect call, and return.
Intel's whitepaper also includes a diagram of cases that trigger this non-uop-cacheable effect. (PDF screenshot borrowed from a Phoronix article with benchmarks before/after, and after with rebuilding with some workarounds in GCC/GAS that try to avoid this new performance pitfall).
The last byte of the ja in your code is ...30, so it's the culprit.
If this was a 32-byte boundary, not just 16, then we'd have the problem here:
0x55555555482a <inhibit_uops_cache+10> jmp # fine
0x55555555482c <decrement_jmp_tgt> dec rdi
0x55555555482f <decrement_jmp_tgt+3> ja # spans 16B boundary (not 32)
0x555555554831 <decrement_jmp_tgt+5> ret # fine
This section not fully updated, still talking about spanning a 32B boundary
JA itself spans a boundary.
Inserting a NOP after dec rdi should work, putting the 2-byte ja fully after the boundary with a new 32-byte chunk. Macro-fusion of dec/ja wasn't possible anyway because JA reads CF (and ZF) but DEC doesn't write CF.
Using sub rdi, 1 to move the JA would not work; it would macro-fuse, and the combined 6 bytes of x86 code corresponding to that instruction would still span the boundary.
You could use single-byte nops instead of mov before the jmp to move everything earlier, if that gets it all in before the last byte of a block.
ASLR can change what virtual page code executes from (bit 12 and higher of the address), but not the alignment within a page or relative to a cache line. So what we see in disassembly in one case will happen every time.

OBSERVATION 1: A branch with a target within the same 32-byte region which is predicted to be taken behaves much like the unconditional branch from the uops cache standpoint (i.e. it should be the last uop in the line).
Consider the following implementation of inhibit_uops_cache:
align 32
inhibit_uops_cache:
xor eax, eax
jmp t1 ;jz, jp, jbe, jge, jle, jnb, jnc, jng, jnl, jno, jns, jae
t1:
jmp t2 ;jz, jp, jbe, jge, jle, jnb, jnc, jng, jnl, jno, jns, jae
t2:
jmp t3 ;jz, jp, jbe, jge, jle, jnb, jnc, jng, jnl, jno, jns, jae
t3:
dec rdi
ja inhibit_uops_cache
ret
The code is tested for all the branches mentioned in the comment. The difference turned out to be very insignificant, so I provide for only 2 of them:
jmp:
Performance counter stats for './bin':
4 748 772 552 idq.dsb_cycles (57,13%)
7 499 524 594 idq.dsb_uops (57,18%)
5 397 128 360 idq.mite_uops (57,18%)
8 696 719 idq.ms_uops (57,18%)
6 247 749 210 dsb2mite_switches.penalty_cycles (57,14%)
3 841 902 993 frontend_retired.dsb_miss (57,10%)
21 508 686 982 cycles (57,10%)
5,464493212 seconds time elapsed
5,464369000 seconds user
0,000000000 seconds sys
jge:
Performance counter stats for './bin':
4 745 825 810 idq.dsb_cycles (57,13%)
7 494 052 019 idq.dsb_uops (57,13%)
5 399 327 121 idq.mite_uops (57,13%)
9 308 081 idq.ms_uops (57,13%)
6 243 915 955 dsb2mite_switches.penalty_cycles (57,16%)
3 842 842 590 frontend_retired.dsb_miss (57,16%)
21 507 525 469 cycles (57,16%)
5,486589670 seconds time elapsed
5,486481000 seconds user
0,000000000 seconds sys
IDK why the number of dsb uops is 7 494 052 019, which is significantly lesser then 4096 * 4096 * 128 * 4 = 8 589 934 592.
Replacing any of the jmp with a branch that is predicted not to be taken yields a result which is significantly different. For example:
align 32
inhibit_uops_cache:
xor eax, eax
jnz t1 ; perfectly predicted to not be taken
t1:
jae t2
t2:
jae t3
t3:
dec rdi
ja inhibit_uops_cache
ret
results in the following counters:
Performance counter stats for './bin':
5 420 107 670 idq.dsb_cycles (56,96%)
10 551 728 155 idq.dsb_uops (57,02%)
2 326 542 570 idq.mite_uops (57,16%)
6 209 728 idq.ms_uops (57,29%)
787 866 654 dsb2mite_switches.penalty_cycles (57,33%)
1 031 630 646 frontend_retired.dsb_miss (57,19%)
11 381 874 966 cycles (57,05%)
2,927769205 seconds time elapsed
2,927683000 seconds user
0,000000000 seconds sys
Considering another example which is similar to the CASE 1:
align 32
inhibit_uops_cache:
nop
nop
nop
nop
nop
xor eax, eax
jmp t1
t1:
dec rdi
ja inhibit_uops_cache
ret
results in
Performance counter stats for './bin':
6 331 388 209 idq.dsb_cycles (57,05%)
19 052 030 183 idq.dsb_uops (57,05%)
343 629 667 idq.mite_uops (57,05%)
2 804 560 idq.ms_uops (57,13%)
367 020 dsb2mite_switches.penalty_cycles (57,27%)
55 220 850 frontend_retired.dsb_miss (57,27%)
7 063 498 379 cycles (57,19%)
1,788124756 seconds time elapsed
1,788101000 seconds user
0,000000000 seconds sys
jz:
Performance counter stats for './bin':
6 347 433 290 idq.dsb_cycles (57,07%)
18 959 366 600 idq.dsb_uops (57,07%)
389 514 665 idq.mite_uops (57,07%)
3 202 379 idq.ms_uops (57,12%)
423 720 dsb2mite_switches.penalty_cycles (57,24%)
69 486 934 frontend_retired.dsb_miss (57,24%)
7 063 060 791 cycles (57,19%)
1,789012978 seconds time elapsed
1,788985000 seconds user
0,000000000 seconds sys
jno:
Performance counter stats for './bin':
6 417 056 199 idq.dsb_cycles (57,02%)
19 113 550 928 idq.dsb_uops (57,02%)
329 353 039 idq.mite_uops (57,02%)
4 383 952 idq.ms_uops (57,13%)
414 037 dsb2mite_switches.penalty_cycles (57,30%)
79 592 371 frontend_retired.dsb_miss (57,30%)
7 044 945 047 cycles (57,20%)
1,787111485 seconds time elapsed
1,787049000 seconds user
0,000000000 seconds sys
All these experiments made me think that the observation corresponds to the real behavior of the uops cache. I also ran another experiments and judging by the counters br_inst_retired.near_taken and br_inst_retired.not_taken the result correlates with the observation.
Consider the following implementation of inhibit_uops_cache:
align 32
inhibit_uops_cache:
t0:
;nops 0-9
jmp t1
t1:
;nop 0-6
dec rdi
ja t0
ret
Collecting dsb2mite_switches.penalty_cycles and frontend_retired.dsb_miss we have:
The X-axis of the plot stands for the number of nops, e.g. 24 means 2 nops after the t1 label, 4 nops after the t0 label:
align 32
inhibit_uops_cache:
t0:
nop
nop
nop
nop
jmp t1
t1:
nop
nop
dec rdi
ja t0
ret
Judging by the plots I came to the
OBSERVATION 2: In case there are 2 branches within a 32-byte region that are predicted to be taken there is no observable correlation between dsb2mite switches and dsb misses. So the dsb misses may occur independently from the dsb2mite switches.
Increasing frontend_retired.dsb_miss rate correlate well with the increasing idq.mite_uops rate and decreasing idq.dsb_uops. This can be seen on the following plot:
OBSERVATION 3: The dsb misses occurring for some (unclear?) reason causes IDQ read bubbles and therefore RAT underflow.
Conclusion: Taking all the measurements into account there are definitely some differences between the behavior defined in the Intel Optimization Manual, 2.5.2.2 Decoded ICache

Vmovntpd instruction on Intel Xeon Platinum 8168 CPU

I have a simple vector-vector addition algorithm implementation in assembly. It uses AVX to read 4 doubles from the A vector, and 4 doubles from B vector. The algorithm adds these numbers and writes the result back to the C vector. If I use vmovntpd to write back the result, the performance becames extremely random. I have made this test on an azure server, with Intel Xeon Platinum 8168 CPU. If I run this test on my laptop (Intel Core i7-2640M CPU), this random effect disappears. What is the problem on the server? One more info: The server has 44 CPU-s.
[Edit]
Here is my code:
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Dense to dense
;; Without cache (for storing the result)
;; AVX-512
;; Without tolerances
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
global _denseToDenseAddAVX512_nocache_64_linux
_denseToDenseAddAVX512_nocache_64_linux:
push rbp
mov rbp, rsp
; c = a + lambda * b
; rdi: address1
; rsi: address2
; rdx: address3
; rcx: count
; xmm0: lambda
mov rax, rcx
shr rcx, 4
and rax, 0x0F
vzeroupper
vmovupd zmm5, [abs_mask]
sub rsp, 8
movlpd [rbp - 8], xmm0
vbroadcastsd zmm7, [rbp - 8]
vmovapd zmm6, zmm7
cmp rcx, 0
je after_loop_denseToDenseAddAVX512_nocache_64_linux
start_denseToDenseAddAVX512_nocache_64_linux:
vmovapd zmm0, [rdi] ; a
vmovapd zmm1, zmm7
vmulpd zmm1, zmm1, [rsi] ; b
vaddpd zmm0, zmm0, zmm1 ; zmm0 = c = a + b
vmovntpd [rdx], zmm0
vmovapd zmm2, [rdi + 64] ; a
vmovapd zmm3, zmm6
vmulpd zmm3, zmm3, [rsi + 64] ; b
vaddpd zmm2, zmm2, zmm3 ; zmm2 = c = a + b
vmovntpd [rdx + 64], zmm2
add rdi, 128
add rsi, 128
add rdx, 128
loop start_denseToDenseAddAVX512_nocache_64_linux
after_loop_denseToDenseAddAVX512_nocache_64_linux:
cmp rax, 0
je end_denseToDenseAddAVX512_nocache_64_linux
mov rcx, rax
last_loop_denseToDenseAddAVX512_nocache_64_linux:
movlpd xmm0, [rdi] ; a
movapd xmm1, xmm7
mulsd xmm1, [rsi] ; b
addsd xmm0, xmm1 ; xmm0 = c = a + b
movlpd [rdx], xmm0
add rdi, 8
add rsi, 8
add rdx, 8
loop last_loop_denseToDenseAddAVX512_nocache_64_linux
end_denseToDenseAddAVX512_nocache_64_linux:
mov rsp, rbp
pop rbp
ret

Okay, I've found the solution! This is a NUMA architecture with 44 CPUs, so I disabled the NUMA, and I've limited the number of online cpu-s to 1 with the following kernel parameters: numa=off maxcpus=1 nr_cpus=1.

libsvm compiled with AVX vs no AVX

I compiled a libsvm benchmarking app which does svm_predict() 100 times on the same image using the same model. The libsvm is compiled statically (MSVC 2017) by directly including svm.cpp and svm.h in my project.
EDIT: adding benchmark details
for (int i = 0; i < counter; i++)
{
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
double label = svm_predict(model, input);
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
total_time += duration;
std::cout << "\n\n\n" << sum << " label:" << label << " duration:" << duration << "\n\n\n";
}
This is the loop that I benchmark without any major modifications to the libsvm code.
After 100 runs the average of one run is 4.7 ms with no difference if I use or not AVX instructions. To make sure the compiler generates the correct instructions I used Intel Software Development Emulator to check the instructions mix
with AVX:
*isa-ext-AVX 36578280
*isa-ext-SSE 4
*isa-ext-SSE2 4
*isa-set-SSE 4
*isa-set-SSE2 4
*scalar-simd 36568174
*sse-scalar 4
*sse-packed 4
*avx-scalar 36568170
*avx128 8363
*avx256 1765
The other part
without AVX:
*isa-ext-SSE 11781
*isa-ext-SSE2 36574119
*isa-set-SSE 11781
*isa-set-SSE2 36574119
*scalar-simd 36564559
*sse-scalar 36564559
*sse-packed 21341
I would expect to get some performance improvment I know that avx128/256/512 are not used that much but still. I have a i7-8550U CPU, do you think that if run the same test on a skylake i9 X series I would see a bigger difference ?
EDIT
I added the instruction mix for each binary
With AVX:
ADD 16868725
AND 49
BT 6
CALL_NEAR 14032515
CDQ 4
CDQE 3601
CMOVLE 6
CMOVNZ 2
CMOVO 12
CMOVZ 6
CMP 25417120
CMPXCHG_LOCK 1
CPUID 3
CQO 12
DEC 68
DIV 1
IDIV 12
IMUL 3621
INC 8496372
JB 325
JBE 5
JL 7101
JLE 38338
JMP 8416984
JNB 6
JNBE 3
JNL 806
JNLE 61
JNS 1
JNZ 22568320
JS 2
JZ 8465164
LEA 16829868
MOV 42209230
MOVSD_XMM 4
MOVSXD 1141
MOVUPS 4
MOVZX 3684
MUL 12
NEG 72
NOP 4219
NOT 1
OR 14
POP 1869
PUSH 1870
REP_STOSD 6
RET_NEAR 1758
ROL 5
ROR 10
SAR 8
SBB 5
SETNZ 4
SETZ 26
SHL 1626
SHR 519
SUB 6530
TEST 5616533
VADDPD 594
VADDSD 8445597
VCOMISD 3
VCVTSI2SD 3603
VEXTRACTF128 6
VFMADD132SD 12
VFMADD231SD 6
VHADDPD 6
VMOVAPD 12
VMOVAPS 2375
VMOVDQU 1
VMOVSD 11256384
VMOVUPD 582
VMULPD 582
VMULSD 8451540
VPXOR 1
VSUBSD 8407425
VUCOMISD 3600
VXORPD 2362
VXORPS 3603
VZEROUPPER 4
XCHG 8
XGETBV 1
XOR 8414763
*total 213991340
Part2
No AVX:
ADD 16869910
ADDPD 1176
ADDSD 8445609
AND 49
BT 6
CALL_NEAR 14032515
CDQ 4
CDQE 3601
CMOVLE 6
CMOVNZ 2
CMOVO 12
CMOVZ 6
CMP 25417408
CMPXCHG_LOCK 1
COMISD 3
CPUID 3
CQO 12
CVTDQ2PD 3603
DEC 68
DIV 1
IDIV 12
IMUL 3621
INC 8496369
JB 325
JBE 5
JL 7392
JLE 38338
JMP 8416984
JNB 6
JNBE 3
JNL 803
JNLE 61
JNS 1
JNZ 22568317
JS 2
JZ 8465164
LEA 16829548
MOV 42209235
MOVAPS 7073
MOVD 3603
MOVDQU 2
MOVSD_XMM 11256376
MOVSXD 1141
MOVUPS 2344
MOVZX 3684
MUL 12
MULPD 1170
MULSD 8451546
NEG 72
NOP 4159
NOT 1
OR 14
POP 1865
PUSH 1866
REP_STOSD 6
RET_NEAR 1758
ROL 5
ROR 10
SAR 8
SBB 5
SETNZ 4
SETZ 26
SHL 1626
SHR 516
SUB 6515
SUBSD 8407425
TEST 5616533
UCOMISD 3600
UNPCKHPD 6
XCHG 8
XGETBV 1
XOR 8414745
XORPS 2364
*total 214000270

Almost all arithmetic instructions you are listing work on scalars e.g., (V)SUBSD means SUBstract Scalar Double. The V in front essentially just means that AVX encoding is used (this also clears the upper half of the register, which the SSE instructions don't do). But given the instructions you listed, there should be barely any runtime difference.
Modern x86 uses SSE1/2 or AVX for scalar FP math, using just the low element of XMM vector registers. It's somewhat better than x87 (more registers, and flat register set), but it's still only one result per instruction.
There are a few thousand packed SIMD instructions, vs. ~36 million scalar instructions, so only a relatively unimportant part of the code got auto-vectorized and could benefit from 256-bit vectors.

Why is gcc -O3 auto-vectorizing factorial? That many extra instructions looks worse

Here's a very simple factorial function.
int factorial(int num) {
if (num == 0)
return 1;
return num*factorial(num-1);
}
GCC's assembly for this function on -O2 is reasonable.
factorial(int):
mov eax, 1
test edi, edi
je .L1
.L2:
imul eax, edi
sub edi, 1
jne .L2
.L1:
ret
However, on -O3 or -Ofast, it decides to make things way more complicated (almost 100 lines!):
factorial(int):
test edi, edi
je .L28
lea edx, [rdi-1]
mov ecx, edi
cmp edx, 6
jbe .L8
mov DWORD PTR [rsp-12], edi
movd xmm5, DWORD PTR [rsp-12]
mov edx, edi
xor eax, eax
movdqa xmm0, XMMWORD PTR .LC0[rip]
movdqa xmm4, XMMWORD PTR .LC2[rip]
shr edx, 2
pshufd xmm2, xmm5, 0
paddd xmm2, XMMWORD PTR .LC1[rip]
.L5:
movdqa xmm3, xmm2
movdqa xmm1, xmm2
paddd xmm2, xmm4
add eax, 1
pmuludq xmm3, xmm0
psrlq xmm1, 32
psrlq xmm0, 32
pmuludq xmm1, xmm0
pshufd xmm0, xmm3, 8
pshufd xmm1, xmm1, 8
punpckldq xmm0, xmm1
cmp eax, edx
jne .L5
movdqa xmm2, xmm0
movdqa xmm1, xmm0
mov edx, edi
psrldq xmm2, 8
psrlq xmm0, 32
and edx, -4
pmuludq xmm1, xmm2
psrlq xmm2, 32
sub edi, edx
pmuludq xmm0, xmm2
pshufd xmm1, xmm1, 8
pshufd xmm0, xmm0, 8
punpckldq xmm1, xmm0
movdqa xmm0, xmm1
psrldq xmm1, 4
pmuludq xmm0, xmm1
movd eax, xmm0
cmp ecx, edx
je .L1
lea edx, [rdi-1]
.L3:
imul eax, edi
test edx, edx
je .L1
imul eax, edx
mov edx, edi
sub edx, 2
je .L1
imul eax, edx
mov edx, edi
sub edx, 3
je .L1
imul eax, edx
mov edx, edi
sub edx, 4
je .L1
imul eax, edx
mov edx, edi
sub edx, 5
je .L1
imul eax, edx
sub edi, 6
je .L1
imul eax, edi
.L1:
ret
.L28:
mov eax, 1
ret
.L8:
mov eax, 1
jmp .L3
.LC0:
.long 1
.long 1
.long 1
.long 1
.LC1:
.long 0
.long -1
.long -2
.long -3
.LC2:
.long -4
.long -4
.long -4
.long -4
I got these results using Compiler Explorer, so it should be the same in a real-world use case.
What's up with that? Are there any cases where this would be faster? Clang seems to do something like this too, but on -O2.

imul r32,r32 has 3 cycle latency on typical modern x86 CPUs (http://agner.org/optimize/). So the scalar implementation can do one multiply per 3 clock cycles, because they're dependent. It's fully pipelined, though, so your scalar loop leaves 2/3rds of the potential throughput unused.
In 3 cycles, the pipeline in Core2 or later can feed 12 uops into the out-of-order part of the core. For small inputs, it might be best to keep the code small and let out-of-order execution overlap the dependency chain with later code, especially if that later code doesn't all depend on the factorial result. But compilers aren't good at knowing when to optimize for latency vs. throughput, and without profile-guided optimization they have no data on how large n usually is.
I suspect that gcc's auto-vectorizer isn't looking at how quickly this will overflow for large n.
A useful scalar optimization would have been unrolling with multiple accumulators, e.g. take advantage of the fact that multiplication is associative and do these in parallel in the loop: prod(n*3/4 .. n) * prod(n/2 .. n*3/4) * prod(n/4 .. n/2) * prod(1..n/4) (with non-overlapping ranges, of course). Multiplication is associative even when it wraps; the product bits only depend on bits at that position and lower, not on (discarded) high bits.
Or more simply, do f0 *= i; f1 *= i+1; f2 *= i+2; f3 *= i+3; i+=4;. And then outside the loop, return (f0*f1) * (f2*f3);. This would be a win in scalar code, too. Of course you also have to account for n % 4 != 0 when unrolling.
What gcc has chosen to do is basically the latter, using pmuludq to do 2 packed multiplies with one instruction (5c latency / 1c or 0.5c throughput on Intel CPUs) It's similar on AMD CPUs; see Agner Fog's instruction tables. Each vector loop iteration does 4 iterations of the factorial loop in your C source, and there's significant instruction-level parallelism within one iteration
The inner loop is only 12 uops long (cmp/jcc macro-fuses into 1), so it can issue at 1 iteration per 3 cycles, same throughput as the latency bottleneck in your scalar version, but doing 4x as much work per iteration.
.L5:
movdqa xmm3, xmm2 ; copy the old i vector
movdqa xmm1, xmm2
paddd xmm2, xmm4 ; [ i0, i1 | i2, i3 ] += 4
add eax, 1
pmuludq xmm3, xmm0 ; [ f0 | f2 ] *= [ i0 | i2 ]
psrlq xmm1, 32 ; bring odd 32 bit elements down to even: [ i1 | i3 ]
psrlq xmm0, 32
pmuludq xmm1, xmm0 ; [ f1 | f3 ] *= [ i1 | i3 ]
pshufd xmm0, xmm3, 8
pshufd xmm1, xmm1, 8
punpckldq xmm0, xmm1 ; merge back into [ f0 f1 f2 f3 ]
cmp eax, edx
jne .L5
So gcc wastes a whole lot of effort emulating a packed 32-bit multiply instead of leaving two separate vector accumulators separate when using pmuludq. I also looked at clang6.0. I think it's falling into the same trap. (Source+asm on the Godbolt compiler explorer)
You didn't use -march=native or anything, so only SSE2 (baseline for x86-64) is available, so only widening 32x32 => 64 bit SIMD multiplies like pmuludq are available for 32-bit input elements. SSE4.1 pmulld is 2 uops on Haswell and later (single-uop on Sandybridge), but would avoid all of gcc's stupid shuffling.
Of course there's a latency bottleneck here, too, especially because of gcc's missed optimizations increasing the length of the loop-carried dep chain involving the accumulators.
Unrolling with more vector accumulators could hide a lot of the pmuludq latency.
With good vectorization, the SIMD integer multipliers can manage 2x or 4x the throughput of the scalar integer multiply unit. (Or, with AVX2, 8x the throughput using vectors of 8x 32-bit integers.)
But the wider the vectors and the more unrolling, the more cleanup code you need.
gcc -march=haswell
We get an inner loop like this:
.L5:
inc eax
vpmulld ymm1, ymm1, ymm0
vpaddd ymm0, ymm0, ymm2
cmp eax, edx
jne .L5
Super simple, but a 10c latency loop-carried dependency chain :/ (pmulld is 2 dependent uops on Haswell and later). Unrolling with multiple accumulators can give up to a 10x throughput boost for large inputs, for 5c latency / 0.5c throughput for SIMD integer multiply uops on Skylake.
But 4 multiplies per 5 cycles is still much better than 1 per 3 for scalar.
Clang unrolls with multiple accumulators by default, so it should be good. But it's a lot of code, so I didn't analyze it by hand. Plug it into IACA or benchmark it for large inputs. (What is IACA and how do I use it?)
Efficient strategies for handling unroll epilogue:
A lookup table for factorial [0..7] is probably the best bet. Arrange things so your vector / unrolled loop does n%8 .. n, instead of 1 .. n/8*8, so the left-over part is always the same for every n.
After a horizontal vector product, do one more scalar multiply with the table lookup result. A SIMD loop already needs some vector constants so you'll probably touch memory anyway, and the table lookup can happen in parallel with the main loop.
8! is 40320, which fits in 16 bits, so a 1..8 lookup table only needs 8 * 2 bytes of storage. Or use 32-bit entries so you can use a memory source operand for imul instead of a separate movzx.

It doesn't make it worse. It runs faster for large numbers. Here are the results for factorial(1000000000):
-O2: 0.78 sec
-O3: 0.5 sec
Of course, using that large number is undefined behavior (because of overflow with signed arithmetic). But the timing is the same with unsigned numbers, for which it is not undefined behavior.
Note, this usage of factorial is usually pointless, as it doesn't calculate num!, but num! & UINT_MAX. But the compiler doesn't know about this.
Maybe with PGO, the compiler won't vectorize this code, if it is always called with small numbers.
If you don't like this behavior, but you want to use -O3, turn off autovectorization with -fno-tree-loop-vectorize.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio