Measuring memory access time x86 - performance

I try to measure cached / non cached memory access time and results confusing me.
Here is the code:
1 #include <stdio.h>
2 #include <x86intrin.h>
3 #include <stdint.h>
4
5 #define SIZE 32*1024
6
7 char arr[SIZE];
8
9 int main()
10 {
11 char *addr;
12 unsigned int dummy;
13 uint64_t tsc1, tsc2;
14 unsigned i;
15 volatile char val;
16
17 memset(arr, 0x0, SIZE);
18 for (addr = arr; addr < arr + SIZE; addr += 64) {
19 _mm_clflush((void *) addr);
20 }
21 asm volatile("sfence\n\t"
22 :
23 :
24 : "memory");
25
26 tsc1 = __rdtscp(&dummy);
27 for (i = 0; i < SIZE; i++) {
28 asm volatile (
29 "mov %0, %%al\n\t" // load data
30 :
31 : "m" (arr[i])
32 );
33
34 }
35 tsc2 = __rdtscp(&dummy);
36 printf("(1) tsc: %llu\n", tsc2 - tsc1);
37
38 tsc1 = __rdtscp(&dummy);
39 for (i = 0; i < SIZE; i++) {
40 asm volatile (
41 "mov %0, %%al\n\t" // load data
42 :
43 : "m" (arr[i])
44 );
45
46 }
47 tsc2 = __rdtscp(&dummy);
48 printf("(2) tsc: %llu\n", tsc2 - tsc1);
49
50 return 0;
51 }
the output:
(1) tsc: 451248
(2) tsc: 449568
I expected, that first value would be much larger because caches were invalidated by clflush in case (1).
Info about my cpu (Intel(R) Core(TM) i7 CPU Q 720 # 1.60GHz) caches:
Cache ID 0:
- Level: 1
- Type: Data Cache
- Sets: 64
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 8
- Total Size: 32768 bytes (32 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 1:
- Level: 1
- Type: Instruction Cache
- Sets: 128
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 4
- Total Size: 32768 bytes (32 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 2:
- Level: 2
- Type: Unified Cache
- Sets: 512
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 8
- Total Size: 262144 bytes (256 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 3:
- Level: 3
- Type: Unified Cache
- Sets: 8192
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 12
- Total Size: 6291456 bytes (6144 kb)
- Is fully associative: false
- Is Self Initializing: true
Code disassembly between two rdtscp instructions
400614: 0f 01 f9 rdtscp
400617: 89 ce mov %ecx,%esi
400619: 48 8b 4d d8 mov -0x28(%rbp),%rcx
40061d: 89 31 mov %esi,(%rcx)
40061f: 48 c1 e2 20 shl $0x20,%rdx
400623: 48 09 d0 or %rdx,%rax
400626: 48 89 45 c0 mov %rax,-0x40(%rbp)
40062a: c7 45 b4 00 00 00 00 movl $0x0,-0x4c(%rbp)
400631: eb 0d jmp 400640 <main+0x8a>
400633: 8b 45 b4 mov -0x4c(%rbp),%eax
400636: 8a 80 80 10 60 00 mov 0x601080(%rax),%al
40063c: 83 45 b4 01 addl $0x1,-0x4c(%rbp)
400640: 81 7d b4 ff 7f 00 00 cmpl $0x7fff,-0x4c(%rbp)
400647: 76 ea jbe 400633 <main+0x7d>
400649: 48 8d 45 b0 lea -0x50(%rbp),%rax
40064d: 48 89 45 e0 mov %rax,-0x20(%rbp)
400651: 0f 01 f9 rdtscp
Looks like I'am missing / misunderstand something. Could you suggest?

mov %0, %%al is so slow (one cache line per 64 clocks, or per 32 clocks on Sandybridge specifically (not Haswell or later)) that you might bottleneck on that whether or not your loads are ultimately coming from DRAM or L1D.
Only every 64-th load will miss in cache, because you're taking full advantage of spatial locality with your tiny byte-load loop. If you actually wanted to test how fast the cache can refill after flushing an L1D-sized block, you should use a SIMD movdqa loop, or just byte loads with a stride of 64. (You only need to touch one byte per cache line).
To avoid the false dependency on the old value of RAX, you should use movzbl %0, %eax. This will let Sandybridge and later (or AMD since K8) use their full load throughput of 2 loads per clock to keep the memory pipeline closer to full. Multiple cache misses can be in flight at once: Intel CPU cores have 10 LFBs (line fill buffers) for lines to/from L1D, or 16 Superqueue entries for lines from L2 to off-core. See also Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?. (Many-core Xeon chips have worse single-thread memory bandwidth than desktops/laptops.)
But your bottleneck is far worse than that!
You compiled with optimizations disabled, so your loop uses addl $0x1,-0x4c(%rbp) for the loop counter, which gives you at least a 6-cycle loop-carried dependency chain. (Store/reload store-forwarding latency + 1 cycle for the ALU add.) http://agner.org/optimize/
(Maybe even higher because of resource conflicts for the load port. i7-720 is a Nehalem microarchitecture, so there's only one load port.)
This definitely means your loop doesn't bottleneck on cache misses, and will probably run about the same speed whether you used clflush or not.
Also note that rdtsc counts reference cycles, not core clock cycles. i.e. it will always count at 1.7GHz on your 1.7GHz CPU, regardless of the CPU running slower (powersave) or faster (Turbo). Control for this with a warm-up loop.
You also didn't declare a clobber on eax, so the compiler isn't expecting your code to modify rax. You end up with mov 0x601080(%rax),%al. But gcc reloads rax from memory every iteration, and doesn't use the rax that you modify, so you aren't actually skipping around in memory like you might be if you'd compiled with optimizations.
Hint: use volatile char * if you want to get the compiler to actually load, and not optimize it to fewer wider loads. You don't need inline asm for this.

Related

disable all AVX-512 instructions for g++ build

Hi I'm trying to build without any avx512 instructions by using those flags:
-march=native -mno-avx512f.
However i still get a binary which has
AVX512 (vmovss) instruction generated (i'm using elfx86exts to check).
Any idea how to disable those ?
-march=native -mno-avx512f is the correct option, vmovss only requires AVX1.
There is an AVX512F EVEX encoding of vmovss, but GAS won't use it unless the register involved is xmm16..31. GCC won't emit asm using those registers when you disable AVX512F with -mno-avx512f, or don't enable it in the first place with something like -march=skylake or -march=znver2.
If you're still not sure, check the actual disassembly + machine code to see what prefix the instruction starts with:
a C5 or C4 byte: start of a 2 or 3 byte VEX prefix, AVX1 encoding.
a 62 byte: start of an EVEX prefix, AVX512F encoding
Example:
.intel_syntax noprefix
vmovss xmm15, [rdi]
vmovss xmm15, [r11]
vmovss xmm16, [rdi]
assembled with gcc -c avx.s and disassemble with objdump -drwC -Mintel avx.o:
0000000000000000 <.text>:
0: c5 7a 10 3f vmovss xmm15,DWORD PTR [rdi] # AVX1
4: c4 41 7a 10 3b vmovss xmm15,DWORD PTR [r11] # AVX1
9: 62 e1 7e 08 10 07 vmovss xmm16,DWORD PTR [rdi] # AVX512F
2 and 3 byte VEX, and 4 byte EVEX prefixes before the 10 opcode. (The ModRM bytes are different too; xmm0 and xmm16 would differ only in the extra register bit from the prefix, not the modrm).
GAS uses the AVX1 VEX encoding of vmovss and other instructions when possible. So you can count on instructions that have a non-AVX512F form to be using the non-AVX512F form whenever possible. This is how the GNU toolchain (used by GCC) makes -mno-avx512f work.
This applies even when the EVEX encoding is shorter. e.g. when a [reg + constant] could use an AVX512 scaled disp8 (scaled by the element width) but the AVX1 encoding would need a 32-bit displacement that counts in bytes.
f: c5 7a 10 bf 00 01 00 00 vmovss xmm15,DWORD PTR [rdi+0x100] # AVX1 [reg+disp32]
17: 62 e1 7e 08 10 47 40 vmovss xmm16,DWORD PTR [rdi+0x100] # AVX512 [reg + disp8*4]
1e: c5 78 28 bf 00 01 00 00 vmovaps xmm15,XMMWORD PTR [rdi+0x100] # AVX1 [reg+disp32]
26: 62 e1 7c 08 28 47 10 vmovaps xmm16,XMMWORD PTR [rdi+0x100] # AVX512 [reg + disp8*16]
Note the last byte, or last 4 bytes, of the machine code encodings: it's a 32-bit little-endian 0x100 byte displacement for the AVX1 encodings, but an 8-bit displacement of 0x40 dwords or 0x10 dqwords for the AVX512 encodings.
But using an asm-source override of {evex} vmovaps xmm0, [rdi+256] we can get the compact encoding even for "low" registers:
62 f1 7c 08 28 47 10 vmovaps xmm0,XMMWORD PTR [rdi+0x100]
GCC will of course not do that with -mno-avx512f.
Unfortunately GCC and clang also miss that optimization when you do enable AVX512F, e.g. when compiling __m128 load(__m128 *p){ return p[16]; } with -O3 -march=skylake-avx512 (Godbolt). Use binary mode, or simply note the lack of an {evex} tag on that asm source line of compiler output.
I found an error in my use-case .. One of the compiled units was dependant on openvino SDK which added -mavx512f flag explicitly.

libsvm compiled with AVX vs no AVX

I compiled a libsvm benchmarking app which does svm_predict() 100 times on the same image using the same model. The libsvm is compiled statically (MSVC 2017) by directly including svm.cpp and svm.h in my project.
EDIT: adding benchmark details
for (int i = 0; i < counter; i++)
{
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
double label = svm_predict(model, input);
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
total_time += duration;
std::cout << "\n\n\n" << sum << " label:" << label << " duration:" << duration << "\n\n\n";
}
This is the loop that I benchmark without any major modifications to the libsvm code.
After 100 runs the average of one run is 4.7 ms with no difference if I use or not AVX instructions. To make sure the compiler generates the correct instructions I used Intel Software Development Emulator to check the instructions mix
with AVX:
*isa-ext-AVX 36578280
*isa-ext-SSE 4
*isa-ext-SSE2 4
*isa-set-SSE 4
*isa-set-SSE2 4
*scalar-simd 36568174
*sse-scalar 4
*sse-packed 4
*avx-scalar 36568170
*avx128 8363
*avx256 1765
The other part
without AVX:
*isa-ext-SSE 11781
*isa-ext-SSE2 36574119
*isa-set-SSE 11781
*isa-set-SSE2 36574119
*scalar-simd 36564559
*sse-scalar 36564559
*sse-packed 21341
I would expect to get some performance improvment I know that avx128/256/512 are not used that much but still. I have a i7-8550U CPU, do you think that if run the same test on a skylake i9 X series I would see a bigger difference ?
EDIT
I added the instruction mix for each binary
With AVX:
ADD 16868725
AND 49
BT 6
CALL_NEAR 14032515
CDQ 4
CDQE 3601
CMOVLE 6
CMOVNZ 2
CMOVO 12
CMOVZ 6
CMP 25417120
CMPXCHG_LOCK 1
CPUID 3
CQO 12
DEC 68
DIV 1
IDIV 12
IMUL 3621
INC 8496372
JB 325
JBE 5
JL 7101
JLE 38338
JMP 8416984
JNB 6
JNBE 3
JNL 806
JNLE 61
JNS 1
JNZ 22568320
JS 2
JZ 8465164
LEA 16829868
MOV 42209230
MOVSD_XMM 4
MOVSXD 1141
MOVUPS 4
MOVZX 3684
MUL 12
NEG 72
NOP 4219
NOT 1
OR 14
POP 1869
PUSH 1870
REP_STOSD 6
RET_NEAR 1758
ROL 5
ROR 10
SAR 8
SBB 5
SETNZ 4
SETZ 26
SHL 1626
SHR 519
SUB 6530
TEST 5616533
VADDPD 594
VADDSD 8445597
VCOMISD 3
VCVTSI2SD 3603
VEXTRACTF128 6
VFMADD132SD 12
VFMADD231SD 6
VHADDPD 6
VMOVAPD 12
VMOVAPS 2375
VMOVDQU 1
VMOVSD 11256384
VMOVUPD 582
VMULPD 582
VMULSD 8451540
VPXOR 1
VSUBSD 8407425
VUCOMISD 3600
VXORPD 2362
VXORPS 3603
VZEROUPPER 4
XCHG 8
XGETBV 1
XOR 8414763
*total 213991340
Part2
No AVX:
ADD 16869910
ADDPD 1176
ADDSD 8445609
AND 49
BT 6
CALL_NEAR 14032515
CDQ 4
CDQE 3601
CMOVLE 6
CMOVNZ 2
CMOVO 12
CMOVZ 6
CMP 25417408
CMPXCHG_LOCK 1
COMISD 3
CPUID 3
CQO 12
CVTDQ2PD 3603
DEC 68
DIV 1
IDIV 12
IMUL 3621
INC 8496369
JB 325
JBE 5
JL 7392
JLE 38338
JMP 8416984
JNB 6
JNBE 3
JNL 803
JNLE 61
JNS 1
JNZ 22568317
JS 2
JZ 8465164
LEA 16829548
MOV 42209235
MOVAPS 7073
MOVD 3603
MOVDQU 2
MOVSD_XMM 11256376
MOVSXD 1141
MOVUPS 2344
MOVZX 3684
MUL 12
MULPD 1170
MULSD 8451546
NEG 72
NOP 4159
NOT 1
OR 14
POP 1865
PUSH 1866
REP_STOSD 6
RET_NEAR 1758
ROL 5
ROR 10
SAR 8
SBB 5
SETNZ 4
SETZ 26
SHL 1626
SHR 516
SUB 6515
SUBSD 8407425
TEST 5616533
UCOMISD 3600
UNPCKHPD 6
XCHG 8
XGETBV 1
XOR 8414745
XORPS 2364
*total 214000270
Almost all arithmetic instructions you are listing work on scalars e.g., (V)SUBSD means SUBstract Scalar Double. The V in front essentially just means that AVX encoding is used (this also clears the upper half of the register, which the SSE instructions don't do). But given the instructions you listed, there should be barely any runtime difference.
Modern x86 uses SSE1/2 or AVX for scalar FP math, using just the low element of XMM vector registers. It's somewhat better than x87 (more registers, and flat register set), but it's still only one result per instruction.
There are a few thousand packed SIMD instructions, vs. ~36 million scalar instructions, so only a relatively unimportant part of the code got auto-vectorized and could benefit from 256-bit vectors.

Why does loop alignment on 32 byte make code faster?

Look at this code:
one.cpp:
bool test(int a, int b, int c, int d);
int main() {
volatile int va = 1;
volatile int vb = 2;
volatile int vc = 3;
volatile int vd = 4;
int a = va;
int b = vb;
int c = vc;
int d = vd;
int s = 0;
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
for (int i=0; i<2000000000; i++) {
s += test(a, b, c, d);
}
return s;
}
two.cpp:
bool test(int a, int b, int c, int d) {
// return a == d || b == d || c == d;
return false;
}
There are 16 nops in one.cpp. You can comment/decomment them to change alignment of the loop's entry point between 16 and 32. I've compiled them with g++ one.cpp two.cpp -O3 -mtune=native.
Here are my questions:
the 32-aligned version is faster than the 16-aligned version. On Sandy Bridge, the difference is 20%; on Haswell, 8%. Why is the difference?
with the 32-aligned version, the code runs the same speed on Sandy Bridge, doesn't matter which return statement is in two.cpp. I thought the return false version should be faster at least a little bit. But no, exactly the same speed!
If I remove volatiles from one.cpp, code becomes slower (Haswell: before: ~2.17 sec, after: ~2.38 sec). Why is that? But this only happens, when the loop aligned to 32.
The fact that 32-aligned version is faster, is strange to me, because Intel® 64 and IA-32 Architectures
Optimization Reference Manual says (page 3-9):
Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch
targets should be 16- byte aligned.
Another little question: is there any tricks to make only this loop 32-aligned (so rest of the code could keep using 16-byte alignment)?
Note: I've tried compilers gcc 6, gcc 7 and clang 3.9, same results.
Here's the code with volatile (the code is the same for 16/32 aligned, just the address differ):
0000000000000560 <main>:
560: 41 57 push r15
562: 41 56 push r14
564: 41 55 push r13
566: 41 54 push r12
568: 55 push rbp
569: 31 ed xor ebp,ebp
56b: 53 push rbx
56c: bb 00 94 35 77 mov ebx,0x77359400
571: 48 83 ec 18 sub rsp,0x18
575: c7 04 24 01 00 00 00 mov DWORD PTR [rsp],0x1
57c: c7 44 24 04 02 00 00 mov DWORD PTR [rsp+0x4],0x2
583: 00
584: c7 44 24 08 03 00 00 mov DWORD PTR [rsp+0x8],0x3
58b: 00
58c: c7 44 24 0c 04 00 00 mov DWORD PTR [rsp+0xc],0x4
593: 00
594: 44 8b 3c 24 mov r15d,DWORD PTR [rsp]
598: 44 8b 74 24 04 mov r14d,DWORD PTR [rsp+0x4]
59d: 44 8b 6c 24 08 mov r13d,DWORD PTR [rsp+0x8]
5a2: 44 8b 64 24 0c mov r12d,DWORD PTR [rsp+0xc]
5a7: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
5ac: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
5b3: 00 00 00
5b6: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
5bd: 00 00 00
5c0: 44 89 e1 mov ecx,r12d
5c3: 44 89 ea mov edx,r13d
5c6: 44 89 f6 mov esi,r14d
5c9: 44 89 ff mov edi,r15d
5cc: e8 4f 01 00 00 call 720 <test(int, int, int, int)>
5d1: 0f b6 c0 movzx eax,al
5d4: 01 c5 add ebp,eax
5d6: 83 eb 01 sub ebx,0x1
5d9: 75 e5 jne 5c0 <main+0x60>
5db: 48 83 c4 18 add rsp,0x18
5df: 89 e8 mov eax,ebp
5e1: 5b pop rbx
5e2: 5d pop rbp
5e3: 41 5c pop r12
5e5: 41 5d pop r13
5e7: 41 5e pop r14
5e9: 41 5f pop r15
5eb: c3 ret
5ec: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
Without volatile:
0000000000000560 <main>:
560: 55 push rbp
561: 31 ed xor ebp,ebp
563: 53 push rbx
564: bb 00 94 35 77 mov ebx,0x77359400
569: 48 83 ec 08 sub rsp,0x8
56d: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0]
574: 00 00
576: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
57d: 00 00 00
580: b9 04 00 00 00 mov ecx,0x4
585: ba 03 00 00 00 mov edx,0x3
58a: be 02 00 00 00 mov esi,0x2
58f: bf 01 00 00 00 mov edi,0x1
594: e8 47 01 00 00 call 6e0 <test(int, int, int, int)>
599: 0f b6 c0 movzx eax,al
59c: 01 c5 add ebp,eax
59e: 83 eb 01 sub ebx,0x1
5a1: 75 dd jne 580 <main+0x20>
5a3: 48 83 c4 08 add rsp,0x8
5a7: 89 e8 mov eax,ebp
5a9: 5b pop rbx
5aa: 5d pop rbp
5ab: c3 ret
5ac: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
This doesn't answer point 2 (return a == d || b == d || c == d; being the same speed as return false). That's still a maybe-interesting question, since that must compile multiple to uop-cache lines of instructions.
The fact that 32-aligned version is faster, is strange to me, because [Intel's manual says to align to 16]
That optimization-guide advice is a very general guideline, and definitely doesn't mean that larger never helps. Usually it doesn't, and padding to 32 would be more likely to hurt than help. (I-cache misses, ITLB misses, and more code bytes to load from disk).
In fact, 16B alignment is rarely necessary, especially on CPUs with a uop cache. For a small loop that can run from the loop buffer, it alignment is usually totally irrelevant.
(Skylake microcode updates disabled the loop buffer to work around a partial-register AH-merging bug, SKL150. This creates problems for tiny loops that span a 32-byte boundary, only running one iteration per 2 clocks, instead of the one iteration per 1.5 clocks you might get from a 6 uop loop on Haswell, or on SKL with older microcode. The LSD is not re-enabled until Ice Lake, broken in Kaby/Coffee/Comet Lake which are the same microarchitecture as SKL/SKX.)
Another SKL erratum workaround created another worse code-alignment pothole: How can I mitigate the impact of the Intel jcc erratum on gcc?
16B is still not bad as a broad recommendation, but it doesn't tell you everything you need to know to understand one specific case on a couple of specific CPUs.
Compilers usually default to aligning loop branches and function entry-points, but usually don't align other branch targets. The cost of executing a NOP (and code bloat) is often larger than the likely cost of an unaligned non-loop branch target.
Code alignment has some direct and some indirect effects. The direct effects include the uop cache on Intel SnB-family. For example, see Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs.
Another section of Intel's optimization manual goes into some detail about how the uop cache works:
2.3.2.2 Decoded ICache:
All micro-ops in a Way (uop cache line) represent instructions which are statically contiguous in the code and have their EIPs within the
same aligned 32-byte region. (I think this means an instruction that
extends past the boundary goes in the uop cache for the block
containing its start, rather than end. Spanning instructions have to
go somewhere, and the branch target address that would run the
instruction is the start of the insn, so it's most useful to put it in
a line for that block).
A multi micro-op instruction cannot be split across Ways.
An instruction which turns on the MSROM consumes an entire Way.
Up to two branches are allowed per Way.
A pair of macro-fused instructions is kept as one micro-op.
See also Agner Fog's microarch guide. He adds:
An unconditional jump or call always ends a μop cache line
lots of other stuff that that probably isn't relevant here.
Also, that if your code doesn't fit in the uop cache, it can't run from the loop buffer.
The indirect effects of alignment include:
larger/smaller code-size (L1I cache misses, TLB). Not relevant for your test
which branches alias each other in the BTB (Branch Target Buffer).
If I remove volatiles from one.cpp, code becomes slower. Why is that?
The larger instructions push the last instruction into the loop across a 32B boundary:
59e: 83 eb 01 sub ebx,0x1
5a1: 75 dd jne 580 <main+0x20>
So if you aren't running from the loop buffer (LSD), then without volatile one of the uop-cache fetch cycles gets only 1 uop.
If sub/jne macro-fuses, this might not apply. And I think only crossing a 64B boundary would break macro-fusion.
Also, those aren't real addresses. Have you checked what the addresses are after linking? There could be a 64B boundary there after linking, if the text section has less than 64B alignment.
Also related to 32-byte boundaries, the JCC erratum disables the uop cache for blocks where a branch (including macro-fused ALU+JCC) includes the last byte of the line, on Skylake CPUs.
How can I mitigate the impact of the Intel jcc erratum on gcc?
Sorry I haven't actually tested this to say more about this specific case. The point is, when you bottleneck on the front-end from stuff like having a call/ret inside a tight loop, alignment becomes important and can get is extremely complex. Boundary-crossing or not for all future instructions is affected. Do not expect it to be simple. If you've read my other answers, you'll know I'm not usually the kind of person to say "it's too complicated to fully explain", but alignment can be that way.
See also Code alignment in one object file is affecting the performance of a function in another object file
In your case, make sure tiny functions inline. Use link-time optimization if your code-base has any important tiny functions in separate .c files instead of in a .h where they can inline. Or change your code to put them in a .h.

Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

I've been trying to figure out a performance problem in an application and have finally narrowed it down to a really weird problem. The following piece of code runs 6 times slower on a Skylake CPU (i5-6500) if the VZEROUPPER instruction is commented out. I've tested Sandy Bridge and Ivy Bridge CPUs and both versions run at the same speed, with or without VZEROUPPER.
Now I have a fairly good idea of what VZEROUPPER does and I think it should not matter at all to this code when there are no VEX coded instructions and no calls to any function which might contain them. The fact that it does not on other AVX capable CPUs appears to support this. So does table 11-2 in the Intel® 64 and IA-32 Architectures Optimization Reference Manual
So what is going on?
The only theory I have left is that there's a bug in the CPU and it's incorrectly triggering the "save the upper half of the AVX registers" procedure where it shouldn't. Or something else just as strange.
This is main.cpp:
#include <immintrin.h>
int slow_function( double i_a, double i_b, double i_c );
int main()
{
/* DAZ and FTZ, does not change anything here. */
_mm_setcsr( _mm_getcsr() | 0x8040 );
/* This instruction fixes performance. */
__asm__ __volatile__ ( "vzeroupper" : : : );
int r = 0;
for( unsigned j = 0; j < 100000000; ++j )
{
r |= slow_function(
0.84445079384884236262,
-6.1000481519580951328,
5.0302160279288017364 );
}
return r;
}
and this is slow_function.cpp:
#include <immintrin.h>
int slow_function( double i_a, double i_b, double i_c )
{
__m128d sign_bit = _mm_set_sd( -0.0 );
__m128d q_a = _mm_set_sd( i_a );
__m128d q_b = _mm_set_sd( i_b );
__m128d q_c = _mm_set_sd( i_c );
int vmask;
const __m128d zero = _mm_setzero_pd();
__m128d q_abc = _mm_add_sd( _mm_add_sd( q_a, q_b ), q_c );
if( _mm_comigt_sd( q_c, zero ) && _mm_comigt_sd( q_abc, zero ) )
{
return 7;
}
__m128d discr = _mm_sub_sd(
_mm_mul_sd( q_b, q_b ),
_mm_mul_sd( _mm_mul_sd( q_a, q_c ), _mm_set_sd( 4.0 ) ) );
__m128d sqrt_discr = _mm_sqrt_sd( discr, discr );
__m128d q = sqrt_discr;
__m128d v = _mm_div_pd(
_mm_shuffle_pd( q, q_c, _MM_SHUFFLE2( 0, 0 ) ),
_mm_shuffle_pd( q_a, q, _MM_SHUFFLE2( 0, 0 ) ) );
vmask = _mm_movemask_pd(
_mm_and_pd(
_mm_cmplt_pd( zero, v ),
_mm_cmple_pd( v, _mm_set1_pd( 1.0 ) ) ) );
return vmask + 1;
}
The function compiles down to this with clang:
0: f3 0f 7e e2 movq %xmm2,%xmm4
4: 66 0f 57 db xorpd %xmm3,%xmm3
8: 66 0f 2f e3 comisd %xmm3,%xmm4
c: 76 17 jbe 25 <_Z13slow_functionddd+0x25>
e: 66 0f 28 e9 movapd %xmm1,%xmm5
12: f2 0f 58 e8 addsd %xmm0,%xmm5
16: f2 0f 58 ea addsd %xmm2,%xmm5
1a: 66 0f 2f eb comisd %xmm3,%xmm5
1e: b8 07 00 00 00 mov $0x7,%eax
23: 77 48 ja 6d <_Z13slow_functionddd+0x6d>
25: f2 0f 59 c9 mulsd %xmm1,%xmm1
29: 66 0f 28 e8 movapd %xmm0,%xmm5
2d: f2 0f 59 2d 00 00 00 mulsd 0x0(%rip),%xmm5 # 35 <_Z13slow_functionddd+0x35>
34: 00
35: f2 0f 59 ea mulsd %xmm2,%xmm5
39: f2 0f 58 e9 addsd %xmm1,%xmm5
3d: f3 0f 7e cd movq %xmm5,%xmm1
41: f2 0f 51 c9 sqrtsd %xmm1,%xmm1
45: f3 0f 7e c9 movq %xmm1,%xmm1
49: 66 0f 14 c1 unpcklpd %xmm1,%xmm0
4d: 66 0f 14 cc unpcklpd %xmm4,%xmm1
51: 66 0f 5e c8 divpd %xmm0,%xmm1
55: 66 0f c2 d9 01 cmpltpd %xmm1,%xmm3
5a: 66 0f c2 0d 00 00 00 cmplepd 0x0(%rip),%xmm1 # 63 <_Z13slow_functionddd+0x63>
61: 00 02
63: 66 0f 54 cb andpd %xmm3,%xmm1
67: 66 0f 50 c1 movmskpd %xmm1,%eax
6b: ff c0 inc %eax
6d: c3 retq
The generated code is different with gcc but it shows the same problem. An older version of the intel compiler generates yet another variation of the function which shows the problem too but only if main.cpp is not built with the intel compiler as it inserts calls to initialize some of its own libraries which probably end up doing VZEROUPPER somewhere.
And of course, if the whole thing is built with AVX support so the intrinsics are turned into VEX coded instructions, there is no problem either.
I've tried profiling the code with perf on linux and most of the runtime usually lands on 1-2 instructions but not always the same ones depending on which version of the code I profile (gcc, clang, intel). Shortening the function appears to make the performance difference gradually go away so it looks like several instructions are causing the problem.
EDIT: Here's a pure assembly version, for linux. Comments below.
.text
.p2align 4, 0x90
.globl _start
_start:
#vmovaps %ymm0, %ymm1 # This makes SSE code crawl.
#vzeroupper # This makes it fast again.
movl $100000000, %ebp
.p2align 4, 0x90
.LBB0_1:
xorpd %xmm0, %xmm0
xorpd %xmm1, %xmm1
xorpd %xmm2, %xmm2
movq %xmm2, %xmm4
xorpd %xmm3, %xmm3
movapd %xmm1, %xmm5
addsd %xmm0, %xmm5
addsd %xmm2, %xmm5
mulsd %xmm1, %xmm1
movapd %xmm0, %xmm5
mulsd %xmm2, %xmm5
addsd %xmm1, %xmm5
movq %xmm5, %xmm1
sqrtsd %xmm1, %xmm1
movq %xmm1, %xmm1
unpcklpd %xmm1, %xmm0
unpcklpd %xmm4, %xmm1
decl %ebp
jne .LBB0_1
mov $0x1, %eax
int $0x80
Ok, so as suspected in comments, using VEX coded instructions causes the slowdown. Using VZEROUPPER clears it up. But that still does not explain why.
As I understand it, not using VZEROUPPER is supposed to involve a cost to transition to old SSE instructions but not a permanent slowdown of them. Especially not such a large one. Taking loop overhead into account, the ratio is at least 10x, perhaps more.
I have tried messing with the assembly a little and float instructions are just as bad as double ones. I could not pinpoint the problem to a single instruction either.
You are experiencing a penalty for "mixing" non-VEX SSE and VEX-encoded instructions - even though your entire visible application doesn't obviously use any AVX instructions!
Prior to Skylake, this type of penalty was only a one-time transition penalty, when switching from code that used vex to code that didn't, or vice-versa. That is, you never paid an ongoing penalty for whatever happened in the past unless you were actively mixing VEX and non-VEX. In Skylake, however, there is a state where non-VEX SSE instructions pay a high ongoing execution penalty, even without further mixing.
Straight from the horse's mouth, here's Figure 11-1 1 - the old (pre-Skylake) transition diagram:
As you can see, all of the penalties (red arrows), bring you to a new state, at which point there is no longer a penalty for repeating that action. For example, if you get to the dirty upper state by executing some 256-bit AVX, an you then execute legacy SSE, you pay a one-time penalty to transition to the preserved non-INIT upper state, but you don't pay any penalties after that.
In Skylake, everything is different per Figure 11-2:
There are fewer penalties overall, but critically for your case, one of them is a self-loop: the penalty for executing a legacy SSE (Penalty A in the Figure 11-2) instruction in the dirty upper state keeps you in that state. That's what happens to you - any AVX instruction puts you in the dirty upper state, which slows all further SSE execution down.
Here's what Intel says (section 11.3) about the new penalty:
The Skylake microarchitecture implements a different state machine
than prior generations to manage the YMM state transition associated
with mixing SSE and AVX instructions. It no longer saves the entire
upper YMM state when executing an SSE instruction when in “Modified
and Unsaved” state, but saves the upper bits of individual register.
As a result, mixing SSE and AVX instructions will experience a penalty
associated with partial register dependency of the destination
registers being used and additional blend operation on the upper bits
of the destination registers.
So the penalty is apparently quite large - it has to blend the top bits all the time to preserve them, and it also makes instructions which are apparently independently become dependent, since there is a dependency on the hidden upper bits. For example xorpd xmm0, xmm0 no longer breaks the dependence on the previous value of xmm0, since the result is actually dependent on the hidden upper bits from ymm0 which aren't cleared by the xorpd. That latter effect is probably what kills your performance since you'll now have very long dependency chains that wouldn't expect from the usual analysis.
This is among the worst type of performance pitfall: where the behavior/best practice for the prior architecture is essentially opposite of the current architecture. Presumably the hardware architects had a good reason for making the change, but it does just add another "gotcha" to the list of subtle performance issues.
I would file a bug against the compiler or runtime that inserted that AVX instruction and didn't follow up with a VZEROUPPER.
Update: Per the OP's comment below, the offending (AVX) code was inserted by the runtime linker ld and a bug already exists.
1 From Intel's optimization manual.
I just made some experiments (on a Haswell). The transition between clean and dirty states is not expensive, but the dirty state makes every non-VEX vector operation dependent on the previous value of the destination register. In your case, for example movapd %xmm1, %xmm5 will have a false dependency on ymm5 which prevents out-of-order execution. This explains why vzeroupper is needed after AVX code.

Can I stop gcc from allocating a local variable on the stack?

I'm using gcc on x86-64 and declaring some local variables with the "register" modifier. I would like to find a way to severely discourage the compiler from allocating and using stack space for these variables. I'd like these variables to remain in registers as much as possible. I'm mixing C/C++ code with inline assmebly.
The variables are simply working storage and don't need to be permanently stored and retrieved later, but yet I see my gcc -O2 code still tucking them into their local stack space from time to time. I understand that their state will need to be preserved when I make C/C++ function calls from time to time, but can I do something to be certain that this preservation is severely discouraged?
Here is an example of what I'm doing. This is a portion of an event-driven logic simulator for those who are wondering:
register __m128d VAL0, VAL1, diff0, diff1;
register __m128d *outputValPtr;
__m128d **cmfmLocs;
...
// all pointers are made to point to valid data
// cmfmLocs is a 0-terminated array of pointers with at least one entry
diff0 = outputValPtr[0];
diff1 = outputValPtr[1];
VAL0 = *(cmfmLocs[0]);
VAL1 = *(cmfmLocs[0]+1);
cfPin = 1;
do
{
asm( "andpd %[src1], %[dest1]\n"
"orpd %[src2], %[dest2]\n" :
[dest1] "=x" (VAL0),
[dest2] "=x" (VAL1) :
[src1] "m" (*(cmfmLocs[cfPin])),
[src2] "m" (*(cmfmLocs[cfPin]+1)) );
cfPin++;
} while ( cmfmLocs[cfPin] );
asm( "xorpd %[val0], %[diffBit0]\n"
"xorpd %[val1], %[diffBit1]\n"
"orpd %[diffBit1], %[diffBit0]\n"
"ptest %[diffBit0], %[diffBit0]\n"
"jz dontSchedule\n"
"movdqa %[val0], (%[permStor])\n"
"movdqa %[val1], 16(%[permStor])\n" :
[diffBit0] "=x" (diff0),
[diffBit1] "=x" (diff1),
[memWrite1] "=m" (outputValPtr[0]),
[memWrite2] "=m" (outputValPtr[1]) :
[val0] "x" (VAL0),
[val1] "x" (VAL1),
[permStor] "p" (outputValPtr) );
SCHEDULE_GOTOS;
asm( "dontSchedule:\n" );
This code produced the following assembly with -O2:
2348: 48 8b 4b 50 mov 0x50(%rbx),%rcx
234c: ba 01 00 00 00 mov $0x1,%edx
2351: 48 8b 41 08 mov 0x8(%rcx),%rax
2355: 0f 1f 00 nopl (%rax)
2358: 83 c2 01 add $0x1,%edx
235b: 66 0f 54 00 andpd (%rax),%xmm0
235f: 66 0f 56 48 10 orpd 0x10(%rax),%xmm1
2364: 0f b7 c2 movzwl %dx,%eax
2367: 66 0f 29 4c 24 20 movapd %xmm1,0x20(%rsp) # Why is this necessary?
236d: 66 0f 29 44 24 30 movapd %xmm0,0x30(%rsp) # Why is this necessary?
2373: 48 8b 04 c1 mov (%rcx,%rax,8),%rax
2377: 48 85 c0 test %rax,%rax
237a: 75 dc jne 2358 <TEST_LABEL+0x10>
237c: 66 0f 57 d0 xorpd %xmm0,%xmm2
2380: 66 0f 57 d9 xorpd %xmm1,%xmm3
2384: 66 0f 56 d3 orpd %xmm3,%xmm2
2388: 66 0f 38 17 d2 ptest %xmm2,%xmm2
238d: 0f 84 cf e7 ff ff je b62 <dontSchedule>
2393: 66 41 0f 7f 07 movdqa %xmm0,(%r15) # After storing here, xmm0/1 values
2398: 66 41 0f 7f 4f 10 movdqa %xmm1,0x10(%r15) # are not needed anymore.
... # my C scheduler routine here ...
0000000000000b62 <dontSchedule>:
I think I've got it now!
I'm using the intrinsics, mainly because they are easy to use and don't require leaving the "C world". The key for me was localizing the scope of my register variables. That probably should have been obvious, but I got bogged down in the details. My actual code now looks like this:
...
case SimF_AND:
{
register __m128d VAL0 = *(cmfmLocs[0]);
register __m128d VAL1 = *(cmfmLocs[0]+1);
register __m128d diff0 = outputValPtr[0];
register __m128d diff1 = outputValPtr[1];
cfPin = 1;
do
{
VAL0 = _mm_and_pd( VAL0, *(cmfmLocs[cfPin]) );
VAL1 = _mm_or_pd( VAL1, *(cmfmLocs[cfPin]+1) );
cfPin++;
} while ( cmfmLocs[cfPin] );
diff0 = _mm_or_pd( _mm_xor_pd( VAL0, diff0 ), _mm_xor_pd( VAL1, diff1 ) ); \
if ( !_mm_testz_pd( diff0, diff0 ) ) \
{ \
outputValPtr[0] = VAL0; \
outputValPtr[1] = VAL1; \
outputValPtr[2] = _mm_xor_pd( VAL0, VAL0 ); \
SCHEDULE_GOTOS; \
}
} // register variables go out of scope here
break;
...
So now it is very easy for both me and the compiler to see that these variables are not referenced after outputValPtr is updated. This produces assembly that does not reserve stack space for the locals so they don't generate any memory writes of their own anymore.
Thanks to all those who left responses. You definitely lead me down the right path!
I was told long ago that there are three classes of C compilers: Really dumb ones, which just don't care about the register keyword, dumb ones, which heed the keyword, and reserve a few regsters for this; and smart ones, which really do a much better job handling shuffling values around than just keeping a value in a fixed register.
If you use GCC's inline assembly, where the values reside should be (almost) transparent. You can force getting the arguments in specific registers by use of restrictions, and the compiler will make sure this gets respected.
Besides, "just working storage" isn't a good enough reason to use up a valuable register. Even in x86_64, which isn't register-starved. Writing parts of the program in assembly has a terrible cost in programmer time and portability, you better make very sure this is relevant performance wise (or the code can't be written portably in the first place).

Resources