Spring Boot 2.3.4 + OpenJ9: Unhandled exception - spring-boot
During last days we are suffering a lot of JVM crashes. We have been using OpenJ9 (8 & 11) without any problems, but some days ago we have started to have a lot of crashes. Two examples from today:
Unhandled exception
Type=Segmentation error vmState=0x00000000
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=00007FE3E0A0F0A0 Handler2=00007FE3E02FEA60 InaccessibleAddress=000055DAB1FD3000
RDI=000055DAB1FB8E5B RSI=0000000000000200 RAX=000055DAB1FD3000 RBX=00007FE3BBE38640
RCX=00007FE3DE7789E7 RDX=0000000000001402 R8=00007FE3E0AD42FF R9=0000000000000200
R10=0000000000000003 R11=0000000000000001 R12=00007FE3E0AD4128 R13=00007FE3BBE38448
R14=0000000000000001 R15=00007FE3BBE38640
RIP=00007FE3E0A9A3E0 GS=0000 FS=0000 RSP=00007FE3BBE38440
EFlags=0000000000010206 CS=0033 RBP=0000000000000002 ERR=0000000000000006
TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=000055DAB1FD3000
xmm0 0114b4d70025b62a (f: 2471466.000000, d: 1.887162e-303)
xmm1 b54c590116b75901 (f: 381114624.000000, d: -5.919270e-52)
xmm2 590115d82a0010c7 (f: 704647360.000000, d: 5.514824e+120)
xmm3 b900e70016b4d72c (f: 380950304.000000, d: -4.069092e-34)
xmm4 0999011bb62b000a (f: 3056271360.000000, d: 1.985176e-262)
xmm5 b40011b4d7570119 (f: 3612803328.000000, d: -3.199957e-58)
xmm6 b4d72c0117b54d59 (f: 397757792.000000, d: -3.780091e-54)
xmm7 0010c72c4d0117b4 (f: 1291917184.000000, d: 2.333271e-308)
xmm8 2c03590019bd0500 (f: 431817984.000000, d: 1.132243e-96)
xmm9 0000000041000000 (f: 1090519040.000000, d: 5.387880e-315)
xmm10 000000003fa00000 (f: 1067450368.000000, d: 5.273906e-315)
xmm11 40d70eab41ea1e65 (f: 1105862272.000000, d: 2.361068e+04)
xmm12 000000003df950b9 (f: 1039749312.000000, d: 5.137044e-315)
xmm13 00000000464fe674 (f: 1179641472.000000, d: 5.828203e-315)
xmm14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm15 40fc85cbcbe05a39 (f: 3420477952.000000, d: 1.168287e+05)
Module=/usr/lib/jvm/java-11/lib/compressedrefs/libj9vm29.so
Module_base_address=00007FE3E097B000
Target=2_90_20200715_697 (Linux 4.18.0-147.13.2.el8_1.x86_64)
CPU=amd64 (4 logical CPUs) (0x7c2a8c000 RAM)
----------- Stack Backtrace -----------
(0x00007FE3E0A9A3E0 [libj9vm29.so+0x11f3e0])
(0x00007FE3E0A9BDB3 [libj9vm29.so+0x120db3])
(0x00007FE3E0A9C9FB [libj9vm29.so+0x1219fb])
(0x00007FE3E0A9CC97 [libj9vm29.so+0x121c97])
(0x00007FE3E0A9CDBF [libj9vm29.so+0x121dbf])
(0x00007FE3E0A78AD2 [libj9vm29.so+0xfdad2])
(0x00007FE3E0A74330 [libj9vm29.so+0xf9330])
(0x00007FE3D96A417D [libj9jvmti29.so+0x917d])
(0x00007FE35ADC3C9B [libinstrument.so+0x4c9b])
(0x00007FE3C16D7F68 [<unknown>+0x0])
---------------------------------------
or this longer one
Unhandled exception
Type=Segmentation error vmState=0x00000000
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000002
Handler1=00007F70176C30A0 Handler2=00007F7016FB2A60 InaccessibleAddress=00007F6F80319000
RDI=00007F6F801BC90B RSI=00007F6F80318FFE RAX=0000000000000F00 RBX=00007F6FF00DE640
RCX=0000000000000000 RDX=00007F70121BFBFF R8=00007F70177882FF R9=0000000000000200
R10=0000000000000003 R11=0000000000000001 R12=00007F7017788128 R13=00007F6FF00DE448
R14=0000000000000001 R15=00007F6FF00DE640
RIP=00007F701774E410 GS=0000 FS=0000 RSP=00007F6FF00DE440
EFlags=0000000000010206 CS=0033 RBP=0000000000000002 ERR=0000000000000006
TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=00007F6F80319000
malloc(): memory corruption
Unhandled exception
Type=Segmentation error vmState=0x00000000
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=00007F70176C30A0 Handler2=00007F7016FB2A60 InaccessibleAddress=0000000000000004
RDI=0000000000000000 RSI=0000000000000000 RAX=0000000000000001 RBX=00007F70185FF531
RCX=0000000000000000 RDX=0000000000000B00 R8=000000003A2DC13B R9=0000000000000004
R10=00000000ED5C6C80 R11=00007F7018D71EA0 R12=00007F70185FF518 R13=0000000000000000
R14=00007F701003E0A0 R15=00007F701003E0A0
RIP=00007F7014BF960B GS=0000 FS=0000 RSP=00007F70185FF460
EFlags=0000000000010246 CS=0033 RBP=00007F70185FF508 ERR=0000000000000004
TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=0000000000000004
xmm0 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm1 69676e652f6c7173 (f: 795636096.000000, d: 5.604828e+199)
xmm2 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm3 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm4 43e0000000000000 (f: 0.000000, d: 9.223372e+18)
xmm5 000000003dd58307 (f: 1037402880.000000, d: 5.125451e-315)
xmm6 000000004a09b01c (f: 1242148864.000000, d: 6.137031e-315)
xmm7 0000000000000005 (f: 5.000000, d: 2.470328e-323)
xmm8 010000000e000d00 (f: 234884352.000000, d: 7.291122e-304)
xmm9 000000004901d000 (f: 1224855552.000000, d: 6.051590e-315)
xmm10 000000003f800000 (f: 1065353216.000000, d: 5.263544e-315)
xmm11 415a267974e94acc (f: 1961446144.000000, d: 6.855142e+06)
xmm12 0000000040c481b4 (f: 1086620032.000000, d: 5.368617e-315)
xmm13 000000004937cbec (f: 1228393472.000000, d: 6.069070e-315)
xmm14 000000003e800000 (f: 1048576000.000000, d: 5.180654e-315)
xmm15 3fef8cd2d486b2fc (f: 3565597440.000000, d: 9.859404e-01)
Module=/usr/lib/jvm/java-11/lib/compressedrefs/libj9gc29.so
Module_base_address=00007F7014BC3000
Target=2_90_20200715_697 (Linux 4.18.0-147.13.2.el8_1.x86_64)
CPU=amd64 (4 logical CPUs) (0x7c2a8c000 RAM)
----------- Stack Backtrace -----------
(0x00007F7014BF960B [libj9gc29.so+0x3660b])
(0x00007F7014CDAF54 [libj9gc29.so+0x117f54])
(0x00007F7014CDBC37 [libj9gc29.so+0x118c37])
(0x00007F7014BFA60E [libj9gc29.so+0x3760e])
(0x00007F7014BFA64E [libj9gc29.so+0x3764e])
(0x00007F7014BFB585 [libj9gc29.so+0x38585])
(0x00007F70141E8780 [libjclse29.so+0x32780])
(0x00007F70141EABD1 [libjclse29.so+0x34bd1])
(0x00007F70141EADEE [libjclse29.so+0x34dee])
(0x00007F6FF8345FBB [<unknown>+0x0])
---------------------------------------
Unhandled exception
Type=Segmentation error vmState=0x00030000
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=00007F70176C30A0 Handler2=00007F7016FB2A60 InaccessibleAddress=FFFFFFFFFFFFFFF8
RDI=FFFFFFFFFFFFFFF0 RSI=0000000000F82568 RAX=0000000000FBDC88 RBX=00007F6FF0260D40
RCX=0000000000000000 RDX=00000000000A5C01 R8=0000000000000000 R9=0000000000000000
R10=0000000004200000 R11=0000000000000000 R12=0000000000000000 R13=0000000000AE8400
R14=0000000004200000 R15=0000000000000000
RIP=00007F70176F3F40 GS=0000 FS=0000 RSP=00007F6FF0260CA0
EFlags=0000000000010246 CS=0033 RBP=0000000000000000 ERR=0000000000000005
TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=FFFFFFFFFFFFFFF8
xmm0 0000003000000020 (f: 32.000000, d: 1.018558e-312)
xmm1 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm2 00000000fe983e08 (f: 4271390208.000000, d: 2.110347e-314)
xmm3 0000000000000001 (f: 1.000000, d: 4.940656e-324)
xmm4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm5 0000000000f825da (f: 16262618.000000, d: 8.034801e-317)
xmm6 00000000fe983e08 (f: 4271390208.000000, d: 2.110347e-314)
xmm7 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm8 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm10 0000002000000020 (f: 32.000000, d: 6.790387e-313)
xmm11 0000000049d70a38 (f: 1238829568.000000, d: 6.120632e-315)
xmm12 000000004689a022 (f: 1183424512.000000, d: 5.846894e-315)
xmm13 0000000047ac082f (f: 1202456576.000000, d: 5.940925e-315)
xmm14 0000000048650dc0 (f: 1214582272.000000, d: 6.000833e-315)
xmm15 0000000046b73e38 (f: 1186414080.000000, d: 5.861665e-315)
Module=/usr/lib/jvm/java-11/lib/compressedrefs/libj9vm29.so
Module_base_address=00007F701762F000
Target=2_90_20200715_697 (Linux 4.18.0-147.13.2.el8_1.x86_64)
CPU=amd64 (4 logical CPUs) (0x7c2a8c000 RAM)
----------- Stack Backtrace -----------
(0x00007F70176F3F40 [libj9vm29.so+0xc4f40])
(0x00007F70176C3466 [libj9vm29.so+0x94466])
(0x00007F70176C3E51 [libj9vm29.so+0x94e51])
(0x00007F7017643B2C [libj9vm29.so+0x14b2c])
(0x00007F7017641B60 [libj9vm29.so+0x12b60])
(0x00007F70176FEC52 [libj9vm29.so+0xcfc52])
---------------------------------------
#0: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x89e995) [0x7f7015944995]
#1: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x8a9390) [0x7f701594f390]
#2: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x1615ce) [0x7f70152075ce]
#3: /usr/lib/jvm/java-11/lib/compressedrefs/libj9prt29.so(+0x1ac8a) [0x7f7016fb2c8a]
#4: /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0) [0x7f70193cd8a0]
#5: /usr/lib/jvm/java-11/lib/compressedrefs/libj9vm29.so(+0x6519c) [0x7f701769419c]
#6: /usr/lib/jvm/java-11/lib/compressedrefs/libj9vm29.so(+0x13c2c7) [0x7f701776b2c7]
#7: /usr/lib/jvm/java-11/lib/compressedrefs/libj9vm29.so(+0x658f3) [0x7f70176948f3]
#8: /usr/lib/jvm/java-11/lib/compressedrefs/libj9vm29.so(+0x8473e) [0x7f70176b373e]
#9: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x9465c9) [0x7f70159ec5c9]
#10: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x1c5c00) [0x7f701526bc00]
#11: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x152b55) [0x7f70151f8b55]
#12: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x1f52e2) [0x7f701529b2e2]
#13: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x1544ad) [0x7f70151fa4ad]
#14: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x2134f5) [0x7f70152b94f5]
#15: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x220370) [0x7f70152c6370]
#16: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x203932) [0x7f70152a9932]
#17: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x205686) [0x7f70152ab686]
#18: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x562d09) [0x7f7015608d09]
#19: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x505a5c) [0x7f70155aba5c]
#20: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x173d27) [0x7f7015219d27]
#21: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x174c71) [0x7f701521ac71]
#22: /usr/lib/jvm/java-11/lib/compressedrefs/libj9prt29.so(+0x1b7c3) [0x7f7016fb37c3]
#23: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x176a75) [0x7f701521ca75]
#24: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x177028) [0x7f701521d028]
#25: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x17292b) [0x7f701521892b]
#26: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x172df2) [0x7f7015218df2]
#27: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x172e9a) [0x7f7015218e9a]
#28: /usr/lib/jvm/java-11/lib/compressedrefs/libj9prt29.so(+0x1b7c3) [0x7f7016fb37c3]
#29: /usr/lib/jvm/java-11/lib/compressedrefs/libj9jit29.so(+0x1732f4) [0x7f70152192f4]
...
JVM details
openjdk version "11.0.8" 2020-07-14
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.8+10)
Eclipse OpenJ9 VM AdoptOpenJDK (build openj9-0.21.0, JRE 11 Linux amd64-64-Bit Compressed References 20200715_697 (JIT enabled, AOT enabled)
OpenJ9 - 34cf4c075
OMR - 113e54219
JCL - 95bb504fbb based on jdk-11.0.8+10)
We are deploying our apps in Openshift 4 and these crashes has appeared (we don't know if it's a coincidence) in apps that have been updated to Spring Boot 2.3.4.
Thanks
Related
Vmovntpd instruction on Intel Xeon Platinum 8168 CPU
I have a simple vector-vector addition algorithm implementation in assembly. It uses AVX to read 4 doubles from the A vector, and 4 doubles from B vector. The algorithm adds these numbers and writes the result back to the C vector. If I use vmovntpd to write back the result, the performance becames extremely random. I have made this test on an azure server, with Intel Xeon Platinum 8168 CPU. If I run this test on my laptop (Intel Core i7-2640M CPU), this random effect disappears. What is the problem on the server? One more info: The server has 44 CPU-s. [Edit] Here is my code: ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Dense to dense ;; Without cache (for storing the result) ;; AVX-512 ;; Without tolerances ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; global _denseToDenseAddAVX512_nocache_64_linux _denseToDenseAddAVX512_nocache_64_linux: push rbp mov rbp, rsp ; c = a + lambda * b ; rdi: address1 ; rsi: address2 ; rdx: address3 ; rcx: count ; xmm0: lambda mov rax, rcx shr rcx, 4 and rax, 0x0F vzeroupper vmovupd zmm5, [abs_mask] sub rsp, 8 movlpd [rbp - 8], xmm0 vbroadcastsd zmm7, [rbp - 8] vmovapd zmm6, zmm7 cmp rcx, 0 je after_loop_denseToDenseAddAVX512_nocache_64_linux start_denseToDenseAddAVX512_nocache_64_linux: vmovapd zmm0, [rdi] ; a vmovapd zmm1, zmm7 vmulpd zmm1, zmm1, [rsi] ; b vaddpd zmm0, zmm0, zmm1 ; zmm0 = c = a + b vmovntpd [rdx], zmm0 vmovapd zmm2, [rdi + 64] ; a vmovapd zmm3, zmm6 vmulpd zmm3, zmm3, [rsi + 64] ; b vaddpd zmm2, zmm2, zmm3 ; zmm2 = c = a + b vmovntpd [rdx + 64], zmm2 add rdi, 128 add rsi, 128 add rdx, 128 loop start_denseToDenseAddAVX512_nocache_64_linux after_loop_denseToDenseAddAVX512_nocache_64_linux: cmp rax, 0 je end_denseToDenseAddAVX512_nocache_64_linux mov rcx, rax last_loop_denseToDenseAddAVX512_nocache_64_linux: movlpd xmm0, [rdi] ; a movapd xmm1, xmm7 mulsd xmm1, [rsi] ; b addsd xmm0, xmm1 ; xmm0 = c = a + b movlpd [rdx], xmm0 add rdi, 8 add rsi, 8 add rdx, 8 loop last_loop_denseToDenseAddAVX512_nocache_64_linux end_denseToDenseAddAVX512_nocache_64_linux: mov rsp, rbp pop rbp ret
Okay, I've found the solution! This is a NUMA architecture with 44 CPUs, so I disabled the NUMA, and I've limited the number of online cpu-s to 1 with the following kernel parameters: numa=off maxcpus=1 nr_cpus=1.
AVX mat4 inv implementation is slower than SSE
I implemented 4x4 matrix inverse in SSE2 and AVX. Both are faster than plain implementation. But if AVX is enabled (-mavx) then SSE2 implementation runs faster than manual AVX implementation. It seems compiler makes my SSE2 implementation more friendly with AVX :( In my AVX implementation, there are less multiplications, less additions... So I expect that AVX could be faster than SSE. Maybe some intructions like _mm256_permute2f128_ps, _mm256_permutevar_ps/_mm256_permute_ps makes AVX slower? I'm not trying to load SSE/XMM register to AVX/YMM register. How can I make my AVX implementation faster than SSE? My CPU: Intel(R) Core(TM) i7-3615QM CPU # 2.30GHz (Ivy Bridge) Plain with -O3 : 0.045853 secs SSE2 with -O3 : 0.026021 secs SSE2 with -O3 -mavx: 0.024336 secs AVX1 with -O3 -mavx: 0.031798 secs Updated (See bottom of question) all have -O3 -mavx flags: AVX1 (reduced div) : 0.027666 secs AVX1 (using rcp_ps) : 0.023205 secs SSE2 (using rcp_ps) : 0.021969 secs Initial Matrix: Matrix (float4x4): |0.0714 -0.6589 0.7488 2.0000| |0.9446 0.2857 0.1613 4.0000| |-0.3202 0.6958 0.6429 6.0000| |0.0000 0.0000 0.0000 1.0000| Test codes: start = clock(); for (int i = 0; i < 1000000; i++) { glm_mat4_inv_sse2(m, m); // glm_mat4_inv_avx(m, m); // glm_mat4_inv(m, m) } end = clock(); total = (float)(end - start) / CLOCKS_PER_SEC; printf("%f secs\n\n", total); Implementations: Library: http://github.com/recp/cglm SSE Impl: https://gist.github.com/recp/690025c955c2e69a91e3a60a13768dee AVX Impl: https://gist.github.com/recp/8ccc5ad0d19f5516de55f9bf7b5045b2 SSE2 implementation output (using godbolt; options -O3): glm_mat4_inv_sse2: movaps xmm8, XMMWORD PTR [rdi+32] movaps xmm2, XMMWORD PTR [rdi+16] movaps xmm5, XMMWORD PTR [rdi+48] movaps xmm6, XMMWORD PTR [rdi] movaps xmm4, xmm8 movaps xmm13, xmm8 movaps xmm11, xmm8 shufps xmm11, xmm2, 170 shufps xmm4, xmm5, 238 movaps xmm3, xmm11 movaps xmm1, xmm8 pshufd xmm12, xmm4, 127 shufps xmm13, xmm2, 255 movaps xmm0, xmm13 movaps xmm9, xmm8 pshufd xmm4, xmm4, 42 shufps xmm9, xmm2, 85 shufps xmm1, xmm5, 153 movaps xmm7, xmm9 mulps xmm0, xmm4 pshufd xmm10, xmm1, 42 movaps xmm1, xmm11 shufps xmm5, xmm8, 0 mulps xmm3, xmm12 pshufd xmm5, xmm5, 128 mulps xmm7, xmm12 mulps xmm1, xmm10 subps xmm3, xmm0 movaps xmm0, xmm13 mulps xmm0, xmm10 mulps xmm13, xmm5 subps xmm7, xmm0 movaps xmm0, xmm9 mulps xmm0, xmm4 subps xmm0, xmm1 movaps xmm1, xmm8 movaps xmm8, xmm11 shufps xmm1, xmm2, 0 mulps xmm8, xmm5 movaps xmm11, xmm7 mulps xmm4, xmm1 mulps xmm5, xmm9 movaps xmm9, xmm2 mulps xmm12, xmm1 shufps xmm9, xmm6, 85 pshufd xmm9, xmm9, 168 mulps xmm1, xmm10 movaps xmm10, xmm2 shufps xmm10, xmm6, 0 pshufd xmm10, xmm10, 168 subps xmm4, xmm8 mulps xmm7, xmm10 movaps xmm8, xmm2 shufps xmm2, xmm6, 255 shufps xmm8, xmm6, 170 pshufd xmm8, xmm8, 168 pshufd xmm2, xmm2, 168 mulps xmm11, xmm8 subps xmm12, xmm13 movaps xmm13, XMMWORD PTR .LC0[rip] subps xmm1, xmm5 movaps xmm5, xmm3 mulps xmm5, xmm9 mulps xmm3, xmm10 subps xmm5, xmm11 movaps xmm11, xmm0 mulps xmm11, xmm2 mulps xmm0, xmm10 addps xmm5, xmm11 movaps xmm11, xmm12 mulps xmm11, xmm8 mulps xmm12, xmm9 xorps xmm5, xmm13 subps xmm3, xmm11 movaps xmm11, xmm4 mulps xmm4, xmm9 subps xmm7, xmm12 mulps xmm11, xmm2 mulps xmm2, xmm1 mulps xmm1, xmm8 subps xmm0, xmm4 addps xmm3, xmm11 movaps xmm11, XMMWORD PTR .LC1[rip] addps xmm2, xmm7 addps xmm0, xmm1 movaps xmm1, xmm5 xorps xmm3, xmm11 xorps xmm2, xmm13 shufps xmm1, xmm3, 0 xorps xmm0, xmm11 movaps xmm4, xmm2 shufps xmm4, xmm0, 0 shufps xmm1, xmm4, 136 mulps xmm1, xmm6 pshufd xmm4, xmm1, 27 addps xmm1, xmm4 pshufd xmm4, xmm1, 65 addps xmm1, xmm4 movaps xmm4, XMMWORD PTR .LC2[rip] divps xmm4, xmm1 mulps xmm5, xmm4 mulps xmm3, xmm4 mulps xmm2, xmm4 mulps xmm0, xmm4 movaps XMMWORD PTR [rsi], xmm5 movaps XMMWORD PTR [rsi+16], xmm3 movaps XMMWORD PTR [rsi+32], xmm2 movaps XMMWORD PTR [rsi+48], xmm0 ret .LC0: .long 0 .long 2147483648 .long 0 .long 2147483648 .LC1: .long 2147483648 .long 0 .long 2147483648 .long 0 .LC2: .long 1065353216 .long 1065353216 .long 1065353216 .long 1065353216 SSE2 implementation (AVX enabled) output (using godbolt; options -O3 -mavx): glm_mat4_inv_sse2: vmovaps xmm9, XMMWORD PTR [rdi+32] vmovaps xmm6, XMMWORD PTR [rdi+48] vmovaps xmm2, XMMWORD PTR [rdi+16] vmovaps xmm7, XMMWORD PTR [rdi] vshufps xmm5, xmm9, xmm6, 238 vpshufd xmm13, xmm5, 127 vpshufd xmm5, xmm5, 42 vshufps xmm1, xmm9, xmm6, 153 vshufps xmm11, xmm9, xmm2, 170 vshufps xmm12, xmm9, xmm2, 255 vmulps xmm3, xmm11, xmm13 vpshufd xmm1, xmm1, 42 vmulps xmm0, xmm12, xmm5 vshufps xmm10, xmm9, xmm2, 85 vshufps xmm6, xmm6, xmm9, 0 vpshufd xmm6, xmm6, 128 vmulps xmm8, xmm10, xmm13 vmulps xmm4, xmm10, xmm5 vsubps xmm3, xmm3, xmm0 vmulps xmm0, xmm12, xmm1 vsubps xmm8, xmm8, xmm0 vmulps xmm0, xmm11, xmm1 vsubps xmm4, xmm4, xmm0 vshufps xmm0, xmm9, xmm2, 0 vmulps xmm9, xmm12, xmm6 vmulps xmm13, xmm0, xmm13 vmulps xmm5, xmm0, xmm5 vmulps xmm0, xmm0, xmm1 vsubps xmm12, xmm13, xmm9 vmulps xmm9, xmm11, xmm6 vmovaps xmm13, XMMWORD PTR .LC0[rip] vmulps xmm6, xmm10, xmm6 vshufps xmm10, xmm2, xmm7, 85 vpshufd xmm10, xmm10, 168 vsubps xmm5, xmm5, xmm9 vshufps xmm9, xmm2, xmm7, 170 vpshufd xmm9, xmm9, 168 vsubps xmm1, xmm0, xmm6 vmulps xmm11, xmm8, xmm9 vshufps xmm0, xmm2, xmm7, 0 vshufps xmm2, xmm2, xmm7, 255 vmulps xmm6, xmm3, xmm10 vpshufd xmm2, xmm2, 168 vpshufd xmm0, xmm0, 168 vmulps xmm3, xmm3, xmm0 vmulps xmm8, xmm8, xmm0 vmulps xmm0, xmm4, xmm0 vsubps xmm6, xmm6, xmm11 vmulps xmm11, xmm4, xmm2 vaddps xmm6, xmm6, xmm11 vmulps xmm11, xmm12, xmm9 vmulps xmm12, xmm12, xmm10 vxorps xmm6, xmm6, xmm13 vsubps xmm3, xmm3, xmm11 vmulps xmm11, xmm5, xmm2 vmulps xmm5, xmm5, xmm10 vsubps xmm8, xmm8, xmm12 vmulps xmm2, xmm1, xmm2 vmulps xmm1, xmm1, xmm9 vaddps xmm3, xmm3, xmm11 vmovaps xmm11, XMMWORD PTR .LC1[rip] vsubps xmm0, xmm0, xmm5 vaddps xmm2, xmm8, xmm2 vxorps xmm3, xmm3, xmm11 vaddps xmm0, xmm0, xmm1 vshufps xmm1, xmm6, xmm3, 0 vxorps xmm2, xmm2, xmm13 vxorps xmm0, xmm0, xmm11 vshufps xmm4, xmm2, xmm0, 0 vshufps xmm1, xmm1, xmm4, 136 vmulps xmm1, xmm1, xmm7 vpshufd xmm4, xmm1, 27 vaddps xmm1, xmm1, xmm4 vpshufd xmm4, xmm1, 65 vaddps xmm1, xmm1, xmm4 vmovaps xmm4, XMMWORD PTR .LC2[rip] vdivps xmm1, xmm4, xmm1 vmulps xmm6, xmm6, xmm1 vmulps xmm3, xmm3, xmm1 vmulps xmm2, xmm2, xmm1 vmulps xmm1, xmm0, xmm1 vmovaps XMMWORD PTR [rsi], xmm6 vmovaps XMMWORD PTR [rsi+16], xmm3 vmovaps XMMWORD PTR [rsi+32], xmm2 vmovaps XMMWORD PTR [rsi+48], xmm1 ret .LC0: .long 0 .long 2147483648 .long 0 .long 2147483648 .LC1: .long 2147483648 .long 0 .long 2147483648 .long 0 .LC2: .long 1065353216 .long 1065353216 .long 1065353216 .long 1065353216 AVX implementation output (using godbolt; options -O3 -mavx): glm_mat4_inv_avx: vmovaps ymm3, YMMWORD PTR [rdi] vmovaps ymm1, YMMWORD PTR [rdi+32] vmovdqa ymm2, YMMWORD PTR .LC1[rip] vmovdqa ymm0, YMMWORD PTR .LC0[rip] vperm2f128 ymm6, ymm3, ymm3, 3 vperm2f128 ymm5, ymm1, ymm1, 0 vperm2f128 ymm1, ymm1, ymm1, 17 vmovdqa ymm10, YMMWORD PTR .LC4[rip] vpermilps ymm9, ymm5, ymm0 vpermilps ymm7, ymm1, ymm2 vperm2f128 ymm8, ymm6, ymm6, 0 vpermilps ymm1, ymm1, ymm0 vpermilps ymm5, ymm5, ymm2 vpermilps ymm0, ymm8, ymm0 vmulps ymm4, ymm7, ymm9 vpermilps ymm8, ymm8, ymm2 vpermilps ymm11, ymm6, 1 vmulps ymm2, ymm5, ymm1 vmulps ymm7, ymm0, ymm7 vmulps ymm1, ymm8, ymm1 vmulps ymm0, ymm0, ymm5 vmulps ymm5, ymm8, ymm9 vmovdqa ymm9, YMMWORD PTR .LC3[rip] vmovdqa ymm8, YMMWORD PTR .LC2[rip] vsubps ymm4, ymm4, ymm2 vsubps ymm7, ymm7, ymm1 vperm2f128 ymm2, ymm4, ymm4, 0 vperm2f128 ymm4, ymm4, ymm4, 17 vshufps ymm1, ymm2, ymm4, 77 vpermilps ymm1, ymm1, ymm9 vsubps ymm5, ymm0, ymm5 vpermilps ymm0, ymm2, ymm8 vmulps ymm0, ymm0, ymm11 vperm2f128 ymm1, ymm1, ymm2, 0 vshufps ymm2, ymm2, ymm4, 74 vpermilps ymm4, ymm6, 90 vmulps ymm1, ymm1, ymm4 vpermilps ymm2, ymm2, ymm10 vpermilps ymm6, ymm6, 191 vmovaps ymm11, YMMWORD PTR .LC5[rip] vperm2f128 ymm2, ymm2, ymm2, 0 vperm2f128 ymm4, ymm3, ymm3, 0 vpermilps ymm12, ymm4, YMMWORD PTR .LC7[rip] vmulps ymm2, ymm2, ymm6 vinsertf128 ymm6, ymm7, xmm5, 1 vperm2f128 ymm5, ymm7, ymm5, 49 vshufps ymm7, ymm6, ymm5, 77 vpermilps ymm9, ymm7, ymm9 vsubps ymm0, ymm0, ymm1 vpermilps ymm1, ymm4, YMMWORD PTR .LC6[rip] vpermilps ymm4, ymm4, YMMWORD PTR .LC8[rip] vaddps ymm2, ymm0, ymm2 vpermilps ymm0, ymm6, ymm8 vshufps ymm6, ymm6, ymm5, 74 vpermilps ymm6, ymm6, ymm10 vmulps ymm1, ymm1, ymm0 vmulps ymm0, ymm12, ymm9 vmulps ymm6, ymm4, ymm6 vxorps ymm2, ymm2, ymm11 vdpps ymm3, ymm3, ymm2, 255 vsubps ymm0, ymm1, ymm0 vdivps ymm2, ymm2, ymm3 vaddps ymm0, ymm0, ymm6 vxorps ymm0, ymm0, ymm11 vdivps ymm0, ymm0, ymm3 vperm2f128 ymm5, ymm2, ymm2, 3 vshufps ymm1, ymm2, ymm5, 68 vshufps ymm2, ymm2, ymm5, 238 vperm2f128 ymm4, ymm0, ymm0, 3 vshufps ymm6, ymm0, ymm4, 68 vshufps ymm0, ymm0, ymm4, 238 vshufps ymm3, ymm1, ymm6, 136 vshufps ymm1, ymm1, ymm6, 221 vinsertf128 ymm1, ymm3, xmm1, 1 vshufps ymm3, ymm2, ymm0, 136 vshufps ymm0, ymm2, ymm0, 221 vinsertf128 ymm0, ymm3, xmm0, 1 vmovaps YMMWORD PTR [rsi], ymm1 vmovaps YMMWORD PTR [rsi+32], ymm0 vzeroupper ret .LC0: .long 2 .long 1 .long 1 .long 0 .long 0 .long 0 .long 0 .long 0 .LC1: .long 3 .long 3 .long 2 .long 3 .long 2 .long 1 .long 1 .long 1 .LC2: .long 0 .long 0 .long 1 .long 2 .long 0 .long 0 .long 1 .long 2 .LC3: .long 0 .long 1 .long 1 .long 2 .long 0 .long 1 .long 1 .long 2 .LC4: .long 0 .long 2 .long 3 .long 3 .long 0 .long 2 .long 3 .long 3 .LC5: .long 0 .long 2147483648 .long 0 .long 2147483648 .long 2147483648 .long 0 .long 2147483648 .long 0 .LC6: .long 1 .long 0 .long 0 .long 0 .long 1 .long 0 .long 0 .long 0 .LC7: .long 2 .long 2 .long 1 .long 1 .long 2 .long 2 .long 1 .long 1 .LC8: .long 3 .long 3 .long 3 .long 2 .long 3 .long 3 .long 3 .long 2 EDIT: I'm using Xcode (Version 10.0 (10A255)) on macOS (on MacBook Pro (Retina, Mid 2012) 15') to build and run tests with -O3 optimization option. It compiles test codes with clang. I used GCC 8.2 in godbolt to view asm (sorry for this), but the assembly output seems similar. I was enabled shuffd by enabling cglm option: CGLM_USE_INT_DOMAIN. I was forgot to disable it when viewing asm. #ifdef CGLM_USE_INT_DOMAIN # define glmm_shuff1(xmm, z, y, x, w) \ _mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(xmm), \ _MM_SHUFFLE(z, y, x, w))) #else # define glmm_shuff1(xmm, z, y, x, w) \ _mm_shuffle_ps(xmm, xmm, _MM_SHUFFLE(z, y, x, w)) #endif Whole test codes (except headers): #include <cglm/cglm.h> #include <sys/time.h> #include <time.h> int main(int argc, const char * argv[]) { CGLM_ALIGN(32) mat4 m = GLM_MAT4_IDENTITY_INIT; double start, end, total; /* generate invertible matrix */ glm_translate(m, (vec3){1,2,3}); glm_rotate(m, M_PI_2, (vec3){1,2,3}); glm_translate(m, (vec3){1,2,3}); glm_mat4_print(m, stderr); start = clock(); for (int i = 0; i < 1000000; i++) { glm_mat4_inv_sse2(m, m); // glm_mat4_inv_avx(m, m); // glm_mat4_inv(m, m); } end = clock(); total = (float)(end - start) / CLOCKS_PER_SEC; printf("%f secs\n\n", total); glm_mat4_print(m, stderr); } EDIT 2: I have reduced one division by using multiplication (1 set_ps + 1 div_ps + 2 mul_ps seems better than 2 div_ps): Old Version: r1 = _mm256_div_ps(r1, y4); r2 = _mm256_div_ps(r2, y4); New Version (SSE2 version was used division like this): y5 = _mm256_div_ps(_mm256_set1_ps(1.0f), y4); r1 = _mm256_mul_ps(r1, y5); r2 = _mm256_mul_ps(r2, y5); New Version (Fast version): y5 = _mm256_rcp_ps(y4); r1 = _mm256_mul_ps(r1, y5); r2 = _mm256_mul_ps(r2, y5); Now it is better than before but still not faster than SSE on Ivy Bridge CPU. I updated the test results.
Your CPU is an Intel IvyBridge. Sandybridge / IvyBridge has 1-per-clock mul and add throughput, on different ports so they don't compete with each other. But it only 1 per clock shuffle throughput for 256-bit shuffles, and all FP shuffles (even 128-bit shufps). However, it has 2-per-clock throughput for integer shuffles, and I notice your compiler is using pshufd as a copy-and-shuffle between FP instructions. This is a solid win when compiling for SSE2, especially where the VEX encoding isn't available (so it's saving a movaps by replacing movaps xmm0, xmm1 / shufps xmm0, xmm0, 65 or whatever.) Your compiler is doing this even when AVX is available so it could have used vshufps xmm0, xmm1,xmm1, 65, but it's either cleverly choosing vpshufd for microarchitectural reasons, or it got lucky, or its heuristics / instruction cost model were designed with this in mind. (I suspect it was clang, but you didn't say in the question or show the C source you compiled from). In Haswell and later (which supports AVX2 and thus 256-bit versions of every integer shuffle), all shuffles can only run on port 5. But in IvB where only AVX1 is supported, it's only FP shuffles that go up to 256 bits. Integer shuffles are always only 128 bits, and can run on port 1 or port 5, because there are 128-bit shuffle execution units on both those ports. (https://agner.org/optimize/) I haven't looked at the asm in a ton of detail because it's long, but if it costs you more shuffles to save on adds / multiplies by using wider vectors, that would be be slower. As well as because all your shuffles become FP shuffles so they only run on port 5, not taking advantage of port 1. I suspect there's so much shuffling that it's a bottleneck vs. port 0 (FP multiply) or port 1 (FP add). BTW, Haswell and later have two FMA units, one each on p0 and p1, so multiply has twice the throughput. Skylake and later runs FP add on those FMA units as well, so they both have 2 per clock throughput. (And if you can usefully use actual FMA instructions, you can get twice the work done.) Also, your benchmark is testing latency, not thoughput, because the same m is the input and output. There might be enough instruction-level parallelism to just bottleneck on shuffle throughput, though. Lane-crossing shuffles like vperm2f128 and vinsertf128 have 2 cycle latency on IvB, vs. in-lane shuffles (including all 128-bit shuffles) having only single cycle latency. Intel's guides claim a different number, IIRC, but 2 cycles is what Agner Fog's actual measurements found in practice in a dependency chain. (This is probably 1 cycle + some kind of bypass delay). On Haswell and later, lane-crossing shuffles are 3 cycle latency. Why are some Haswell AVX latencies advertised by Intel as 3x slower than Sandy Bridge? Also related: Do 128bit cross lane operations in AVX512 give better performance? you can sometimes reduce the amount of shuffling with an unaligned load that cuts into 128-bit halves at a useful point, and then use in-lane shuffles. That's potentially useful for AVX1 because it lacks vpermps or other lane-crossing shuffles with granularity less than 128 bits.
Why this SSE2 program (integers) generate movaps (float)?
The following loops transpose an integer matrix to another integer matrix. when I compiled interestingly it generates movaps instruction to store the result into the output matrix. why gcc does this? data: int __attribute__(( aligned(16))) t[N][M] , __attribute__(( aligned(16))) c_tra[N][M]; loops: for( i=0; i<N; i+=4){ for(j=0; j<M; j+=4){ row0 = _mm_load_si128((__m128i *)&t[i][j]); row1 = _mm_load_si128((__m128i *)&t[i+1][j]); row2 = _mm_load_si128((__m128i *)&t[i+2][j]); row3 = _mm_load_si128((__m128i *)&t[i+3][j]); __t0 = _mm_unpacklo_epi32(row0, row1); __t1 = _mm_unpacklo_epi32(row2, row3); __t2 = _mm_unpackhi_epi32(row0, row1); __t3 = _mm_unpackhi_epi32(row2, row3); /* values back into I[0-3] */ row0 = _mm_unpacklo_epi64(__t0, __t1); row1 = _mm_unpackhi_epi64(__t0, __t1); row2 = _mm_unpacklo_epi64(__t2, __t3); row3 = _mm_unpackhi_epi64(__t2, __t3); _mm_store_si128((__m128i *)&c_tra[j][i], row0); _mm_store_si128((__m128i *)&c_tra[j+1][i], row1); _mm_store_si128((__m128i *)&c_tra[j+2][i], row2); _mm_store_si128((__m128i *)&c_tra[j+3][i], row3); } } Assembly generated code: .L39: lea rcx, [rsi+rdx] movdqa xmm1, XMMWORD PTR [rdx] add rdx, 16 add rax, 2048 movdqa xmm6, XMMWORD PTR [rcx+rdi] movdqa xmm3, xmm1 movdqa xmm2, XMMWORD PTR [rcx+r9] punpckldq xmm3, xmm6 movdqa xmm5, XMMWORD PTR [rcx+r10] movdqa xmm4, xmm2 punpckhdq xmm1, xmm6 punpckldq xmm4, xmm5 punpckhdq xmm2, xmm5 movdqa xmm5, xmm3 punpckhqdq xmm3, xmm4 punpcklqdq xmm5, xmm4 movdqa xmm4, xmm1 punpckhqdq xmm1, xmm2 punpcklqdq xmm4, xmm2 movaps XMMWORD PTR [rax-2048], xmm5 movaps XMMWORD PTR [rax-1536], xmm3 movaps XMMWORD PTR [rax-1024], xmm4 movaps XMMWORD PTR [rax-512], xmm1 cmp r11, rdx jne .L39 gcc -Wall -msse4.2 -masm="intel" -O2 -c -S skylake linuxmint -mavx2 or -march=naticve generate VEX-encoding :vmovaps.
Functionally those instructions are the same. I don't like to copy+paste other people statements as mine so few links explaining it: Difference between MOVDQA and MOVAPS x86 instructions? https://software.intel.com/en-us/forums/intel-isa-extensions/topic/279587 http://masm32.com/board/index.php?topic=1138.0 https://www.gamedev.net/blog/615/entry-2250281-demystifying-sse-move-instructions/ Short version: So for the most part, you should try to use the move instruction that corresponds with the operations you are going to use on those registers. However, there is an additional complication. Loads and stores to and from memory execute on a separate port from the integer and floating point units; thus instructions that load from memory into a register or store from a register into memory will experience the same delay regardless of the data type you attach to the move. Thus in this case, movaps, movapd, and movdqa will have the same delay no matter what data you use. Since movaps (and movups) is encoded in binary form with one less byte than the other two, it makes sense to use it for all reg-mem moves, regardless of the data type. So it is GCC optimization.
arithmetics optimisation inside C for-loop
I have two functions with for-loops, which look very similar. The number of data to process is very large, so I am trying to optimise the cycles as much as possible. The execution time for the second function is 320 sec, but the first one takes 460 sec. Could somebody please give me any suggestions what makes the difference and how to optimise the computation? int ii, jj; double c1, c2; for (ii = 0; ii < n; ++ii) { a[jj] += b[ii] * c1; a[++jj] += b[ii] * c2; } The second one: int ii, jj; double c1, c2; for (ii = 0; ii < n; ++ii) { b[ii] += a[jj] * c1; b[ii] += a[++jj] * c2; } And here is the assembler output for the first loop: movl -104(%rbp), %eax movq -64(%rbp), %rcx cmpl (%rcx), %eax jge LBB0_12 ## BB#10: ## in Loop: Header=BB0_9 Depth=5 movslq -88(%rbp), %rax movq -48(%rbp), %rcx movsd (%rcx,%rax,8), %xmm0 ## xmm0 = mem[0],zero mulsd -184(%rbp), %xmm0 movslq -108(%rbp), %rax movq -224(%rbp), %rcx ## 8-byte Reload addsd (%rcx,%rax,8), %xmm0 movsd %xmm0, (%rcx,%rax,8) movslq -88(%rbp), %rax movq -48(%rbp), %rdx movsd (%rdx,%rax,8), %xmm0 ## xmm0 = mem[0],zero mulsd -192(%rbp), %xmm0 movl -108(%rbp), %esi addl $1, %esi movl %esi, -108(%rbp) movslq %esi, %rax addsd (%rcx,%rax,8), %xmm0 movsd %xmm0, (%rcx,%rax,8) movl -88(%rbp), %esi addl $1, %esi movl %esi, -88(%rbp) and for the second one: movl -104(%rbp), %eax movq -64(%rbp), %rcx cmpl (%rcx), %eax jge LBB0_12 ## BB#10: ## in Loop: Header=BB0_9 Depth=5 movslq -108(%rbp), %rax movq -224(%rbp), %rcx ## 8-byte Reload movsd (%rcx,%rax,8), %xmm0 ## xmm0 = mem[0],zero mulsd -184(%rbp), %xmm0 movslq -88(%rbp), %rax movq -48(%rbp), %rdx addsd (%rdx,%rax,8), %xmm0 movsd %xmm0, (%rdx,%rax,8) movl -108(%rbp), %esi addl $1, %esi movl %esi, -108(%rbp) movslq %esi, %rax movsd (%rcx,%rax,8), %xmm0 ## xmm0 = mem[0],zero mulsd -192(%rbp), %xmm0 movslq -88(%rbp), %rax movq -48(%rbp), %rdx addsd (%rdx,%rax,8), %xmm0 movsd %xmm0, (%rdx,%rax,8) movl -88(%rbp), %esi addl $1, %esi movl %esi, -88(%rbp) The original function is much bigger, so here I provided only the pieces responsible for those for-loops. The rest of the c-code and its assembler output is exactly the same for both functions.
The structure of that calculation is pretty weird, but it can be optimized significantly. Some problems with that code are reloading data from a pointer after writing to an other pointer that isn't known to not alias. I assume they won't alias because this algorithm would be even weirder if that was allowed, but if they're really supposed to maybe alias, ignore this. In general, structure your loop body as: first load everything, do calculations, then store back. Don't mix loading and storing, it makes the compiler more conservative. reloading data that was stored in the previous iteration. The compiler can see through this a bit, but it complicates matters. Don't do it. implicitly treating the first and last items differently. It looks like a nice homogeneous loop at first, but due to its weird structure it's actually special casing the first and last things. So let's first fix the second loops, which is simpler. The only problem here is the first store to b[ii], which has to Really Happen(tm) because it might alias with a[jj + 1]. But it can trivially be written so that that problem goes away: for (ii = 0; ii < n; ++ii) { b[ii] += a[jj] * c1 + a[jj + 1] * c2; jj++; } You can tell by the assembly output that the compiler is happier now, and of course benchmarking confirms it's faster. Old asm (only main loop, not the extra cruft): .LBB0_14: # =>This Inner Loop Header: Depth=1 vmulpd ymm4, ymm2, ymmword ptr [r8 - 8] vaddpd ymm4, ymm4, ymmword ptr [rax] vmovupd ymmword ptr [rax], ymm4 vmulpd ymm5, ymm3, ymmword ptr [r8] vaddpd ymm4, ymm4, ymm5 vmovupd ymmword ptr [rax], ymm4 add r8, 32 add rax, 32 add r11, -4 jne .LBB0_14 New asm (only main loop): .LBB1_20: # =>This Inner Loop Header: Depth=1 vmulpd ymm4, ymm2, ymmword ptr [rax - 104] vmulpd ymm5, ymm2, ymmword ptr [rax - 72] vmulpd ymm6, ymm2, ymmword ptr [rax - 40] vmulpd ymm7, ymm2, ymmword ptr [rax - 8] vmulpd ymm8, ymm3, ymmword ptr [rax - 96] vmulpd ymm9, ymm3, ymmword ptr [rax - 64] vmulpd ymm10, ymm3, ymmword ptr [rax - 32] vmulpd ymm11, ymm3, ymmword ptr [rax] vaddpd ymm4, ymm4, ymm8 vaddpd ymm5, ymm5, ymm9 vaddpd ymm6, ymm6, ymm10 vaddpd ymm7, ymm7, ymm11 vaddpd ymm4, ymm4, ymmword ptr [rcx - 96] vaddpd ymm5, ymm5, ymmword ptr [rcx - 64] vaddpd ymm6, ymm6, ymmword ptr [rcx - 32] vaddpd ymm7, ymm7, ymmword ptr [rcx] vmovupd ymmword ptr [rcx - 96], ymm4 vmovupd ymmword ptr [rcx - 64], ymm5 vmovupd ymmword ptr [rcx - 32], ymm6 vmovupd ymmword ptr [rcx], ymm7 sub rax, -128 sub rcx, -128 add rbx, -16 jne .LBB1_20 That also got unrolled more (automatically), but the more significant difference (not that unrolling is useless, but reducing the loop overhead isn't such a big deal usually, it can mostly be handled by the ports that aren't busy with vector instructions) is the reduction in stores, which takes it from a ratio of 2/3 (potentially bottlenecked by store throughput where half the stores are useless) to 4/12 (bottlenecked by something that really has to happen). Now for that first loop, once you take out the first and last iterations, it's just adding two scaled b's to every a, and then we put the first and last iterations back in separately: a[0] += b[0] * c1; for (ii = 1; ii < n; ++ii) { a[ii] += b[ii - 1] * c2 + b[ii] * c1; } a[n] += b[n - 1] * c2; That takes it from this (note that this isn't even vectorized): .LBB0_3: # =>This Inner Loop Header: Depth=1 vmulsd xmm3, xmm0, qword ptr [rsi + 8*rax] vaddsd xmm2, xmm2, xmm3 vmovsd qword ptr [rdi + 8*rax], xmm2 vmulsd xmm2, xmm1, qword ptr [rsi + 8*rax] vaddsd xmm2, xmm2, qword ptr [rdi + 8*rax + 8] vmovsd qword ptr [rdi + 8*rax + 8], xmm2 vmulsd xmm3, xmm0, qword ptr [rsi + 8*rax + 8] vaddsd xmm2, xmm2, xmm3 vmovsd qword ptr [rdi + 8*rax + 8], xmm2 vmulsd xmm2, xmm1, qword ptr [rsi + 8*rax + 8] vaddsd xmm2, xmm2, qword ptr [rdi + 8*rax + 16] vmovsd qword ptr [rdi + 8*rax + 16], xmm2 lea rax, [rax + 2] cmp ecx, eax jne .LBB0_3 To this: .LBB1_6: # =>This Inner Loop Header: Depth=1 vmulpd ymm4, ymm2, ymmword ptr [rbx - 104] vmulpd ymm5, ymm2, ymmword ptr [rbx - 72] vmulpd ymm6, ymm2, ymmword ptr [rbx - 40] vmulpd ymm7, ymm2, ymmword ptr [rbx - 8] vmulpd ymm8, ymm3, ymmword ptr [rbx - 96] vmulpd ymm9, ymm3, ymmword ptr [rbx - 64] vmulpd ymm10, ymm3, ymmword ptr [rbx - 32] vmulpd ymm11, ymm3, ymmword ptr [rbx] vaddpd ymm4, ymm4, ymm8 vaddpd ymm5, ymm5, ymm9 vaddpd ymm6, ymm6, ymm10 vaddpd ymm7, ymm7, ymm11 vaddpd ymm4, ymm4, ymmword ptr [rcx - 96] vaddpd ymm5, ymm5, ymmword ptr [rcx - 64] vaddpd ymm6, ymm6, ymmword ptr [rcx - 32] vaddpd ymm7, ymm7, ymmword ptr [rcx] vmovupd ymmword ptr [rcx - 96], ymm4 vmovupd ymmword ptr [rcx - 64], ymm5 vmovupd ymmword ptr [rcx - 32], ymm6 vmovupd ymmword ptr [rcx], ymm7 sub rbx, -128 sub rcx, -128 add r11, -16 jne .LBB1_6 Nice and vectorized this time, and much less storing and loading going on. Both changes combined made it about twice as fast on my PC but of course YMMV. I still that this code is weird though. Note how we're modifying a[n] in the last iteration of the first loop, then use it in the first iteration of the second loop, while the other a's just sort of stand to side and watch. It's odd. Maybe it really has to be that way, but frankly it looks like a bug to me.
Why specify address of variable in ASM instead of just copying it into register?
In my quest to learn assembly (using GCC on x86_64), I have come across some SSE examples where instead of just copying a C variable into a register, the address is copied in to EAX instead. Why do that when you can just do this: typedef float v4sf __attribute__((vector_size(16))); typedef union { v4sf v; float f[4]; } Vec4; Vec4 vector.v = (v4sf){ 64.1,128.2,256.3,512.4 }; float blah = 2.2; __asm__("movups %0, %%xmm0 \n\t" "movups %1, %%xmm1 \n\t" "shufps $0x00, %%xmm1, %%xmm1 \n\t" "mulps %%xmm1, %%xmm0 \n\t" "movups %%xmm0, %0 \n\t" : "+m"(vector) : "m"(blah) : "%xmm0","%xmm1" ); Does copying the vector into xmm0 (rather than keeping it in memory) cause a performance hit? Here is an example of what I'm talking about (it's Intel syntax): void powf_schlickSSE(const float * a, const float b, float * result){ __asm { mov eax, a //load address of vector movss xmm0, dword ptr [b] //load exponent into SSE register movups xmm1, [eax] //load vector into SSE register shufps xmm0, xmm0, 0 //shuffle b into all floats movaps xmm2, xmm1 //duplicate vector mov eax, result //load address of result mulps xmm1, xmm0 //xmm1 = a*b subps xmm0, xmm1 //xmm0 = b-a*b addps xmm0, xmm2 //xmm2 = b-a*b+a rcpps xmm0, xmm0 //xmm1 = 1 / (b-a*b+a) mulps xmm2, xmm0 //xmm0 = a * (1 / (b-a*b+a)) movups [eax], xmm2 //store result } }
I can see multiple reasons MSVC (which that Intel-syntax code comes from, right?) doesn't support passing __m128 values into assembly blocks, or at least the version the code was written for didn't. Or maybe that version didn't support SSE at all except via inline assembly. The rest of the program didn't deal with vector types, so passing by pointer was the simplest solution.