AVX mat4 inv implementation is slower than SSE - performance

I implemented 4x4 matrix inverse in SSE2 and AVX. Both are faster than plain implementation. But if AVX is enabled (-mavx) then SSE2 implementation runs faster than manual AVX implementation. It seems compiler makes my SSE2 implementation more friendly with AVX :(
In my AVX implementation, there are less multiplications, less additions... So I expect that AVX could be faster than SSE. Maybe some intructions like _mm256_permute2f128_ps, _mm256_permutevar_ps/_mm256_permute_ps makes AVX slower? I'm not trying to load SSE/XMM register to AVX/YMM register.
How can I make my AVX implementation faster than SSE?
My CPU: Intel(R) Core(TM) i7-3615QM CPU # 2.30GHz (Ivy Bridge)
Plain with -O3 : 0.045853 secs
SSE2 with -O3 : 0.026021 secs
SSE2 with -O3 -mavx: 0.024336 secs
AVX1 with -O3 -mavx: 0.031798 secs
Updated (See bottom of question) all have -O3 -mavx flags:
AVX1 (reduced div) : 0.027666 secs
AVX1 (using rcp_ps) : 0.023205 secs
SSE2 (using rcp_ps) : 0.021969 secs
Initial Matrix:
Matrix (float4x4):
|0.0714 -0.6589 0.7488 2.0000|
|0.9446 0.2857 0.1613 4.0000|
|-0.3202 0.6958 0.6429 6.0000|
|0.0000 0.0000 0.0000 1.0000|
Test codes:
start = clock();
for (int i = 0; i < 1000000; i++) {
glm_mat4_inv_sse2(m, m);
// glm_mat4_inv_avx(m, m);
// glm_mat4_inv(m, m)
}
end = clock();
total = (float)(end - start) / CLOCKS_PER_SEC;
printf("%f secs\n\n", total);
Implementations:
Library: http://github.com/recp/cglm
SSE Impl: https://gist.github.com/recp/690025c955c2e69a91e3a60a13768dee
AVX Impl: https://gist.github.com/recp/8ccc5ad0d19f5516de55f9bf7b5045b2
SSE2 implementation output (using godbolt; options -O3):
glm_mat4_inv_sse2:
movaps xmm8, XMMWORD PTR [rdi+32]
movaps xmm2, XMMWORD PTR [rdi+16]
movaps xmm5, XMMWORD PTR [rdi+48]
movaps xmm6, XMMWORD PTR [rdi]
movaps xmm4, xmm8
movaps xmm13, xmm8
movaps xmm11, xmm8
shufps xmm11, xmm2, 170
shufps xmm4, xmm5, 238
movaps xmm3, xmm11
movaps xmm1, xmm8
pshufd xmm12, xmm4, 127
shufps xmm13, xmm2, 255
movaps xmm0, xmm13
movaps xmm9, xmm8
pshufd xmm4, xmm4, 42
shufps xmm9, xmm2, 85
shufps xmm1, xmm5, 153
movaps xmm7, xmm9
mulps xmm0, xmm4
pshufd xmm10, xmm1, 42
movaps xmm1, xmm11
shufps xmm5, xmm8, 0
mulps xmm3, xmm12
pshufd xmm5, xmm5, 128
mulps xmm7, xmm12
mulps xmm1, xmm10
subps xmm3, xmm0
movaps xmm0, xmm13
mulps xmm0, xmm10
mulps xmm13, xmm5
subps xmm7, xmm0
movaps xmm0, xmm9
mulps xmm0, xmm4
subps xmm0, xmm1
movaps xmm1, xmm8
movaps xmm8, xmm11
shufps xmm1, xmm2, 0
mulps xmm8, xmm5
movaps xmm11, xmm7
mulps xmm4, xmm1
mulps xmm5, xmm9
movaps xmm9, xmm2
mulps xmm12, xmm1
shufps xmm9, xmm6, 85
pshufd xmm9, xmm9, 168
mulps xmm1, xmm10
movaps xmm10, xmm2
shufps xmm10, xmm6, 0
pshufd xmm10, xmm10, 168
subps xmm4, xmm8
mulps xmm7, xmm10
movaps xmm8, xmm2
shufps xmm2, xmm6, 255
shufps xmm8, xmm6, 170
pshufd xmm8, xmm8, 168
pshufd xmm2, xmm2, 168
mulps xmm11, xmm8
subps xmm12, xmm13
movaps xmm13, XMMWORD PTR .LC0[rip]
subps xmm1, xmm5
movaps xmm5, xmm3
mulps xmm5, xmm9
mulps xmm3, xmm10
subps xmm5, xmm11
movaps xmm11, xmm0
mulps xmm11, xmm2
mulps xmm0, xmm10
addps xmm5, xmm11
movaps xmm11, xmm12
mulps xmm11, xmm8
mulps xmm12, xmm9
xorps xmm5, xmm13
subps xmm3, xmm11
movaps xmm11, xmm4
mulps xmm4, xmm9
subps xmm7, xmm12
mulps xmm11, xmm2
mulps xmm2, xmm1
mulps xmm1, xmm8
subps xmm0, xmm4
addps xmm3, xmm11
movaps xmm11, XMMWORD PTR .LC1[rip]
addps xmm2, xmm7
addps xmm0, xmm1
movaps xmm1, xmm5
xorps xmm3, xmm11
xorps xmm2, xmm13
shufps xmm1, xmm3, 0
xorps xmm0, xmm11
movaps xmm4, xmm2
shufps xmm4, xmm0, 0
shufps xmm1, xmm4, 136
mulps xmm1, xmm6
pshufd xmm4, xmm1, 27
addps xmm1, xmm4
pshufd xmm4, xmm1, 65
addps xmm1, xmm4
movaps xmm4, XMMWORD PTR .LC2[rip]
divps xmm4, xmm1
mulps xmm5, xmm4
mulps xmm3, xmm4
mulps xmm2, xmm4
mulps xmm0, xmm4
movaps XMMWORD PTR [rsi], xmm5
movaps XMMWORD PTR [rsi+16], xmm3
movaps XMMWORD PTR [rsi+32], xmm2
movaps XMMWORD PTR [rsi+48], xmm0
ret
.LC0:
.long 0
.long 2147483648
.long 0
.long 2147483648
.LC1:
.long 2147483648
.long 0
.long 2147483648
.long 0
.LC2:
.long 1065353216
.long 1065353216
.long 1065353216
.long 1065353216
SSE2 implementation (AVX enabled) output (using godbolt; options -O3 -mavx):
glm_mat4_inv_sse2:
vmovaps xmm9, XMMWORD PTR [rdi+32]
vmovaps xmm6, XMMWORD PTR [rdi+48]
vmovaps xmm2, XMMWORD PTR [rdi+16]
vmovaps xmm7, XMMWORD PTR [rdi]
vshufps xmm5, xmm9, xmm6, 238
vpshufd xmm13, xmm5, 127
vpshufd xmm5, xmm5, 42
vshufps xmm1, xmm9, xmm6, 153
vshufps xmm11, xmm9, xmm2, 170
vshufps xmm12, xmm9, xmm2, 255
vmulps xmm3, xmm11, xmm13
vpshufd xmm1, xmm1, 42
vmulps xmm0, xmm12, xmm5
vshufps xmm10, xmm9, xmm2, 85
vshufps xmm6, xmm6, xmm9, 0
vpshufd xmm6, xmm6, 128
vmulps xmm8, xmm10, xmm13
vmulps xmm4, xmm10, xmm5
vsubps xmm3, xmm3, xmm0
vmulps xmm0, xmm12, xmm1
vsubps xmm8, xmm8, xmm0
vmulps xmm0, xmm11, xmm1
vsubps xmm4, xmm4, xmm0
vshufps xmm0, xmm9, xmm2, 0
vmulps xmm9, xmm12, xmm6
vmulps xmm13, xmm0, xmm13
vmulps xmm5, xmm0, xmm5
vmulps xmm0, xmm0, xmm1
vsubps xmm12, xmm13, xmm9
vmulps xmm9, xmm11, xmm6
vmovaps xmm13, XMMWORD PTR .LC0[rip]
vmulps xmm6, xmm10, xmm6
vshufps xmm10, xmm2, xmm7, 85
vpshufd xmm10, xmm10, 168
vsubps xmm5, xmm5, xmm9
vshufps xmm9, xmm2, xmm7, 170
vpshufd xmm9, xmm9, 168
vsubps xmm1, xmm0, xmm6
vmulps xmm11, xmm8, xmm9
vshufps xmm0, xmm2, xmm7, 0
vshufps xmm2, xmm2, xmm7, 255
vmulps xmm6, xmm3, xmm10
vpshufd xmm2, xmm2, 168
vpshufd xmm0, xmm0, 168
vmulps xmm3, xmm3, xmm0
vmulps xmm8, xmm8, xmm0
vmulps xmm0, xmm4, xmm0
vsubps xmm6, xmm6, xmm11
vmulps xmm11, xmm4, xmm2
vaddps xmm6, xmm6, xmm11
vmulps xmm11, xmm12, xmm9
vmulps xmm12, xmm12, xmm10
vxorps xmm6, xmm6, xmm13
vsubps xmm3, xmm3, xmm11
vmulps xmm11, xmm5, xmm2
vmulps xmm5, xmm5, xmm10
vsubps xmm8, xmm8, xmm12
vmulps xmm2, xmm1, xmm2
vmulps xmm1, xmm1, xmm9
vaddps xmm3, xmm3, xmm11
vmovaps xmm11, XMMWORD PTR .LC1[rip]
vsubps xmm0, xmm0, xmm5
vaddps xmm2, xmm8, xmm2
vxorps xmm3, xmm3, xmm11
vaddps xmm0, xmm0, xmm1
vshufps xmm1, xmm6, xmm3, 0
vxorps xmm2, xmm2, xmm13
vxorps xmm0, xmm0, xmm11
vshufps xmm4, xmm2, xmm0, 0
vshufps xmm1, xmm1, xmm4, 136
vmulps xmm1, xmm1, xmm7
vpshufd xmm4, xmm1, 27
vaddps xmm1, xmm1, xmm4
vpshufd xmm4, xmm1, 65
vaddps xmm1, xmm1, xmm4
vmovaps xmm4, XMMWORD PTR .LC2[rip]
vdivps xmm1, xmm4, xmm1
vmulps xmm6, xmm6, xmm1
vmulps xmm3, xmm3, xmm1
vmulps xmm2, xmm2, xmm1
vmulps xmm1, xmm0, xmm1
vmovaps XMMWORD PTR [rsi], xmm6
vmovaps XMMWORD PTR [rsi+16], xmm3
vmovaps XMMWORD PTR [rsi+32], xmm2
vmovaps XMMWORD PTR [rsi+48], xmm1
ret
.LC0:
.long 0
.long 2147483648
.long 0
.long 2147483648
.LC1:
.long 2147483648
.long 0
.long 2147483648
.long 0
.LC2:
.long 1065353216
.long 1065353216
.long 1065353216
.long 1065353216
AVX implementation output (using godbolt; options -O3 -mavx):
glm_mat4_inv_avx:
vmovaps ymm3, YMMWORD PTR [rdi]
vmovaps ymm1, YMMWORD PTR [rdi+32]
vmovdqa ymm2, YMMWORD PTR .LC1[rip]
vmovdqa ymm0, YMMWORD PTR .LC0[rip]
vperm2f128 ymm6, ymm3, ymm3, 3
vperm2f128 ymm5, ymm1, ymm1, 0
vperm2f128 ymm1, ymm1, ymm1, 17
vmovdqa ymm10, YMMWORD PTR .LC4[rip]
vpermilps ymm9, ymm5, ymm0
vpermilps ymm7, ymm1, ymm2
vperm2f128 ymm8, ymm6, ymm6, 0
vpermilps ymm1, ymm1, ymm0
vpermilps ymm5, ymm5, ymm2
vpermilps ymm0, ymm8, ymm0
vmulps ymm4, ymm7, ymm9
vpermilps ymm8, ymm8, ymm2
vpermilps ymm11, ymm6, 1
vmulps ymm2, ymm5, ymm1
vmulps ymm7, ymm0, ymm7
vmulps ymm1, ymm8, ymm1
vmulps ymm0, ymm0, ymm5
vmulps ymm5, ymm8, ymm9
vmovdqa ymm9, YMMWORD PTR .LC3[rip]
vmovdqa ymm8, YMMWORD PTR .LC2[rip]
vsubps ymm4, ymm4, ymm2
vsubps ymm7, ymm7, ymm1
vperm2f128 ymm2, ymm4, ymm4, 0
vperm2f128 ymm4, ymm4, ymm4, 17
vshufps ymm1, ymm2, ymm4, 77
vpermilps ymm1, ymm1, ymm9
vsubps ymm5, ymm0, ymm5
vpermilps ymm0, ymm2, ymm8
vmulps ymm0, ymm0, ymm11
vperm2f128 ymm1, ymm1, ymm2, 0
vshufps ymm2, ymm2, ymm4, 74
vpermilps ymm4, ymm6, 90
vmulps ymm1, ymm1, ymm4
vpermilps ymm2, ymm2, ymm10
vpermilps ymm6, ymm6, 191
vmovaps ymm11, YMMWORD PTR .LC5[rip]
vperm2f128 ymm2, ymm2, ymm2, 0
vperm2f128 ymm4, ymm3, ymm3, 0
vpermilps ymm12, ymm4, YMMWORD PTR .LC7[rip]
vmulps ymm2, ymm2, ymm6
vinsertf128 ymm6, ymm7, xmm5, 1
vperm2f128 ymm5, ymm7, ymm5, 49
vshufps ymm7, ymm6, ymm5, 77
vpermilps ymm9, ymm7, ymm9
vsubps ymm0, ymm0, ymm1
vpermilps ymm1, ymm4, YMMWORD PTR .LC6[rip]
vpermilps ymm4, ymm4, YMMWORD PTR .LC8[rip]
vaddps ymm2, ymm0, ymm2
vpermilps ymm0, ymm6, ymm8
vshufps ymm6, ymm6, ymm5, 74
vpermilps ymm6, ymm6, ymm10
vmulps ymm1, ymm1, ymm0
vmulps ymm0, ymm12, ymm9
vmulps ymm6, ymm4, ymm6
vxorps ymm2, ymm2, ymm11
vdpps ymm3, ymm3, ymm2, 255
vsubps ymm0, ymm1, ymm0
vdivps ymm2, ymm2, ymm3
vaddps ymm0, ymm0, ymm6
vxorps ymm0, ymm0, ymm11
vdivps ymm0, ymm0, ymm3
vperm2f128 ymm5, ymm2, ymm2, 3
vshufps ymm1, ymm2, ymm5, 68
vshufps ymm2, ymm2, ymm5, 238
vperm2f128 ymm4, ymm0, ymm0, 3
vshufps ymm6, ymm0, ymm4, 68
vshufps ymm0, ymm0, ymm4, 238
vshufps ymm3, ymm1, ymm6, 136
vshufps ymm1, ymm1, ymm6, 221
vinsertf128 ymm1, ymm3, xmm1, 1
vshufps ymm3, ymm2, ymm0, 136
vshufps ymm0, ymm2, ymm0, 221
vinsertf128 ymm0, ymm3, xmm0, 1
vmovaps YMMWORD PTR [rsi], ymm1
vmovaps YMMWORD PTR [rsi+32], ymm0
vzeroupper
ret
.LC0:
.long 2
.long 1
.long 1
.long 0
.long 0
.long 0
.long 0
.long 0
.LC1:
.long 3
.long 3
.long 2
.long 3
.long 2
.long 1
.long 1
.long 1
.LC2:
.long 0
.long 0
.long 1
.long 2
.long 0
.long 0
.long 1
.long 2
.LC3:
.long 0
.long 1
.long 1
.long 2
.long 0
.long 1
.long 1
.long 2
.LC4:
.long 0
.long 2
.long 3
.long 3
.long 0
.long 2
.long 3
.long 3
.LC5:
.long 0
.long 2147483648
.long 0
.long 2147483648
.long 2147483648
.long 0
.long 2147483648
.long 0
.LC6:
.long 1
.long 0
.long 0
.long 0
.long 1
.long 0
.long 0
.long 0
.LC7:
.long 2
.long 2
.long 1
.long 1
.long 2
.long 2
.long 1
.long 1
.LC8:
.long 3
.long 3
.long 3
.long 2
.long 3
.long 3
.long 3
.long 2
EDIT:
I'm using Xcode (Version 10.0 (10A255)) on macOS (on MacBook Pro (Retina, Mid 2012) 15') to build and run tests with -O3 optimization option. It compiles test codes with clang. I used GCC 8.2 in godbolt to view asm (sorry for this), but the assembly output seems similar.
I was enabled shuffd by enabling cglm option: CGLM_USE_INT_DOMAIN. I was forgot to disable it when viewing asm.
#ifdef CGLM_USE_INT_DOMAIN
# define glmm_shuff1(xmm, z, y, x, w) \
_mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(xmm), \
_MM_SHUFFLE(z, y, x, w)))
#else
# define glmm_shuff1(xmm, z, y, x, w) \
_mm_shuffle_ps(xmm, xmm, _MM_SHUFFLE(z, y, x, w))
#endif
Whole test codes (except headers):
#include <cglm/cglm.h>
#include <sys/time.h>
#include <time.h>
int
main(int argc, const char * argv[]) {
CGLM_ALIGN(32) mat4 m = GLM_MAT4_IDENTITY_INIT;
double start, end, total;
/* generate invertible matrix */
glm_translate(m, (vec3){1,2,3});
glm_rotate(m, M_PI_2, (vec3){1,2,3});
glm_translate(m, (vec3){1,2,3});
glm_mat4_print(m, stderr);
start = clock();
for (int i = 0; i < 1000000; i++) {
glm_mat4_inv_sse2(m, m);
// glm_mat4_inv_avx(m, m);
// glm_mat4_inv(m, m);
}
end = clock();
total = (float)(end - start) / CLOCKS_PER_SEC;
printf("%f secs\n\n", total);
glm_mat4_print(m, stderr);
}
EDIT 2:
I have reduced one division by using multiplication (1 set_ps + 1 div_ps + 2 mul_ps seems better than 2 div_ps):
Old Version:
r1 = _mm256_div_ps(r1, y4);
r2 = _mm256_div_ps(r2, y4);
New Version (SSE2 version was used division like this):
y5 = _mm256_div_ps(_mm256_set1_ps(1.0f), y4);
r1 = _mm256_mul_ps(r1, y5);
r2 = _mm256_mul_ps(r2, y5);
New Version (Fast version):
y5 = _mm256_rcp_ps(y4);
r1 = _mm256_mul_ps(r1, y5);
r2 = _mm256_mul_ps(r2, y5);
Now it is better than before but still not faster than SSE on Ivy Bridge CPU. I updated the test results.

Your CPU is an Intel IvyBridge.
Sandybridge / IvyBridge has 1-per-clock mul and add throughput, on different ports so they don't compete with each other.
But it only 1 per clock shuffle throughput for 256-bit shuffles, and all FP shuffles (even 128-bit shufps). However, it has 2-per-clock throughput for integer shuffles, and I notice your compiler is using pshufd as a copy-and-shuffle between FP instructions. This is a solid win when compiling for SSE2, especially where the VEX encoding isn't available (so it's saving a movaps by replacing movaps xmm0, xmm1 / shufps xmm0, xmm0, 65 or whatever.) Your compiler is doing this even when AVX is available so it could have used vshufps xmm0, xmm1,xmm1, 65, but it's either cleverly choosing vpshufd for microarchitectural reasons, or it got lucky, or its heuristics / instruction cost model were designed with this in mind. (I suspect it was clang, but you didn't say in the question or show the C source you compiled from).
In Haswell and later (which supports AVX2 and thus 256-bit versions of every integer shuffle), all shuffles can only run on port 5. But in IvB where only AVX1 is supported, it's only FP shuffles that go up to 256 bits. Integer shuffles are always only 128 bits, and can run on port 1 or port 5, because there are 128-bit shuffle execution units on both those ports. (https://agner.org/optimize/)
I haven't looked at the asm in a ton of detail because it's long, but if it costs you more shuffles to save on adds / multiplies by using wider vectors, that would be be slower.
As well as because all your shuffles become FP shuffles so they only run on port 5, not taking advantage of port 1. I suspect there's so much shuffling that it's a bottleneck vs. port 0 (FP multiply) or port 1 (FP add).
BTW, Haswell and later have two FMA units, one each on p0 and p1, so multiply has twice the throughput. Skylake and later runs FP add on those FMA units as well, so they both have 2 per clock throughput. (And if you can usefully use actual FMA instructions, you can get twice the work done.)
Also, your benchmark is testing latency, not thoughput, because the same m is the input and output. There might be enough instruction-level parallelism to just bottleneck on shuffle throughput, though.
Lane-crossing shuffles like vperm2f128 and vinsertf128 have 2 cycle latency on IvB, vs. in-lane shuffles (including all 128-bit shuffles) having only single cycle latency. Intel's guides claim a different number, IIRC, but 2 cycles is what Agner Fog's actual measurements found in practice in a dependency chain. (This is probably 1 cycle + some kind of bypass delay). On Haswell and later, lane-crossing shuffles are 3 cycle latency. Why are some Haswell AVX latencies advertised by Intel as 3x slower than Sandy Bridge?
Also related: Do 128bit cross lane operations in AVX512 give better performance? you can sometimes reduce the amount of shuffling with an unaligned load that cuts into 128-bit halves at a useful point, and then use in-lane shuffles. That's potentially useful for AVX1 because it lacks vpermps or other lane-crossing shuffles with granularity less than 128 bits.

Related

Will intel -03 convert pairs of __m256d instructions into __m512d

Will a code written for a 256 vectorization register will be compiled to use 512 instructions using the (2019) intel compiler with O3 level of optimization?
e.g. will operations on two __m256d objects be either converted to the same amount of operations over masked __m512d objects or grouped to make the most use out of the register, in the best case the total number of operations dropping by a factor 2?
arch: Knights Landing
Unfortunately, no: a code written to use AVX/AVX-2 intrinsics is not rewritten by ICC so to use AVX-512 yet (with both ICC 2019 and ICC 2021). There is no instruction fusing. Here is an example (see on GodBolt).
#include <x86intrin.h>
void compute(double* restrict data, int size)
{
__m256d cst1 = _mm256_set1_pd(23.42);
__m256d cst2 = _mm256_set1_pd(815.0);
__m256d v1 = _mm256_load_pd(data);
__m256d v2 = _mm256_load_pd(data+4);
__m256d v3 = _mm256_load_pd(data+8);
__m256d v4 = _mm256_load_pd(data+12);
v1 = _mm256_fmadd_pd(v1, cst1, cst2);
v2 = _mm256_fmadd_pd(v2, cst1, cst2);
v3 = _mm256_fmadd_pd(v3, cst1, cst2);
v4 = _mm256_fmadd_pd(v4, cst1, cst2);
_mm256_store_pd(data, v1);
_mm256_store_pd(data+4, v2);
_mm256_store_pd(data+8, v3);
_mm256_store_pd(data+12, v4);
}
Generated code:
compute:
vmovupd ymm0, YMMWORD PTR .L_2il0floatpacket.0[rip] #5.20
vmovupd ymm1, YMMWORD PTR .L_2il0floatpacket.1[rip] #6.20
vmovupd ymm2, YMMWORD PTR [rdi] #7.33
vmovupd ymm3, YMMWORD PTR [32+rdi] #8.33
vmovupd ymm4, YMMWORD PTR [64+rdi] #9.33
vmovupd ymm5, YMMWORD PTR [96+rdi] #10.33
vfmadd213pd ymm2, ymm0, ymm1 #11.10
vfmadd213pd ymm3, ymm0, ymm1 #12.10
vfmadd213pd ymm4, ymm0, ymm1 #13.10
vfmadd213pd ymm5, ymm0, ymm1 #14.10
vmovupd YMMWORD PTR [rdi], ymm2 #15.21
vmovupd YMMWORD PTR [32+rdi], ymm3 #16.21
vmovupd YMMWORD PTR [64+rdi], ymm4 #17.21
vmovupd YMMWORD PTR [96+rdi], ymm5 #18.21
vzeroupper #19.1
ret #19.1
The same code is generated for both version of ICC.
Note that using AVX-512 should not always speed up your code by a factor of two. For example, on Skylake SP (server-side processors) there is 2 AVX/AVX-2 SIMD units that can be fused to execute AVX-512 instructions but fusing does not improve throughput (assuming the SIMD units are the bottleneck). However, Skylake SP also supports an optional additional 512-bits SIMD units that does not support AVX/AVX-2 (only available on some processors). In this case, AVX-512 can make your code twice faster.

x86: partial register stall in cvtsi2sd in big loops but not in small loops [duplicate]

This question already has answers here:
Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?
(1 answer)
Floating point division vs floating point multiplication
(7 answers)
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
(1 answer)
Closed 1 year ago.
I'm testing floating point code generation of some compiler. The code in question calculates a simple sum:
double sum = 0.0;
for (int i = 1; i <= n; i++) {
sum += 1.0 / i;
}
I get two versions of this code, a simple one:
L017: cvtsi2sd xmm1, eax
add eax, byte 1
movsd xmm2, qword [rel L007]
divsd xmm2, xmm1
addsd xmm0, xmm2
cmp edx, eax
jge short L017
and one with the loop unrolled:
L017: cvtsi2sd xmm1, ecx
movsd xmm2, qword [rel L008]
divsd xmm2, xmm1
addsd xmm2, xmm0
lea edx, [rcx+1H]
add ecx, byte 2
cvtsi2sd xmm0, edx
movsd xmm1, qword [rel L008]
divsd xmm1, xmm0
addsd xmm2, xmm1
movapd xmm0, xmm2
cmp eax, ecx
jnz short L017
The unrolled one runs much longer then the simple one: 4s vs 1.6 s when n=1000000000.
The problem is resolved by clearing the register before cvtsi2sd :
L017: pxor xmm1, xmm1
cvtsi2sd xmm1, ecx
movsd xmm2, qword [rel L008]
...
The new unrolled version runs as fast as the simple one. OK, this is a partial register stall, although I'd never expected it to be so bad. There are however two questions:
Why does it not affect the simple version? It looks like tight loops do not suffer from this stall.
What makes cvtsi2sd so much different from other scalar FP instructions, such as addsd? They also affect only part of an xmm register (in non-VEX mode, and this is the mode used).
I observed this on three processors:
i7-6820HQ (Skylake)
i7-8665U (Whiskey Lake)
Xeon Gold 5120 (Skylake)

Haswell AVX/FMA latencies tested 1 cycle slower than Intel's guide says

In Intel Intrinsics Guide, vmulpd and vfmadd213pd has latency of 5, vaddpd has latency of 3.
I write some test code, but all of the results are 1 cycle slower.
Here is my test code:
.CODE
test_latency PROC
vxorpd ymm0, ymm0, ymm0
vxorpd ymm1, ymm1, ymm1
loop_start:
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
sub rcx, 4
jg loop_start
ret
test_latency ENDP
END
#include <stdio.h>
#include <omp.h>
#include <stdint.h>
#include <windows.h>
extern "C" void test_latency(int64_t n);
int main()
{
SetThreadAffinityMask(GetCurrentThread(), 1); // Avoid context switch
int64_t n = (int64_t)3e9;
double start = omp_get_wtime();
test_latency(n);
double end = omp_get_wtime();
double time = end - start;
double freq = 3.3e9; // My CPU frequency
double latency = freq * time / n;
printf("latency = %f\n", latency);
}
My CPU is Core i5 4590, I locked its frequency at 3.3GHz. The output is: latency = 6.102484.
Strange enough, if I change vmulpd ymm0, ymm0, ymm1 to vmulpd ymm0, ymm0, ymm0, then the output become: latency = 5.093745.
Is there an explanation? Is my test code problematic?
MORE RESULTS
results on Core i5 4590 #3.3GHz
vmulpd ymm0, ymm0, ymm1 6.056094
vmulpd ymm0, ymm0, ymm0 5.054515
vaddpd ymm0, ymm0, ymm1 4.038062
vaddpd ymm0, ymm0, ymm0 3.029360
vfmadd213pd ymm0, ymm0, ymm1 6.052501
vfmadd213pd ymm0, ymm1, ymm0 6.053163
vfmadd213pd ymm0, ymm1, ymm1 6.055160
vfmadd213pd ymm0, ymm0, ymm0 5.041532
(without vzeroupper)
vmulpd xmm0, xmm0, xmm1 6.050404
vmulpd xmm0, xmm0, xmm0 5.042191
vaddpd xmm0, xmm0, xmm1 4.044518
vaddpd xmm0, xmm0, xmm0 3.024233
vfmadd213pd xmm0, xmm0, xmm1 6.047219
vfmadd213pd xmm0, xmm1, xmm0 6.046022
vfmadd213pd xmm0, xmm1, xmm1 6.052805
vfmadd213pd xmm0, xmm0, xmm0 5.046843
(with vzeroupper)
vmulpd xmm0, xmm0, xmm1 5.062350
vmulpd xmm0, xmm0, xmm0 5.039132
vaddpd xmm0, xmm0, xmm1 3.019815
vaddpd xmm0, xmm0, xmm0 3.026791
vfmadd213pd xmm0, xmm0, xmm1 5.043748
vfmadd213pd xmm0, xmm1, xmm0 5.051424
vfmadd213pd xmm0, xmm1, xmm1 5.049090
vfmadd213pd xmm0, xmm0, xmm0 5.051947
(without vzeroupper)
mulpd xmm0, xmm1 5.047671
mulpd xmm0, xmm0 5.042176
addpd xmm0, xmm1 3.019492
addpd xmm0, xmm0 3.028642
(with vzeroupper)
mulpd xmm0, xmm1 5.046220
mulpd xmm0, xmm0 5.057278
addpd xmm0, xmm1 3.025577
addpd xmm0, xmm0 3.031238
MY GUESS
I changed test_latency like this:
.CODE
test_latency PROC
vxorpd ymm0, ymm0, ymm0
vxorpd ymm1, ymm1, ymm1
loop_start:
vaddpd ymm1, ymm1, ymm1 ; added this line
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
vmulpd ymm0, ymm0, ymm1
sub rcx, 4
jg loop_start
ret
test_latency ENDP
END
Finally I get the result of 5 cycle. There are other instructions to achieve the same effect:
vmovupd ymm1, ymm0
vmovupd ymm1, [mem]
vmovdqu ymm1, [mem]
vxorpd ymm1, ymm1, ymm1
vpxor ymm1, ymm1, ymm1
vmulpd ymm1, ymm1, ymm1
vshufpd ymm1, ymm1, ymm1, 0
But these instructions cannot:
vmovupd ymm1, ymm2 ; suppose ymm2 is zeroed
vpaddq ymm1, ymm1, ymm1
vpmulld ymm1, ymm1, ymm1
vpand ymm1, ymm1, ymm1
In the case of ymm instructions, I guess the conditions to avoid 1 extra cycle are:
All inputs are from the same domain.
All inputs are fresh enough. (move from old value doesn't work)
As for VEX xmm, the condition seems a little blur. It seems related to upper half state, but I don't know which one is cleaner:
vxorpd ymm1, ymm1, ymm1
vxorpd xmm1, xmm1, xmm1
vzeroupper
Hard question to me.
I've been meaning to write something up about this for a few years now, since noticing it on Skylake. https://github.com/travisdowns/uarch-bench/wiki/Intel-Performance-Quirks#after-an-integer-to-fp-bypass-latency-can-be-increased-indefinitely
Bypass-delay latency is "sticky": an integer SIMD instruction can "infect" all future instructions that read that value, even long after the instruction is done. I'm surprised that "infection" survived across a zeroing idiom, especially an FP zeroing instruction like vxorpd, but I can reproduce that effect on SKL (i7-6700k, counting clock cycles directly in a test loop with perf on Linux instead of messing around with time and frequency.)
(On Skylake, it seems 3 or more vxorpd zeroing instructions in a row before the loop happen to work, removing the extra bypass latency. AFAIK, xor-zeroing is always eliminated, unlike mov-elimination which sometimes fails. But perhaps the difference is just in creating a gap between issue of the vpaddb into the back-end and the first vmulpd; in my test loop I "dirty" / pollute the register right before the loop.)
(update: trying my test code again now, even one vxorps seems to clean the register. Perhaps a microcode update changed something.)
Presumably some previous use of YMM1 in the caller involved an integer instruction. (TODO: investigate how common it is for a register to get into this state, and when it can survive xor-zeroing! I expected it to only happen when constructing an FP bit-pattern with integer instructions, including stuff like vpcmpeqd ymm1,ymm1,ymm1 to make a -NaN (all-one bits).)
On Skylake I can fix it by doing vaddpd ymm1, ymm1, ymm1 before the loop, after the xor-zeroing. (Or before; it might not matter! That might be more optimal, putting it at the end of the previous dep chain instead of the start of this.)
As I wrote in a comment on another question
xsave/rstor can fix the issue where writing a register with a
SIMD-integer instruction like paddd creates extra latency indefinitely
for reading it with an FP instruction, affecting latency from both
inputs. e.g. paddd xmm0, xmm0 then in a loop addps xmm1, xmm0 has 5c
latency instead of the usual 4, until the next save/restore.
It's
bypass latency but still happens even if you don't touch the register
until after the paddd has definitely retired (by padding with >ROB
uops) before the loop.
Test program:
; taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread -r1 ./bypass-latency
default rel
global _start
_start:
vmovaps xmm1, [one] ; FP load into ymm1 (zeroing the upper lane)
vpaddd ymm1, ymm1,ymm0 ; ymm1 written in the ivec domain
;vxorps ymm1, ymm1,ymm1 ; In 2017, ymm1 still makes vaddps slow (5c) after this
; but I can't reproduce that now with updated microcode.
vxorps ymm0, ymm0, ymm0 ; zeroing-idiom on ymm0
mov rcx, 50000000
align 32 ; doesn't help or hurt, as expected since the bottleneck isn't frontend
.loop:
vaddps ymm0, ymm0,ymm1
vaddps ymm0, ymm0,ymm1
dec rcx
jnz .loop
xor edi,edi
mov eax,231
syscall ; exit_group(0)
section .rodata
align 16
one: times 4 dd 1.0
Perf results a static executable on i7-6700k:
Performance counter stats for './foo' (4 runs):
129.01 msec task-clock # 0.998 CPUs utilized ( +- 0.51% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
2 page-faults # 0.016 K/sec
500,053,798 cycles # 3.876 GHz ( +- 0.00% )
50,000,042 branches # 387.576 M/sec ( +- 0.00% )
200,000,059 instructions # 0.40 insn per cycle ( +- 0.00% )
150,020,084 uops_issued.any # 1162.883 M/sec ( +- 0.00% )
150,014,866 uops_executed.thread # 1162.842 M/sec ( +- 0.00% )
0.129244 +- 0.000670 seconds time elapsed ( +- 0.52% )
500M cycles for 50M iterations = 10 cycle loop-carried dependency for 2x vaddps, or 5 each.

Why this SSE2 program (integers) generate movaps (float)?

The following loops transpose an integer matrix to another integer matrix. when I compiled interestingly it generates movaps instruction to store the result into the output matrix. why gcc does this?
data:
int __attribute__(( aligned(16))) t[N][M]
, __attribute__(( aligned(16))) c_tra[N][M];
loops:
for( i=0; i<N; i+=4){
for(j=0; j<M; j+=4){
row0 = _mm_load_si128((__m128i *)&t[i][j]);
row1 = _mm_load_si128((__m128i *)&t[i+1][j]);
row2 = _mm_load_si128((__m128i *)&t[i+2][j]);
row3 = _mm_load_si128((__m128i *)&t[i+3][j]);
__t0 = _mm_unpacklo_epi32(row0, row1);
__t1 = _mm_unpacklo_epi32(row2, row3);
__t2 = _mm_unpackhi_epi32(row0, row1);
__t3 = _mm_unpackhi_epi32(row2, row3);
/* values back into I[0-3] */
row0 = _mm_unpacklo_epi64(__t0, __t1);
row1 = _mm_unpackhi_epi64(__t0, __t1);
row2 = _mm_unpacklo_epi64(__t2, __t3);
row3 = _mm_unpackhi_epi64(__t2, __t3);
_mm_store_si128((__m128i *)&c_tra[j][i], row0);
_mm_store_si128((__m128i *)&c_tra[j+1][i], row1);
_mm_store_si128((__m128i *)&c_tra[j+2][i], row2);
_mm_store_si128((__m128i *)&c_tra[j+3][i], row3);
}
}
Assembly generated code:
.L39:
lea rcx, [rsi+rdx]
movdqa xmm1, XMMWORD PTR [rdx]
add rdx, 16
add rax, 2048
movdqa xmm6, XMMWORD PTR [rcx+rdi]
movdqa xmm3, xmm1
movdqa xmm2, XMMWORD PTR [rcx+r9]
punpckldq xmm3, xmm6
movdqa xmm5, XMMWORD PTR [rcx+r10]
movdqa xmm4, xmm2
punpckhdq xmm1, xmm6
punpckldq xmm4, xmm5
punpckhdq xmm2, xmm5
movdqa xmm5, xmm3
punpckhqdq xmm3, xmm4
punpcklqdq xmm5, xmm4
movdqa xmm4, xmm1
punpckhqdq xmm1, xmm2
punpcklqdq xmm4, xmm2
movaps XMMWORD PTR [rax-2048], xmm5
movaps XMMWORD PTR [rax-1536], xmm3
movaps XMMWORD PTR [rax-1024], xmm4
movaps XMMWORD PTR [rax-512], xmm1
cmp r11, rdx
jne .L39
gcc -Wall -msse4.2 -masm="intel" -O2 -c -S
skylake
linuxmint
-mavx2 or -march=naticve generate VEX-encoding :vmovaps.
Functionally those instructions are the same.
I don't like to copy+paste other people statements as mine so few links explaining it:
Difference between MOVDQA and MOVAPS x86 instructions?
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/279587
http://masm32.com/board/index.php?topic=1138.0
https://www.gamedev.net/blog/615/entry-2250281-demystifying-sse-move-instructions/
Short version:
So for the most part, you should try to use the move instruction that
corresponds with the operations you are going to use on those
registers. However, there is an additional complication. Loads and
stores to and from memory execute on a separate port from the integer
and floating point units; thus instructions that load from memory into
a register or store from a register into memory will experience the
same delay regardless of the data type you attach to the move. Thus
in this case, movaps, movapd, and movdqa will have the same delay no
matter what data you use. Since movaps (and movups) is encoded in
binary form with one less byte than the other two, it makes sense to
use it for all reg-mem moves, regardless of the data type.
So it is GCC optimization.

arithmetics optimisation inside C for-loop

I have two functions with for-loops, which look very similar. The number of data to process is very large, so I am trying to optimise the cycles as much as possible. The execution time for the second function is 320 sec, but the first one takes 460 sec. Could somebody please give me any suggestions what makes the difference and how to optimise the computation?
int ii, jj;
double c1, c2;
for (ii = 0; ii < n; ++ii) {
a[jj] += b[ii] * c1;
a[++jj] += b[ii] * c2;
}
The second one:
int ii, jj;
double c1, c2;
for (ii = 0; ii < n; ++ii) {
b[ii] += a[jj] * c1;
b[ii] += a[++jj] * c2;
}
And here is the assembler output for the first loop:
movl -104(%rbp), %eax
movq -64(%rbp), %rcx
cmpl (%rcx), %eax
jge LBB0_12
## BB#10: ## in Loop: Header=BB0_9 Depth=5
movslq -88(%rbp), %rax
movq -48(%rbp), %rcx
movsd (%rcx,%rax,8), %xmm0 ## xmm0 = mem[0],zero
mulsd -184(%rbp), %xmm0
movslq -108(%rbp), %rax
movq -224(%rbp), %rcx ## 8-byte Reload
addsd (%rcx,%rax,8), %xmm0
movsd %xmm0, (%rcx,%rax,8)
movslq -88(%rbp), %rax
movq -48(%rbp), %rdx
movsd (%rdx,%rax,8), %xmm0 ## xmm0 = mem[0],zero
mulsd -192(%rbp), %xmm0
movl -108(%rbp), %esi
addl $1, %esi
movl %esi, -108(%rbp)
movslq %esi, %rax
addsd (%rcx,%rax,8), %xmm0
movsd %xmm0, (%rcx,%rax,8)
movl -88(%rbp), %esi
addl $1, %esi
movl %esi, -88(%rbp)
and for the second one:
movl -104(%rbp), %eax
movq -64(%rbp), %rcx
cmpl (%rcx), %eax
jge LBB0_12
## BB#10: ## in Loop: Header=BB0_9 Depth=5
movslq -108(%rbp), %rax
movq -224(%rbp), %rcx ## 8-byte Reload
movsd (%rcx,%rax,8), %xmm0 ## xmm0 = mem[0],zero
mulsd -184(%rbp), %xmm0
movslq -88(%rbp), %rax
movq -48(%rbp), %rdx
addsd (%rdx,%rax,8), %xmm0
movsd %xmm0, (%rdx,%rax,8)
movl -108(%rbp), %esi
addl $1, %esi
movl %esi, -108(%rbp)
movslq %esi, %rax
movsd (%rcx,%rax,8), %xmm0 ## xmm0 = mem[0],zero
mulsd -192(%rbp), %xmm0
movslq -88(%rbp), %rax
movq -48(%rbp), %rdx
addsd (%rdx,%rax,8), %xmm0
movsd %xmm0, (%rdx,%rax,8)
movl -88(%rbp), %esi
addl $1, %esi
movl %esi, -88(%rbp)
The original function is much bigger, so here I provided only the pieces responsible for those for-loops. The rest of the c-code and its assembler output is exactly the same for both functions.
The structure of that calculation is pretty weird, but it can be optimized significantly. Some problems with that code are
reloading data from a pointer after writing to an other pointer that isn't known to not alias. I assume they won't alias because this algorithm would be even weirder if that was allowed, but if they're really supposed to maybe alias, ignore this. In general, structure your loop body as: first load everything, do calculations, then store back. Don't mix loading and storing, it makes the compiler more conservative.
reloading data that was stored in the previous iteration. The compiler can see through this a bit, but it complicates matters. Don't do it.
implicitly treating the first and last items differently. It looks like a nice homogeneous loop at first, but due to its weird structure it's actually special casing the first and last things.
So let's first fix the second loops, which is simpler. The only problem here is the first store to b[ii], which has to Really Happen(tm) because it might alias with a[jj + 1]. But it can trivially be written so that that problem goes away:
for (ii = 0; ii < n; ++ii) {
b[ii] += a[jj] * c1 + a[jj + 1] * c2;
jj++;
}
You can tell by the assembly output that the compiler is happier now, and of course benchmarking confirms it's faster.
Old asm (only main loop, not the extra cruft):
.LBB0_14: # =>This Inner Loop Header: Depth=1
vmulpd ymm4, ymm2, ymmword ptr [r8 - 8]
vaddpd ymm4, ymm4, ymmword ptr [rax]
vmovupd ymmword ptr [rax], ymm4
vmulpd ymm5, ymm3, ymmword ptr [r8]
vaddpd ymm4, ymm4, ymm5
vmovupd ymmword ptr [rax], ymm4
add r8, 32
add rax, 32
add r11, -4
jne .LBB0_14
New asm (only main loop):
.LBB1_20: # =>This Inner Loop Header: Depth=1
vmulpd ymm4, ymm2, ymmword ptr [rax - 104]
vmulpd ymm5, ymm2, ymmword ptr [rax - 72]
vmulpd ymm6, ymm2, ymmword ptr [rax - 40]
vmulpd ymm7, ymm2, ymmword ptr [rax - 8]
vmulpd ymm8, ymm3, ymmword ptr [rax - 96]
vmulpd ymm9, ymm3, ymmword ptr [rax - 64]
vmulpd ymm10, ymm3, ymmword ptr [rax - 32]
vmulpd ymm11, ymm3, ymmword ptr [rax]
vaddpd ymm4, ymm4, ymm8
vaddpd ymm5, ymm5, ymm9
vaddpd ymm6, ymm6, ymm10
vaddpd ymm7, ymm7, ymm11
vaddpd ymm4, ymm4, ymmword ptr [rcx - 96]
vaddpd ymm5, ymm5, ymmword ptr [rcx - 64]
vaddpd ymm6, ymm6, ymmword ptr [rcx - 32]
vaddpd ymm7, ymm7, ymmword ptr [rcx]
vmovupd ymmword ptr [rcx - 96], ymm4
vmovupd ymmword ptr [rcx - 64], ymm5
vmovupd ymmword ptr [rcx - 32], ymm6
vmovupd ymmword ptr [rcx], ymm7
sub rax, -128
sub rcx, -128
add rbx, -16
jne .LBB1_20
That also got unrolled more (automatically), but the more significant difference (not that unrolling is useless, but reducing the loop overhead isn't such a big deal usually, it can mostly be handled by the ports that aren't busy with vector instructions) is the reduction in stores, which takes it from a ratio of 2/3 (potentially bottlenecked by store throughput where half the stores are useless) to 4/12 (bottlenecked by something that really has to happen).
Now for that first loop, once you take out the first and last iterations, it's just adding two scaled b's to every a, and then we put the first and last iterations back in separately:
a[0] += b[0] * c1;
for (ii = 1; ii < n; ++ii) {
a[ii] += b[ii - 1] * c2 + b[ii] * c1;
}
a[n] += b[n - 1] * c2;
That takes it from this (note that this isn't even vectorized):
.LBB0_3: # =>This Inner Loop Header: Depth=1
vmulsd xmm3, xmm0, qword ptr [rsi + 8*rax]
vaddsd xmm2, xmm2, xmm3
vmovsd qword ptr [rdi + 8*rax], xmm2
vmulsd xmm2, xmm1, qword ptr [rsi + 8*rax]
vaddsd xmm2, xmm2, qword ptr [rdi + 8*rax + 8]
vmovsd qword ptr [rdi + 8*rax + 8], xmm2
vmulsd xmm3, xmm0, qword ptr [rsi + 8*rax + 8]
vaddsd xmm2, xmm2, xmm3
vmovsd qword ptr [rdi + 8*rax + 8], xmm2
vmulsd xmm2, xmm1, qword ptr [rsi + 8*rax + 8]
vaddsd xmm2, xmm2, qword ptr [rdi + 8*rax + 16]
vmovsd qword ptr [rdi + 8*rax + 16], xmm2
lea rax, [rax + 2]
cmp ecx, eax
jne .LBB0_3
To this:
.LBB1_6: # =>This Inner Loop Header: Depth=1
vmulpd ymm4, ymm2, ymmword ptr [rbx - 104]
vmulpd ymm5, ymm2, ymmword ptr [rbx - 72]
vmulpd ymm6, ymm2, ymmword ptr [rbx - 40]
vmulpd ymm7, ymm2, ymmword ptr [rbx - 8]
vmulpd ymm8, ymm3, ymmword ptr [rbx - 96]
vmulpd ymm9, ymm3, ymmword ptr [rbx - 64]
vmulpd ymm10, ymm3, ymmword ptr [rbx - 32]
vmulpd ymm11, ymm3, ymmword ptr [rbx]
vaddpd ymm4, ymm4, ymm8
vaddpd ymm5, ymm5, ymm9
vaddpd ymm6, ymm6, ymm10
vaddpd ymm7, ymm7, ymm11
vaddpd ymm4, ymm4, ymmword ptr [rcx - 96]
vaddpd ymm5, ymm5, ymmword ptr [rcx - 64]
vaddpd ymm6, ymm6, ymmword ptr [rcx - 32]
vaddpd ymm7, ymm7, ymmword ptr [rcx]
vmovupd ymmword ptr [rcx - 96], ymm4
vmovupd ymmword ptr [rcx - 64], ymm5
vmovupd ymmword ptr [rcx - 32], ymm6
vmovupd ymmword ptr [rcx], ymm7
sub rbx, -128
sub rcx, -128
add r11, -16
jne .LBB1_6
Nice and vectorized this time, and much less storing and loading going on.
Both changes combined made it about twice as fast on my PC but of course YMMV.
I still that this code is weird though. Note how we're modifying a[n] in the last iteration of the first loop, then use it in the first iteration of the second loop, while the other a's just sort of stand to side and watch. It's odd. Maybe it really has to be that way, but frankly it looks like a bug to me.

Resources