What's the most efficient implementation of dense-matrix-vector multiplication? - performance

If M is a dense m x n matrix and v is an n-component vector, then the product u = Mv is an m-component vector given by u[i] = sum(M[i,j] * v[j], 1 <= j <= n). One simple implementation of this multiplication is
allocate m-component vector u of zeroes
for i = 1:m
for j = 1:n
u[i] += M[i,j] * v[j]
end
end
which builds up the vector u one element at a time. Another implementation is to interchange the loops:
allocate m-component vector u of zeroes
for j = 1:n
for i = 1:m
u[i] += M[i,j] * v[j]
end
end
where the entire vector is built up together.
Which of these implementations (if either) is typically used in languages like C and Fortran that are designed for efficient numerical computation? My guess is that languages like C that internally store matrices in row-major order use the former implementation, while languages like Fortran that use column-major order use the latter, so that the inner loop accesses consecutive memory sites for the matrix M. Is this correct?
The former implementation seems more efficient, because the memory location being written to only changes m times, while in the latter implementation the memory location being written to changes m*n times, even though only m unique locations are ever written to. (Of course, by the same logic, the latter implementation would be more efficient for row-vector-matrix multiplication, but this is much less common.) On the other hand, I believe that Fortran is typically faster at dense-matrix-vector multiplication than C, so perhaps I am either (a) guessing their implementations wrong, or (b) misjudging the relative efficiency of the two implementations.

Probably using an established BLAS implementation is the most common. Apart from that, there are some issues with simple implementations that may be interesting to look at. For example, in C (or C++ for that matter), pointers aliasing often prevents a lot of optimization, and thus for example
void matvec(double *M, size_t n, size_t m, double *v, double * u)
{
for (size_t i = 0; i < m; i++) {
for (size_t j = 0; j < n; j++) {
u[i] += M[i * n + j] * v[j];
}
}
}
Is turned into this by Clang 5 (inner loop excerpt)
.LBB0_4: # Parent Loop BB0_3 Depth=1
vmovsd xmm1, qword ptr [rcx + 8*rax] # xmm1 = mem[0],zero
vfmadd132sd xmm1, xmm0, qword ptr [r13 + 8*rax - 24]
vmovsd qword ptr [r8 + 8*rbx], xmm1
vmovsd xmm0, qword ptr [rcx + 8*rax + 8] # xmm0 = mem[0],zero
vfmadd132sd xmm0, xmm1, qword ptr [r13 + 8*rax - 16]
vmovsd qword ptr [r8 + 8*rbx], xmm0
vmovsd xmm1, qword ptr [rcx + 8*rax + 16] # xmm1 = mem[0],zero
vfmadd132sd xmm1, xmm0, qword ptr [r13 + 8*rax - 8]
vmovsd qword ptr [r8 + 8*rbx], xmm1
vmovsd xmm0, qword ptr [rcx + 8*rax + 24] # xmm0 = mem[0],zero
vfmadd132sd xmm0, xmm1, qword ptr [r13 + 8*rax]
vmovsd qword ptr [r8 + 8*rbx], xmm0
add rax, 4
cmp r11, rax
jne .LBB0_4
That really hurts to look at, and it will hurt even more to execute. The compiler "had to" do this because u may alias with M and/or v, so stores into u are treated with great suspicion ("had to" is in quotes because the compiler could insert a test for aliasing and have a fast-path for the nice case). In Fortran, procedure arguments by default cannot alias, so this problem wouldn't have existed. This is a typical reason why code that is just randomly typed out without special tricks is faster in Fortran than in C - the rest of my answer won't be about that, I'm just going to make the C code a bit faster (in the end I get back to a column-major M). In C one way the aliasing problem can be fixed is with restrict, but about the only thing it has going for it is that it's not intrusive (using an explicit accumulator instead of summing into u[i] also does the trick, but without relying on a magic keyword)
void matvec(double *M, size_t n, size_t m, double *v, double * restrict u)
{
for (size_t i = 0; i < m; i++) {
for (size_t j = 0; j < n; j++) {
u[i] += M[i * n + j] * v[j];
}
}
}
Now this happens:
.LBB0_8: # Parent Loop BB0_3 Depth=1
vmovupd ymm5, ymmword ptr [rcx + 8*rbx]
vmovupd ymm6, ymmword ptr [rcx + 8*rbx + 32]
vmovupd ymm7, ymmword ptr [rcx + 8*rbx + 64]
vmovupd ymm8, ymmword ptr [rcx + 8*rbx + 96]
vfmadd132pd ymm5, ymm1, ymmword ptr [rax + 8*rbx - 224]
vfmadd132pd ymm6, ymm2, ymmword ptr [rax + 8*rbx - 192]
vfmadd132pd ymm7, ymm3, ymmword ptr [rax + 8*rbx - 160]
vfmadd132pd ymm8, ymm4, ymmword ptr [rax + 8*rbx - 128]
vmovupd ymm1, ymmword ptr [rcx + 8*rbx + 128]
vmovupd ymm2, ymmword ptr [rcx + 8*rbx + 160]
vmovupd ymm3, ymmword ptr [rcx + 8*rbx + 192]
vmovupd ymm4, ymmword ptr [rcx + 8*rbx + 224]
vfmadd132pd ymm1, ymm5, ymmword ptr [rax + 8*rbx - 96]
vfmadd132pd ymm2, ymm6, ymmword ptr [rax + 8*rbx - 64]
vfmadd132pd ymm3, ymm7, ymmword ptr [rax + 8*rbx - 32]
vfmadd132pd ymm4, ymm8, ymmword ptr [rax + 8*rbx]
add rbx, 32
add rbp, 2
jne .LBB0_8
It's not scalar any more, so that's good. But not ideal. While there are 8 FMAs here, they're arranged in four pairs of dependent FMAs. Taken across the whole loop, there are actually only 4 independent dependency chains of FMAs. FMA typically has a long latency and decent throughput though, for example on Skylake it has a latency of 4 and a throughput of 2/cycle, so 8 independent chains of FMAs are needed to utilize all of that compute throughput. Haswell is even worse, FMA had a latency of 5 and already a throughput of 2/cycle, so it needed 10 independent chains. An other problem is that it is hard to actually feed all of those FMAs, the structure above does not even really try: it uses 2 loads per FMA, while loads actually have the same throughput as FMAs so their ratio should be 1:1.
Improving the load:FMA ratio can be done by unrolling the outer loop, which lets us re-use the loads from v for several rows of M (this is not even a caching consideration, but it helps for that, too). Unrolling the outer loop also works towards the goal of having more independent chains of FMAs. Compilers typically don't like to unroll anything but the inner loop, so this takes some manual work. "Tail" iterations omitted (or: assume m is a multiple of 4).
void matvec(double *M, size_t n, size_t m, double *v, double * restrict u)
{
size_t i;
for (i = 0; i + 3 < m; i += 4) {
for (size_t j = 0; j < n; j++) {
size_t it = i;
u[it] += M[it * n + j] * v[j];
it++;
u[it] += M[it * n + j] * v[j];
it++;
u[it] += M[it * n + j] * v[j];
it++;
u[it] += M[it * n + j] * v[j];
}
}
}
Unfortunately, Clang still decides to unroll the inner loop wrong, with "wrong" being that naive serial-unroll. There's not much point as long as there are still only 4 independent chains:
.LBB0_8: # Parent Loop BB0_3 Depth=1
vmovupd ymm5, ymmword ptr [rcx + 8*rdx]
vmovupd ymm6, ymmword ptr [rcx + 8*rdx + 32]
vfmadd231pd ymm4, ymm5, ymmword ptr [r12 + 8*rdx - 32]
vfmadd231pd ymm3, ymm5, ymmword ptr [r13 + 8*rdx - 32]
vfmadd231pd ymm2, ymm5, ymmword ptr [rax + 8*rdx - 32]
vfmadd231pd ymm1, ymm5, ymmword ptr [rbx + 8*rdx - 32]
vfmadd231pd ymm4, ymm6, ymmword ptr [r12 + 8*rdx]
vfmadd231pd ymm3, ymm6, ymmword ptr [r13 + 8*rdx]
vfmadd231pd ymm2, ymm6, ymmword ptr [rax + 8*rdx]
vfmadd231pd ymm1, ymm6, ymmword ptr [rbx + 8*rdx]
add rdx, 8
add rdi, 2
jne .LBB0_8
This problem goes away if we stop being lazy and finally make some explicit accumulators:
void matvec(double *M, size_t n, size_t m, double *v, double *u)
{
size_t i;
for (i = 0; i + 3 < m; i += 4) {
double t0 = 0, t1 = 0, t2 = 0, t3 = 0;
for (size_t j = 0; j < n; j++) {
size_t it = i;
t0 += M[it * n + j] * v[j];
it++;
t1 += M[it * n + j] * v[j];
it++;
t2 += M[it * n + j] * v[j];
it++;
t3 += M[it * n + j] * v[j];
}
u[i] += t0;
u[i + 1] += t1;
u[i + 2] += t2;
u[i + 3] += t3;
}
}
Now Clang does this:
.LBB0_6: # Parent Loop BB0_3 Depth=1
vmovupd ymm8, ymmword ptr [r10 - 32]
vmovupd ymm9, ymmword ptr [r10]
vfmadd231pd ymm6, ymm8, ymmword ptr [rdi]
vfmadd231pd ymm7, ymm9, ymmword ptr [rdi + 32]
lea rax, [rdi + r14]
vfmadd231pd ymm4, ymm8, ymmword ptr [rdi + 8*rsi]
vfmadd231pd ymm5, ymm9, ymmword ptr [rdi + 8*rsi + 32]
vfmadd231pd ymm1, ymm8, ymmword ptr [rax + 8*rsi]
vfmadd231pd ymm3, ymm9, ymmword ptr [rax + 8*rsi + 32]
lea rax, [rax + r14]
vfmadd231pd ymm0, ymm8, ymmword ptr [rax + 8*rsi]
vfmadd231pd ymm2, ymm9, ymmword ptr [rax + 8*rsi + 32]
add rdi, 64
add r10, 64
add rbp, -8
jne .LBB0_6
Which is decent. The load:FMA ratio is 10:8 and there are too few accumulators for Haswell, so some improvement is still possible. Some other interesting unrolling combinations are (outer x inner) 4x3 (12 accumulators, 3 temporaries, 5/4 load:FMA), 5x2 (10, 2, 6/5), 7x2 (14, 2, 8/7), 15x1 (15, 1, 16/15). That makes it look as through unrolling the outer loop is better, but having too many different streams (even if not "streaming" in the sense of "streaming loads") is bad for automatic prefetching, and when actually streaming it may be bad to exceed the number of fill buffers (actual details are scarce). Manual prefetching is also an option. Getting to an actually good MVM procedure would take a lot more work, trying out a lot of these things.
Saving the stores into u for outside the inner loop meant that restrict was no longer necessary. Most impressively, I think, is that no SIMD intrinsics were needed to get this far - Clang is pretty good with that, if there is no scary potential aliasing. GCC and ICC do try, but don't unroll enough, yet more manual unrolling would probably do the trick..
Loop tiling is also an option, but this is MVM. Tiling is extremely necessary for MMM, but MMM has an almost unlimited amount of data-reuse, which MVM does not have. Only the vector is reused, the matrix is just streamed through once. Likely memory throughput to stream a huge matrix will be a bigger problem than the vector not fitting in cache.
With a column-major M, it's different, with no significant loop-carried dependency. There is a dependency through memory but it has a lot of time. The load:FMA ratio still has to be reduced though, so it still takes some unrolling of the outer loop, but overall it seems easier to deal with. It can be rearranged to use mostly additions, but FMA has a high throughput anyway (on HSW, higher than addition!). It does not need the horizontal sums, which were annoying but they happened outside the inner loop anyway. In return there are stores in the inner loop. Without trying it, I don't expect a large inherent difference between those approaches, it seems like both ways should be tunable to between 80 and 90 percent of the compute throughput (for cacheable sizes). The "annoying extra load" inherently prevents getting arbitrarily close to 100% either way.

Related

Counting differences between 2 buffers seems too slow

My problem
I have 2 adjacent buffers of bytes of identical size (around 20 MB each). I just want to count the differences between them.
My question
How much time this loop should take to run on a 4.8GHz Intel I7 9700K with 3600MT RAM ?
How do we compute max theoretical speed ?
What I tried
uint64_t compareFunction(const char *const __restrict buffer, const uint64_t commonSize)
{
uint64_t diffFound = 0;
for(uint64_t byte = 0; byte < commonSize; ++byte)
diffFound += static_cast<uint64_t>(buffer[byte] != buffer[byte + commonSize]);
return diffFound;
}
It takes 11ms on my PC (9700K 4.8Ghz RAM 3600 Windows 10 Clang 14.0.6 -O3 MinGW ) and I feel it is too slow and that I am missing something.
40MB should take less than 2ms to be read on the CPU (my RAM bandwidth is between 20 and 30GB/s)
I don't know how to count cycles required to execute one iteration (especially because CPUs are superscalar nowadays). If I assume 1 cycle per operation and if I don't mess up my counting, it should be 10 ops per iteration -> 200 million ops -> at 4.8 Ghz with only one execution unit -> 40ms. Obviously I am wrong on how to compute the number of cycles per loop.
Fun fact: I tried on Linux PopOS GCC 11.2 -O3 and it ran at 4.5ms. Why such a difference?
Here are the dissassemblies vectorised and scalar produced by clang:
compareFunction(char const*, unsigned long): # #compareFunction(char const*, unsigned long)
test rsi, rsi
je .LBB0_1
lea r8, [rdi + rsi]
neg rsi
xor edx, edx
xor eax, eax
.LBB0_4: # =>This Inner Loop Header: Depth=1
movzx r9d, byte ptr [rdi + rdx]
xor ecx, ecx
cmp r9b, byte ptr [r8 + rdx]
setne cl
add rax, rcx
add rdx, 1
mov rcx, rsi
add rcx, rdx
jne .LBB0_4
ret
.LBB0_1:
xor eax, eax
ret
Clang14 O3:
.LCPI0_0:
.quad 1 # 0x1
.quad 1 # 0x1
compareFunction(char const*, unsigned long): # #compareFunction(char const*, unsigned long)
test rsi, rsi
je .LBB0_1
cmp rsi, 4
jae .LBB0_4
xor r9d, r9d
xor eax, eax
jmp .LBB0_11
.LBB0_1:
xor eax, eax
ret
.LBB0_4:
mov r9, rsi
and r9, -4
lea rax, [r9 - 4]
mov r8, rax
shr r8, 2
add r8, 1
test rax, rax
je .LBB0_5
mov rdx, r8
and rdx, -2
lea r10, [rdi + 6]
lea r11, [rdi + rsi]
add r11, 6
pxor xmm0, xmm0
xor eax, eax
pcmpeqd xmm2, xmm2
movdqa xmm3, xmmword ptr [rip + .LCPI0_0] # xmm3 = [1,1]
pxor xmm1, xmm1
.LBB0_7: # =>This Inner Loop Header: Depth=1
movzx ecx, word ptr [r10 + rax - 6]
movd xmm4, ecx
movzx ecx, word ptr [r10 + rax - 4]
movd xmm5, ecx
movzx ecx, word ptr [r11 + rax - 6]
movd xmm6, ecx
pcmpeqb xmm6, xmm4
movzx ecx, word ptr [r11 + rax - 4]
movd xmm7, ecx
pcmpeqb xmm7, xmm5
pxor xmm6, xmm2
punpcklbw xmm6, xmm6 # xmm6 = xmm6[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm4, xmm6, 212 # xmm4 = xmm6[0,1,1,3,4,5,6,7]
pshufd xmm4, xmm4, 212 # xmm4 = xmm4[0,1,1,3]
pand xmm4, xmm3
paddq xmm4, xmm0
pxor xmm7, xmm2
punpcklbw xmm7, xmm7 # xmm7 = xmm7[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm0, xmm7, 212 # xmm0 = xmm7[0,1,1,3,4,5,6,7]
pshufd xmm5, xmm0, 212 # xmm5 = xmm0[0,1,1,3]
pand xmm5, xmm3
paddq xmm5, xmm1
movzx ecx, word ptr [r10 + rax - 2]
movd xmm0, ecx
movzx ecx, word ptr [r10 + rax]
movd xmm1, ecx
movzx ecx, word ptr [r11 + rax - 2]
movd xmm6, ecx
pcmpeqb xmm6, xmm0
movzx ecx, word ptr [r11 + rax]
movd xmm7, ecx
pcmpeqb xmm7, xmm1
pxor xmm6, xmm2
punpcklbw xmm6, xmm6 # xmm6 = xmm6[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm0, xmm6, 212 # xmm0 = xmm6[0,1,1,3,4,5,6,7]
pshufd xmm0, xmm0, 212 # xmm0 = xmm0[0,1,1,3]
pand xmm0, xmm3
paddq xmm0, xmm4
pxor xmm7, xmm2
punpcklbw xmm7, xmm7 # xmm7 = xmm7[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm1, xmm7, 212 # xmm1 = xmm7[0,1,1,3,4,5,6,7]
pshufd xmm1, xmm1, 212 # xmm1 = xmm1[0,1,1,3]
pand xmm1, xmm3
paddq xmm1, xmm5
add rax, 8
add rdx, -2
jne .LBB0_7
test r8b, 1
je .LBB0_10
.LBB0_9:
movzx ecx, word ptr [rdi + rax]
movd xmm2, ecx
movzx ecx, word ptr [rdi + rax + 2]
movd xmm3, ecx
add rax, rsi
movzx ecx, word ptr [rdi + rax]
movd xmm4, ecx
pcmpeqb xmm4, xmm2
movzx eax, word ptr [rdi + rax + 2]
movd xmm2, eax
pcmpeqb xmm2, xmm3
pcmpeqd xmm3, xmm3
pxor xmm4, xmm3
punpcklbw xmm4, xmm4 # xmm4 = xmm4[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm4, xmm4, 212 # xmm4 = xmm4[0,1,1,3,4,5,6,7]
pshufd xmm4, xmm4, 212 # xmm4 = xmm4[0,1,1,3]
movdqa xmm5, xmmword ptr [rip + .LCPI0_0] # xmm5 = [1,1]
pand xmm4, xmm5
paddq xmm0, xmm4
pxor xmm2, xmm3
punpcklbw xmm2, xmm2 # xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw xmm2, xmm2, 212 # xmm2 = xmm2[0,1,1,3,4,5,6,7]
pshufd xmm2, xmm2, 212 # xmm2 = xmm2[0,1,1,3]
pand xmm2, xmm5
paddq xmm1, xmm2
.LBB0_10:
paddq xmm0, xmm1
pshufd xmm1, xmm0, 238 # xmm1 = xmm0[2,3,2,3]
paddq xmm1, xmm0
movq rax, xmm1
cmp r9, rsi
je .LBB0_13
.LBB0_11:
lea r8, [r9 + rsi]
sub rsi, r9
add r8, rdi
add rdi, r9
xor edx, edx
.LBB0_12: # =>This Inner Loop Header: Depth=1
movzx r9d, byte ptr [rdi + rdx]
xor ecx, ecx
cmp r9b, byte ptr [r8 + rdx]
setne cl
add rax, rcx
add rdx, 1
cmp rsi, rdx
jne .LBB0_12
.LBB0_13:
ret
.LBB0_5:
pxor xmm0, xmm0
xor eax, eax
pxor xmm1, xmm1
test r8b, 1
jne .LBB0_9
jmp .LBB0_10
TLDR: the reason why the Clang code is so slow comes from a poor vectorization method saturating the port 5 (known to be often an issue). GCC does a better job here, but it is still far from being efficient. One can write a much faster chunk-based code using AVX-2 not saturating the port 5.
Analysis of the unvectorized Clang code
To understand what is going on it is better to start with a simple example. Indeed, as you said, modern processor are superscalar so it is not easy to understand the speed of some generated code on such architecture.
The code generated by Clang using the -O1 optimization flag is a good start. Here is the code of the hot loop produced by GodBold provided in your question:
(instructions) (ports)
.LBB0_4:
movzx r9d, byte ptr [rdi + rdx] p23
xor ecx, ecx p0156
cmp r9b, byte ptr [r8 + rdx] p0156+p23
setne cl p06
add rax, rcx p0156
add rdx, 1 p0156
mov rcx, rsi (optimized)
add rcx, rdx p0156
jne .LBB0_4 p06
Modern processors like the Coffee Lake 9700K are structured in two big parts: a front-end fetching/decoding the instructions (and splitting them into micro-instructions, aka. uops), and a back-end scheduling/executing them. The back-end schedule the uops on many ports and each of them can execute some specific sets of instructions (eg. only memory load, or only arithmetic instruction). For each instruction, I put the ports that can execute them. p0156+p23 means the instruction is split in two uops: the first can be executed by the ports 0 or 1 or 5 or 6, and the second can be executed by the ports 2 or 3. Note that the front-end can somehow optimize the code so not to produce any uops for basic instructions like the mov in the loop (thanks to a mechanism called register renaming).
For each loop iteration, the processor needs to read 2 value from memory. A Coffee Lake processor like the 9700K can load two values per cycle so the loop will at least take 1 cycle/iteration (assuming the loads in r9d and r9b does not conflict due to the use of different part of the same r9 64-bit register). This processor has a uops cache and the loop has a lot of instructions so the decoding part should not be a problem. That being said, there is 9 uops to execute and the processor can only execute 6 of them per cycle so the loop cannot take less than 1.5 cycle/iteration. More precisely, the ports 0, 1, 5 and 6 are under pressure, so even assuming the processor perfectly load balance the uops, 2 cycle/iterations are needed. This is an optimistic lower-bound execution time since the processor may not perfectly schedule the instruction and there are many things that could possibly go wrong (like a sneaky hidden dependency I did not see). With a frequency of 4.8GHz, the final execution time is at least 8.3 ms. It can reach 12.5 ms with 3 cycle/iteration (note that 2.5 cycle/iteration is possible due to the scheduling of uops to ports).
The loop can be improved using unrolling. Indeed, a significant number of instructions are needed just to do the loop and not the actual computation. Unrolling can help to increase the ratio of useful instructions so to make a better usage of available ports. Still, the 2 loads prevent the loop to be faster than 1 cycle/iteration, that is 4.2 ms.
Analysis of the vectorized Clang code
The vectorized code generated by Clang is complex. One could try to apply the same analysis than in the previous code but it would be a tedious task.
One can note that even though the code is vectorized, the loads are not vectorized. This is an issue since only 2 loads can be done per cycle. That being said, loads are performed by pairs two contiguous char values so loads are not so slow compared to the previously generated code.
Clang does that since only two 64-bit values can fit in a 128-bit SSE register and a 64-bit and it needs to do that because diffFound is a 64-bit integer. The 8-bit to 64-bit conversion is the biggest issue in the code because it requires several SSE instructions to do the conversion. Moreover, only 4 integers can be computed at a time since there is 3 SSE integer units on Coffee Lake and each of them can only compute two 64-bit integers at a time. In the end, Clang only put 2 values in each SSE register (and use 4 of them so to compute 8 items per loop iteration) so one should expect a code running more than twice faster (especially due to SSE and the loop unrolling), but this is not much the case due to fewer SSE ports than ALU ports and a more instructions required for the type conversions. Put it shortly, the vectorization is clearly inefficient, but this is not so easy for Clang to generate an efficient code in this case. Still, with 28 SSE instructions and 3 SSE integer units computing 8 items per loop, one should expect the computing part of the code to take about 28/3/8 ~= 1.2 cycle/item which is far from what you can observe (and this is not due to other instruction since they can mostly be executed in parallel as they can mostly be scheduled on other ports).
In fact, the performance issue certainly comes from the saturation of the port 5. Indeed, this port is the only one that can shuffle items of SIMD registers. Thus, the instructions punpcklbw, pshuflw, pshufd and even the movd can only be executed on the port 5. This is a pretty common issue with SIMD codes. This is a big issue since there is 20 instructions per loop and the processor may not even use it perfectly. This means the code should take at least 10.4 ms which is very close to the observed execution time (11 ms).
Analysis of the vectorized GCC code
The code generated by GCC is actually pretty good compared to the one of Clang. Firstly, GCC loads items using SIMD instruction directly which is much more efficient as 16 items are computed per instruction (and by iteration): it only need 2 load uops per iteration reducing the pressure on the port 2 and 3 (1 cycle/iteration for that, so 0.0625 cycle/item). Secondly, GCC only uses 14 punpckhwd instructions while each iteration compute 16 items, reducing critical pressure on the port 5 (0.875 cycle/item for that). Thirdly, the SIMD registers are nearly fully used, at least for the comparison since the pcmpeqb comparison instruction compare 16 items at a time (as opposed to 2 with Clang). The other instructions like paddq are cheap (for example, paddq can be scheduled on the 3 SSE ports) and they should not impact much the execution time. In the end, this version should still be bounded by the port 5, but it should be much faster than the Clang version. Indeed, one should expect the execution time to reach 1 cycle/item (since the port scheduling is certainly not perfect and memory loads may introduce some stalling cycles). This means an execution time of 4.2 ms. This is close to the observed results.
Faster implementation
The GCC implementation is not perfect.
First of all, it does not use AVX2 supported by your processor since the -mavx2 flag is not provided (or any similar flag like -march=native). Indeed, GCC like other mainstream compilers only use SSE2 by default for sake of compatibility with previous architecture: SSE2 is safe to use on all x86-64 processors, but not other instruction sets like SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2. With such flag, GCC should be able to produce a memory bound code.
Moreover, the compiler could theoretically perform a multi-level sum reduction. The idea is to accumulate the result of the comparison in a 8-bit wide SIMD lane using chunks with a size of 1024 items (ie. 64x16 items). This is safe since the value of each lane cannot exceed 64. To avoid overflow, the accumulated values needs to be stored in wider SIMD lanes (eg. 64-bit ones). With this strategy, the overhead of the punpckhwd instructions is 64 time smaller. This is a big improvement since it removes the saturation of the port 5. This strategy should be sufficient to generate a memory-bound code, even using only SSE2. Here is an example of untested code requiring the flag -fopenmp-simd to be efficient.
uint64_t compareFunction(const char *const __restrict buffer, const uint64_t commonSize)
{
uint64_t byteChunk = 0;
uint64_t diffFound = 0;
if(commonSize >= 127)
{
for(; byteChunk < commonSize-127; byteChunk += 128)
{
uint8_t tmpDiffFound = 0;
#pragma omp simd reduction(+:tmpDiffFound)
for(uint64_t byte = byteChunk; byte < byteChunk + 128; ++byte)
tmpDiffFound += buffer[byte] != buffer[byte + commonSize];
diffFound += tmpDiffFound;
}
}
for(uint64_t byte = byteChunk; byte < commonSize; ++byte)
diffFound += buffer[byte] != buffer[byte + commonSize];
return diffFound;
}
Both GCC and Clang generates a rather efficient code (while sub-optimal for data fitting in the cache), especially Clang. Here is for example the code generated by Clang using AVX2:
.LBB0_4:
lea r10, [rdx + 128]
vmovdqu ymm2, ymmword ptr [r9 + rdx - 96]
vmovdqu ymm3, ymmword ptr [r9 + rdx - 64]
vmovdqu ymm4, ymmword ptr [r9 + rdx - 32]
vpcmpeqb ymm2, ymm2, ymmword ptr [rcx + rdx - 96]
vpcmpeqb ymm3, ymm3, ymmword ptr [rcx + rdx - 64]
vpcmpeqb ymm4, ymm4, ymmword ptr [rcx + rdx - 32]
vmovdqu ymm5, ymmword ptr [r9 + rdx]
vpaddb ymm2, ymm4, ymm2
vpcmpeqb ymm4, ymm5, ymmword ptr [rcx + rdx]
vpaddb ymm3, ymm4, ymm3
vpaddb ymm2, ymm3, ymm2
vpaddb ymm2, ymm2, ymm0
vextracti128 xmm3, ymm2, 1
vpaddb xmm2, xmm2, xmm3
vpshufd xmm3, xmm2, 238
vpaddb xmm2, xmm2, xmm3
vpsadbw xmm2, xmm2, xmm1
vpextrb edx, xmm2, 0
add rax, rdx
mov rdx, r10
cmp r10, r8
jb .LBB0_4
All the loads are 256-bit SIMD ones. The number of vpcmpeqb is optimal. The number of vpaddb is relatively good. There are few other instructions, but they should clearly not be a bottleneck. The loop operate on 128 items per iteration and I expect it to takes less than a dozen of cycles per iteration for data already in the cache (otherwise it should be completely memory-bound). This means <0.1 cycle/item, that is, far less than the previous implementation. In fact, the uiCA tool indicates about 0.055 cycle/item, that is 81 GiB/s! One may manually write a better code using SIMD intrinsics, but at the expense of a significantly worse portability, maintenance and readability.
Note that generating a sequential memory-bound does not always mean the RAM throughput will be saturated. In fact, on one core, there is sometimes not enough concurrency to hide the latency of memory operations though it should be fine on your processor (like it is on my i5-9600KF with 2 interleaved 3200 MHz DDR4 memory channels).
Yes, if your data is not hot in cache, even SSE2 should keep up with memory bandwidth. Compare-and-sum of 32 compare results per cycle (from two 32-byte loads) is totally possible if data is hot in L1d cache, or whatever bandwidth outer levels of cache can provide.
If not, the compiler did a bad job. That's unfortunately common for problems like this reducing into a wider variable; compilers don't know good vectorization strategies for summing bytes, especially compare-result bytes that must be 0/-1. They probably widen to 64-bit with pmovsxbq right away (or even worse if SSE4.1 instructions aren't available).
So even -O3 -march=native doesn't help much; this is a big missed-optimization; hopefully GCC and clang will learn how to vectorize this kind of loop at some point, summing compare results probably comes up in enough codebases to be worth recognizing that pattern.
The efficient way is to use psadbw to sum horizontally into qwords. But only after an inner loop does some iterations of vsum -= cmp(p, q), subtracting 0 or -1 to increment a counter or not. 8-bit elements can do 255 iterations of that without risk of overflow. And with unrolling for multiple vector accumulators, that's many vectors of 32 bytes each, so you don't have to break out of that inner loop very often.
See How to count character occurrences using SIMD for manually-vectorized AVX2 code. (And one answer has a Godbolt link to an SSE2 version.) Summing the compare results is the same problem as that, but you're loading two vectors to feed pcmpeqb instead of broadcasting one byte outside the loop to find occurrences of a single char.
An answer there has benchmarks that report 28 GB/s for AVX2, 23 GB/s for SSE2, on an i7-6700 Skylake (at only 3.4GHz, maybe they disabled turbo or are just reporting the rated speed. DRAM speed not mentioned.)
I'd expect 2 input streams of data to achieve about the same sustained bandwidth as one.
This is more interesting to optimize if you benchmark repeated passes over smaller arrays that fit in L2 cache, then efficiency of your ALU instructions matters. (The strategy in the answers on that question are pretty good and well tuned for that case.)
Fast counting the number of equal bytes between two arrays is an older Q&A using a worse strategy, not using psadbw to sum bytes to 64-bit. (But not as bad as GCC/clang, still hsumming as it widens to 32-bit.)
Multiple threads/cores will barely help on a modern desktop, especially at high core clocks like yours. Memory latency is low enough and each core has enough buffers to keep enough requests in flight that it can nearly saturate dual-channel DRAM controllers.
On a big Xeon, that would be very different; you need most of the cores to achieve peak aggregate bandwidth, even for just memcpy or memset so there's zero ALU work, just loads/stores. The higher latency means a single core has much less memory bandwidth available than on a desktop (even in an absolute sense, let alone as a percentage of 6 channels instead of 2). See also Enhanced REP MOVSB for memcpy and Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?
Portable source that compiles to less-bad asm, micro-optimized from Jérôme's: 5.5 cycles per 4x 32-byte vectors, down from 7 or 8, assuming L1d cache hits.
Still not good (as it reduces to scalar every 128 bytes, or 192 if you want to try that), but
#Jérôme Richard came up with a clever way to give clang something it could vectorize a short with a good strategy, with a uint8_t sum, using that as an inner loop short enough to not overflow.
But clang still does some dumb things with that loop, as we can see in his answer. I modified the loop control to use a pointer increment, which reduces the loop overhead a bit, just one pointer-add and compare/jcc, not LEA/MOV. I don't know why clang was doing it inefficiently using integer indexing.
And it avoids an indexed addressing mode for the vpcmpeqb memory source operands, letting them stay micro-fused on Intel CPUs. (Clang doesn't seem to know that this matters at all! Reversing operands to != in the source was enough to make it use indexed addressing modes for vpcmpeqb instead of for vmovdqu pure loads.)
// micro-optimized version of Jérôme's function, clang compiles this better
// instead of 2 arrays, it compares first and 2nd half of one array, which lets it index one relative to the other with an offset if we hand-hold clang into doing that.
uint64_t compareFunction_sink_fixup(const char *const __restrict buffer, const size_t commonSize)
{
uint64_t byteChunk = 0;
uint64_t diffFound = 0;
const char *endp = buffer + commonSize;
const char *__restrict ptr = buffer;
if(commonSize >= 127) {
// A signed type for commonSize wouldn't avoid UB in pointer subtraction creating a pointer before the object
// in practice it would be fine except maybe when inlining into a function where the compiler could see a compile-time-constant array size.
for(; ptr < endp-127 ; ptr += 128)
{
uint8_t tmpDiffFound = 0;
#pragma omp simd reduction(+:tmpDiffFound)
for(int off = 0 ; off < 128; ++off)
tmpDiffFound += ptr[off + commonSize] != ptr[off];
// without AVX-512, we get -1 for ==, 0 for not-equal. So clang adds set1_epi(4) to each bucket that holds the sum of four 0 / -1 elements
diffFound += tmpDiffFound;
}
}
// clang still auto-vectorizes, but knows the max trip count is only 127
// so doesn't unroll, just 4 bytes per iter.
for(int byte = 0 ; byte < commonSize % 128 ; ++byte)
diffFound += ptr[byte] != ptr[byte + commonSize];
return diffFound;
}
Godbolt with clang15 -O3 -fopenmp-simd -mavx2 -march=skylake -mbranches-within-32B-boundaries
# The main loop, from clang 15 for x86-64 Skylake
.LBB0_4: # =>This Inner Loop Header: Depth=1
vmovdqu ymm2, ymmword ptr [rdi + rsi]
vmovdqu ymm3, ymmword ptr [rdi + rsi + 32] # Indexed addressing modes are fine here
vmovdqu ymm4, ymmword ptr [rdi + rsi + 64]
vmovdqu ymm5, ymmword ptr [rdi + rsi + 96]
vpcmpeqb ymm2, ymm2, ymmword ptr [rdi] # non-indexed allow micro-fusion without un-lamination
vpcmpeqb ymm3, ymm3, ymmword ptr [rdi + 32]
vpcmpeqb ymm4, ymm4, ymmword ptr [rdi + 64]
vpaddb ymm2, ymm4, ymm2
vpcmpeqb ymm4, ymm5, ymmword ptr [rdi + 96]
vpaddb ymm3, ymm4, ymm3
vpaddb ymm2, ymm2, ymm3
vpaddb ymm2, ymm2, ymm0 # add a vector of set1_epi8(4) to turn sums of 0 / -1 into sums of 1 / 0
vextracti128 xmm3, ymm2, 1
vpaddb xmm2, xmm2, xmm3
vpshufd xmm3, xmm2, 238 # xmm3 = xmm2[2,3,2,3]
vpaddb xmm2, xmm2, xmm3 # reduced to 8 bytes
vpsadbw xmm2, xmm2, xmm1 # hsum to one qword
vpextrb edx, xmm2, 0 # extract and zero-extend
add rax, rdx # accumulate the chunk sum
sub rdi, -128 # pointer increment (with a sign_extended_imm8 instead of +imm32)
cmp rdi, rcx
jb .LBB0_4 # }while(p < endp)
This could use 192 instead of 128 to further amortize the loop overhead, at the cost of needing to do %192 (not a power of 2), and making the cleanup loop worst case be 191 bytes. We can't go to 256, or anything higher than UINT8_MAX (255), and sticking to multiples of 32 is necessary. Or 64 for good measure.
There's an extra vpaddb of a fixup constant, set1_epi8(4), which turns the sum of four 0 / -1 into a sum of four 1 / 0 results from the C != operator.
I don't think there's any way to get rid of it or sink it out of the loop while still accumulating into a uint8_t, which is necessary for clang to vectorize this way. It doesn't know how to use vpsadbw to do a widening (non-truncating) sum of bytes, which is ironic because that's what it actually does when used against an all-zero register. If you do something like sum += ptr[off + commonSize] == ptr[off] ? -1 : 0 you can get it to use the vpcmpeqb result directly, summing 4 vectors down to one with 3 adds, and eventually feeding that to vpsadbw after some reduction steps. So you get a sum of matches * 0xFF truncated to uint8_t for each block of 128 bytes. Or as an int8_t, that's a sum of -1 * matches, so 0..-128, which doesn't overflow a signed byte. So that's interesting. But adding with zero-extension into a 64-bit counter might destroy information, and sign-extension inside the outer loop would cost another instruction. It would be a scalar movsx instruction instead of vpaddb, but that's not important for Skylake, probably only if using AVX-512 with 512-bit vectors (which clang and GCC both do badly, not using masked adds). Can we do 128*n_chunks - count after the loop to recover the differences from the sum of matches? No, I don't think so.
uiCA static analysis predicts Skylake (such as your CPU) will run the main loop at 5.51 cycles / iter (4 vectors) if data is hot in L1d cache, or 5.05 on Ice Lake / Rocket Lake. (I had to hand-tweak the asm to emulate the padding effect -mbranches-within-32B-boundaries would have, for uiCA's default assumption of where the top of the loop is relative to a 32-byte alignment boundary. I could have just changed that setting in uiCA instead. :/)
The only missed micro-optimization in implementing this sub-optimal strategy is that it's using vpextrb (because it doesn't prove that truncation to uint8_t isn't needed?) instead of vmovd or vmovq. So it costs an extra uop for the front-end, and for port 5 in the back end. With that optimized (comment + uncomment in the link), 5.25c / iter on Skylake, or 4.81 on Ice Lake, pretty close to the 2 load/clock bottleneck.
(Doing 6 vectors per iter, 192 bytes, predicts 7 cycles per iter on SKL, or 1.166 per vector, down from 5.5 / iter = 1.375 per vector. Or about 6.5 on ICL/RKL = 1.08 c/vec, hitting back-end ALU port bottlecks.)
This is not bad for something we were able to coax clang into generating from portable C++ source, vs. 4 cycles per 4 vectors of 32 byte-compares each for efficient manual vectorization. This will very likely keep up with memory or cache bandwidth even from L2 cache, so it's pretty usable, and not much slower with data hot in L1d. Taking a few more uops does hurt out-of-order exec, and uses up more execution resources that another logical core sharing a physical core could use. (Hyperthreading).
Unfortunately gcc/clang do not make good use of AVX-512 for this. If you were using 512-bit vectors (or AVX-512 features on 256-bit vectors), you'd compare into mask registers, then do something like vpaddb zmm0{k1}, zmm0, zmm1 merge-masking to conditionally increment a vector, where zmm1 = set1_epi8( 1 ). (Or a -1 constant with sub.) Instruction and uop count per vector should be about the same as AVX2 if done properly, but gcc/clang use about twice as many, so the only saving is in the reduction to scalar which seems to be the price for getting anything at all usable.
This version also avoids unrolling of the clean-up loop, just vectorizing with its dumb 4 bytes per iter strategy, which is about right for cleanup of size%128 bytes. It's pretty silly that it uses both vpxor to flip and vpand to turn 0xff into 0x01, when it could have used vpandn to do both those things in one instruction. That would get that cleanup loop down to 8 uops, just twice the pipeline width on Haswell / Skylake, so it would issue more efficiently from the loop buffer, except Skylake disabled that in microcode updates. It would help a bit on Haswell
Correct me if I am wrong but the answer seems to be
-march=native for the win.
the scalar version of the code was CPU bottlenecked and not RAM bottlenecked
use uica.uops.info to have an estimate of the cycles per loop
I will try to write my own AVX code to compare.
Details
After an afternoon tinkering around with the suggestions, here is what I found with clang:
-O1 around 10ms, scalar code
-O3 enables SSE2 and is as slow as O1, maybe poor assembly code
-O3 -march=westmere enables also SSE2 but is faster (7ms)
-O3 -march=native enables AVX -> 2.5ms and we are probably RAM bandwidth limited (close to the theoretical speed)
The scalar 10ms makes sense now because according to that awesome tool uica.uops.info it takes
2.35 cycles per loop
47 million cycles for the whole comparison (20 million iterations)
Processor is clocked at 4.8GHz meaning it should take around 9.8ms and it is close to what is measured.
g++ seems to generate better default code when no flags are added
O1 11ms
O2 scalar still but 9ms
O3 SSE 4.5ms
O3 -march=westmere 7ms like clang
O3 -march=native 3.4ms, slightly slower than clang

Will intel -03 convert pairs of __m256d instructions into __m512d

Will a code written for a 256 vectorization register will be compiled to use 512 instructions using the (2019) intel compiler with O3 level of optimization?
e.g. will operations on two __m256d objects be either converted to the same amount of operations over masked __m512d objects or grouped to make the most use out of the register, in the best case the total number of operations dropping by a factor 2?
arch: Knights Landing
Unfortunately, no: a code written to use AVX/AVX-2 intrinsics is not rewritten by ICC so to use AVX-512 yet (with both ICC 2019 and ICC 2021). There is no instruction fusing. Here is an example (see on GodBolt).
#include <x86intrin.h>
void compute(double* restrict data, int size)
{
__m256d cst1 = _mm256_set1_pd(23.42);
__m256d cst2 = _mm256_set1_pd(815.0);
__m256d v1 = _mm256_load_pd(data);
__m256d v2 = _mm256_load_pd(data+4);
__m256d v3 = _mm256_load_pd(data+8);
__m256d v4 = _mm256_load_pd(data+12);
v1 = _mm256_fmadd_pd(v1, cst1, cst2);
v2 = _mm256_fmadd_pd(v2, cst1, cst2);
v3 = _mm256_fmadd_pd(v3, cst1, cst2);
v4 = _mm256_fmadd_pd(v4, cst1, cst2);
_mm256_store_pd(data, v1);
_mm256_store_pd(data+4, v2);
_mm256_store_pd(data+8, v3);
_mm256_store_pd(data+12, v4);
}
Generated code:
compute:
vmovupd ymm0, YMMWORD PTR .L_2il0floatpacket.0[rip] #5.20
vmovupd ymm1, YMMWORD PTR .L_2il0floatpacket.1[rip] #6.20
vmovupd ymm2, YMMWORD PTR [rdi] #7.33
vmovupd ymm3, YMMWORD PTR [32+rdi] #8.33
vmovupd ymm4, YMMWORD PTR [64+rdi] #9.33
vmovupd ymm5, YMMWORD PTR [96+rdi] #10.33
vfmadd213pd ymm2, ymm0, ymm1 #11.10
vfmadd213pd ymm3, ymm0, ymm1 #12.10
vfmadd213pd ymm4, ymm0, ymm1 #13.10
vfmadd213pd ymm5, ymm0, ymm1 #14.10
vmovupd YMMWORD PTR [rdi], ymm2 #15.21
vmovupd YMMWORD PTR [32+rdi], ymm3 #16.21
vmovupd YMMWORD PTR [64+rdi], ymm4 #17.21
vmovupd YMMWORD PTR [96+rdi], ymm5 #18.21
vzeroupper #19.1
ret #19.1
The same code is generated for both version of ICC.
Note that using AVX-512 should not always speed up your code by a factor of two. For example, on Skylake SP (server-side processors) there is 2 AVX/AVX-2 SIMD units that can be fused to execute AVX-512 instructions but fusing does not improve throughput (assuming the SIMD units are the bottleneck). However, Skylake SP also supports an optional additional 512-bits SIMD units that does not support AVX/AVX-2 (only available on some processors). In this case, AVX-512 can make your code twice faster.

Why this SSE2 program (integers) generate movaps (float)?

The following loops transpose an integer matrix to another integer matrix. when I compiled interestingly it generates movaps instruction to store the result into the output matrix. why gcc does this?
data:
int __attribute__(( aligned(16))) t[N][M]
, __attribute__(( aligned(16))) c_tra[N][M];
loops:
for( i=0; i<N; i+=4){
for(j=0; j<M; j+=4){
row0 = _mm_load_si128((__m128i *)&t[i][j]);
row1 = _mm_load_si128((__m128i *)&t[i+1][j]);
row2 = _mm_load_si128((__m128i *)&t[i+2][j]);
row3 = _mm_load_si128((__m128i *)&t[i+3][j]);
__t0 = _mm_unpacklo_epi32(row0, row1);
__t1 = _mm_unpacklo_epi32(row2, row3);
__t2 = _mm_unpackhi_epi32(row0, row1);
__t3 = _mm_unpackhi_epi32(row2, row3);
/* values back into I[0-3] */
row0 = _mm_unpacklo_epi64(__t0, __t1);
row1 = _mm_unpackhi_epi64(__t0, __t1);
row2 = _mm_unpacklo_epi64(__t2, __t3);
row3 = _mm_unpackhi_epi64(__t2, __t3);
_mm_store_si128((__m128i *)&c_tra[j][i], row0);
_mm_store_si128((__m128i *)&c_tra[j+1][i], row1);
_mm_store_si128((__m128i *)&c_tra[j+2][i], row2);
_mm_store_si128((__m128i *)&c_tra[j+3][i], row3);
}
}
Assembly generated code:
.L39:
lea rcx, [rsi+rdx]
movdqa xmm1, XMMWORD PTR [rdx]
add rdx, 16
add rax, 2048
movdqa xmm6, XMMWORD PTR [rcx+rdi]
movdqa xmm3, xmm1
movdqa xmm2, XMMWORD PTR [rcx+r9]
punpckldq xmm3, xmm6
movdqa xmm5, XMMWORD PTR [rcx+r10]
movdqa xmm4, xmm2
punpckhdq xmm1, xmm6
punpckldq xmm4, xmm5
punpckhdq xmm2, xmm5
movdqa xmm5, xmm3
punpckhqdq xmm3, xmm4
punpcklqdq xmm5, xmm4
movdqa xmm4, xmm1
punpckhqdq xmm1, xmm2
punpcklqdq xmm4, xmm2
movaps XMMWORD PTR [rax-2048], xmm5
movaps XMMWORD PTR [rax-1536], xmm3
movaps XMMWORD PTR [rax-1024], xmm4
movaps XMMWORD PTR [rax-512], xmm1
cmp r11, rdx
jne .L39
gcc -Wall -msse4.2 -masm="intel" -O2 -c -S
skylake
linuxmint
-mavx2 or -march=naticve generate VEX-encoding :vmovaps.
Functionally those instructions are the same.
I don't like to copy+paste other people statements as mine so few links explaining it:
Difference between MOVDQA and MOVAPS x86 instructions?
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/279587
http://masm32.com/board/index.php?topic=1138.0
https://www.gamedev.net/blog/615/entry-2250281-demystifying-sse-move-instructions/
Short version:
So for the most part, you should try to use the move instruction that
corresponds with the operations you are going to use on those
registers. However, there is an additional complication. Loads and
stores to and from memory execute on a separate port from the integer
and floating point units; thus instructions that load from memory into
a register or store from a register into memory will experience the
same delay regardless of the data type you attach to the move. Thus
in this case, movaps, movapd, and movdqa will have the same delay no
matter what data you use. Since movaps (and movups) is encoded in
binary form with one less byte than the other two, it makes sense to
use it for all reg-mem moves, regardless of the data type.
So it is GCC optimization.

arithmetics optimisation inside C for-loop

I have two functions with for-loops, which look very similar. The number of data to process is very large, so I am trying to optimise the cycles as much as possible. The execution time for the second function is 320 sec, but the first one takes 460 sec. Could somebody please give me any suggestions what makes the difference and how to optimise the computation?
int ii, jj;
double c1, c2;
for (ii = 0; ii < n; ++ii) {
a[jj] += b[ii] * c1;
a[++jj] += b[ii] * c2;
}
The second one:
int ii, jj;
double c1, c2;
for (ii = 0; ii < n; ++ii) {
b[ii] += a[jj] * c1;
b[ii] += a[++jj] * c2;
}
And here is the assembler output for the first loop:
movl -104(%rbp), %eax
movq -64(%rbp), %rcx
cmpl (%rcx), %eax
jge LBB0_12
## BB#10: ## in Loop: Header=BB0_9 Depth=5
movslq -88(%rbp), %rax
movq -48(%rbp), %rcx
movsd (%rcx,%rax,8), %xmm0 ## xmm0 = mem[0],zero
mulsd -184(%rbp), %xmm0
movslq -108(%rbp), %rax
movq -224(%rbp), %rcx ## 8-byte Reload
addsd (%rcx,%rax,8), %xmm0
movsd %xmm0, (%rcx,%rax,8)
movslq -88(%rbp), %rax
movq -48(%rbp), %rdx
movsd (%rdx,%rax,8), %xmm0 ## xmm0 = mem[0],zero
mulsd -192(%rbp), %xmm0
movl -108(%rbp), %esi
addl $1, %esi
movl %esi, -108(%rbp)
movslq %esi, %rax
addsd (%rcx,%rax,8), %xmm0
movsd %xmm0, (%rcx,%rax,8)
movl -88(%rbp), %esi
addl $1, %esi
movl %esi, -88(%rbp)
and for the second one:
movl -104(%rbp), %eax
movq -64(%rbp), %rcx
cmpl (%rcx), %eax
jge LBB0_12
## BB#10: ## in Loop: Header=BB0_9 Depth=5
movslq -108(%rbp), %rax
movq -224(%rbp), %rcx ## 8-byte Reload
movsd (%rcx,%rax,8), %xmm0 ## xmm0 = mem[0],zero
mulsd -184(%rbp), %xmm0
movslq -88(%rbp), %rax
movq -48(%rbp), %rdx
addsd (%rdx,%rax,8), %xmm0
movsd %xmm0, (%rdx,%rax,8)
movl -108(%rbp), %esi
addl $1, %esi
movl %esi, -108(%rbp)
movslq %esi, %rax
movsd (%rcx,%rax,8), %xmm0 ## xmm0 = mem[0],zero
mulsd -192(%rbp), %xmm0
movslq -88(%rbp), %rax
movq -48(%rbp), %rdx
addsd (%rdx,%rax,8), %xmm0
movsd %xmm0, (%rdx,%rax,8)
movl -88(%rbp), %esi
addl $1, %esi
movl %esi, -88(%rbp)
The original function is much bigger, so here I provided only the pieces responsible for those for-loops. The rest of the c-code and its assembler output is exactly the same for both functions.
The structure of that calculation is pretty weird, but it can be optimized significantly. Some problems with that code are
reloading data from a pointer after writing to an other pointer that isn't known to not alias. I assume they won't alias because this algorithm would be even weirder if that was allowed, but if they're really supposed to maybe alias, ignore this. In general, structure your loop body as: first load everything, do calculations, then store back. Don't mix loading and storing, it makes the compiler more conservative.
reloading data that was stored in the previous iteration. The compiler can see through this a bit, but it complicates matters. Don't do it.
implicitly treating the first and last items differently. It looks like a nice homogeneous loop at first, but due to its weird structure it's actually special casing the first and last things.
So let's first fix the second loops, which is simpler. The only problem here is the first store to b[ii], which has to Really Happen(tm) because it might alias with a[jj + 1]. But it can trivially be written so that that problem goes away:
for (ii = 0; ii < n; ++ii) {
b[ii] += a[jj] * c1 + a[jj + 1] * c2;
jj++;
}
You can tell by the assembly output that the compiler is happier now, and of course benchmarking confirms it's faster.
Old asm (only main loop, not the extra cruft):
.LBB0_14: # =>This Inner Loop Header: Depth=1
vmulpd ymm4, ymm2, ymmword ptr [r8 - 8]
vaddpd ymm4, ymm4, ymmword ptr [rax]
vmovupd ymmword ptr [rax], ymm4
vmulpd ymm5, ymm3, ymmword ptr [r8]
vaddpd ymm4, ymm4, ymm5
vmovupd ymmword ptr [rax], ymm4
add r8, 32
add rax, 32
add r11, -4
jne .LBB0_14
New asm (only main loop):
.LBB1_20: # =>This Inner Loop Header: Depth=1
vmulpd ymm4, ymm2, ymmword ptr [rax - 104]
vmulpd ymm5, ymm2, ymmword ptr [rax - 72]
vmulpd ymm6, ymm2, ymmword ptr [rax - 40]
vmulpd ymm7, ymm2, ymmword ptr [rax - 8]
vmulpd ymm8, ymm3, ymmword ptr [rax - 96]
vmulpd ymm9, ymm3, ymmword ptr [rax - 64]
vmulpd ymm10, ymm3, ymmword ptr [rax - 32]
vmulpd ymm11, ymm3, ymmword ptr [rax]
vaddpd ymm4, ymm4, ymm8
vaddpd ymm5, ymm5, ymm9
vaddpd ymm6, ymm6, ymm10
vaddpd ymm7, ymm7, ymm11
vaddpd ymm4, ymm4, ymmword ptr [rcx - 96]
vaddpd ymm5, ymm5, ymmword ptr [rcx - 64]
vaddpd ymm6, ymm6, ymmword ptr [rcx - 32]
vaddpd ymm7, ymm7, ymmword ptr [rcx]
vmovupd ymmword ptr [rcx - 96], ymm4
vmovupd ymmword ptr [rcx - 64], ymm5
vmovupd ymmword ptr [rcx - 32], ymm6
vmovupd ymmword ptr [rcx], ymm7
sub rax, -128
sub rcx, -128
add rbx, -16
jne .LBB1_20
That also got unrolled more (automatically), but the more significant difference (not that unrolling is useless, but reducing the loop overhead isn't such a big deal usually, it can mostly be handled by the ports that aren't busy with vector instructions) is the reduction in stores, which takes it from a ratio of 2/3 (potentially bottlenecked by store throughput where half the stores are useless) to 4/12 (bottlenecked by something that really has to happen).
Now for that first loop, once you take out the first and last iterations, it's just adding two scaled b's to every a, and then we put the first and last iterations back in separately:
a[0] += b[0] * c1;
for (ii = 1; ii < n; ++ii) {
a[ii] += b[ii - 1] * c2 + b[ii] * c1;
}
a[n] += b[n - 1] * c2;
That takes it from this (note that this isn't even vectorized):
.LBB0_3: # =>This Inner Loop Header: Depth=1
vmulsd xmm3, xmm0, qword ptr [rsi + 8*rax]
vaddsd xmm2, xmm2, xmm3
vmovsd qword ptr [rdi + 8*rax], xmm2
vmulsd xmm2, xmm1, qword ptr [rsi + 8*rax]
vaddsd xmm2, xmm2, qword ptr [rdi + 8*rax + 8]
vmovsd qword ptr [rdi + 8*rax + 8], xmm2
vmulsd xmm3, xmm0, qword ptr [rsi + 8*rax + 8]
vaddsd xmm2, xmm2, xmm3
vmovsd qword ptr [rdi + 8*rax + 8], xmm2
vmulsd xmm2, xmm1, qword ptr [rsi + 8*rax + 8]
vaddsd xmm2, xmm2, qword ptr [rdi + 8*rax + 16]
vmovsd qword ptr [rdi + 8*rax + 16], xmm2
lea rax, [rax + 2]
cmp ecx, eax
jne .LBB0_3
To this:
.LBB1_6: # =>This Inner Loop Header: Depth=1
vmulpd ymm4, ymm2, ymmword ptr [rbx - 104]
vmulpd ymm5, ymm2, ymmword ptr [rbx - 72]
vmulpd ymm6, ymm2, ymmword ptr [rbx - 40]
vmulpd ymm7, ymm2, ymmword ptr [rbx - 8]
vmulpd ymm8, ymm3, ymmword ptr [rbx - 96]
vmulpd ymm9, ymm3, ymmword ptr [rbx - 64]
vmulpd ymm10, ymm3, ymmword ptr [rbx - 32]
vmulpd ymm11, ymm3, ymmword ptr [rbx]
vaddpd ymm4, ymm4, ymm8
vaddpd ymm5, ymm5, ymm9
vaddpd ymm6, ymm6, ymm10
vaddpd ymm7, ymm7, ymm11
vaddpd ymm4, ymm4, ymmword ptr [rcx - 96]
vaddpd ymm5, ymm5, ymmword ptr [rcx - 64]
vaddpd ymm6, ymm6, ymmword ptr [rcx - 32]
vaddpd ymm7, ymm7, ymmword ptr [rcx]
vmovupd ymmword ptr [rcx - 96], ymm4
vmovupd ymmword ptr [rcx - 64], ymm5
vmovupd ymmword ptr [rcx - 32], ymm6
vmovupd ymmword ptr [rcx], ymm7
sub rbx, -128
sub rcx, -128
add r11, -16
jne .LBB1_6
Nice and vectorized this time, and much less storing and loading going on.
Both changes combined made it about twice as fast on my PC but of course YMMV.
I still that this code is weird though. Note how we're modifying a[n] in the last iteration of the first loop, then use it in the first iteration of the second loop, while the other a's just sort of stand to side and watch. It's odd. Maybe it really has to be that way, but frankly it looks like a bug to me.

Is it possible to get multiple sines in AVX/SSE?

I'm trying to write a C++ program, which launches a function I write in x64 assembler.
I'd like to speed things up a little (and play with CPU features), so I chose to use vector operations.
The problem is, I have to multiply sines by an integer, so I have to calculate the sines first.
Is it possible to do this in SSE/AVX? I'm aware of instruction fsin, but not only it is in FPU, but also it calculates only 1 sine at once. So I'd have to push it in FPU, call fsin, pop it from FPU to memory, and then put it in AVX register. It seems to me it's not worth the hassle.
Yes, there is a vector version using SSE/AVX! But the catch is that Intel C++ compiler must be used.
This is called Intel small vector math library (intrinsics):
for 128bit SSE please use (double precision): _mm_sin_pd
for 256bit AVX please use (double precision): _mm256_sin_pd
The two intrinsics are actually very small functions consists of hand written SSE/AVX assemblies, and now you can process 4 sine calculations at once by using AVX :=) the latency is about ~10 clock cycles (if I remember correctly) on Haswell CPU.
By the way, the CPU needs to execute about 100 such intrinsics to warm up and reach its peak performance, if there is only a few sin functions needs to be evaluated, it's better to use plain sin() instead.
Good luck!!
Since the vectorized sin/cos extensions are required by OpenMP 4.0, gcc-glibc offers them in libmvec as well. See:
https://stackoverflow.com/a/54355153 for a header file declaring the function prototypes;
https://sourceware.org/glibc/wiki/libmvec for what libmvec is and how you normally would use it.
For a list of other SVML alternatives, see https://stackoverflow.com/a/36637424.
Sample cosine approximation:
0.15 ULPS average (1 ULPS max) error
12x speedup for AVX512, 8-9x for AVX2
approximation of cosine over [-1,1] range.
Polynomial coefficients were found by a genetic algorithm. The multiplication series look like Chebyshev Polynomials but they are not, but needed when you need a wider range of input rather than just [-1,1].
// only optimized for [-1,1] input range!!
template<typename Type, int Simd>
inline
void cosFast(
Type * const __restrict__ data,
Type * const __restrict__ result) noexcept
{
alignas(64)
Type xSqr[Simd];
alignas(64)
Type xSqrSqr[Simd];
alignas(64)
Type xSqrSqrSqr[Simd];
alignas(64)
Type xSqrSqrSqrSqr[Simd];
#pragma GCC ivdep
for(int i=0;i<Simd;i++)
{
xSqr[i] = data[i]*data[i];
}
#pragma GCC ivdep
for(int i=0;i<Simd;i++)
{
xSqrSqr[i] = xSqr[i]*xSqr[i];
}
#pragma GCC ivdep
for(int i=0;i<Simd;i++)
{
xSqrSqrSqr[i] = xSqrSqr[i]*xSqr[i];
}
#pragma GCC ivdep
for(int i=0;i<Simd;i++)
{
xSqrSqrSqrSqr[i] = xSqrSqr[i]*xSqrSqr[i];
}
#pragma GCC ivdep
for(int i=0;i<Simd;i++)
{
result[i] = Type(2.37711074060342753000441e-05)*xSqrSqrSqrSqr[i] +
Type(-0.001387712893937020908197155)*xSqrSqrSqr[i] +
Type(0.04166611039514833692010143)*xSqrSqr[i] +
Type(-0.4999998698566363586337502)*xSqr[i] +
Type(0.9999999941252593060880827);
}
}
If you need to use a wider range, then you should do a range-reduction at high precision and use something like this:
range reduce to -pi,pi at high precision
divide by "4" to put it in -1,1 range
compute same series as above => tmp
compute (Chebyshev) L_"4" (tmp) = result
.L29:
vmovups ymm7, YMMWORD PTR [r14+rax]
vmulps ymm1, ymm7, ymm7
vmovups ymm7, YMMWORD PTR [r14+32+rax]
vmulps ymm3, ymm1, ymm1
vmulps ymm6, ymm3, ymm3
vmulps ymm6, ymm6, YMMWORD PTR .LC11[rip]
vmulps ymm0, ymm7, ymm7
vmulps ymm5, ymm3, ymm1
vfmadd132ps ymm5, ymm6, YMMWORD PTR .LC12[rip]
vmulps ymm2, ymm0, ymm0
vmulps ymm6, ymm2, ymm2
vmulps ymm6, ymm6, YMMWORD PTR .LC11[rip]
vfmadd132ps ymm3, ymm5, YMMWORD PTR .LC13[rip]
vmulps ymm4, ymm2, ymm0
vfmadd132ps ymm4, ymm6, YMMWORD PTR .LC12[rip]
vfmadd132ps ymm1, ymm3, YMMWORD PTR .LC14[rip]
vfmadd132ps ymm2, ymm4, YMMWORD PTR .LC13[rip]
vaddps ymm1, ymm1, YMMWORD PTR .LC15[rip]
vfmadd132ps ymm0, ymm2, YMMWORD PTR .LC14[rip]
vaddps ymm0, ymm0, YMMWORD PTR .LC15[rip]
vmovups YMMWORD PTR [r15+rax], ymm1
vmovups YMMWORD PTR [r15+32+rax], ymm0
add rax, 64
cmp rax, 16777216
jne .L29
There is no sine instruction in SSE/AVX, however depending on the precision you require you can write an approximation to the sine function either as a polynomial using Taylor/Madhava series or as the quotient of two polynomials using Pade Approximant. And of course many more polynomial approximation techniques.
Whether this yields the precision you want and how fast this method is depends on your exact problem. Generally speaking polynomial approximation is very fast as one can evaluate an n'th degree polynomial using n FMA instructions (The Pade approximant also requires one division) by writing it in the form of
a+x*(b+x*(c+x*(...))).
However sines are notoriously ill behaved when approximated using polynomials so the use cases are limited.

Resources