I modified the code from a previous experiment (Agner Fog's Optimizing Assembly, example 12.10a) to make it more dependent:
movsd xmm2, [x]
movsd xmm1, [one]
xorps xmm0, xmm0
mov eax, coeff
L1:
movsd xmm3, [eax]
mulsd xmm3, xmm1
mulsd xmm1, xmm2
addsd xmm1, xmm3
add eax, 8
cmp eax, coeff_end
jb L1
And now it takes ~13 cycles per iteration, but I have no idea why so much.
Please help me understand.
(update)
I'm sorry. Yes, definetely #Peter Cordes is right- it takes 9 cycles per iteration in fact. The misunderstanding is caused by myself. I missed two similar pieces of codes ( instructions swapped), the 13-cycles code is here:
movsd xmm2, [x]
movsd xmm1, [one]
xorps xmm0, xmm0
mov eax, coeff
L1:
movsd xmm3, [eax]
mulsd xmm1, xmm2
mulsd xmm3, xmm1
addsd xmm1, xmm3
add eax, 8
cmp eax, coeff_end
jb L1
It runs at exactly one iteration per 9c for me, on a Core2 E6600, which is expected:
movsd xmm3, [eax] ; independent, depends only on eax
A: mulsd xmm3, xmm1 ; 5c: depends on xmm1:C from last iteration
B: mulsd xmm1, xmm2 ; 5c: depends on xmm1:C from last iteration
C: addsd xmm1, xmm3 ; 3c: depends on xmm1:B from THIS iteration (and xmm3:A from this iteration)
When xmm1:C is ready from iteration i, the next iteration can start calculating:
A: producing xmm3:A in 5c
B: producing xmm1:B in 5c (but there's a resource conflict; these multiplies can't both start in the same cycle in Core2 or IvyBridge, only Haswell and later)
Regardless of which one runs first, both have to finish before C can run. So the loop-carried dependency chain is 5 + 3 cycles, +1c for the resource conflict that stops both multiplies from starting in the same cycle.
Test code that runs at the expected speed:
This slows down to one iteration per ~11c when the array is 8B * 128 * 1024. If you're testing with an even bigger array instead of using a repeat-loop around what you posted, then that's why you're seeing a higher latency.
If a load arrives late, there's no way for the CPU to "catch up", since it delays the loop-carried dependency chain. If the load was only needed in a dependency chain that forked off from the loop-carried chain, then the pipeline could absorb an occasional slow load more easily. So, some loops can be more sensitive to memory delays than others.
default REL
%macro IACA_start 0
mov ebx, 111
db 0x64, 0x67, 0x90
%endmacro
%macro IACA_end 0
mov ebx, 222
db 0x64, 0x67, 0x90
%endmacro
global _start
_start:
movsd xmm2, [x]
movsd xmm1, [one]
xorps xmm0, xmm0
mov ecx, 10000
outer_loop:
mov eax, coeff
IACA_start ; outside the loop
ALIGN 32 ; this matters on Core2, .78 insn per cycle vs. 0.63 without
L1:
movsd xmm3, [eax]
mulsd xmm3, xmm1
mulsd xmm1, xmm2
addsd xmm1, xmm3
add eax, 8
cmp eax, coeff_end
jb L1
IACA_end
dec ecx
jnz outer_loop
;mov eax, 1
;int 0x80 ; exit() for 32bit code
xor edi, edi
mov eax, 231 ; exit_group(0). __NR_exit = 60.
syscall
section .data
x:
one: dq 1.0
section .bss
coeff: resq 24*1024 ; 6 * L1 size. Doesn't run any faster when it fits in L1 (resb)
coeff_end:
Experimental test
$ asm-link interiteration-test.asm
+ yasm -felf64 -Worphan-labels -gdwarf2 interiteration-test.asm
+ ld -o interiteration-test interiteration-test.o
$ perf stat ./interiteration-test
Performance counter stats for './interiteration-test':
928.543744 task-clock (msec) # 0.995 CPUs utilized
152 context-switches # 0.164 K/sec
1 cpu-migrations # 0.001 K/sec
52 page-faults # 0.056 K/sec
2,222,536,634 cycles # 2.394 GHz (50.14%)
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,723,575,954 instructions # 0.78 insns per cycle (75.06%)
246,414,304 branches # 265.377 M/sec (75.16%)
51,483 branch-misses # 0.02% of all branches (74.74%)
0.933372495 seconds time elapsed
Each branch / every 7 instructions is one iteration of the inner loop.
$ bc -l
bc 1.06.95
1723575954 / 7
246225136.28571428571428571428
# ~= number of branches: good
2222536634 / .
9.026
# cycles per iteration
IACA agrees: 9c per iteration on IvB
(not counting the nops from ALIGN):
$ iaca.sh -arch IVB interiteration-test
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - interiteration-test
Binary Format - 64Bit
Architecture - IVB
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 9.00 Cycles Throughput Bottleneck: InterIteration
Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |
-------------------------------------------------------------------------
| Cycles | 2.0 0.0 | 1.0 | 0.5 0.5 | 0.5 0.5 | 0.0 | 2.0 |
-------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
# - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
---------------------------------------------------------------------
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | movsd xmm3, qword ptr [eax]
| 1 | 1.0 | | | | | | | mulsd xmm3, xmm1
| 1 | 1.0 | | | | | | CP | mulsd xmm1, xmm2
| 1 | | 1.0 | | | | | CP | addsd xmm1, xmm3
| 1 | | | | | | 1.0 | | add eax, 0x8
| 1 | | | | | | 1.0 | | cmp eax, 0x63011c
| 0F | | | | | | | | jb 0xffffffffffffffe7
Total Num Of Uops: 6
With the addsd change suggested in my comments above, --> addsd xmm0,xmm3, this can be coded to use the full width registers and the performance is twice as fast.
Loosely:
For the initial value of ones, it needs to be:
double ones[2] = { 1.0, x }
And we need to replace x with x2:
double x2[2] = { x * x, x * x }
If there is an odd number of coefficients, pad it with a zero to produce an even number of them.
And, changing the pointer increment to 16.
Here are the test results I got. I did a number of trials and took the ones that had the best time and elongated the time by doing 100 iterations. std is the C version, dbl is your version, and qed is the "wide" version:
R=1463870188
C=100
T=100
I=100
x: 3.467957099973322e+00 3.467957099973322e+00
one: 1.000000000000000e+00 3.467957099973322e+00
x2: 1.202672644725538e+01 1.202672644725538e+01
std: 2.803772098439484e+56 (ELAP: 0.000019312)
dbl: 2.803772098439484e+56 (ELAP: 0.000019312)
qed: 2.803772098439492e+56 (ELAP: 0.000009060)
rtn_loop: 2.179378907910304e+55 2.585834207648461e+56
rtn_shuf: 2.585834207648461e+56 2.179378907910304e+55
rtn_add: 2.803772098439492e+56 2.585834207648461e+56
This was done on an i7 920 # 2.67 GHz.
I think if you take the elapsed numbers and convert them, you'll see that your version is faster than you think.
I apologize, in advance, for switching to AT&T syntax as I had difficulty getting the assembler to work the other way. Again, sorry. Also, I'm using linux, so I used the rdi rsi registers to pass the coefficient pointers. If you're on windows, the ABI is different and you'll have to adjust for that.
I did a C version and diassembled it. It was virtually identical to your code except that it rearranged the non-xmm instructions a bit, which I've added below.
I believe I posted all the files, so you could conceivably run this on your system if you wished.
Here's the original code:
# xmmloop/dbl.s -- implement using single double
.globl dbl
# dbl -- compute result using single double
#
# arguments:
# rdi -- pointer to coeff vector
# rsi -- pointer to coeff vector end
dbl:
movsd x(%rip),%xmm2 # get x value
movsd one(%rip),%xmm1 # get ones
xorps %xmm0,%xmm0 # sum = 0
dbl_loop:
movsd (%rdi),%xmm3 # c[i]
add $8,%rdi # increment to next vector element
cmp %rsi,%rdi # done yet?
mulsd %xmm1,%xmm3 # c[i]*x^i
mulsd %xmm2,%xmm1 # x^(i+1)
addsd %xmm3,%xmm0 # sum += c[i]*x^i
jb dbl_loop # no, loop
retq
Here's the code changed to use the movapd et. al:
# xmmloop/qed.s -- implement using single double
.globl qed
# qed -- compute result using single double
#
# arguments:
# rdi -- pointer to coeff vector
# rsi -- pointer to coeff vector end
qed:
movapd x2(%rip),%xmm2 # get x^2 value
movapd one(%rip),%xmm1 # get [1,x]
xorpd %xmm4,%xmm4 # sum = 0
qed_loop:
movapd (%rdi),%xmm3 # c[i]
add $16,%rdi # increment to next coefficient
cmp %rsi,%rdi # done yet?
mulpd %xmm1,%xmm3 # c[i]*x^i
mulpd %xmm2,%xmm1 # x^(i+2)
addpd %xmm3,%xmm4 # sum += c[i]*x^i
jb qed_loop # no, loop
movapd %xmm4,rtn_loop(%rip) # save intermediate DEBUG
movapd %xmm4,%xmm0 # get lower sum
shufpd $1,%xmm4,%xmm4 # get upper value into lower half
movapd %xmm4,rtn_shuf(%rip) # save intermediate DEBUG
addsd %xmm4,%xmm0 # add upper sum to lower
movapd %xmm0,rtn_add(%rip) # save intermediate DEBUG
retq
Here's a C version of the code:
// xmmloop/std -- compute result using C code
#include <xmmloop.h>
// std -- compute result using C
double
std(const double *cur,const double *ep)
{
double xt;
double xn;
double ci;
double sum;
xt = x[0];
xn = one[0];
sum = 0;
for (; cur < ep; ++cur) {
ci = *cur; // get c[i]
ci *= xn; // c[i]*x^i
xn *= xt; // x^(i+1)
sum += ci; // sum += c[i]*x^i
}
return sum;
}
Here's the test program I used:
// xmmloop/xmmloop -- test program
#define _XMMLOOP_GLO_
#include <xmmloop.h>
// tvget -- get high precision time
double
tvget(void)
{
struct timespec ts;
double sec;
clock_gettime(CLOCK_REALTIME,&ts);
sec = ts.tv_nsec;
sec /= 1e9;
sec += ts.tv_sec;
return sec;
}
// timeit -- get best time
void
timeit(fnc_p proc,double *cofptr,double *cofend,const char *tag)
{
double tvbest;
double tvbeg;
double tvdif;
double sum;
sum = 0;
tvbest = 1e9;
for (int trycnt = 1; trycnt <= opt_T; ++trycnt) {
tvbeg = tvget();
for (int iter = 1; iter <= opt_I; ++iter)
sum = proc(cofptr,cofend);
tvdif = tvget();
tvdif -= tvbeg;
if (tvdif < tvbest)
tvbest = tvdif;
}
printf("%s: %.15e (ELAP: %.9f)\n",tag,sum,tvbest);
}
// main -- main program
int
main(int argc,char **argv)
{
char *cp;
double *cofptr;
double *cofend;
double *cur;
double val;
long rseed;
int cnt;
--argc;
++argv;
rseed = 0;
cnt = 0;
for (; argc > 0; --argc, ++argv) {
cp = *argv;
if (*cp != '-')
break;
switch (cp[1]) {
case 'C':
cp += 2;
cnt = strtol(cp,&cp,10);
break;
case 'R':
cp += 2;
rseed = strtol(cp,&cp,10);
break;
case 'T':
cp += 2;
opt_T = (*cp != 0) ? strtol(cp,&cp,10) : 1;
break;
case 'I':
cp += 2;
opt_I = (*cp != 0) ? strtol(cp,&cp,10) : 1;
break;
}
}
if (rseed == 0)
rseed = time(NULL);
srand48(rseed);
printf("R=%ld\n",rseed);
if (cnt == 0)
cnt = 100;
if (cnt & 1)
++cnt;
printf("C=%d\n",cnt);
if (opt_T == 0)
opt_T = 100;
printf("T=%d\n",opt_T);
if (opt_I == 0)
opt_I = 100;
printf("I=%d\n",opt_I);
cofptr = malloc(sizeof(double) * cnt);
cofend = &cofptr[cnt];
val = drand48();
for (; val < 3; val += 1.0);
x[0] = val;
x[1] = val;
DMP(x);
one[0] = 1.0;
one[1] = val;
DMP(one);
val *= val;
x2[0] = val;
x2[1] = val;
DMP(x2);
for (cur = cofptr; cur < cofend; ++cur) {
val = drand48();
val *= 1e3;
*cur = val;
}
timeit(std,cofptr,cofend,"std");
timeit(dbl,cofptr,cofend,"dbl");
timeit(qed,cofptr,cofend,"qed");
DMP(rtn_loop);
DMP(rtn_shuf);
DMP(rtn_add);
return 0;
}
And the header file:
// xmmloop/xmmloop.h -- common control
#ifndef _xmmloop_xmmloop_h_
#define _xmmloop_xmmloop_h_
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#ifdef _XMMLOOP_GLO_
#define EXTRN_XMMLOOP /**/
#else
#define EXTRN_XMMLOOP extern
#endif
#define XMMALIGN __attribute__((aligned(16)))
EXTRN_XMMLOOP int opt_T;
EXTRN_XMMLOOP int opt_I;
EXTRN_XMMLOOP double x[2] XMMALIGN;
EXTRN_XMMLOOP double x2[2] XMMALIGN;
EXTRN_XMMLOOP double one[2] XMMALIGN;
EXTRN_XMMLOOP double rtn_loop[2] XMMALIGN;
EXTRN_XMMLOOP double rtn_shuf[2] XMMALIGN;
EXTRN_XMMLOOP double rtn_add[2] XMMALIGN;
#define DMP(_sym) \
printf(#_sym ": %.15e %.15e\n",_sym[0],_sym[1]);
typedef double (*fnc_p)(const double *cofptr,const double *cofend);
double std(const double *cofptr,const double *cofend);
double dbl(const double *cofptr,const double *cofend);
double qed(const double *cofptr,const double *cofend);
#endif
Related
I have recently started the Go track on exercism.io and had fun optimizing the "nth-prime" calculation. Actually I came across a funny fact I can't explain. Imagine the following code:
// Package prime provides ...
package prime
// Nth function checks for the prime number on position n
func Nth(n int) (int, bool) {
if n <= 0 {
return 0, false
}
if (n == 1) {
return 2, true
}
currentNumber := 1
primeCounter := 1
for n > primeCounter {
currentNumber+=2
if isPrime(currentNumber) {
primeCounter++
}
}
return currentNumber, primeCounter==n
}
// isPrime function checks if a number
// is a prime number
func isPrime(n int) bool {
//useless because never triggered but makes it faster??
if n < 2 {
println("n < 2")
return false
}
//useless because never triggered but makes it faster??
if n%2 == 0 {
println("n%2")
return n==2
}
for i := 3; i*i <= n; i+=2 {
if n%i == 0 {
return false
}
}
return true
}
In the private function isPrime I have two initial if-statements that are never triggered, because I only give in uneven numbers greater than 2. The benchmark returns following:
Running tool: /usr/bin/go test -benchmem -run=^$ -bench ^(BenchmarkNth)$
BenchmarkNth-8 100 18114825 ns/op 0 B/op 0
If I remove the never triggered if-statements the benchmark goes slower:
Running tool: /usr/bin/go test -benchmem -run=^$ -bench ^(BenchmarkNth)$
BenchmarkNth-8 50 21880749 ns/op 0 B/op 0
I have run the benchmark multiple times changing the code back and forth always getting more or less the same numbers and I can't think of a reason why these two if-statements should make the execution faster. Yes it is micro-optimization, but I want to know: Why?
Here is the whole exercise from exercism with test-cases: nth-prime
Go version i am using is 1.12.1 linux/amd64 on a manjaro i3 linux
What happens is the compiler is guaranteed with some assertions about the input when those if's are added.
If those assertions are lifted, the compiler has to add it himself. The way it does it is by validating it on each iteration. We can take a look at the assembly code to prove it. (by passing -gcflags=-S to the go test command)
With the if's:
0x004b 00075 (func.go:16) JMP 81
0x004d 00077 (func.go:16) LEAQ 2(BX), AX
0x0051 00081 (func.go:16) MOVQ AX, DX
0x0054 00084 (func.go:16) IMULQ AX, AX
0x0058 00088 (func.go:16) CMPQ AX, CX
0x005b 00091 (func.go:16) JGT 133
0x005d 00093 (func.go:17) TESTQ DX, DX
0x0060 00096 (func.go:17) JEQ 257
0x0066 00102 (func.go:17) MOVQ CX, AX
0x0069 00105 (func.go:17) MOVQ DX, BX
0x006c 00108 (func.go:17) CQO
0x006e 00110 (func.go:17) IDIVQ BX
0x0071 00113 (func.go:17) TESTQ DX, DX
0x0074 00116 (func.go:17) JNE 77
Without the if's:
0x0016 00022 (func.go:16) JMP 28
0x0018 00024 (func.go:16) LEAQ 2(BX), AX
0x001c 00028 (func.go:16) MOVQ AX, DX
0x001f 00031 (func.go:16) IMULQ AX, AX
0x0023 00035 (func.go:16) CMPQ AX, CX
0x0026 00038 (func.go:16) JGT 88
0x0028 00040 (func.go:17) TESTQ DX, DX
0x002b 00043 (func.go:17) JEQ 102
0x002d 00045 (func.go:17) MOVQ CX, AX
0x0030 00048 (func.go:17) MOVQ DX, BX
0x0033 00051 (func.go:17) CMPQ BX, $-1
0x0037 00055 (func.go:17) JEQ 64
0x0039 00057 (func.go:17) CQO
0x003b 00059 (func.go:17) IDIVQ BX
0x003e 00062 (func.go:17) JMP 69
0x0040 00064 func.go:17) NEGQ AX
0x0043 00067 (func.go:17) XORL DX, DX
0x0045 00069 (func.go:17) TESTQ DX, DX
0x0048 00072 (func.go:17) JNE 24
Line 51 in the assembly code 0x0033 00051 (func.go:17) CMPQ BX, $-1 is the culprit.
Line 16, for i := 3; i*i <= n; i+=2, in the original Go code, is translated the same for both cases. But line 17 if n%i == 0 that runs every iteration compiles to more instructions and as a result more work for the CPU in total.
Something similar in the encoding/base64 package by ensuring the loop won't receive a nil value. You can take a look here:
https://go-review.googlesource.com/c/go/+/151158/3/src/encoding/base64/base64.go
This check was added intentionally. In your case, you optimized it accidentally :)
I want to run the following code (in Intel syntax) in gcc (AT&T syntax).
; float a[128], b[128], c[128];
; for (int i = 0; i < 128; i++) a[i] = b[i] + c[i];
; Assume that a, b and c are aligned by 32
xor ecx, ecx ; Loop counter i = 0
L: vmovaps ymm0, [b+rcx] ; Load 8 elements from b
vaddps ymm0,ymm0,[c+rcx] ; Add 8 elements from c
vmovaps [a+rcx], ymm0 ; Store result in a
add ecx,32 ; 8 elements * 4 bytes = 32
cmp ecx, 512 ; 128 elements * 4 bytes = 512
jb L ;Loop
Code is from Optimizing subroutines in assembly language.
The code I've written so far is:
static inline void addArray(float *a, float *b, float *c) {
__asm__ __volatile__ (
"nop \n"
"xor %%ecx, %%ecx \n" //;Loop counter set to 0
"loop: \n\t"
"vmovaps %1, %%ymm0 \n" //;Load 8 elements from b <== WRONG
"vaddps %2, %%ymm0, %%ymm0 \n" //;Add 8 elements from c <==WRONG
"vmovaps %%ymm0, %0 \n" //;Store result in a
"add 0x20, %%ecx \n" //;8 elemtns * 4 bytes = 32 (0x20)
"cmp 0x200,%%ecx \n" //;128 elements * 4 bytes = 512 (0x200)
"jb loop \n" //;Loop"
"nop \n"
: "=m"(a) //Outputs
: "m"(b), "m"(c) //Inputs
: "%ecx","%ymm0" //Modifies ECX and YMM0
);
}
The lines marked as "wrong" generate: (except from gdb disassemble)
0x0000000000000b78 <+19>: vmovaps -0x10(%rbp),%ymm0
0x0000000000000b7d <+24>: vaddps -0x18(%rbp),%ymm0,%ymm0
I want to get something like this (I guess):
vmovaps -0x10(%rbp,%ecx,%0x8),%ymm0
But I do not know how to specify %ecx as my index register.
Can you help me, please?
EDIT
I've tried (%1, %%ecx):
__asm__ __volatile__ (
"nop \n"
"xor %%ecx, %%ecx \n" //;Loop counter set to 0
"loop: \n\t"
"vmovaps (%1, %%rcx), %%ymm0 \n" //;Load 8 elements from b <== MODIFIED HERE
"vaddps %2, %%ymm0, %%ymm0 \n" //;Add 8 elements from c
"vmovaps %%ymm0, %0 \n" //;Store result in a
"add 0x20, %%ecx \n" //;8 elemtns * 4 bytes = 32 (0x20)
"cmp 0x200,%%ecx \n" //;128 elements * 4 bytes = 512 (0x200)
"jb loop \n" //;Loop"
"nop \n"
: "=m"(a) //Outputs
: "m"(b), "m"(c) //Inputs
: "%ecx","%ymm0" //Modifies ECX and YMM0
);
And I got:
inline1.cpp: Assembler messages:
inline1.cpp:90: Error: found '(', expected: ')'
inline1.cpp:90: Error: junk `(%rbp),%rcx)' after expression
I don't think it is possible to translate this literally into GAS inline assembly. In AT&T syntax, the syntax is:
displacement(base register, offset register, scalar multiplier)
which would produce something akin to:
movl -4(%ebp, %ecx, 4), %eax
or in your case:
vmovaps -16(%rsp, %ecx, 0), %ymm0
The problem is, when you use a memory constraint (m), the inline assembler is going to emit the following wherever you write %n (where n is the number of the input/output):
-16(%rsp)
There is no way to manipulate the above into the form you actually want. You can write:
(%1, %%rcx)
but this will produce:
(-16(%rsp),%rcx)
which is clearly wrong. There is no way to get the offset register inside of those parentheses, where it belongs, since %n is emitting the whole -16(%rsp) as a chunk.
Of course, this is not really an issue, since you write inline assembly to get speed, and there's nothing speedy about loading from memory. You should have the inputs in a register, and when you use a register constraint for the input/output (r), you don't have a problem. Notice that this will require modifying your code slightly
Other things wrong with your inline assembly include:
Numeric literals begin with $.
Instructions should have size suffixes, like l for 32-bit and q for 64-bit.
You are clobbering memory when you write through a, so you should have a memory clobber.
The nop instructions at the beginning and the end are completely pointless. They aren't even aligning the branch target.
Every line should really end with a tab character (\t), in addition to a new-line (\n), so that you get proper alignment when you inspect the disassembly.
Here is my version of the code:
void addArray(float *a, float *b, float *c) {
__asm__ __volatile__ (
"xorl %%ecx, %%ecx \n\t" // Loop counter set to 0
"loop: \n\t"
"vmovaps (%1,%%rcx), %%ymm0 \n\t" // Load 8 elements from b
"vaddps (%2,%%rcx), %%ymm0, %%ymm0 \n\t" // Add 8 elements from c
"vmovaps %%ymm0, (%0,%%rcx) \n\t" // Store result in a
"addl $0x20, %%ecx \n\t" // 8 elemtns * 4 bytes = 32 (0x20)
"cmpl $0x200, %%ecx \n\t" // 128 elements * 4 bytes = 512 (0x200)
"jb loop" // Loop"
: // Outputs
: "r" (a), "r" (b), "r" (c) // Inputs
: "%ecx", "%ymm0", "memory" // Modifies ECX, YMM0, and memory
);
}
This causes the compiler to emit the following:
addArray(float*, float*, float*):
xorl %ecx, %ecx
loop:
vmovaps (%rsi,%rcx), %ymm0 # b
vaddps (%rdx,%rcx), %ymm0, %ymm0 # c
vmovaps %ymm0, (%rdi,%rcx) # a
addl $0x20, %ecx
cmpl $0x200, %ecx
jb loop
vzeroupper
retq
Or, in the more familiar Intel syntax:
addArray(float*, float*, float*):
xor ecx, ecx
loop:
vmovaps ymm0, YMMWORD PTR [rsi + rcx]
vaddps ymm0, ymm0, YMMWORD PTR [rdx + rcx]
vmovaps YMMWORD PTR [rdi + rcx], ymm0
add ecx, 32
cmp ecx, 512
jb loop
vzeroupper
ret
In the System V 64-bit calling convention, the first three parameters are passed in the rdi, rsi, and rdx registers, so the code doesn't need to move the parameters into registers—they are already there.
But you are not using input/output constraints to their fullest. You don't need rcx to be used as the counter. Nor do you need to use ymm0 as the scratch register. If you let the compiler pick which free registers to use, it will make the code more efficient. You also won't need to provide an explicit clobber list:
#include <stdint.h>
#include <x86intrin.h>
void addArray(float *a, float *b, float *c) {
uint64_t temp = 0;
__m256 ymm;
__asm__ __volatile__(
"loop: \n\t"
"vmovaps (%3,%0), %1 \n\t" // Load 8 elements from b
"vaddps (%4,%0), %1, %1 \n\t" // Add 8 elements from c
"vmovaps %1, (%2,%0) \n\t" // Store result in a
"addl $0x20, %0 \n\t" // 8 elemtns * 4 bytes = 32 (0x20)
"cmpl $0x200, %0 \n\t" // 128 elements * 4 bytes = 512 (0x200)
"jb loop" // Loop
: "+r" (temp), "=x" (ymm)
: "r" (a), "r" (b), "r" (c)
: "memory"
);
}
Of course, as has been mentioned in the comments, this entire exercise is a waste of time. GAS-style inline assembly, although powerful, is exceedingly difficult to write correctly (I'm not even 100% positive that my code here is correct!), so you should not write anything using inline assembly that you absolutely don't have to. And this is certainly not a case where you have to—the compiler will optimize the addition loop automatically:
void addArray(float *a, float *b, float *c) {
for (int i = 0; i < 128; i++) a[i] = b[i] + c[i];
}
With -O2 and -mavx2, GCC compiles this to the following:
addArray(float*, float*, float*):
xor eax, eax
.L2:
vmovss xmm0, DWORD PTR [rsi+rax]
vaddss xmm0, xmm0, DWORD PTR [rdx+rax]
vmovss DWORD PTR [rdi+rax], xmm0
add rax, 4
cmp rax, 512
jne .L2
rep ret
Well, that looks awfully familiar, doesn't it? To be fair, it isn't vectorized like your code is. You can get that by using -O3 or -ftree-vectorize, but you also get a lot more code generated, so I'd need a benchmark to convince me that it was actually faster and worth the explosion in code size. But most of this is to handle cases where the input isn't aligned—if you indicate that it is aligned and that the pointer is restricted, that solves these problems and improves the code generation substantially. Notice that it is completely unrolling the loop, as well as vectorizing the addition.
I have std::vector<double> X,Y both of size N (with N%16==0) and I want to calculate sum(X[i]*Y[i]). That's a classical use case for Fused Multiply and Add (FMA), which should be fast on AVX-capable processors. I know all my target CPU's are Intel, Haswell or newer.
How do I get GCC to emit that AVX code? -mfma is part of the solution, but do I need other switches?
And is std::vector<double>::operator[] hindering this? I know I can transform
size_t N = X.size();
double sum = 0.0;
for (size_t i = 0; i != N; ++i) sum += X[i] * Y[i];
to
size_t N = X.size();
double sum = 0.0;
double const* Xp = &X[0];
double const* Yp = &X[0];
for (size_t i = 0; i != N; ++i) sum += Xp[i] * Yp[i];
so the compiler can spot that &X[0] doesn't change in the loop. But is this sufficient or even necessary?
Current compiler is GCC 4.9.2, Debian 8, but could upgrade to GCC 5 if necessary.
Did you look at the assembly? I put
double foo(std::vector<double> &X, std::vector<double> &Y) {
size_t N = X.size();
double sum = 0.0;
for (size_t i = 0; i <N; ++i) sum += X[i] * Y[i];
return sum;
}
into http://gcc.godbolt.org/ and looked at the assembly in GCC 4.9.2 with -O3 -mfma and I see
.L3:
vmovsd (%rcx,%rax,8), %xmm1
vfmadd231sd (%rsi,%rax,8), %xmm1, %xmm0
addq $1, %rax
cmpq %rdx, %rax
jne .L3
So it uses fma. However, it doest not vectorize the loop (The s in sd means single (i.e. not packed) and the d means double floating point).
To vectorize the loop you need to enable associative math e.g. with -Ofast. Using -Ofast -mavx2 -mfma gives
.L8:
vmovupd (%rax,%rsi), %xmm2
addq $1, %r10
vinsertf128 $0x1, 16(%rax,%rsi), %ymm2, %ymm2
vfmadd231pd (%r12,%rsi), %ymm2, %ymm1
addq $32, %rsi
cmpq %r10, %rdi
ja .L8
So now it's vectorized (pd means packed doubles). However, it's not unrolled. This is currently a limitation of GCC. You need to unroll several times due to the dependency chain. If you want to have the compiler do this for you then consider using Clang which unrolls four times otherwise unroll by hand with intrinsics.
Note that unlike GCC, Clang does not use fma by default with -mfma. In order to use fma with Clang use -ffp-contract=fast (e.g. -O3 -mfma -ffp-contract=fast) or #pragma STDC FP_CONTRACT ON or enable associative math with e.g. -Ofast You're going to want to enable associate math anyway if you want to vectorize the loop with Clang.
See Fused multiply add and default rounding modes and https://stackoverflow.com/a/34461738/2542702 for more info about enabling fma with different compilers.
GCC creates a lot of extra code to handle misalignment and for N not a multiples of 8. You can tell the compiler to assume the arrays are aligned using __builtin_assume_aligned and that N is a multiple of 8 using N & -8
The following code with -Ofast -mavx2 -mfma
double foo2(double * __restrict X, double * __restrict Y, int N) {
X = (double*)__builtin_assume_aligned(X,32);
Y = (double*)__builtin_assume_aligned(Y,32);
double sum = 0.0;
for (int i = 0; i < (N &-8); ++i) sum += X[i] * Y[i];
return sum;
}
produces the following simple assembly
andl $-8, %edx
jle .L4
subl $4, %edx
vxorpd %xmm0, %xmm0, %xmm0
shrl $2, %edx
xorl %ecx, %ecx
leal 1(%rdx), %eax
xorl %edx, %edx
.L3:
vmovapd (%rsi,%rdx), %ymm2
addl $1, %ecx
vfmadd231pd (%rdi,%rdx), %ymm2, %ymm0
addq $32, %rdx
cmpl %eax, %ecx
jb .L3
vhaddpd %ymm0, %ymm0, %ymm0
vperm2f128 $1, %ymm0, %ymm0, %ymm1
vaddpd %ymm1, %ymm0, %ymm0
vzeroupper
ret
.L4:
vxorpd %xmm0, %xmm0, %xmm0
ret
I'm not sure this will get you all the way there, but I'm almost sure that a big part of the solution.
You have to break the loop into two: 0 to N, with step M>1. I'd try with M of 16, 8, 4, and look at the asm. And a inner loop of 0 to M. Don't worry about the math iterator math. Gcc is smart enough with it.
Gcc should unroll the inner loop and them it can SIMD it and maybe use FMA.
I want to test if two SSE registers are not both zero without destroying them.
This is the code I currently have:
uint8_t *src; // Assume it is initialized and 16-byte aligned
__m128i xmm0, xmm1, xmm2;
xmm0 = _mm_load_si128((__m128i const*)&src[i]); // Need to preserve xmm0 & xmm1
xmm1 = _mm_load_si128((__m128i const*)&src[i+16]);
xmm2 = _mm_or_si128(xmm0, xmm1);
if (!_mm_testz_si128(xmm2, xmm2)) { // Test both are not zero
}
Is this the best way (using up to SSE 4.2)?
I learned something useful from this question. Let's first look at some scalar code
extern foo2(int x, int y);
void foo(int x, int y) {
if((x || y)!=0) foo2(x,y);
}
Compile this like this gcc -O3 -S -masm=intel test.c and the important assembly is
mov eax, edi ; edi = x, esi = y -> copy x into eax
or eax, esi ; eax = x | y and set zero flag in FLAGS if zero
jne .L4 ; jump not zero
Now let's look at testing SIMD registers for zero. Unlike scalar code there is no SIMD FLAGS register. However, with SSE4.1 there are SIMD test instructions which can set the zero flag (and carry flag) in the scalar FLAGS register.
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
__m128i z = _mm_or_si128(x,y);
if (!_mm_testz_si128(z,z)) foo2(x,y);
}
Compile with c99 -msse4.1 -O3 -masm=intel -S test_SSE.c and the the important assembly is
movdqa xmm2, xmm0 ; xmm0 = x, xmm1 = y, copy x into xmm2
por xmm2, xmm1 ; xmm2 = x | y
ptest xmm2, xmm2 ; set zero flag if zero
jne .L4 ; jump not zero
Notice that this takes one more instruction because the packed bit-wise OR does not set the zero flag. Notice also that both the scalar version and the SIMD version need to use an additional register (eax in the scalar case and xmm2 in the SIMD case). So to answer your question your current solution is the best you can do.
However, if you did not have a processor with SSE4.1 or better you would have to use _mm_movemask_epi8. Another alternative which only needs SSE2 is to use _mm_movemask_epi8
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
if (_mm_movemask_epi8(_mm_or_si128(x,y))) foo2(x,y);
}
The important assembly is
movdqa xmm2, xmm0
por xmm2, xmm1
pmovmskb eax, xmm2
test eax, eax
jne .L4
Notice that this needs one more instruction then with the SSE4.1 ptest instruction.
Until now I have been using the pmovmaskb instruction because the latency is better on pre Sandy Bridge processors than with ptest. However, I realized this before Haswell. On Haswell the latency of pmovmaskb is worse than the latency of ptest. They both have the same throughput. But in this case this is not really important. What's important (which I did not realize before) is that pmovmaskb does not set the FLAGS register and so it requires another instruction. So now I'll be using ptest in my critical loop. Thank you for your question.
Edit: as suggested by the OP there is a way this can be done without using another SSE register.
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
if (_mm_movemask_epi8(x) | _mm_movemask_epi8(y)) foo2(x,y);
}
The relevant assembly from GCC is:
pmovmskb eax, xmm0
pmovmskb edx, xmm1
or edx, eax
jne .L4
Instead of using another xmm register this uses two scalar registers.
Note that fewer instructions does not necessarily mean better performance. Which of these solutions is best? You have to test each of them to find out.
If you use C / C ++, you can not control the individual CPU registers. If you want full control, you must use assembler.
Most modern CMOS camera can produce 12bit bayered images.
What would be the fastest way to convert an image data array of 12bit to 16bit so processing would be possible? The actual problem is padding each 12bit number with 4 zeros, little endian can be assumed, SSE2/SSE3/SS4 also acceptable.
Code added:
int* imagePtr = (int*)Image.data;
fixed (float* imageData = img.Data)
{
float* imagePointer = imageData;
for (int t = 0; t < total; t++)
{
int i1 = *imagePtr;
imagePtr = (int*)((ushort*)imagePtr + 1);
int i2 = *imagePtr;
imagePtr = (int*)((ushort*)imagePtr + 2);
*imagePointer = (float)(((i1 << 4) & 0x00000FF0) | ((i1 >> 8) & 0x0000000F));
imagePointer++;
*imagePointer = (float)((i1 >> 12) & 0x00000FFF);
imagePointer++;
*imagePointer = (float)(((i2 >> 4) & 0x00000FF0) | ((i2 >> 12) & 0x0000000F));
imagePointer++;
*imagePointer = (float)((i2 >> 20) & 0x00000FFF);
imagePointer++;
}
}
I cannot guarantee fastest, but this is an approach that uses SSE. Eight 12-16bit conversions are done per iteration and two conversions (approx) are done per step (ie, each iteration takes multiple steps).
This approach straddles the 12bit integers around the 16bit boundaries in the xmm register. Below shows how this is done.
One xmm register is being used (assume xmm0). The state of the register is represented by one line of letters.
Each letter represents 4 bits of a 12bit integer (ie, AAA is the entire first 12bit word in the array).
Each gap represents a 16-bit boundary.
>>2 indicates a logical right-shift of one byte.
The carrot (^) symbol is used to highlight which relevant 12bit integers are straddling a 16bit boundary in each step.
:
load
AAAB BBCC CDDD EEEF FFGG GHHH JJJK KKLL
^^^
>>2
00AA ABBB CCCD DDEE EFFF GGGH HHJJ JKKK
^^^ ^^^
>>2
0000 AAAB BBCC CDDD EEEF FFGG GHHH JJJK
^^^ ^^^
>>2
0000 00AA ABBB CCCD DDEE EFFF GGGH HHJJ
^^^ ^^^
>>2
0000 0000 AAAB BBCC CDDD EEEF FFGG GHHH
^^^
At each step, we can extract the aligned 12bit integers and store them in the xmm1 register. At the end, our xmm1 will look as follows. Question marks denote values which we do not care about.
AAA? ?BBB CCC? ?DDD EEE? ?FFF GGG? ?HHH
Extract the high aligned integers (A, C, E, G) into xmm2 and then, on xmm2, perform a right logical word shift of 4 bits. This will convert the high aligned integers to low aligned. Blend these adjusted integers back into xmm1. The state of xmm1 is now:
?AAA ?BBB ?CCC ?DDD ?EEE ?FFF ?GGG ?HHH
Finally we can mask out the integers (ie, convert the ?'s to 0's) with 0FFFh on each word.
0AAA 0BBB 0CCC 0DDD 0EEE 0FFF 0GGG 0HHH
Now xmm1 contains eight consecutive converted integers.
The following NASM program demonstrates this algorithm.
global main
segment .data
sample dw 1234, 5678, 9ABCh, 1234, 5678, 9ABCh, 1234, 5678
low12 times 8 dw 0FFFh
segment .text
main:
movdqa xmm0, [sample]
pblendw xmm1, xmm0, 10000000b
psrldq xmm0, 1
pblendw xmm1, xmm0, 01100000b
psrldq xmm0, 1
pblendw xmm1, xmm0, 00011000b
psrldq xmm0, 1
pblendw xmm1, xmm0, 00000110b
psrldq xmm0, 1
pblendw xmm1, xmm0, 00000001b
pblendw xmm2, xmm1, 10101010b
psrlw xmm2, 4
pblendw xmm1, xmm2, 10101010b
pand xmm1, [low12] ; low12 could be stored in another xmm register
I'd try to build a solution around the SSSE3 instruction PSHUFB;
Given A=[a0, a1, a2, a3 ... a7], B=[b0, b1, b2, .. b7];
PSHUFB(A,B) = [a_b0, a_b1, a_b2, ... a_b7],
except that the result byte will be zero, if the top bit of bX is 1.
Thus, if
A = [aa ab bb cc cd dd ee ef] == input vector
C=PSHUFB(A, [0 1 1 2 3 4 4 5]) = [aa ab ab bb cc cd cd dd]
C=PSRLW (C, [4 0 4 0]) = [0a aa ab bb 0c cc cd dd] // (>> 4)
C=PSLLW (C, 4) = [aa a0 bb b0 cc c0 dd d0] // << by immediate
A complete solution would read in 3 or 6 mmx / xmm registers and output 4/8 mmx/xmm registers each round. The middle two outputs will have to be combined from two input chunks, requiring some extra copying and combining of registers.