I would like to accelerate an program I'm working on by dynamically generating code with LLVM's JIT. The algorithm can operate on vectors, and I'd rather like to use the SIMD vector extensions in LLVM to do this (not only does it make some operations faster, but it actually makes the code generation simpler).
Do I stand any chance of having this work in a reasonably portable way?
On the C side of things, I'll be compiling with gcc, clang or, maybe, icc. My vectors are going to be simple float x 4 or double x 4 things. The de facto standard for non-platform-specific vector operations in this world appears to be the gcc vector extension:
typedef double Vector4 __attribute__ ((vector_size (sizeof(double)*4)));
Inspection of generated code shows that clang will pass a double x 4 vector in registers, while gcc wants it on the stack --- which is bad. (They both pass float x 4 vectors in registers.)
My understanding is that the two systems are supposed to be ABI compatible, but obviously vectors don't count. Can I actually do this?
My example program is::
typedef double real;
typedef real Vector4 __attribute__ ((vector_size (sizeof(real)*4)));
Vector4 scale(Vector4 a)
{
Vector4 s = {2, 2, 2, 2};
return a*s;
}
This compiles with LLVM into:
scale:
movapd .LCPI0_0(%rip), %xmm2
mulpd %xmm2, %xmm0
mulpd %xmm2, %xmm1
ret
...but gcc produces this horror:
scale:
subq $64, %rsp
movq %rdi, %rax
movsd .LC0(%rip), %xmm0
movapd 72(%rsp), %xmm1
movsd %xmm0, -56(%rsp)
movsd %xmm0, -48(%rsp)
movsd %xmm0, -72(%rsp)
movsd %xmm0, -64(%rsp)
mulpd -56(%rsp), %xmm1
movapd 88(%rsp), %xmm0
mulpd -72(%rsp), %xmm0
movapd %xmm1, -104(%rsp)
movq -104(%rsp), %rdx
movapd %xmm1, -24(%rsp)
movapd %xmm0, -8(%rsp)
movq %rdx, (%rdi)
movq -16(%rsp), %rdx
movq %rdx, 8(%rdi)
movq -8(%rsp), %rdx
movq %rdx, 16(%rdi)
movq (%rsp), %rdx
movq %rdx, 24(%rdi)
addq $64, %rsp
ret
If I redefine real to be a float, I get this from both compilers (they produce identical code):
scale:
mulps .LCPI0_0(%rip), %xmm0
ret
These were all compiled with $CC -O3 -S -msse test.c.
Update: It suddenly occurs to me that the simple solution is to just use LLVM to create a trampoline that the translates from structures to vectors and vice versa. That way the interoperability problem is reduced to pass-by-value structures, which are nailed down by the ABI; the vectors only exist in LLVM-land. It means I only get to use the SIMD stuff inside LLVM, but I can live with that.
However, I would still like to know the answer to the above; vectors are awesome and I'd like to be able to use them more.
Update update: Turns out that the way C passes structures by value is insanely... er, insane! A struct { double x, y, z; } is passed by pointer; a struct { float x, y, z } is passed as a pair of %xmm registers: x and y are packed into the first, and the z is in the second...
Simple and unpainful it's not!
Related
Basically I am trying to understand why both gcc/clang use xmm register for their __builtin_memset even when the memory destination and size are both divisible by sizeof ymm (or zmm for that matter) and the CPU supports AVX2 / AVX512.
and why GCC implements __builtin_memset on medium sized values without any SIMD (again assuming CPU supports SIMD).
For example:
__builtin_memset(__builtin_assume_aligned(ptr, 64), -1, 64));
Will compile to:
vpcmpeqd %xmm0, %xmm0, %xmm0
vmovdqa %xmm0, (%rdi)
vmovdqa %xmm0, 16(%rdi)
vmovdqa %xmm0, 32(%rdi)
vmovdqa %xmm0, 48(%rdi)
I am trying to understand why this is chosen as opposed to something like
vpcmpeqd %ymm0, %ymm0, %ymm0
vmovdqa %ymm0, (%rdi)
vmovdqa %ymm0, 32(%rdi)
if you mix the __builtin_memset with AVX2 instructions they still use xmm so its definitely not to save the vzeroupper
Second for GCC's __builtin_memset(__builtin_assume_aligned(ptr, 64), -1, 512) gcc implements it as:
movq $-1, %rdx
xorl %eax, %eax
.L8:
movl %eax, %ecx
addl $32, %eax
movq %rdx, (%rdi,%rcx)
movq %rdx, 8(%rdi,%rcx)
movq %rdx, 16(%rdi,%rcx)
movq %rdx, 24(%rdi,%rcx)
cmpl $512, %eax
jb .L8
ret
Why would gcc choose this over a loop with xmm (or ymm / zmm) registers?
Here is a godbolt link with the examples (and a few others)
Thank you.
Edit: clang uses ymm (but not zmm)
I have std::vector<double> X,Y both of size N (with N%16==0) and I want to calculate sum(X[i]*Y[i]). That's a classical use case for Fused Multiply and Add (FMA), which should be fast on AVX-capable processors. I know all my target CPU's are Intel, Haswell or newer.
How do I get GCC to emit that AVX code? -mfma is part of the solution, but do I need other switches?
And is std::vector<double>::operator[] hindering this? I know I can transform
size_t N = X.size();
double sum = 0.0;
for (size_t i = 0; i != N; ++i) sum += X[i] * Y[i];
to
size_t N = X.size();
double sum = 0.0;
double const* Xp = &X[0];
double const* Yp = &X[0];
for (size_t i = 0; i != N; ++i) sum += Xp[i] * Yp[i];
so the compiler can spot that &X[0] doesn't change in the loop. But is this sufficient or even necessary?
Current compiler is GCC 4.9.2, Debian 8, but could upgrade to GCC 5 if necessary.
Did you look at the assembly? I put
double foo(std::vector<double> &X, std::vector<double> &Y) {
size_t N = X.size();
double sum = 0.0;
for (size_t i = 0; i <N; ++i) sum += X[i] * Y[i];
return sum;
}
into http://gcc.godbolt.org/ and looked at the assembly in GCC 4.9.2 with -O3 -mfma and I see
.L3:
vmovsd (%rcx,%rax,8), %xmm1
vfmadd231sd (%rsi,%rax,8), %xmm1, %xmm0
addq $1, %rax
cmpq %rdx, %rax
jne .L3
So it uses fma. However, it doest not vectorize the loop (The s in sd means single (i.e. not packed) and the d means double floating point).
To vectorize the loop you need to enable associative math e.g. with -Ofast. Using -Ofast -mavx2 -mfma gives
.L8:
vmovupd (%rax,%rsi), %xmm2
addq $1, %r10
vinsertf128 $0x1, 16(%rax,%rsi), %ymm2, %ymm2
vfmadd231pd (%r12,%rsi), %ymm2, %ymm1
addq $32, %rsi
cmpq %r10, %rdi
ja .L8
So now it's vectorized (pd means packed doubles). However, it's not unrolled. This is currently a limitation of GCC. You need to unroll several times due to the dependency chain. If you want to have the compiler do this for you then consider using Clang which unrolls four times otherwise unroll by hand with intrinsics.
Note that unlike GCC, Clang does not use fma by default with -mfma. In order to use fma with Clang use -ffp-contract=fast (e.g. -O3 -mfma -ffp-contract=fast) or #pragma STDC FP_CONTRACT ON or enable associative math with e.g. -Ofast You're going to want to enable associate math anyway if you want to vectorize the loop with Clang.
See Fused multiply add and default rounding modes and https://stackoverflow.com/a/34461738/2542702 for more info about enabling fma with different compilers.
GCC creates a lot of extra code to handle misalignment and for N not a multiples of 8. You can tell the compiler to assume the arrays are aligned using __builtin_assume_aligned and that N is a multiple of 8 using N & -8
The following code with -Ofast -mavx2 -mfma
double foo2(double * __restrict X, double * __restrict Y, int N) {
X = (double*)__builtin_assume_aligned(X,32);
Y = (double*)__builtin_assume_aligned(Y,32);
double sum = 0.0;
for (int i = 0; i < (N &-8); ++i) sum += X[i] * Y[i];
return sum;
}
produces the following simple assembly
andl $-8, %edx
jle .L4
subl $4, %edx
vxorpd %xmm0, %xmm0, %xmm0
shrl $2, %edx
xorl %ecx, %ecx
leal 1(%rdx), %eax
xorl %edx, %edx
.L3:
vmovapd (%rsi,%rdx), %ymm2
addl $1, %ecx
vfmadd231pd (%rdi,%rdx), %ymm2, %ymm0
addq $32, %rdx
cmpl %eax, %ecx
jb .L3
vhaddpd %ymm0, %ymm0, %ymm0
vperm2f128 $1, %ymm0, %ymm0, %ymm1
vaddpd %ymm1, %ymm0, %ymm0
vzeroupper
ret
.L4:
vxorpd %xmm0, %xmm0, %xmm0
ret
I'm not sure this will get you all the way there, but I'm almost sure that a big part of the solution.
You have to break the loop into two: 0 to N, with step M>1. I'd try with M of 16, 8, 4, and look at the asm. And a inner loop of 0 to M. Don't worry about the math iterator math. Gcc is smart enough with it.
Gcc should unroll the inner loop and them it can SIMD it and maybe use FMA.
so im a total noob at assembly code and reading them as well
so i have a simple c code
void saxpy()
{
for(int i = 0; i < ARRAY_SIZE; i++) {
float product = a*x[i];
z[i] = product + y[i];
}
}
and the equivalent assembly code when compiled with
gcc -std=c99 -O3 -fno-tree-vectorize -S code.c -o code-O3.s
gives me the follows asssembly code
saxpy:
.LFB0:
.cfi_startproc
movss a(%rip), %xmm1
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
movss x(%rax), %xmm0
addq $4, %rax
mulss %xmm1, %xmm0
addss y-4(%rax), %xmm0
movss %xmm0, z-4(%rax)
cmpq $262144, %rax
jne .L3
rep ret
.cfi_endproc
i do understand that loop unrolling has taken place
but im not able to understand the intention and idea behind
addq $4, %rax
mulss %xmm1, %xmm0
addss y-4(%rax), %xmm0
movss %xmm0, z-4(%rax)
Can someone explain, the usage of 4, and
what does the statements mean
y-4(%rax)
x, y, and z are global arrays. You left out the end of the listing where the symbols are declared.
I put your code on godbolt for you, with the necessary globals defined (and fixed the indenting). Look at the bottom.
BTW, there's no unrolling going on here. There's one each scalar single-precision mul and add in the loop. Try with -funroll-loops to see it unroll.
With -march=haswell, gcc will use an FMA instruction. If you un-cripple the compiler by leaving out -fno-tree-vectorize, and #define ARRAY_SIZE is small, like 100, it fully unrolls the loop with mostly 32byte FMA ymm instructions, ending with some 16byte FMA xmm.
Also, what is the need to add an immediate value 4 to rax register.
which is done as per the statement "addq $4, %rax"
The loop increments a pointer by 4 bytes, instead of using a scaled-index addressing mode.
Look at the links on https://stackoverflow.com/questions/tagged/x86. Also, single-stepping through code with a debugger is often a good way to make sure you understand what it's doing.
I'm a total noob at assembly, just poking around a bit to see what's going on. Anyway, I wrote a very simple function:
void multA(double *x,long size)
{
long i;
for(i=0; i<size; ++i){
x[i] = 2.4*x[i];
}
}
I compiled it with:
gcc -S -m64 -O2 fun.c
And I get this:
.file "fun.c"
.text
.p2align 4,,15
.globl multA
.type multA, #function
multA:
.LFB34:
.cfi_startproc
testq %rsi, %rsi
jle .L1
movsd .LC0(%rip), %xmm1
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
movsd (%rdi,%rax,8), %xmm0
mulsd %xmm1, %xmm0
movsd %xmm0, (%rdi,%rax,8)
addq $1, %rax
cmpq %rsi, %rax
jne .L3
.L1:
rep
ret
.cfi_endproc
.LFE34:
.size multA, .-multA
.section .rodata.cst8,"aM",#progbits,8
.align 8
.LC0:
.long 858993459
.long 1073951539
.ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
.section .note.GNU-stack,"",#progbits
The assembly output makes sense to me (mostly) except for the line xorl %eax, %eax. From googling, I gather that the purpose of this is simply to set %eax to zero, which in this case corresponds to my iterator long i;.
However, unless I am mistaken, %eax is a 32-bit register. So it seems to me that this should actually be xorq %rax, %rax, particularly since this is holding a 64-bit long int. Moreover, further down in the code, it actually uses the 64-bit register %rax to do the iterating, which never gets initialized outside of xorl %eax %eax, which would seem to only zero out the lower 32 bits of the register.
Am I missing something?
Also, out of curiosity, why are there two .long constants there at the bottom? The first one, 858993459 is equal to the double floating-point representation of 2.4 but I can't figure out what the second number is or why it is there.
I gather that the purpose of this is simply to set %eax to zero
Yes.
which in this case corresponds to my iterator long i;.
No. Your i is uninitialized in the declaration. Strictly speaking, that operation corresponds to the i = 0 expression in the for loop.
However, unless I am mistaken, %eax is a 32-bit register. So it seems to me that this should actually be xorq %rax, %rax, particularly since this is holding a 64-bit long int.
But clearing the lower double word of the register clears the entire register. This is not intuitive, but it's implicit.
Just to answer the second part: .long means 32 bit, and the two integral constants side-by-side form the IEEE-754 representation of the double 2.4:
Dec: 1073951539 858993459
Hex: 0x40033333 0x33333333
400 3333333333333
S+E Mantissa
The exponent is offset by 1023, so the actual exponent is 0x400 − 1023 = 1. The leading "one" in the mantissa is implied, so it's 21 × 0b1.001100110011... (You recognize this periodic expansion as 3/15, i.e. 0.2. Sure enough, 2 × 1.2 = 2.4.)
I have three functions a(), b() and c() that are supposed to do the same thing:
typedef float Builtin __attribute__ ((vector_size (16)));
typedef struct {
float values[4];
} Struct;
typedef union {
Builtin b;
Struct s;
} Union;
extern void printv(Builtin);
extern void printv(Union);
extern void printv(Struct);
int a() {
Builtin m = { 1.0, 2.0, 3.0, 4.0 };
printv(m);
}
int b() {
Union m = { 1.0, 2.0, 3.0, 4.0 };
printv(m);
}
int c() {
Struct m = { 1.0, 2.0, 3.0, 4.0 };
printv(m);
}
When I compile this code I observe the following behaviour:
When calling printv() in a() all 4 floats are being passed by %xmm0. No writes to memory occur.
When calling printv() in b() 2 floats are being passed by %xmm0 and the two other floats by %xmm1. To accomplish this 4 floats are loaded (.LC0) to %xmm2 and from there to memory. After that, 2 floats are read from the same place in memory to %xmm0 and the 2 other floats are loaded (.LC1) to %xmm1.
I'm a bit lost on what c() actually does.
Why are a(), b() and c() different?
Here is the assembly output for a():
vmovaps .LC0(%rip), %xmm0
call _Z6printvU8__vectorf
The assembly output for b():
vmovaps .LC0(%rip), %xmm2
vmovaps %xmm2, (%rsp)
vmovq .LC1(%rip), %xmm1
vmovq (%rsp), %xmm0
call _Z6printv5Union
And the assembly output for c():
andq $-32, %rsp
subq $32, %rsp
vmovaps .LC0(%rip), %xmm0
vmovaps %xmm0, (%rsp)
vmovq .LC2(%rip), %xmm0
vmovq 8(%rsp), %xmm1
call _Z6printv6Struct
The data:
.section .rodata.cst16,"aM",#progbits,16
.align 16
.LC0:
.long 1065353216
.long 1073741824
.long 1077936128
.long 1082130432
.section .rodata.cst8,"aM",#progbits,8
.align 8
.LC1:
.quad 4647714816524288000
.align 8
.LC2:
.quad 4611686019492741120
The quad 4647714816524288000 seems to be nothing more than the floats 3.0 and 4.0 in adjacent long words.
Nice question, I had to dig a little because I never used SSE (in this case SSE2) myself. Essentially vector instructions are used to operate on multiple values stored in one register i.e. the XMM(0-7) registers. In C the data type float uses IEEE 754 and its length is thus 32bits. Using four floats will yield a vector of length 128bits which is exactly the length of the XMM(0-7) registers. Now the registers provided by SSE look like this:
SSE (avx-128): |----------------|name: XMM0; size: 128bit
SSE (avx-256): |----------------|----------------|name: YMM0; size: 256bit
In your first case a() you use the SIMD vectorization with
typedef float Builtin __attribute__ ((vector_size (16)));
which allows you to shift the entire vector in one go into the XMM0 register. Now in your second case b() you use a union. But because you do not load .LC0 into the union with Union m.b = { 1.0, 2.0, 3.0, 4.0 }; the data is not recognized as a vectorization. This leads to the following behavior:
The data from .LC0 is loaded into XMM2 with:
vmovaps .LC0(%rip), %xmm2
but because your data can be interpreted as a structure or as a vectorization the data has to be split up into two 64bit chunks which will still have to be in the XMM(0-7) registers because it can be treated as a vectorization but it has to be maximally 64bit long that it can be transferred to a register (which is only 64bit wide and would overflow if 128bit were transferred to it; data is lost) because the data can also be treated as a structure. This is done in the following.
The vectorization in XMM2 is loaded to memory with
vmovaps %xmm2, (%rsp)
now the upper 64bits of the vectorization (bits 64-127), i.e. the floats 3.0 and 4.0 are moved (vmovq moves quadword i.e. 64 bits ) to XMM1 with
vmovq .LC1(%rip), %xmm1
and finally the lower 64bits of the vectorization (bits 0-63) i.e. the floats 1.0 and 2.0 are moved from memory to XMM0 with
vmovq (%rsp), %xmm0
Now you have the upper and the lower part of the 128bit vector in separate XMM(0-7) registers.
Now in case c() I'm not quite sure as well but here it goes. First %rsp is aligned to a 32bit address and then 32 byte are subtracted to store the data on the stack (this will align to a 32bit address again) this is done with
andq $-32, %rsp
subq $32, %rsp
now this time the vectorization is loaded into XMM0 and then placed on the stack with
vmovaps .LC0(%rip), %xmm0
vmovaps %xmm0, (%rsp)
and finally the upper 64bits of the vectorization are stored in XMM0 and the lower 64bits are stored in the XMM1 register with
vmovq .LC2(%rip), %xmm0
vmovq 8(%rsp), %xmm1
In all three cases the vectorization is treated differently. Hope this helps.