Assembly code x86 - gcc

so im a total noob at assembly code and reading them as well
so i have a simple c code
void saxpy()
{
for(int i = 0; i < ARRAY_SIZE; i++) {
float product = a*x[i];
z[i] = product + y[i];
}
}
and the equivalent assembly code when compiled with
gcc -std=c99 -O3 -fno-tree-vectorize -S code.c -o code-O3.s
gives me the follows asssembly code
saxpy:
.LFB0:
.cfi_startproc
movss a(%rip), %xmm1
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
movss x(%rax), %xmm0
addq $4, %rax
mulss %xmm1, %xmm0
addss y-4(%rax), %xmm0
movss %xmm0, z-4(%rax)
cmpq $262144, %rax
jne .L3
rep ret
.cfi_endproc
i do understand that loop unrolling has taken place
but im not able to understand the intention and idea behind
addq $4, %rax
mulss %xmm1, %xmm0
addss y-4(%rax), %xmm0
movss %xmm0, z-4(%rax)
Can someone explain, the usage of 4, and
what does the statements mean
y-4(%rax)

x, y, and z are global arrays. You left out the end of the listing where the symbols are declared.
I put your code on godbolt for you, with the necessary globals defined (and fixed the indenting). Look at the bottom.
BTW, there's no unrolling going on here. There's one each scalar single-precision mul and add in the loop. Try with -funroll-loops to see it unroll.
With -march=haswell, gcc will use an FMA instruction. If you un-cripple the compiler by leaving out -fno-tree-vectorize, and #define ARRAY_SIZE is small, like 100, it fully unrolls the loop with mostly 32byte FMA ymm instructions, ending with some 16byte FMA xmm.
Also, what is the need to add an immediate value 4 to rax register.
which is done as per the statement "addq $4, %rax"
The loop increments a pointer by 4 bytes, instead of using a scaled-index addressing mode.
Look at the links on https://stackoverflow.com/questions/tagged/x86. Also, single-stepping through code with a debugger is often a good way to make sure you understand what it's doing.

Related

Trying to understand clang/gcc __builtin_memset on constant size / aligned pointers

Basically I am trying to understand why both gcc/clang use xmm register for their __builtin_memset even when the memory destination and size are both divisible by sizeof ymm (or zmm for that matter) and the CPU supports AVX2 / AVX512.
and why GCC implements __builtin_memset on medium sized values without any SIMD (again assuming CPU supports SIMD).
For example:
__builtin_memset(__builtin_assume_aligned(ptr, 64), -1, 64));
Will compile to:
vpcmpeqd %xmm0, %xmm0, %xmm0
vmovdqa %xmm0, (%rdi)
vmovdqa %xmm0, 16(%rdi)
vmovdqa %xmm0, 32(%rdi)
vmovdqa %xmm0, 48(%rdi)
I am trying to understand why this is chosen as opposed to something like
vpcmpeqd %ymm0, %ymm0, %ymm0
vmovdqa %ymm0, (%rdi)
vmovdqa %ymm0, 32(%rdi)
if you mix the __builtin_memset with AVX2 instructions they still use xmm so its definitely not to save the vzeroupper
Second for GCC's __builtin_memset(__builtin_assume_aligned(ptr, 64), -1, 512) gcc implements it as:
movq $-1, %rdx
xorl %eax, %eax
.L8:
movl %eax, %ecx
addl $32, %eax
movq %rdx, (%rdi,%rcx)
movq %rdx, 8(%rdi,%rcx)
movq %rdx, 16(%rdi,%rcx)
movq %rdx, 24(%rdi,%rcx)
cmpl $512, %eax
jb .L8
ret
Why would gcc choose this over a loop with xmm (or ymm / zmm) registers?
Here is a godbolt link with the examples (and a few others)
Thank you.
Edit: clang uses ymm (but not zmm)

Is movzbl followed by testl faster than testb?

Consider this C code:
int f(void) {
int ret;
char carry;
__asm__(
"nop # do something that sets eax and CF"
: "=a"(ret), "=#ccc"(carry)
);
return carry ? -ret : ret;
}
When I compile it with gcc -O3, I get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
negl %edx
testb %cl, %cl
cmovne %edx, %eax
ret
If I change char carry to int carry, I instead get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
movzbl %cl, %ecx
negl %edx
testl %ecx, %ecx
cmovne %edx, %eax
ret
That change replaced testb %cl, %cl with movzbl %cl, %ecx and testl %ecx, %ecx. The program is actually equivalent, though, and GCC knows it. As evidence of this, if I compile with -Os instead of -O3, then both char carry and int carry result in the exact same assembly:
f:
nop # do something that sets eax and CF
jnc .L1
negl %eax
.L1:
ret
It seems like one of two things must be true, but I'm not sure which:
A testb is faster than a movzbl followed by a testl, so GCC's use of the latter with int is a missed optimization.
A testb is slower than a movzbl followed by a testl, so GCC's use of the former with char is a missed optimization.
My gut tells me that an extra instruction will be slower, but I also have a nagging doubt that it's preventing a partial register stall that I just don't see.
By the way, the usual recommended approach of xoring the register to zero before the setc doesn't work in my real example. You can't do it after the inline assembly runs, since xor will overwrite the carry flag, and you can't do it before the inline assembly runs, since in the real context of this code, every general-purpose call-clobbered register is already in use somehow.
There's no downside I'm aware of to reading a byte register with test vs. movzb.
If you are going to zero-extend, it's also a missed optimization not to xor-zero a reg ahead of the asm statement, and setc into that so the cost of zero-extension is off the critical path. (On CPUs other than Intel IvyBridge+ where movzx r32, r8 is not zero latency). Assuming there's a free register, of course. Recent GCC does sometimes find this zero/set-flags/setcc optimization for generating a 32-bit boolean from a flag-setting instruction, but often misses it when things get complex.
Fortunately for you, your real use-case couldn't do that optimization anyway (except with mov $0, %eax zeroing, which would be off the critical path for latency but cause a partial-register stall on Intel P6 family, and cost more code size.) But it's still a missed optimization for your test case.

xorl %eax, %eax in x86_64 assembly code produced by gcc

I'm a total noob at assembly, just poking around a bit to see what's going on. Anyway, I wrote a very simple function:
void multA(double *x,long size)
{
long i;
for(i=0; i<size; ++i){
x[i] = 2.4*x[i];
}
}
I compiled it with:
gcc -S -m64 -O2 fun.c
And I get this:
.file "fun.c"
.text
.p2align 4,,15
.globl multA
.type multA, #function
multA:
.LFB34:
.cfi_startproc
testq %rsi, %rsi
jle .L1
movsd .LC0(%rip), %xmm1
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
movsd (%rdi,%rax,8), %xmm0
mulsd %xmm1, %xmm0
movsd %xmm0, (%rdi,%rax,8)
addq $1, %rax
cmpq %rsi, %rax
jne .L3
.L1:
rep
ret
.cfi_endproc
.LFE34:
.size multA, .-multA
.section .rodata.cst8,"aM",#progbits,8
.align 8
.LC0:
.long 858993459
.long 1073951539
.ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
.section .note.GNU-stack,"",#progbits
The assembly output makes sense to me (mostly) except for the line xorl %eax, %eax. From googling, I gather that the purpose of this is simply to set %eax to zero, which in this case corresponds to my iterator long i;.
However, unless I am mistaken, %eax is a 32-bit register. So it seems to me that this should actually be xorq %rax, %rax, particularly since this is holding a 64-bit long int. Moreover, further down in the code, it actually uses the 64-bit register %rax to do the iterating, which never gets initialized outside of xorl %eax %eax, which would seem to only zero out the lower 32 bits of the register.
Am I missing something?
Also, out of curiosity, why are there two .long constants there at the bottom? The first one, 858993459 is equal to the double floating-point representation of 2.4 but I can't figure out what the second number is or why it is there.
I gather that the purpose of this is simply to set %eax to zero
Yes.
which in this case corresponds to my iterator long i;.
No. Your i is uninitialized in the declaration. Strictly speaking, that operation corresponds to the i = 0 expression in the for loop.
However, unless I am mistaken, %eax is a 32-bit register. So it seems to me that this should actually be xorq %rax, %rax, particularly since this is holding a 64-bit long int.
But clearing the lower double word of the register clears the entire register. This is not intuitive, but it's implicit.
Just to answer the second part: .long means 32 bit, and the two integral constants side-by-side form the IEEE-754 representation of the double 2.4:
Dec: 1073951539 858993459
Hex: 0x40033333 0x33333333
400 3333333333333
S+E Mantissa
The exponent is offset by 1023, so the actual exponent is 0x400 − 1023 = 1. The leading "one" in the mantissa is implied, so it's 21 × 0b1.001100110011... (You recognize this periodic expansion as 3/15, i.e. 0.2. Sure enough, 2 × 1.2 = 2.4.)

SIMD vector interoperability between LLVM and gcc

I would like to accelerate an program I'm working on by dynamically generating code with LLVM's JIT. The algorithm can operate on vectors, and I'd rather like to use the SIMD vector extensions in LLVM to do this (not only does it make some operations faster, but it actually makes the code generation simpler).
Do I stand any chance of having this work in a reasonably portable way?
On the C side of things, I'll be compiling with gcc, clang or, maybe, icc. My vectors are going to be simple float x 4 or double x 4 things. The de facto standard for non-platform-specific vector operations in this world appears to be the gcc vector extension:
typedef double Vector4 __attribute__ ((vector_size (sizeof(double)*4)));
Inspection of generated code shows that clang will pass a double x 4 vector in registers, while gcc wants it on the stack --- which is bad. (They both pass float x 4 vectors in registers.)
My understanding is that the two systems are supposed to be ABI compatible, but obviously vectors don't count. Can I actually do this?
My example program is::
typedef double real;
typedef real Vector4 __attribute__ ((vector_size (sizeof(real)*4)));
Vector4 scale(Vector4 a)
{
Vector4 s = {2, 2, 2, 2};
return a*s;
}
This compiles with LLVM into:
scale:
movapd .LCPI0_0(%rip), %xmm2
mulpd %xmm2, %xmm0
mulpd %xmm2, %xmm1
ret
...but gcc produces this horror:
scale:
subq $64, %rsp
movq %rdi, %rax
movsd .LC0(%rip), %xmm0
movapd 72(%rsp), %xmm1
movsd %xmm0, -56(%rsp)
movsd %xmm0, -48(%rsp)
movsd %xmm0, -72(%rsp)
movsd %xmm0, -64(%rsp)
mulpd -56(%rsp), %xmm1
movapd 88(%rsp), %xmm0
mulpd -72(%rsp), %xmm0
movapd %xmm1, -104(%rsp)
movq -104(%rsp), %rdx
movapd %xmm1, -24(%rsp)
movapd %xmm0, -8(%rsp)
movq %rdx, (%rdi)
movq -16(%rsp), %rdx
movq %rdx, 8(%rdi)
movq -8(%rsp), %rdx
movq %rdx, 16(%rdi)
movq (%rsp), %rdx
movq %rdx, 24(%rdi)
addq $64, %rsp
ret
If I redefine real to be a float, I get this from both compilers (they produce identical code):
scale:
mulps .LCPI0_0(%rip), %xmm0
ret
These were all compiled with $CC -O3 -S -msse test.c.
Update: It suddenly occurs to me that the simple solution is to just use LLVM to create a trampoline that the translates from structures to vectors and vice versa. That way the interoperability problem is reduced to pass-by-value structures, which are nailed down by the ABI; the vectors only exist in LLVM-land. It means I only get to use the SIMD stuff inside LLVM, but I can live with that.
However, I would still like to know the answer to the above; vectors are awesome and I'd like to be able to use them more.
Update update: Turns out that the way C passes structures by value is insanely... er, insane! A struct { double x, y, z; } is passed by pointer; a struct { float x, y, z } is passed as a pair of %xmm registers: x and y are packed into the first, and the z is in the second...
Simple and unpainful it's not!

Benchmarking SSE instructions

I'm benchmarking some SSE code (multiplying 4 floats by 4 floats) against traditional C code doing the same thing. I think my benchmark code must be incorrect in some way because it seems to say that the non-SSE code is faster than the SSE by a factor of 2-3.
Can someone tell me what is wrong with the benchmarking code below? And perhaps suggest another approach that accurately shows the speeds for both the SSE and non-SSE code.
#include <time.h>
#include <string.h>
#include <stdio.h>
#define ITERATIONS 100000
#define MULT_FLOAT4(X, Y) ({ \
asm volatile ( \
"movaps (%0), %%xmm0\n\t" \
"mulps (%1), %%xmm0\n\t" \
"movaps %%xmm0, (%1)" \
:: "r" (X), "r" (Y)); })
int main(void)
{
int i, j;
float a[4] __attribute__((aligned(16))) = { 10, 20, 30, 40 };
time_t timer, sse_time, std_time;
timer = time(NULL);
for(j = 0; j < 5000; ++j)
for(i = 0; i < ITERATIONS; ++i) {
float b[4] __attribute__((aligned(16))) = { 0.1, 0.1, 0.1, 0.1 };
MULT_FLOAT4(a, b);
}
sse_time = time(NULL) - timer;
timer = time(NULL);
for(j = 0; j < 5000; ++j)
for(i = 0; i < ITERATIONS; ++i) {
float b[4] __attribute__((aligned(16))) = { 0.1, 0.1, 0.1, 0.1 };
b[0] *= a[0];
b[1] *= a[1];
b[2] *= a[2];
b[3] *= a[3];
}
std_time = time(NULL) - timer;
printf("sse_time %d\nstd_time %d\n", sse_time, std_time);
return 0;
}
When you enable optimizations the non-SSE code is eliminated completely, whereas the SSE code remains there, so this case is trivial. The more interesting part is when the optimizations are turned off: in this case the SSE-code is still slower whereas the loops' code is the same.
Non-SSE code of the innermost loop's body:
movl $0x3dcccccd, %eax
movl %eax, -80(%rbp)
movl $0x3dcccccd, %eax
movl %eax, -76(%rbp)
movl $0x3dcccccd, %eax
movl %eax, -72(%rbp)
movl $0x3dcccccd, %eax
movl %eax, -68(%rbp)
movss -80(%rbp), %xmm1
movss -48(%rbp), %xmm0
mulss %xmm1, %xmm0
movss %xmm0, -80(%rbp)
movss -76(%rbp), %xmm1
movss -44(%rbp), %xmm0
mulss %xmm1, %xmm0
movss %xmm0, -76(%rbp)
movss -72(%rbp), %xmm1
movss -40(%rbp), %xmm0
mulss %xmm1, %xmm0
movss %xmm0, -72(%rbp)
movss -68(%rbp), %xmm1
movss -36(%rbp), %xmm0
mulss %xmm1, %xmm0
movss %xmm0, -68(%rbp)
SSE code of the innermost loop's body:
movl $0x3dcccccd, %eax
movl %eax, -64(%rbp)
movl $0x3dcccccd, %eax
movl %eax, -60(%rbp)
movl $0x3dcccccd, %eax
movl %eax, -56(%rbp)
movl $0x3dcccccd, %eax
movl %eax, -52(%rbp)
leaq -48(%rbp), %rax
leaq -64(%rbp), %rdx
movaps (%rax), %xmm0
mulps (%rdx), %xmm0
movaps %xmm0, (%rdx)
I'm not sure about this, but here's my guess:
As you can see the compiler just stores the 4 floating values by 4 32-bit stores. This is then read back by a 16 byte load. This causes store forwarding stall which is costly when happens. You can look up this in the Intel manuals. It doesn't occur in the scalar version and this makes the performance difference.
To make it faster you need to make sure that this stall doesn't occur. If you are using a constant array of 4 floats, make it const and store the results in an another aligned array. This way the compiler hopefully won't make those unnecessary 4 byte movs before the load. Or, if you need to fill up the resulting array, do it with a 16 byte store command. If you can't avoid those 4 byte movs, you need to do something else after the store but before the load (for example calculating something else).

Resources