I want to run the following code (in Intel syntax) in gcc (AT&T syntax).
; float a[128], b[128], c[128];
; for (int i = 0; i < 128; i++) a[i] = b[i] + c[i];
; Assume that a, b and c are aligned by 32
xor ecx, ecx ; Loop counter i = 0
L: vmovaps ymm0, [b+rcx] ; Load 8 elements from b
vaddps ymm0,ymm0,[c+rcx] ; Add 8 elements from c
vmovaps [a+rcx], ymm0 ; Store result in a
add ecx,32 ; 8 elements * 4 bytes = 32
cmp ecx, 512 ; 128 elements * 4 bytes = 512
jb L ;Loop
Code is from Optimizing subroutines in assembly language.
The code I've written so far is:
static inline void addArray(float *a, float *b, float *c) {
__asm__ __volatile__ (
"nop \n"
"xor %%ecx, %%ecx \n" //;Loop counter set to 0
"loop: \n\t"
"vmovaps %1, %%ymm0 \n" //;Load 8 elements from b <== WRONG
"vaddps %2, %%ymm0, %%ymm0 \n" //;Add 8 elements from c <==WRONG
"vmovaps %%ymm0, %0 \n" //;Store result in a
"add 0x20, %%ecx \n" //;8 elemtns * 4 bytes = 32 (0x20)
"cmp 0x200,%%ecx \n" //;128 elements * 4 bytes = 512 (0x200)
"jb loop \n" //;Loop"
"nop \n"
: "=m"(a) //Outputs
: "m"(b), "m"(c) //Inputs
: "%ecx","%ymm0" //Modifies ECX and YMM0
);
}
The lines marked as "wrong" generate: (except from gdb disassemble)
0x0000000000000b78 <+19>: vmovaps -0x10(%rbp),%ymm0
0x0000000000000b7d <+24>: vaddps -0x18(%rbp),%ymm0,%ymm0
I want to get something like this (I guess):
vmovaps -0x10(%rbp,%ecx,%0x8),%ymm0
But I do not know how to specify %ecx as my index register.
Can you help me, please?
EDIT
I've tried (%1, %%ecx):
__asm__ __volatile__ (
"nop \n"
"xor %%ecx, %%ecx \n" //;Loop counter set to 0
"loop: \n\t"
"vmovaps (%1, %%rcx), %%ymm0 \n" //;Load 8 elements from b <== MODIFIED HERE
"vaddps %2, %%ymm0, %%ymm0 \n" //;Add 8 elements from c
"vmovaps %%ymm0, %0 \n" //;Store result in a
"add 0x20, %%ecx \n" //;8 elemtns * 4 bytes = 32 (0x20)
"cmp 0x200,%%ecx \n" //;128 elements * 4 bytes = 512 (0x200)
"jb loop \n" //;Loop"
"nop \n"
: "=m"(a) //Outputs
: "m"(b), "m"(c) //Inputs
: "%ecx","%ymm0" //Modifies ECX and YMM0
);
And I got:
inline1.cpp: Assembler messages:
inline1.cpp:90: Error: found '(', expected: ')'
inline1.cpp:90: Error: junk `(%rbp),%rcx)' after expression
I don't think it is possible to translate this literally into GAS inline assembly. In AT&T syntax, the syntax is:
displacement(base register, offset register, scalar multiplier)
which would produce something akin to:
movl -4(%ebp, %ecx, 4), %eax
or in your case:
vmovaps -16(%rsp, %ecx, 0), %ymm0
The problem is, when you use a memory constraint (m), the inline assembler is going to emit the following wherever you write %n (where n is the number of the input/output):
-16(%rsp)
There is no way to manipulate the above into the form you actually want. You can write:
(%1, %%rcx)
but this will produce:
(-16(%rsp),%rcx)
which is clearly wrong. There is no way to get the offset register inside of those parentheses, where it belongs, since %n is emitting the whole -16(%rsp) as a chunk.
Of course, this is not really an issue, since you write inline assembly to get speed, and there's nothing speedy about loading from memory. You should have the inputs in a register, and when you use a register constraint for the input/output (r), you don't have a problem. Notice that this will require modifying your code slightly
Other things wrong with your inline assembly include:
Numeric literals begin with $.
Instructions should have size suffixes, like l for 32-bit and q for 64-bit.
You are clobbering memory when you write through a, so you should have a memory clobber.
The nop instructions at the beginning and the end are completely pointless. They aren't even aligning the branch target.
Every line should really end with a tab character (\t), in addition to a new-line (\n), so that you get proper alignment when you inspect the disassembly.
Here is my version of the code:
void addArray(float *a, float *b, float *c) {
__asm__ __volatile__ (
"xorl %%ecx, %%ecx \n\t" // Loop counter set to 0
"loop: \n\t"
"vmovaps (%1,%%rcx), %%ymm0 \n\t" // Load 8 elements from b
"vaddps (%2,%%rcx), %%ymm0, %%ymm0 \n\t" // Add 8 elements from c
"vmovaps %%ymm0, (%0,%%rcx) \n\t" // Store result in a
"addl $0x20, %%ecx \n\t" // 8 elemtns * 4 bytes = 32 (0x20)
"cmpl $0x200, %%ecx \n\t" // 128 elements * 4 bytes = 512 (0x200)
"jb loop" // Loop"
: // Outputs
: "r" (a), "r" (b), "r" (c) // Inputs
: "%ecx", "%ymm0", "memory" // Modifies ECX, YMM0, and memory
);
}
This causes the compiler to emit the following:
addArray(float*, float*, float*):
xorl %ecx, %ecx
loop:
vmovaps (%rsi,%rcx), %ymm0 # b
vaddps (%rdx,%rcx), %ymm0, %ymm0 # c
vmovaps %ymm0, (%rdi,%rcx) # a
addl $0x20, %ecx
cmpl $0x200, %ecx
jb loop
vzeroupper
retq
Or, in the more familiar Intel syntax:
addArray(float*, float*, float*):
xor ecx, ecx
loop:
vmovaps ymm0, YMMWORD PTR [rsi + rcx]
vaddps ymm0, ymm0, YMMWORD PTR [rdx + rcx]
vmovaps YMMWORD PTR [rdi + rcx], ymm0
add ecx, 32
cmp ecx, 512
jb loop
vzeroupper
ret
In the System V 64-bit calling convention, the first three parameters are passed in the rdi, rsi, and rdx registers, so the code doesn't need to move the parameters into registers—they are already there.
But you are not using input/output constraints to their fullest. You don't need rcx to be used as the counter. Nor do you need to use ymm0 as the scratch register. If you let the compiler pick which free registers to use, it will make the code more efficient. You also won't need to provide an explicit clobber list:
#include <stdint.h>
#include <x86intrin.h>
void addArray(float *a, float *b, float *c) {
uint64_t temp = 0;
__m256 ymm;
__asm__ __volatile__(
"loop: \n\t"
"vmovaps (%3,%0), %1 \n\t" // Load 8 elements from b
"vaddps (%4,%0), %1, %1 \n\t" // Add 8 elements from c
"vmovaps %1, (%2,%0) \n\t" // Store result in a
"addl $0x20, %0 \n\t" // 8 elemtns * 4 bytes = 32 (0x20)
"cmpl $0x200, %0 \n\t" // 128 elements * 4 bytes = 512 (0x200)
"jb loop" // Loop
: "+r" (temp), "=x" (ymm)
: "r" (a), "r" (b), "r" (c)
: "memory"
);
}
Of course, as has been mentioned in the comments, this entire exercise is a waste of time. GAS-style inline assembly, although powerful, is exceedingly difficult to write correctly (I'm not even 100% positive that my code here is correct!), so you should not write anything using inline assembly that you absolutely don't have to. And this is certainly not a case where you have to—the compiler will optimize the addition loop automatically:
void addArray(float *a, float *b, float *c) {
for (int i = 0; i < 128; i++) a[i] = b[i] + c[i];
}
With -O2 and -mavx2, GCC compiles this to the following:
addArray(float*, float*, float*):
xor eax, eax
.L2:
vmovss xmm0, DWORD PTR [rsi+rax]
vaddss xmm0, xmm0, DWORD PTR [rdx+rax]
vmovss DWORD PTR [rdi+rax], xmm0
add rax, 4
cmp rax, 512
jne .L2
rep ret
Well, that looks awfully familiar, doesn't it? To be fair, it isn't vectorized like your code is. You can get that by using -O3 or -ftree-vectorize, but you also get a lot more code generated, so I'd need a benchmark to convince me that it was actually faster and worth the explosion in code size. But most of this is to handle cases where the input isn't aligned—if you indicate that it is aligned and that the pointer is restricted, that solves these problems and improves the code generation substantially. Notice that it is completely unrolling the loop, as well as vectorizing the addition.
Related
Very specific optimisation task.
I have 3 arrays:
const char* inputTape
const int* inputOffset, organised in a group of four
char* outputTapeoutput
which i must assemble output tape from input, according to following 5 operations:
int selectorOffset = inputOffset[4*i];
char selectorValue = inputTape[selectorOffset];
int outputOffset = inputOffset[4*i+1+selectorValue];
char outputValue = inputTape[outputOffset];
outputTape[i] = outputValue; // store byte
and then advance counter.
All iterations are same and could be done all in parallel. Format of inputOffset could be a subject for change, but until same input will produce same output.
OpenCL on GPU fails on this algorithm (works same or even slower that cpu)
Assembly the best i got 5 mov, 1 lea, 1 dec instructions. Upd:
thanks to Peter Cordes little hint
loop_start:
mov eax,dword ptr [rdx-10h] ; selector offset
movzx r10d,byte ptr [rax+r8] ; selector value
mov eax,dword ptr [rdx+r10*4-0Ch] ; output offset
movzx r10d,byte ptr [r8+rax] ; output value
mov byte ptr [r9+rcx-1],r10b ; store to outputTape
lea rdx, [rdx-10h] ; pointer to inputOffset for current
dec ecx ; loop counter, sets zero flag if (ecx == 0)
jne loop_start ; continue looping while non zero iterations left: ( ecx != 0 )
How could i optimise this for SSE/AVX operation? i am stumbled...
UPD: better to see it than to hear it..
I've been through a couple of threads already, and everything seems to be worded correctly...but my output_01 is returning false. It seems that the inline assembly is writing zeros to my variables...and I just can't figure out why. Below is code from the main c file which calls the assembly, and the assembly it calls (and thrown in the header file as well, though I don't think it has any bearing...but the devil is in the details...right?
main file
#include stdio.h
#include "lab.h"
int output_01();
int main()
{
printf("Starting exercise lab_01:\n");
if(output_01())
printf("lab_01: Successful!");
else
printf("lab_01: Failed!");
return 0;
}
int output_01()
{
int eax=0;
int edx=0;
int ecx=0;
asm("call lab_01;"
"mov %0, eax;"
"mov %1, edx;"
"mov %2, ecx;"
:
:"r" (eax), "r" (edx), "r" (ecx)
:
);
if(eax==3 && edx==1 && ecx==2)
{
printf("\teax = %i\n",eax);
printf("\tedx = %i\n",edx);
printf("\tecx = %i\n",ecx);
return 1;
}
else
{
printf("\teax = %i\n",eax);
printf("\tedx = %i\n",edx);
printf("\tecx = %i\n",ecx);
return 0;
}
}
assembly file
BITS 32 ;you must specify bits mode
segment .text ;you must specify a section
GLOBAL lab_01, labSize_01
lab_01:
;instructions:
;the following registers have the following values:
;eax = 1
;edx = 2
;ecx = 3
;Make it so that the registers have the following values, using only the allowed opcodes and registers:
;eax = 3
;edx = 1
;ecx = 2
;Allowed registers: eax,ebx,ecx,edx
;Allowed opcodes: mov, int3
;Non volatile registers: ebp, ebx, edi, esi
;Volatile registers: eax, ecx, edx
;Forbidden items: immediate values, memory addresses
;;;;;;;;;;;;; EXERCISE SETUP CODE - DO NOT TOUCH
int3 ;make it 'easier' to debug
push ebx; this is to save ebx onto the stack.
mov eax, 1
mov edx, 2
mov ecx, 3
;;;;;;;;;;;;; YOUR CODE BELOW
;int3 ;make it 'easier' to debug
mov ebx, eax ;hold 1
mov eax, ecx ;eax is set 3
mov ecx, edx ;ecx is set to 2
mov edx, ebx ;edx is set to 1
int3 ;make it 'easier' to debug
;;;;;;;;;;;;; YOUR CODE ABOVE
pop ebx;
ret
labSize_01 dd $-lab_01 -1
lab.h file:
extern int lab_01();
You listed the registers as input only. You have no outputs at all. The correct asm for this is:
asm("call lab_01;"
: "=a" (eax), "=d" (edx), "=c" (ecx)
);
I want to test if two SSE registers are not both zero without destroying them.
This is the code I currently have:
uint8_t *src; // Assume it is initialized and 16-byte aligned
__m128i xmm0, xmm1, xmm2;
xmm0 = _mm_load_si128((__m128i const*)&src[i]); // Need to preserve xmm0 & xmm1
xmm1 = _mm_load_si128((__m128i const*)&src[i+16]);
xmm2 = _mm_or_si128(xmm0, xmm1);
if (!_mm_testz_si128(xmm2, xmm2)) { // Test both are not zero
}
Is this the best way (using up to SSE 4.2)?
I learned something useful from this question. Let's first look at some scalar code
extern foo2(int x, int y);
void foo(int x, int y) {
if((x || y)!=0) foo2(x,y);
}
Compile this like this gcc -O3 -S -masm=intel test.c and the important assembly is
mov eax, edi ; edi = x, esi = y -> copy x into eax
or eax, esi ; eax = x | y and set zero flag in FLAGS if zero
jne .L4 ; jump not zero
Now let's look at testing SIMD registers for zero. Unlike scalar code there is no SIMD FLAGS register. However, with SSE4.1 there are SIMD test instructions which can set the zero flag (and carry flag) in the scalar FLAGS register.
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
__m128i z = _mm_or_si128(x,y);
if (!_mm_testz_si128(z,z)) foo2(x,y);
}
Compile with c99 -msse4.1 -O3 -masm=intel -S test_SSE.c and the the important assembly is
movdqa xmm2, xmm0 ; xmm0 = x, xmm1 = y, copy x into xmm2
por xmm2, xmm1 ; xmm2 = x | y
ptest xmm2, xmm2 ; set zero flag if zero
jne .L4 ; jump not zero
Notice that this takes one more instruction because the packed bit-wise OR does not set the zero flag. Notice also that both the scalar version and the SIMD version need to use an additional register (eax in the scalar case and xmm2 in the SIMD case). So to answer your question your current solution is the best you can do.
However, if you did not have a processor with SSE4.1 or better you would have to use _mm_movemask_epi8. Another alternative which only needs SSE2 is to use _mm_movemask_epi8
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
if (_mm_movemask_epi8(_mm_or_si128(x,y))) foo2(x,y);
}
The important assembly is
movdqa xmm2, xmm0
por xmm2, xmm1
pmovmskb eax, xmm2
test eax, eax
jne .L4
Notice that this needs one more instruction then with the SSE4.1 ptest instruction.
Until now I have been using the pmovmaskb instruction because the latency is better on pre Sandy Bridge processors than with ptest. However, I realized this before Haswell. On Haswell the latency of pmovmaskb is worse than the latency of ptest. They both have the same throughput. But in this case this is not really important. What's important (which I did not realize before) is that pmovmaskb does not set the FLAGS register and so it requires another instruction. So now I'll be using ptest in my critical loop. Thank you for your question.
Edit: as suggested by the OP there is a way this can be done without using another SSE register.
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
if (_mm_movemask_epi8(x) | _mm_movemask_epi8(y)) foo2(x,y);
}
The relevant assembly from GCC is:
pmovmskb eax, xmm0
pmovmskb edx, xmm1
or edx, eax
jne .L4
Instead of using another xmm register this uses two scalar registers.
Note that fewer instructions does not necessarily mean better performance. Which of these solutions is best? You have to test each of them to find out.
If you use C / C ++, you can not control the individual CPU registers. If you want full control, you must use assembler.
In my quest to learn assembly (using GCC on x86_64), I have come across some SSE examples where instead of just copying a C variable into a register, the address is copied in to EAX instead. Why do that when you can just do this:
typedef float v4sf __attribute__((vector_size(16)));
typedef union {
v4sf v;
float f[4];
} Vec4;
Vec4 vector.v = (v4sf){ 64.1,128.2,256.3,512.4 };
float blah = 2.2;
__asm__("movups %0, %%xmm0 \n\t"
"movups %1, %%xmm1 \n\t"
"shufps $0x00, %%xmm1, %%xmm1 \n\t"
"mulps %%xmm1, %%xmm0 \n\t"
"movups %%xmm0, %0 \n\t"
: "+m"(vector)
: "m"(blah)
: "%xmm0","%xmm1"
);
Does copying the vector into xmm0 (rather than keeping it in memory) cause a performance hit?
Here is an example of what I'm talking about (it's Intel syntax):
void powf_schlickSSE(const float * a, const float b, float * result){
__asm {
mov eax, a //load address of vector
movss xmm0, dword ptr [b] //load exponent into SSE register
movups xmm1, [eax] //load vector into SSE register
shufps xmm0, xmm0, 0 //shuffle b into all floats
movaps xmm2, xmm1 //duplicate vector
mov eax, result //load address of result
mulps xmm1, xmm0 //xmm1 = a*b
subps xmm0, xmm1 //xmm0 = b-a*b
addps xmm0, xmm2 //xmm2 = b-a*b+a
rcpps xmm0, xmm0 //xmm1 = 1 / (b-a*b+a)
mulps xmm2, xmm0 //xmm0 = a * (1 / (b-a*b+a))
movups [eax], xmm2 //store result
}
}
I can see multiple reasons
MSVC (which that Intel-syntax code comes from, right?) doesn't support passing __m128 values into assembly blocks, or at least the version the code was written for didn't. Or maybe that version didn't support SSE at all except via inline assembly.
The rest of the program didn't deal with vector types, so passing by pointer was the simplest solution.
I have an asm loop guaranteed not to go over 128 iterations that I want to unroll with a PC-relative jump. The idea is to unroll each iteration in reverse order and then jump however far into the loop it needs to be. The code would look like this:
#define __mul(i) \
"movq -"#i"(%3,%5,8),%%rax;" \
"mulq "#i"(%4,%6,8);" \
"addq %%rax,%0;" \
"adcq %%rdx,%1;" \
"adcq $0,%2;"
asm("jmp (128-count)*size_of_one_iteration" // I need to figure this jump out
__mul(127)
__mul(126)
__mul(125)
...
__mul(1)
__mul(0)
: "+r"(lo),"+r"(hi),"+r"(overflow)
: "r"(a.data),"r"(b.data),"r"(i-k),"r"(k)
: "%rax","%rdx");
Is something like this possible with gcc inline assembly?
In gcc inline assembly, you can use labels and have the assembler sort out the jump target for you. Something like (contrived example):
int max(int a, int b)
{
int result;
__asm__ __volatile__(
"movl %1, %0\n"
"cmpl %2, %0\n"
"jeq a_is_larger\n"
"movl %2, %0\n"
"a_is_larger:\n" : "=r"(result), "r"(a), "r"(b));
return (result);
}
That's one thing. The other thing you could do to avoid multiplication is to make the assembler align the blocks for you, say, at a multiple of 32 bytes (I don't think the instruction sequence fits into 16 Bytes), like:
#define mul(i) \
".align 32\n" \
".Lmul" #i ":\n" \
"movq -" #i "(%3,%5,8),%%rax\n"\
"mulq " #i "(%4,%6,8)\n" \
"addq %%rax,%0\n" \
"adcq %%rdx,%1\n" \
"adcq $0,%2\n"
This will simply pad the instruction stream with nop. If yo do choose not to align these blocks, you can still, in your main expression, use the generated local labels to find the size of the assembly blocks:
#ifdef UNALIGNED
__asm__ ("imul $(.Lmul0-.Lmul1), %[label]\n"
#else
__asm__ ("shlq $5, %[label]\n"
#endif
"leaq .Lmulblkstart, %[dummy]\n" /* this is PC-relative in 64bit */
"jmp (%[dummy], %[label])\n"
".align 32\n"
".Lmulblkstart:\n"
__mul(127)
...
__mul(0)
: ... [dummy]"=r"(dummy) : [label]"r"((128-count)))
And for the case where count is a compile-time constant, you can even do:
__asm__("jmp .Lmul" #count "\n" ...);
Little note on the end:
Aligning the blocks is a good idea if the autogenerated _mul() thing can create sequences of different lengths. For constants 0..127 as you use, that won't be the case as they all fit into a byte, but if you'll scale them larger it would go to 16- or 32-bit values and the instruction block would grow alongside. By padding the instruction stream, the jumptable technique can still be used.
This isn't a direct answer, but have you considered using a variant of
Duff's Device instead of inline
assembly? That would take the form of switch statement:
switch(iterations) {
case 128: /* code for i=128 here */
case 127: /* code for i=127 here */
case 126: /* code for i=126 here */
/* ... */
case 1: /* code for i=1 here*/
break;
default: die("too many cases");
}
Sorry I can't provide the answer in ATT syntax, I hope you can easily perform the translations.
If you have the count in RCX and you can have a label just after __mul(0) then you could do this:
; rcx must be in [0..128] range.
imul ecx, ecx, -size_of_one_iteration ; Notice the multiplier is negative (using ecx is faster, the upper half of RCX will be automatically cleared by CPU)
lea rcx, [rcx + the_label] ; There is no memory read here
jmp rcx
Hope this helps.
EDIT:
I made a mistake yesterday. I've assumed that referencing a label in [rcx + the_label] is resolved as [rcx + rip + disp] but it is not since there is no such addressing mode (only [rip + disp32] exists)
This code should work and additionally it will left rcx untouched and will destroy rax and rdx instead (but your code seems to not read them before writing to them first):
; rcx must be in [0..128] range.
imul edx, ecx, -size_of_one_iteration ; Notice the multiplier is negative (using ecx is faster, the upper half of RCX will be automatically cleared by CPU)
lea rax, [the_label] ; PC-relative addressing (There is no memory read here)
add rax, rdx
jmp rax