I write a lot of vectorized loops, so 1 common idiom is
volatile int dummy[1<<10];
for (int64_t i = 0; i + 16 <= argc; i+= 16) // process all elements with whole vector
{
int x = dummy[i];
}
// handle remainder (hopefully with SIMD too)
But the resulting machine code has 1 more instruction than I would like (using gcc 4.9)
.L3:
leaq -16(%rax), %rdx
addq $16, %rax
cmpq %rcx, %rax
movl -120(%rsp,%rdx,4), %edx
jbe .L3
If I change the code to for (int64_t i = 0; i <= argc - 16; i+= 16), then the "extra"
instruction is gone:
.L2:
movl -120(%rsp,%rax,4), %ecx
addq $16, %rax
cmpq %rdx, %rax
jbe .L2
But why the difference? I was thinking maybe it was due to loop invariants, but too vaguely. Then I noticed in the 5 instruction case, the increment is done before the load, which would require an extra mov due to x86's destructive 2 operand instructions.
So another explanation could be that it's trading instruction parallelism for 1 extra instruction.
Although it seems there would hardly be any performance difference, can someone explain this mystery (preferably who knows about compiler transformations)?
Ideally I would like to keep the i + 16 <= size form since that has a more intuitive meaning (the last element of the vector doesn't go out of bounds)
If argc were below -2147483632, and i was below 2147483632, the expressions i+16 <= argc would be required to yield an arithmetically-correct result, while the expression and i<argc-16 would not. The need to give an arithmetically-correct result in that corner case prevents the compiler from optimizing the former expression to match the latter.
Related
Consider this C code:
int f(void) {
int ret;
char carry;
__asm__(
"nop # do something that sets eax and CF"
: "=a"(ret), "=#ccc"(carry)
);
return carry ? -ret : ret;
}
When I compile it with gcc -O3, I get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
negl %edx
testb %cl, %cl
cmovne %edx, %eax
ret
If I change char carry to int carry, I instead get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
movzbl %cl, %ecx
negl %edx
testl %ecx, %ecx
cmovne %edx, %eax
ret
That change replaced testb %cl, %cl with movzbl %cl, %ecx and testl %ecx, %ecx. The program is actually equivalent, though, and GCC knows it. As evidence of this, if I compile with -Os instead of -O3, then both char carry and int carry result in the exact same assembly:
f:
nop # do something that sets eax and CF
jnc .L1
negl %eax
.L1:
ret
It seems like one of two things must be true, but I'm not sure which:
A testb is faster than a movzbl followed by a testl, so GCC's use of the latter with int is a missed optimization.
A testb is slower than a movzbl followed by a testl, so GCC's use of the former with char is a missed optimization.
My gut tells me that an extra instruction will be slower, but I also have a nagging doubt that it's preventing a partial register stall that I just don't see.
By the way, the usual recommended approach of xoring the register to zero before the setc doesn't work in my real example. You can't do it after the inline assembly runs, since xor will overwrite the carry flag, and you can't do it before the inline assembly runs, since in the real context of this code, every general-purpose call-clobbered register is already in use somehow.
There's no downside I'm aware of to reading a byte register with test vs. movzb.
If you are going to zero-extend, it's also a missed optimization not to xor-zero a reg ahead of the asm statement, and setc into that so the cost of zero-extension is off the critical path. (On CPUs other than Intel IvyBridge+ where movzx r32, r8 is not zero latency). Assuming there's a free register, of course. Recent GCC does sometimes find this zero/set-flags/setcc optimization for generating a 32-bit boolean from a flag-setting instruction, but often misses it when things get complex.
Fortunately for you, your real use-case couldn't do that optimization anyway (except with mov $0, %eax zeroing, which would be off the critical path for latency but cause a partial-register stall on Intel P6 family, and cost more code size.) But it's still a missed optimization for your test case.
This question is similar to another question I posted here. I am attempting to write the Assembly version of the following in c/c++:
int x[10];
for (int i = 0; i < 10; i++){
x[i] = i;
}
Essentially, creating an array storing the values 1 through 9.
My current logic is to create a label that loops up to 10 (calling itself until reaching the end value). In the label, I have placed the instructions to update the array at the current index of iteration. However, after compiling with gcc filename.s and running with ./a.out, the error Segmentation fault: 11 is printed to the console. My code is below:
.data
x:.fill 10, 4
index:.int 0
end:.int 10
.text
.globl _main
_main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
jmp outer_loop
leave
ret
outer_loop:
movl index(%rip), %eax;
cmpl end(%rip), %eax
jge end_loop
lea x(%rip), %rdi;
mov index(%rip), %rsi;
movl index(%rip), %eax;
movl %eax, (%rdi, %rsi, 4)
incl index(%rip)
jmp outer_loop
leave
ret
end_loop:
leave
ret
Oddly the code below
lea x(%rip), %rdi;
mov index(%rip), %rsi;
movl index(%rip), %eax;
movl %eax, (%rdi, %rsi, 4)
works only if it is not in a label that is called repetitively. Does anyone know how I can implement the code above in a loop, without Segmentation fault: 11 being raised? I am using x86 Assembly on MacOS with GNU GAS syntax compiled with gcc.
Please note that this question is not a duplicate of this question as different Assembly syntax is being used and the scope of the problem is different.
You're using a 64-bit instruction to access a 32-bit area of memory :
mov index(%rip), %rsi;
This results in %rsi being assigned the contents of memory starting from index and ending at end (I'm assuming no alignment, though I don't remember GAS's rules regarding it). Thus, %rsi effectively is assigned the value 0xa00000000 (assuming first iteration of the loop), and executing the following movl %eax, (%rdi, %rsi, 4) results in the CPU trying to access the address that's not mapped by your process.
The solution is to remove the assignment, and replace the line after it with movl index(%rip), %esi. 32-bit operations are guaranteed to always clear out the upper bits of 64-bit registers, so you can then safely use %rsi in the address calculation, as it's going to contain the current index and nothing more.
Your debugger would've told you this, so please do use it next time.
I am trying to write a for loop that does multiplication by adding a number (var a) by another number (var b) times.
.globl times
times:
movl $0, %ecx # i = 0
cmpl %ecx, %esi #if b-i
jge end # if >= 0, jump to end
loop:
addl (%edi, %eax), %eax #sum += a
incl %ecx # i++
cmpl %esi, %ecx # compare (i-b)
jl loop # < 0? loop b times total
end:
ret
Where am I going wrong? I've run through the logic and I can't figure out what the problem is.
TL:DR: you didn't zero EAX, and your ADD instruction is using a memory operand.
You should have used a debugger. You'd easily have seen that EAX wasn't zero to start with. See the bottom of the x86 tag wiki for tips on using gdb to debug asm.
I guess you're using the x86-64 System V ABI, so your args (a and b) are in %edi and %esi.
At the start of a function, registers other than the ones holding your args should be assumed to contain garbage. Even the high parts of registers that are holding your args can contain garbage. (exception to this rule: unofficially, narrow args are sign or zero extended to 32-bit by the caller)
Neither arg is a pointer, so you shouldn't dereference them. add (%edi, %eax), %eax calculates a 32-bit address as EDI+EAX, and then loads 32 bits from there. It adds that dword to EAX (the destination operand).
I'm shocked that your program didn't segfault, since you're using your integer arg as a pointer.
For many x86 instructions (like ADD), the destination operand is not write-only. add %edi, %eax does EAX += EDI. I think you're getting mixed up with 3-operand RISC syntax, where you might have an instruction like add %src1, %src2, %dst.
x86 has some instructions like that, added as recent extensions, like BMI2 bzhi, but the usual instructions are all 2-operand with destructive destinations. (except for LEA, where instead of loading from the address, it stores the address in the destination. So lea (%edi, %eax), %eax would work. You could even put the result in a different register. LEA is great for saving MOV instructions by doing shift+add and a mov all in one instruction, using the addressing mode syntax and machine-code encoding.
You have a comment that says ie eax = sum + (a x 4bits). No clue what you're talking about there. a is 4 bytes (not bits), and you're not multiplying a (%edi) by anything.
Just for fun, here's how I'd write your function (if I had to avoid imul %edi, %esi / mov %esi, %eax). I'll assume both args are non-negative, to keep it simple. If your args are signed integers, and you have to loop -b times if b is negative, then you need some extra code.
# args: int a(%edi), int b(%esi) # comments are important for documenting inputs/outputs to blocks of code
# return value: product in %eax
# assumptions: b is non-negative.
times:
xor %eax, %eax # zero eax
test %esi, %esi # set flags from b
jz loop_end # early-out if it's zero
loop: # do{
add %edi, %eax # sum += a,
dec %esi # b-- (setting flags based on the result, except for CF so don't use ja or jb after it)
jge loop # }while(b>=0)
loop_end:
ret
Note the indenting style, so it's easy to find the branch targets. Some people like to indent extra for instructions inside loops.
Your way works fine (if you do it right), but my way illustrates that counting down is easier in asm (no need for an extra register or immediate to hold the upper bound). Also, avoiding redundant compares. But don't worry about optimizing until after you're comfortable writing code that at least works.
This is a pseudo code, keep that in mind.
mov X,ebx <- put into EBX your counter, your B
mov Y,edx <- put into EDX your value, your A
mov 0,eax <- Result
loop:
add eax,edx
dec ebx
jnz loop <- While EBX is not zero
The above implementation should result in your value into EAX. Your code looks like it's missing the eax initialisation.
so im a total noob at assembly code and reading them as well
so i have a simple c code
void saxpy()
{
for(int i = 0; i < ARRAY_SIZE; i++) {
float product = a*x[i];
z[i] = product + y[i];
}
}
and the equivalent assembly code when compiled with
gcc -std=c99 -O3 -fno-tree-vectorize -S code.c -o code-O3.s
gives me the follows asssembly code
saxpy:
.LFB0:
.cfi_startproc
movss a(%rip), %xmm1
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
movss x(%rax), %xmm0
addq $4, %rax
mulss %xmm1, %xmm0
addss y-4(%rax), %xmm0
movss %xmm0, z-4(%rax)
cmpq $262144, %rax
jne .L3
rep ret
.cfi_endproc
i do understand that loop unrolling has taken place
but im not able to understand the intention and idea behind
addq $4, %rax
mulss %xmm1, %xmm0
addss y-4(%rax), %xmm0
movss %xmm0, z-4(%rax)
Can someone explain, the usage of 4, and
what does the statements mean
y-4(%rax)
x, y, and z are global arrays. You left out the end of the listing where the symbols are declared.
I put your code on godbolt for you, with the necessary globals defined (and fixed the indenting). Look at the bottom.
BTW, there's no unrolling going on here. There's one each scalar single-precision mul and add in the loop. Try with -funroll-loops to see it unroll.
With -march=haswell, gcc will use an FMA instruction. If you un-cripple the compiler by leaving out -fno-tree-vectorize, and #define ARRAY_SIZE is small, like 100, it fully unrolls the loop with mostly 32byte FMA ymm instructions, ending with some 16byte FMA xmm.
Also, what is the need to add an immediate value 4 to rax register.
which is done as per the statement "addq $4, %rax"
The loop increments a pointer by 4 bytes, instead of using a scaled-index addressing mode.
Look at the links on https://stackoverflow.com/questions/tagged/x86. Also, single-stepping through code with a debugger is often a good way to make sure you understand what it's doing.
I'm a total noob at assembly, just poking around a bit to see what's going on. Anyway, I wrote a very simple function:
void multA(double *x,long size)
{
long i;
for(i=0; i<size; ++i){
x[i] = 2.4*x[i];
}
}
I compiled it with:
gcc -S -m64 -O2 fun.c
And I get this:
.file "fun.c"
.text
.p2align 4,,15
.globl multA
.type multA, #function
multA:
.LFB34:
.cfi_startproc
testq %rsi, %rsi
jle .L1
movsd .LC0(%rip), %xmm1
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
movsd (%rdi,%rax,8), %xmm0
mulsd %xmm1, %xmm0
movsd %xmm0, (%rdi,%rax,8)
addq $1, %rax
cmpq %rsi, %rax
jne .L3
.L1:
rep
ret
.cfi_endproc
.LFE34:
.size multA, .-multA
.section .rodata.cst8,"aM",#progbits,8
.align 8
.LC0:
.long 858993459
.long 1073951539
.ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
.section .note.GNU-stack,"",#progbits
The assembly output makes sense to me (mostly) except for the line xorl %eax, %eax. From googling, I gather that the purpose of this is simply to set %eax to zero, which in this case corresponds to my iterator long i;.
However, unless I am mistaken, %eax is a 32-bit register. So it seems to me that this should actually be xorq %rax, %rax, particularly since this is holding a 64-bit long int. Moreover, further down in the code, it actually uses the 64-bit register %rax to do the iterating, which never gets initialized outside of xorl %eax %eax, which would seem to only zero out the lower 32 bits of the register.
Am I missing something?
Also, out of curiosity, why are there two .long constants there at the bottom? The first one, 858993459 is equal to the double floating-point representation of 2.4 but I can't figure out what the second number is or why it is there.
I gather that the purpose of this is simply to set %eax to zero
Yes.
which in this case corresponds to my iterator long i;.
No. Your i is uninitialized in the declaration. Strictly speaking, that operation corresponds to the i = 0 expression in the for loop.
However, unless I am mistaken, %eax is a 32-bit register. So it seems to me that this should actually be xorq %rax, %rax, particularly since this is holding a 64-bit long int.
But clearing the lower double word of the register clears the entire register. This is not intuitive, but it's implicit.
Just to answer the second part: .long means 32 bit, and the two integral constants side-by-side form the IEEE-754 representation of the double 2.4:
Dec: 1073951539 858993459
Hex: 0x40033333 0x33333333
400 3333333333333
S+E Mantissa
The exponent is offset by 1023, so the actual exponent is 0x400 − 1023 = 1. The leading "one" in the mantissa is implied, so it's 21 × 0b1.001100110011... (You recognize this periodic expansion as 3/15, i.e. 0.2. Sure enough, 2 × 1.2 = 2.4.)