I'm a total noob at assembly, just poking around a bit to see what's going on. Anyway, I wrote a very simple function:
void multA(double *x,long size)
{
long i;
for(i=0; i<size; ++i){
x[i] = 2.4*x[i];
}
}
I compiled it with:
gcc -S -m64 -O2 fun.c
And I get this:
.file "fun.c"
.text
.p2align 4,,15
.globl multA
.type multA, #function
multA:
.LFB34:
.cfi_startproc
testq %rsi, %rsi
jle .L1
movsd .LC0(%rip), %xmm1
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
movsd (%rdi,%rax,8), %xmm0
mulsd %xmm1, %xmm0
movsd %xmm0, (%rdi,%rax,8)
addq $1, %rax
cmpq %rsi, %rax
jne .L3
.L1:
rep
ret
.cfi_endproc
.LFE34:
.size multA, .-multA
.section .rodata.cst8,"aM",#progbits,8
.align 8
.LC0:
.long 858993459
.long 1073951539
.ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
.section .note.GNU-stack,"",#progbits
The assembly output makes sense to me (mostly) except for the line xorl %eax, %eax. From googling, I gather that the purpose of this is simply to set %eax to zero, which in this case corresponds to my iterator long i;.
However, unless I am mistaken, %eax is a 32-bit register. So it seems to me that this should actually be xorq %rax, %rax, particularly since this is holding a 64-bit long int. Moreover, further down in the code, it actually uses the 64-bit register %rax to do the iterating, which never gets initialized outside of xorl %eax %eax, which would seem to only zero out the lower 32 bits of the register.
Am I missing something?
Also, out of curiosity, why are there two .long constants there at the bottom? The first one, 858993459 is equal to the double floating-point representation of 2.4 but I can't figure out what the second number is or why it is there.
I gather that the purpose of this is simply to set %eax to zero
Yes.
which in this case corresponds to my iterator long i;.
No. Your i is uninitialized in the declaration. Strictly speaking, that operation corresponds to the i = 0 expression in the for loop.
However, unless I am mistaken, %eax is a 32-bit register. So it seems to me that this should actually be xorq %rax, %rax, particularly since this is holding a 64-bit long int.
But clearing the lower double word of the register clears the entire register. This is not intuitive, but it's implicit.
Just to answer the second part: .long means 32 bit, and the two integral constants side-by-side form the IEEE-754 representation of the double 2.4:
Dec: 1073951539 858993459
Hex: 0x40033333 0x33333333
400 3333333333333
S+E Mantissa
The exponent is offset by 1023, so the actual exponent is 0x400 − 1023 = 1. The leading "one" in the mantissa is implied, so it's 21 × 0b1.001100110011... (You recognize this periodic expansion as 3/15, i.e. 0.2. Sure enough, 2 × 1.2 = 2.4.)
Related
I found example of code on assembly, which finds the maximum number in array named data_items but that example was for x86 and I tried to adapt it for x64 because 32 bit absolute addressing is not supported by 64 bit system.
To be short there are three actions:
lea data_items(%rip), %rdi #(1) Obtaining data_items address
add $4, %rdi #(2) Incrementing the pointer to 4 to read a next item
movl (%rdi), %eax #(3) Reading data at %rdi to %eax
The main questions:
Is it correct way to pointing? Can it produce error after code relocation?
If the %rip register constantly grows, why lea data_items(%rip), %rdi loads correct memory address? May be getting an offset by %rip have special meaning rather than "dataItems + %rip"?
Full adapted code here:
.section __DATA,__data
data_items:
.long 3,67,34,222,45,75,54,34,44,33,22,11,66,0
.section __TEXT,__text
.globl _main
_main:
lea data_items(%rip), %rdi #(1)
movl (%rdi), %eax
movl %eax, %ebx
start_loop:
cmpl $0, %eax
je loop_exit
add $4, %rdi #(2)
movl (%rdi), %eax #(3)
cmpl %ebx, %eax
jle start_loop
movl %eax, %ebx
jmp start_loop
loop_exit:
mov $0x2000001, %rax
mov $0, %rdi
syscall
I am trying to understand scanf function a have 3 question regarding it.
this is c file:
#include <stdio.h>
#include <stdlib.h>
int main(){
int x;
printf("Enter X:\n");
scanf("%i",&x);
printf("You entered %d...\n",x);
return 0;
}
and here is gas:
.text
.section .rodata
.LC0:
.string "Enter X:"
.LC1:
.string "%i"
.LC2:
.string "You entered %d...\n"
.text
.globl main
.type main, #function
main:
pushq %rbp #
movq %rsp, %rbp #,
subq $16, %rsp #,
# a.c:5: printf("Enter X:\n");
leaq .LC0(%rip), %rdi #,
call puts#PLT #
# a.c:6: scanf("%i",&x);
leaq -4(%rbp), %rax #, tmp90
movq %rax, %rsi # tmp90,
leaq .LC1(%rip), %rdi #,
movl $0, %eax #,
call __isoc99_scanf#PLT #
# a.c:7: printf("You entered %d...\n",x);
movl -4(%rbp), %eax # x, x.0_1
movl %eax, %esi # x.0_1,
leaq .LC2(%rip), %rdi #,
movl $0, %eax #,
call printf#PLT #
# a.c:8: return 0;
movl $0, %eax #, _6
# a.c:9: }
leave
ret
.size main, .-main
.ident "GCC: (Debian 8.3.0-6) 8.3.0"
.section .note.GNU-stack,"",#progbits
1)
The rsi should take address of x int, but it takes the address from -4(%rbp), where there is nothing, in time of execution. Because the initialization of x variable comes from the stdin as scanf waits for input to init the variable. But the what is in -4(%rbp) in the time of instruction leaq -4(%rbp), %rax? It looks like garbage, not address of x, which value should be initialized from stdin.
2)according to this Integer describing number of floating point arguments in xmm registers not passed to rax, the movl $0, %eax is to zero FP registers in al, but that is the same convention for printf. So my question is, to which functions from glibc or other libraries apply this convetion? (So I have to zero %al in printf, scanf, ....?). I assume to every, that has va_list or variable argument?
3) where in the gas source is stack canary in that should protect scanf buffer from overflow? according to this: https://reverseengineering.stackexchange.com/questions/10823/how-does-scanf-interact-with-my-code-in-assembly, this should set canary (in masm):
0x080484c5 <+6>: mov eax,gs:0x14
0x080484cb <+12>: mov DWORD PTR [ebp-0xc],eax
0x080484ce <+15>: xor eax,eax
But I see nothing similar to this in my gas source, which is also output from gcc, which should set it by itself (unless there is some checking in the the scanf function itself which is not visible in my source). So where is it?
Consider this C code:
int f(void) {
int ret;
char carry;
__asm__(
"nop # do something that sets eax and CF"
: "=a"(ret), "=#ccc"(carry)
);
return carry ? -ret : ret;
}
When I compile it with gcc -O3, I get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
negl %edx
testb %cl, %cl
cmovne %edx, %eax
ret
If I change char carry to int carry, I instead get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
movzbl %cl, %ecx
negl %edx
testl %ecx, %ecx
cmovne %edx, %eax
ret
That change replaced testb %cl, %cl with movzbl %cl, %ecx and testl %ecx, %ecx. The program is actually equivalent, though, and GCC knows it. As evidence of this, if I compile with -Os instead of -O3, then both char carry and int carry result in the exact same assembly:
f:
nop # do something that sets eax and CF
jnc .L1
negl %eax
.L1:
ret
It seems like one of two things must be true, but I'm not sure which:
A testb is faster than a movzbl followed by a testl, so GCC's use of the latter with int is a missed optimization.
A testb is slower than a movzbl followed by a testl, so GCC's use of the former with char is a missed optimization.
My gut tells me that an extra instruction will be slower, but I also have a nagging doubt that it's preventing a partial register stall that I just don't see.
By the way, the usual recommended approach of xoring the register to zero before the setc doesn't work in my real example. You can't do it after the inline assembly runs, since xor will overwrite the carry flag, and you can't do it before the inline assembly runs, since in the real context of this code, every general-purpose call-clobbered register is already in use somehow.
There's no downside I'm aware of to reading a byte register with test vs. movzb.
If you are going to zero-extend, it's also a missed optimization not to xor-zero a reg ahead of the asm statement, and setc into that so the cost of zero-extension is off the critical path. (On CPUs other than Intel IvyBridge+ where movzx r32, r8 is not zero latency). Assuming there's a free register, of course. Recent GCC does sometimes find this zero/set-flags/setcc optimization for generating a 32-bit boolean from a flag-setting instruction, but often misses it when things get complex.
Fortunately for you, your real use-case couldn't do that optimization anyway (except with mov $0, %eax zeroing, which would be off the critical path for latency but cause a partial-register stall on Intel P6 family, and cost more code size.) But it's still a missed optimization for your test case.
This question is similar to another question I posted here. I am attempting to write the Assembly version of the following in c/c++:
int x[10];
for (int i = 0; i < 10; i++){
x[i] = i;
}
Essentially, creating an array storing the values 1 through 9.
My current logic is to create a label that loops up to 10 (calling itself until reaching the end value). In the label, I have placed the instructions to update the array at the current index of iteration. However, after compiling with gcc filename.s and running with ./a.out, the error Segmentation fault: 11 is printed to the console. My code is below:
.data
x:.fill 10, 4
index:.int 0
end:.int 10
.text
.globl _main
_main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
jmp outer_loop
leave
ret
outer_loop:
movl index(%rip), %eax;
cmpl end(%rip), %eax
jge end_loop
lea x(%rip), %rdi;
mov index(%rip), %rsi;
movl index(%rip), %eax;
movl %eax, (%rdi, %rsi, 4)
incl index(%rip)
jmp outer_loop
leave
ret
end_loop:
leave
ret
Oddly the code below
lea x(%rip), %rdi;
mov index(%rip), %rsi;
movl index(%rip), %eax;
movl %eax, (%rdi, %rsi, 4)
works only if it is not in a label that is called repetitively. Does anyone know how I can implement the code above in a loop, without Segmentation fault: 11 being raised? I am using x86 Assembly on MacOS with GNU GAS syntax compiled with gcc.
Please note that this question is not a duplicate of this question as different Assembly syntax is being used and the scope of the problem is different.
You're using a 64-bit instruction to access a 32-bit area of memory :
mov index(%rip), %rsi;
This results in %rsi being assigned the contents of memory starting from index and ending at end (I'm assuming no alignment, though I don't remember GAS's rules regarding it). Thus, %rsi effectively is assigned the value 0xa00000000 (assuming first iteration of the loop), and executing the following movl %eax, (%rdi, %rsi, 4) results in the CPU trying to access the address that's not mapped by your process.
The solution is to remove the assignment, and replace the line after it with movl index(%rip), %esi. 32-bit operations are guaranteed to always clear out the upper bits of 64-bit registers, so you can then safely use %rsi in the address calculation, as it's going to contain the current index and nothing more.
Your debugger would've told you this, so please do use it next time.
so im a total noob at assembly code and reading them as well
so i have a simple c code
void saxpy()
{
for(int i = 0; i < ARRAY_SIZE; i++) {
float product = a*x[i];
z[i] = product + y[i];
}
}
and the equivalent assembly code when compiled with
gcc -std=c99 -O3 -fno-tree-vectorize -S code.c -o code-O3.s
gives me the follows asssembly code
saxpy:
.LFB0:
.cfi_startproc
movss a(%rip), %xmm1
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
movss x(%rax), %xmm0
addq $4, %rax
mulss %xmm1, %xmm0
addss y-4(%rax), %xmm0
movss %xmm0, z-4(%rax)
cmpq $262144, %rax
jne .L3
rep ret
.cfi_endproc
i do understand that loop unrolling has taken place
but im not able to understand the intention and idea behind
addq $4, %rax
mulss %xmm1, %xmm0
addss y-4(%rax), %xmm0
movss %xmm0, z-4(%rax)
Can someone explain, the usage of 4, and
what does the statements mean
y-4(%rax)
x, y, and z are global arrays. You left out the end of the listing where the symbols are declared.
I put your code on godbolt for you, with the necessary globals defined (and fixed the indenting). Look at the bottom.
BTW, there's no unrolling going on here. There's one each scalar single-precision mul and add in the loop. Try with -funroll-loops to see it unroll.
With -march=haswell, gcc will use an FMA instruction. If you un-cripple the compiler by leaving out -fno-tree-vectorize, and #define ARRAY_SIZE is small, like 100, it fully unrolls the loop with mostly 32byte FMA ymm instructions, ending with some 16byte FMA xmm.
Also, what is the need to add an immediate value 4 to rax register.
which is done as per the statement "addq $4, %rax"
The loop increments a pointer by 4 bytes, instead of using a scaled-index addressing mode.
Look at the links on https://stackoverflow.com/questions/tagged/x86. Also, single-stepping through code with a debugger is often a good way to make sure you understand what it's doing.