The latest version of gcc is producing assembly that doesn't make sense to me. I compiled the code using no optimization; but, some parts of this code don't make sense, even with no optimization.
Here is the C source:
#include <stdio.h>
int main()
{
int a = 1324;
int b = 5657;
int difference = 9876;
int printf_answer = 2221;
difference = a - b;
printf_answer = printf("%d + %d = %d\n", a, b, difference);
return difference;
}
It produces this assembly:
.file "exampleIML-1b.c"
.section .rodata
.LC0:
.string "%d + %d = %d\n"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %rbx
subq $24, %rsp
movl $1324, -32(%rbp)
movl $5657, -28(%rbp)
movl $9876, -24(%rbp)
movl $2221, -20(%rbp)
movl -28(%rbp), %eax
movl -32(%rbp), %edx
movl %edx, %ecx
subl %eax, %ecx
movl %ecx, %eax
movl %eax, -24(%rbp)
movl $.LC0, %eax
movl -24(%rbp), %ecx
movl -28(%rbp), %edx
movl -32(%rbp), %ebx
.cfi_offset 3, -24
movl %ebx, %esi
movq %rax, %rdi
movl $0, %eax
call printf
movl %eax, -20(%rbp)
movl -24(%rbp), %eax
addq $24, %rsp
popq %rbx
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (GNU) 4.4.6 20120305 (Red Hat 4.4.6-4)"
.section .note.GNU-stack,"",#progbits
Several things don't make sense:
(1) Why are we pushing %rbx? What is in %rbx that needs to be saved?
(2) Why are we moving %edx to %ecx before subtracting? What doesn't it just do sub %eax, %edx?
(3) Similarly, why the move from %ecx back to %eax before storing the value?
(4) The compiler is putting the variable a in memory location -32(%rbp). Unless I'm adding wrong, isn't -32(%rbp) equal to the stack pointer? Shouldn't all local variables be stored at values less than the current stack pointer?
I'm using this version of gcc:
[eos17:~/Courses/CS451/IntelMachineLanguage]$ gcc -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC)
GCC dictates how the stack is used. Contract between caller and callee on x86:
* after call instruction:
o %eip points at first instruction of function
o %esp+4 points at first argument
o %esp points at return address
* after ret instruction:
o %eip contains return address
o %esp points at arguments pushed by caller
o called function may have trashed arguments
o %eax contains return value (or trash if function is void)
o %ecx, %edx may be trashed
o %ebp, %ebx, %esi, %edi must contain contents from time of call
* Terminology:
o %eax, %ecx, %edx are "caller save" registers
o %ebp, %ebx, %esi, %edi are "callee save" registers
The main function is like any other function in this context. gcc decided to use ebx for intermediate calculations, so it preserves its value.
By default gcc compiles with optimization disabled, which is the case here, apparently.
You need to enable it with one of the optimization switches (e.g. -O2 or -O3).
Then you will not see redundant and seemingly meaningless things.
As for rbx, it has to be preserved because that's what the calling conventions require. Your function modifies it (movl -32(%rbp), %ebx), so it has to be saved and restored explicitly.
Related
GCC does not seem to support the different memory ordering settings as it generates the same code for relaxed, acquire and sequentially consistent.
I tried the following code with GCC 7.4 and 9.1:
#include <thread>
#include <atomic>
using namespace std;
atomic<int> z(0);
void Thr1()
{
z.store(1,memory_order_relaxed);
}
void Thr2()
{
z.store(2,memory_order_release);
}
void Thr3()
{
z.store(3);
}
//------------------------------------------
int main (int argc, char **argv)
{
thread t1(Thr1);
thread t2(Thr2);
thread t3(Thr3);
t1.join();
t2.join();
t3.join();
return 0;
}
When I generate assembly for the above, I get the following for each of the three functions:
_Z4Thr1v:
.LFB2992:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $1, -12(%rbp)
movl $0, -8(%rbp)
movl -8(%rbp), %eax
movl $65535, %esi
movl %eax, %edi
call _ZStanSt12memory_orderSt23__memory_order_modifier
movl %eax, -4(%rbp)
movl -12(%rbp), %edx
leaq z(%rip), %rax
movl %edx, (%rax)
mfence
nop
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE2992:
.size _Z4Thr1v, .-_Z4Thr1v
.globl _Z4Thr2v
.type _Z4Thr2v, #function
_Z4Thr2v:
.LFB2993:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $2, -12(%rbp)
movl $3, -8(%rbp)
movl -8(%rbp), %eax
movl $65535, %esi
movl %eax, %edi
call _ZStanSt12memory_orderSt23__memory_order_modifier
movl %eax, -4(%rbp)
movl -12(%rbp), %edx
leaq z(%rip), %rax
movl %edx, (%rax)
mfence
nop
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE2993:
.size _Z4Thr2v, .-_Z4Thr2v
.globl _Z4Thr3v
.type _Z4Thr3v, #function
_Z4Thr3v:
.LFB2994:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $3, -12(%rbp)
movl $5, -8(%rbp)
movl -8(%rbp), %eax
movl $65535, %esi
movl %eax, %edi
call _ZStanSt12memory_orderSt23__memory_order_modifier
movl %eax, -4(%rbp)
movl -12(%rbp), %edx
leaq z(%rip), %rax
movl %edx, (%rax)
mfence
nop
leave
.cfi_def_cfa 7, 8
ret
where all of the code ends in a memory fence instruction.
If you are interested in performance, then looking at non-optimised machine code is not going to get you anywhere. Here's what gcc -O2 generates:
Thr1():
mov DWORD PTR z[rip], 1
ret
Thr2():
mov DWORD PTR z[rip], 2
ret
Thr3():
mov DWORD PTR z[rip], 3
mfence
ret
As you can see, only sequentially consistent memory order requires mfence.
I have been trying for some time now to get a number from a keyboard and comparing it with a value on the stack. If it is correct it will print "Hello World!" and if incorrect, it should print out "Nope!". However, what happens now is no matter the input "jne" is called, nope is printed, and segfault. Perhaps one of you could lend a hand.
.section __DATA,__data
str:
.asciz "Hello world!\n"
sto:
.asciz "Nope!\n"
.section __TEXT,__text
.globl _main
_main:
push %rbp
mov %rsp,%rbp
sub $0x20, %rsp
movl $0x0, -0x4(%rbp)
movl $0x2, -0x8(%rbp)
movl $0x2000003, %eax
mov $0, %edi
subq $0x4, %rsi
movq %rsi, %rcx
syscall
cmp -0x8(%rbp), %edx
je L1
jne L2
xor %rbx, %rbx
xor %rax, %rax
movl $0x2000001, %eax
syscall
L1:
xor %rax, %rax
movl $0x2000004, %eax
movl $1, %edi
movq str#GOTPCREL(%rip), %rsi
movq $14, %rdx
syscall
ret
L2:
xor %eax, %eax
movl $0x2000004, %eax
movl $1, %edi
movq sto#GOTPCREL(%rip), %rsi
movq $6, %rdx
syscall
ret
I would start with this OS/X Syscall tutorial (The 64-bit part in your case). It is written for NASM syntax but the important information is the text and links for the SYSCALL calling convention. The SYSCALL table is found on this Apple webpage. Additional information on the standard calling convention for 64-bit OS/X can be found in the System V 64-bit ABI.
Of importance for SYSCALL convention:
arguments are passed in order via these registers rdi, rsi, rdx, r10, r8 and r9
syscall number in the rax register
the call is done via the syscall instruction
what OS X contributes to the mix is that you have to add 0x20000000 to the syscall number (still have to figure out why)
You have many issues with with your sys_read system call. The SYSCALL table says this:
3 AUE_NULL ALL { user_ssize_t read(int fd, user_addr_t cbuf, user_size_t nbyte); }
So given the calling convention, int fd is in RDI, user_addr_t cbuf (pointer to character buffer to hold return data) is in RSI, and user_size_t nbyte (maximum bytes buffer can contain) is in RDX.
Your program seg faulted on the ret because you didn't have proper function epilogue to match the function prologue at the top:
push %rbp #
mov %rsp,%rbp # Function prologue
You need to do the reverse at the bottom, set the result code in RAX and then do the ret. Something like:
mov %rbp,%rsp # \ Function epilogue
pop %rbp # /
xor %eax, %eax # Return value = 0
ret # Return to C runtime which will exit
# gracefully and return to OS
I did other minor cleanup, but tried to keep the structure of the code similar. You will have to learn more assembly to better understand the code that sets up RSI with the address for sys_read SYSCALL . You should try to find a good tutorial/book on x86-64 assembly language programming in general. Writing a primer on that subject is beyond the scope of this answer.
Code that might be closer to what you were looking for that takes the above into account:
.section __DATA,__data
str:
.asciz "Hello world!\n"
sto:
.asciz "Nope!\n"
.section __TEXT,__text
.globl _main
_main:
push %rbp #
mov %rsp,%rbp # Function prologue
sub $0x20, %rsp # Allocate 32 bytes of space on stack
# for temp local variables
movl $0x2, -4(%rbp) # Number for comparison
# 16-bytes from -20(%rbp) to -5(%rbp)
# for char input buffer
movl $0x2000003, %eax
mov $0, %edi # 0 for STDIN
lea -20(%rbp), %rsi # Address of temporary buffer on stack
mov $16, %edx # Read 16 character maximum
syscall
movb (%rsi), %r10b # RSI = pointer to buffer on stack
# get first byte
subb $48, %r10b # Convert first character to number 0-9
cmpb -4(%rbp), %r10b # Did we find magic number (2)?
jne L2 # If No exit with error message
L1: # If the magic number matched print
# Hello World
xor %rax, %rax
movl $0x2000004, %eax
movl $1, %edi
movq str#GOTPCREL(%rip), %rsi
movq $14, %rdx
syscall
jmp L0 # Jump to exit code
L2: # Print "Nope"
xor %eax, %eax
movl $0x2000004, %eax
movl $1, %edi
movq sto#GOTPCREL(%rip), %rsi
movq $6, %rdx
syscall
L0: # Code to exit main
mov %rbp,%rsp # \ Function epilogue
pop %rbp # /
xor %eax, %eax # Return value = 0
ret # Return to C runtime which will exit
# gracefully and return to OS
I'm checking the assembly code of the following simple program on https://godbolt.org/ with parameters -Wall -m32 using gcc 6.3 and to my surprise the generated assembly code takes three multiplications to perform one multiplication in C as shown below.
#include<stdio.h>
#include<inttypes.h>
int main(void){
int32_t ax=-854763;
int32_t bx=586478;
int64_t cx= (int64_t)ax*bx;
printf("%lld\n", cx);
}
The generated assembly code for the statement int64_t cx= (int64_t)ax*bx; is given below. There are two imull and one mull instructions.
movl -28(%ebp), %eax
movl %eax, %ecx
movl %eax, %ebx
sarl $31, %ebx
movl -32(%ebp), %eax
cltd
movl %ebx, %edi
imull %eax, %edi
movl %edx, %esi
imull %ecx, %esi
addl %edi, %esi
mull %ecx
leal (%esi,%edx), %ecx
movl %ecx, %edx
movl %eax, -40(%ebp)
movl %edx, -36(%ebp)
movl %eax, -40(%ebp)
movl %edx, -36(%ebp)`
Is it possible to enforce the gcc compiler/assembler to do it with just one multiplication without using inline assembly? Because in the original code I have to perform many such multiplications within a statement and in different statements.
Why the compiler does not perform it with one imul instruction? I have also checked the different versions of the gcc on https://godbolt.org/ but the result is same three multiplication instruction.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
This might be a simple question.
I want to know which of the 2 statements will take less time to get executed.
if ( a - b > 0 ) or if ( a > b )
In the 1st case, the difference has to be computed and then it has to be compared with 0, while in the 2nd case, a and b are compared directly.
Thanks.
As Keith Thompson points out, the two are not the same. If a and b are unsigned, for example, a-b is always non-negative, making the statement equivalent to if(a != b).
Anyway, I did an unrealistic test:
int main() {
volatile int a, b;
if(a-b>=0)
printf("a");
if(a>b)
printf("b");
return 0;
}
Compile it with -O3. Here's the disassembly:
pushq %rbp
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp4:
.cfi_def_cfa_register %rbp
subq $16, %rsp
movl -4(%rbp), %eax
subl -8(%rbp), %eax
testl %eax, %eax
jle LBB0_2
## BB#1:
movl $97, %edi
callq _putchar
LBB0_2:
movl -4(%rbp), %eax
cmpl -8(%rbp), %eax
jle LBB0_4
## BB#3:
movl $98, %edi
callq _putchar
LBB0_4:
xorl %eax, %eax
addq $16, %rsp
popq %rbp
ret
At -O3, a-b>0 is still using one extra instruction.
Even if you compile it with an ARM compiler, there's an extra instruction:
push {lr}
sub sp, sp, #12
ldr r2, [sp, #0]
ldr r3, [sp, #4]
subs r3, r2, r3
cmp r3, #0
ble .L2
movs r0, #97
bl putchar(PLT)
.L2:
ldr r2, [sp, #0]
ldr r3, [sp, #4]
cmp r2, r3
ble .L3
movs r0, #98
bl putchar(PLT)
.L3:
movs r0, #0
add sp, sp, #12
pop {pc}
Note that (1) volatile is unrealistic unless you are dealing with e.g. hardware registers or thread-shared memory, and (2) that the difference in practice is not even measurable.
Because the two have different semantics in some cases, write what is correct. Worry about optimization later!
If you really like to know, in C, suppose we have test.c:
int main()
{
int a = 1000, b = 2000;
if (a > b) {
int c = 2;
}
if (a - b > 0) {
int c = 3;
}
}
Compile with gcc -S -O0 test.c, we got test.s:
.file "test.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $1000, -16(%rbp)
movl $2000, -12(%rbp)
movl -16(%rbp), %eax
cmpl -12(%rbp), %eax
jle .L2
movl $2, -8(%rbp)
.L2:
movl -12(%rbp), %eax
movl -16(%rbp), %edx
subl %eax, %edx
movl %edx, %eax
testl %eax, %eax
jle .L4
movl $3, -4(%rbp)
.L4:
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2"
.section .note.GNU-stack,"",#progbits
Please see the above
movl $1000, -16(%rbp)
movl $2000, -12(%rbp)
movl -16(%rbp), %eax
cmpl -12(%rbp), %eax
jle .L2
movl $2, -8(%rbp)
and
movl -12(%rbp), %eax
movl -16(%rbp), %edx
subl %eax, %edx
movl %edx, %eax
testl %eax, %eax
jle .L4
movl $3, -4(%rbp)
a - b > 0 needs one more step.
Note
The above are compiled on Ubuntu 14.04 with gcc 4.8, with optimization turned off (-O0)
As pointed out by #Blastfurnace, With -O2 or -O3, the assembly code are no longer readable. You need to profile to get a idea. But I believe they will be optimized to the same code.
Who cares
If doing the subtract and testing the sign had the same result as the comparison and ran faster, the processor designers would have mapped the integer compare instruction to a subtract and test.
I think you can assume the compare is at least as fast as subtract and test, as well as having the really major advantage of being clearer.
for readability, I always go with a > b
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
x86 assembly registers — Why do they work the way they do?
I've compiled the following program:
#include <stdio.h>
int square(int x) {
return x * x;
}
int main() {
int y = square(9);
printf("%d\n", y);
return 0;
}
two times with different options on OSX with GCC 4.2.1:
gcc foo.c -o foo_32.s -S -fverbose-asm -m32 -O1
gcc foo.c -o foo_64.s -S -fverbose-asm -m64 -O1
The result for 32 bit:
_square: ## #square
## BB#0: ## %entry
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
imull %eax, %eax
popl %ebp
ret
And for 64-bit:
_square: ## #square
Leh_func_begin1:
## BB#0: ## %entry
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
movl %edi, %eax
imull %eax, %eax
popq %rbp
ret
As is evident, the 32-bit version retrieves the parameter from the stack, which is what one would expect with cdecl. The 64-bit version however uses the EDI register to pass the parameter.
Doesn't this violate the System V AMD64 ABI, which specifies that the RDI, RSI, RDX, RCX, R8, R9, XMM0–7 registers should be used? Or is this only the case for true 64-bit values like long?
EDI is simply the lower half of RDI so the compiler is passing the argument in RDI, but the argument is only 32-bits long, so it only takes up half of the register.