This question already has answers here:
How to load address of function or label into register
(1 answer)
32-bit absolute addresses no longer allowed in x86-64 Linux?
(1 answer)
Closed 4 months ago.
I'm completing the final assignment for a compilers course and right now the deal is to translate some intermediate representation into x86_64 assembly source code and then build an executable through gcc by running
gcc output.s -o output
This executable should work properly. The issue is that I just can't get my code past GCC when it comes to (at least) one particular instruction. This is it:
mov L0, %rbx
Where L0 is a label.
The whole test file is as follows:
.text
.section .rodata
.text
.globl main
main:
// rbss for now
add $0, %rsp
mov %rsp, %rsi
// register spill area
add $0, %rsp
mov %rsp, %rdi
// store rax => rsp
mov %rax, %rcx
mov %rcx, ( %rsp )
// subI rsp, 4 => rsp
mov %rsp, %rcx
sub $4, %rcx
mov %rcx, %rsp
// lea L0 => rbx
mov L0, %rbx
// store rbp => rsp
mov %rbp, %rcx
mov %rcx, ( %rsp )
// subI rsp, 4 => rsp
mov %rsp, %rcx
sub $4, %rcx
mov %rcx, %rsp
// store rbx => rsp
mov %rbx, %rcx
mov %rcx, ( %rsp )
// subI rsp, 4 => rsp
mov %rsp, %rcx
sub $4, %rcx
mov %rcx, %rsp
// jumpI => Lmain
jmp Lmain
// L0 : halt
L0:
hlt
// Lmain : nop
Lmain:
// addI rsp, 0 => rbp
mov %rsp, %rcx
add $0, %rcx
mov %rcx, %rbp
// subI rsp, 0 => rsp
mov %rsp, %rcx
sub $0, %rcx
mov %rcx, %rsp
// addI rbp, 8 => rsp
mov %rbp, %rcx
add $8, %rcx
mov %rcx, %rsp
// loadAI rbp, 4 => rbp
mov %rdi, %rbx
add %rbp, %rbx
add $4, %rbx
mov ( %rbx ), %rcx
mov %rcx, %rbp
// jump => rbp
mov %rbp, %rbx
jmp *%rbx
Is there anything inherently mistaken about using mov this way? I'm not using call/ret semantics since the translation must be carried out directly from ILOC (a toy/education purpose) intermediate code.
When I try to run the aforementioned command I get some variation of:
/usr/bin/ld: /tmp/cccAoFmz.o: relocation R_X86_64_32S against `.text' can not be used when making a PIE object; recompile with -fPIE
collect2: error: ld returned 1 exit status
Could you guys help me to get a grasp of what's actually going on? I'm quite new to x86 programming and that's my first time with this kind of application. The whole assignment is done, my only issue is getting it to a working state (So no, huahuahua, I'm not getting you guys to do my homework :D).
Is there another way to get what I'm trying to achieve? Is my approach incorrect? I'm out of ideas right now.
Thank you so much :)
Best,
Related
I found example of code on assembly, which finds the maximum number in array named data_items but that example was for x86 and I tried to adapt it for x64 because 32 bit absolute addressing is not supported by 64 bit system.
To be short there are three actions:
lea data_items(%rip), %rdi #(1) Obtaining data_items address
add $4, %rdi #(2) Incrementing the pointer to 4 to read a next item
movl (%rdi), %eax #(3) Reading data at %rdi to %eax
The main questions:
Is it correct way to pointing? Can it produce error after code relocation?
If the %rip register constantly grows, why lea data_items(%rip), %rdi loads correct memory address? May be getting an offset by %rip have special meaning rather than "dataItems + %rip"?
Full adapted code here:
.section __DATA,__data
data_items:
.long 3,67,34,222,45,75,54,34,44,33,22,11,66,0
.section __TEXT,__text
.globl _main
_main:
lea data_items(%rip), %rdi #(1)
movl (%rdi), %eax
movl %eax, %ebx
start_loop:
cmpl $0, %eax
je loop_exit
add $4, %rdi #(2)
movl (%rdi), %eax #(3)
cmpl %ebx, %eax
jle start_loop
movl %eax, %ebx
jmp start_loop
loop_exit:
mov $0x2000001, %rax
mov $0, %rdi
syscall
I am working with a team in a Computer Architecture class on a Y86 program to implement multiplication function imul. We have a block of code that works, but we are trying to make it as execution-time efficient as we can. Currently our block looks like this for imul:
imul:
# push all used registers to stack for preservation
pushq %rdi
pushq %rsi
pushq %r8
pushq %r9
pushq %r10
irmovq 0, %r9 # set 0 into r9
rrmovq %rdi, %r10 # preserve rdi in r10
subq %rsi, %rdi # compare rdi and rsi
rrmovq %r10, %rdi # restore rdi
jl continue # if rdi (looping value/count) less than rsi, don't swap
swap:
# swap rsi and rdi to make rdi smaller value of the two
rrmovq %rsi, %rdi
rrmovq %r10, %rsi
continue:
subq %r9, %rdi # check if rdi is zero
cmove %r9, %rax # if rdi = 0, rax = 0
je imulDone # if rdi = 0, jump to end
irmovq 1, %r8 # set 1 into r8
rrmovq %rsi, %rax # set rax equal to initial value from rsi
imulLoop:
subq %r8, %rdi # count - 1
je imulDone # if count = 0, jump to end
addq %rsi, %rax # add another instance of rsi into rax, looped adition
jmp imulLoop # restart loop
imulDone:
# pop all used registers from stack to original values and return
popq %r10
popq %r9
popq %r8
popq %rsi
popq %rdi
ret
Right now our best idea is using immediate arithmetic instructions (isubq, etc) instead of normal OPq instructions with settings constants into registers and using those registers. Would this method be meaningfully more efficient in this particular instance? Thanks so much!
I have been trying for some time now to get a number from a keyboard and comparing it with a value on the stack. If it is correct it will print "Hello World!" and if incorrect, it should print out "Nope!". However, what happens now is no matter the input "jne" is called, nope is printed, and segfault. Perhaps one of you could lend a hand.
.section __DATA,__data
str:
.asciz "Hello world!\n"
sto:
.asciz "Nope!\n"
.section __TEXT,__text
.globl _main
_main:
push %rbp
mov %rsp,%rbp
sub $0x20, %rsp
movl $0x0, -0x4(%rbp)
movl $0x2, -0x8(%rbp)
movl $0x2000003, %eax
mov $0, %edi
subq $0x4, %rsi
movq %rsi, %rcx
syscall
cmp -0x8(%rbp), %edx
je L1
jne L2
xor %rbx, %rbx
xor %rax, %rax
movl $0x2000001, %eax
syscall
L1:
xor %rax, %rax
movl $0x2000004, %eax
movl $1, %edi
movq str#GOTPCREL(%rip), %rsi
movq $14, %rdx
syscall
ret
L2:
xor %eax, %eax
movl $0x2000004, %eax
movl $1, %edi
movq sto#GOTPCREL(%rip), %rsi
movq $6, %rdx
syscall
ret
I would start with this OS/X Syscall tutorial (The 64-bit part in your case). It is written for NASM syntax but the important information is the text and links for the SYSCALL calling convention. The SYSCALL table is found on this Apple webpage. Additional information on the standard calling convention for 64-bit OS/X can be found in the System V 64-bit ABI.
Of importance for SYSCALL convention:
arguments are passed in order via these registers rdi, rsi, rdx, r10, r8 and r9
syscall number in the rax register
the call is done via the syscall instruction
what OS X contributes to the mix is that you have to add 0x20000000 to the syscall number (still have to figure out why)
You have many issues with with your sys_read system call. The SYSCALL table says this:
3 AUE_NULL ALL { user_ssize_t read(int fd, user_addr_t cbuf, user_size_t nbyte); }
So given the calling convention, int fd is in RDI, user_addr_t cbuf (pointer to character buffer to hold return data) is in RSI, and user_size_t nbyte (maximum bytes buffer can contain) is in RDX.
Your program seg faulted on the ret because you didn't have proper function epilogue to match the function prologue at the top:
push %rbp #
mov %rsp,%rbp # Function prologue
You need to do the reverse at the bottom, set the result code in RAX and then do the ret. Something like:
mov %rbp,%rsp # \ Function epilogue
pop %rbp # /
xor %eax, %eax # Return value = 0
ret # Return to C runtime which will exit
# gracefully and return to OS
I did other minor cleanup, but tried to keep the structure of the code similar. You will have to learn more assembly to better understand the code that sets up RSI with the address for sys_read SYSCALL . You should try to find a good tutorial/book on x86-64 assembly language programming in general. Writing a primer on that subject is beyond the scope of this answer.
Code that might be closer to what you were looking for that takes the above into account:
.section __DATA,__data
str:
.asciz "Hello world!\n"
sto:
.asciz "Nope!\n"
.section __TEXT,__text
.globl _main
_main:
push %rbp #
mov %rsp,%rbp # Function prologue
sub $0x20, %rsp # Allocate 32 bytes of space on stack
# for temp local variables
movl $0x2, -4(%rbp) # Number for comparison
# 16-bytes from -20(%rbp) to -5(%rbp)
# for char input buffer
movl $0x2000003, %eax
mov $0, %edi # 0 for STDIN
lea -20(%rbp), %rsi # Address of temporary buffer on stack
mov $16, %edx # Read 16 character maximum
syscall
movb (%rsi), %r10b # RSI = pointer to buffer on stack
# get first byte
subb $48, %r10b # Convert first character to number 0-9
cmpb -4(%rbp), %r10b # Did we find magic number (2)?
jne L2 # If No exit with error message
L1: # If the magic number matched print
# Hello World
xor %rax, %rax
movl $0x2000004, %eax
movl $1, %edi
movq str#GOTPCREL(%rip), %rsi
movq $14, %rdx
syscall
jmp L0 # Jump to exit code
L2: # Print "Nope"
xor %eax, %eax
movl $0x2000004, %eax
movl $1, %edi
movq sto#GOTPCREL(%rip), %rsi
movq $6, %rdx
syscall
L0: # Code to exit main
mov %rbp,%rsp # \ Function epilogue
pop %rbp # /
xor %eax, %eax # Return value = 0
ret # Return to C runtime which will exit
# gracefully and return to OS
this code is memcpy() on x86 platforms . but I need to memcpy() on x64 platform .
_asm {
mov esi, src
mov edi, dest
mov ecx, nbytes
shr ecx, 6 // 64 bytes per iteration
loop1:
movq mm1, 0[ESI] // Read in source data
movq mm2, 8[ESI]
movq mm3, 16[ESI]
movq mm4, 24[ESI]
movq mm5, 32[ESI]
movq mm6, 40[ESI]
movq mm7, 48[ESI]
movq mm0, 56[ESI]
movq 0[EDI], mm1 // Write to destination
movq 8[EDI], mm2
movq 16[EDI], mm3
movq 24[EDI], mm4
movq 32[EDI], mm5
movq 40[EDI], mm6
movq 48[EDI], mm7
movq 56[EDI], mm0
add esi, 64
add edi, 64
dec ecx
jnz loop1
emms
}
I have no knowledge of x64 assembly language .
how convert this code from x86 to x64 ?
I suppose replacing esi and edi with rsi and rdi should do the trick. Although it will not become faster (or fast).
Other than pointers, x64 is backwards compatible with x86.
In general better make a C loop or use the default memcpy. It will generate much better code.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
This might be a simple question.
I want to know which of the 2 statements will take less time to get executed.
if ( a - b > 0 ) or if ( a > b )
In the 1st case, the difference has to be computed and then it has to be compared with 0, while in the 2nd case, a and b are compared directly.
Thanks.
As Keith Thompson points out, the two are not the same. If a and b are unsigned, for example, a-b is always non-negative, making the statement equivalent to if(a != b).
Anyway, I did an unrealistic test:
int main() {
volatile int a, b;
if(a-b>=0)
printf("a");
if(a>b)
printf("b");
return 0;
}
Compile it with -O3. Here's the disassembly:
pushq %rbp
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp4:
.cfi_def_cfa_register %rbp
subq $16, %rsp
movl -4(%rbp), %eax
subl -8(%rbp), %eax
testl %eax, %eax
jle LBB0_2
## BB#1:
movl $97, %edi
callq _putchar
LBB0_2:
movl -4(%rbp), %eax
cmpl -8(%rbp), %eax
jle LBB0_4
## BB#3:
movl $98, %edi
callq _putchar
LBB0_4:
xorl %eax, %eax
addq $16, %rsp
popq %rbp
ret
At -O3, a-b>0 is still using one extra instruction.
Even if you compile it with an ARM compiler, there's an extra instruction:
push {lr}
sub sp, sp, #12
ldr r2, [sp, #0]
ldr r3, [sp, #4]
subs r3, r2, r3
cmp r3, #0
ble .L2
movs r0, #97
bl putchar(PLT)
.L2:
ldr r2, [sp, #0]
ldr r3, [sp, #4]
cmp r2, r3
ble .L3
movs r0, #98
bl putchar(PLT)
.L3:
movs r0, #0
add sp, sp, #12
pop {pc}
Note that (1) volatile is unrealistic unless you are dealing with e.g. hardware registers or thread-shared memory, and (2) that the difference in practice is not even measurable.
Because the two have different semantics in some cases, write what is correct. Worry about optimization later!
If you really like to know, in C, suppose we have test.c:
int main()
{
int a = 1000, b = 2000;
if (a > b) {
int c = 2;
}
if (a - b > 0) {
int c = 3;
}
}
Compile with gcc -S -O0 test.c, we got test.s:
.file "test.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $1000, -16(%rbp)
movl $2000, -12(%rbp)
movl -16(%rbp), %eax
cmpl -12(%rbp), %eax
jle .L2
movl $2, -8(%rbp)
.L2:
movl -12(%rbp), %eax
movl -16(%rbp), %edx
subl %eax, %edx
movl %edx, %eax
testl %eax, %eax
jle .L4
movl $3, -4(%rbp)
.L4:
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2"
.section .note.GNU-stack,"",#progbits
Please see the above
movl $1000, -16(%rbp)
movl $2000, -12(%rbp)
movl -16(%rbp), %eax
cmpl -12(%rbp), %eax
jle .L2
movl $2, -8(%rbp)
and
movl -12(%rbp), %eax
movl -16(%rbp), %edx
subl %eax, %edx
movl %edx, %eax
testl %eax, %eax
jle .L4
movl $3, -4(%rbp)
a - b > 0 needs one more step.
Note
The above are compiled on Ubuntu 14.04 with gcc 4.8, with optimization turned off (-O0)
As pointed out by #Blastfurnace, With -O2 or -O3, the assembly code are no longer readable. You need to profile to get a idea. But I believe they will be optimized to the same code.
Who cares
If doing the subtract and testing the sign had the same result as the comparison and ran faster, the processor designers would have mapped the integer compare instruction to a subtract and test.
I think you can assume the compare is at least as fast as subtract and test, as well as having the really major advantage of being clearer.
for readability, I always go with a > b