What is this assembly function prologue / epilogue code doing with rbp / rsp / leave? - macos

I am just starting to learn assembly for the mac using the GCC compiler to assemble my code. Unfortunately, there are VERY limited resources for learning how to do this if you are a beginner. I finally managed to find some simple sample code that I could start to rap my head around, and I got it to assemble and run correctly. Here is the code:
.text # start of code indicator.
.globl _main # make the main function visible to the outside.
_main: # actually label this spot as the start of our main function.
push %rbp # save the base pointer to the stack.
mov %rsp, %rbp # put the previous stack pointer into the base pointer.
subl $8, %esp # Balance the stack onto a 16-byte boundary.
movl $0, %eax # Stuff 0 into EAX, which is where result values go.
leave # leave cleans up base and stack pointers again.
ret
The comments explain some things in the code (I kind of understand what lines 2 - 5 do), but I dont understand what most of this means. I do understand the basics of what registers are and what each register here (rbp, rsp, esp and eax) is used for and how big they are, I also understand (generally) what the stack is, but this is still going over my head. Can anyone tell me exactly what this is doing? Also, could anyone point me in the direction of a good tutorial for beginners?

Stack is a data structure that follows LIFO principle. Whereas stacks in everyday life (outside computers, I mean) grow upward, stacks in x86 and x86-64 processors grow downward. See Wikibooks article on x86 stack (but please take into account that the code examples are 32-bit x86 code in Intel syntax, and your code is 64-bit x86-64 code in AT&T syntax).
So, what your code does (my explanations here are with Intel syntax):
push %rbp
Pushes rbp to stack, practically subtracting 8 from rsp (because the size of rbp is 8 bytes) and then stores rbp to [ss:rsp].
So, in Intel syntax push rbp practically does this:
sub rsp, 8
mov [ss:rsp], rbp
Then:
mov %rsp, %rbp
This is obvious. Just store the value of rsp into rbp.
subl $8, %esp
Subtract 8 from esp and store it into esp. Actually this is a bug in your code, even if it causes no problems here. Any instruction with a 32-bit register (eax, ebx, ecx, edx, ebp, esp, esi or edi) as destination in x86-64 sets the topmost 32 bits of the corresponding 64-bit register (rax, rbx, rcx, rdx, rbp, rsp, rsi or rdi) to zero, causing the stack pointer to point somewhere below the 4 GiB limit, effectively doing this (in Intel syntax):
sub rsp,8
and rsp,0x00000000ffffffff
Edit: added consequences of sub esp,8 below.
However, this causes no problems on a computer with less than 4 GiB of memory. On computers with more than 4 GiB memory, it may result in a segmentation fault. leave further below in your code returns a sane value to rsp. Generally in x86-64 code you don't need esp never (excluding possibly some optimizations or tweaks). To fix this bug:
subq $8, %rsp
The instructions so far are the standard entry sequence (replace $8 according to the stack usage). Wikibooks has a useful article on x86 functions and stack frames (but note again that it uses 32-bit x86 assembly with Intel syntax, not 64-bit x86-64 assembly with AT&T syntax).
Then:
movl $0, %eax
This is obvious. Store 0 into eax. This has nothing to do with the stack.
leave
This is equivalent to mov rsp, rbp followed by pop rbp.
ret
And this, finally, sets rip to the value stored at [ss:rsp], effective returning the code pointer back to where this procedure was called, and adds 8 to rsp.

Related

How can I know which registers WinAPI functions use for arguments? [duplicate]

I'm writing a function in x86 assembly that should be callable from c code, and I'm wondering which registers i have to restore before i return to the caller.
Currently I'm only restoring esp and ebp, while the return value is in eax.
Are there any other registers I should be concerned about, or could I leave whatever pleases me in them?
Using Microsoft's 32 bit ABI (cdecl or stdcall or other calling conventions), EAX, EDX and ECX are scratch registers (call clobbered). The other general-purpose integer registers are call-preserved.
The condition codes in EFLAGS are call-clobbered. DF=0 is required on call/return so you can use rep movsb without a cld first. The x87 stack must be empty on call, or on return from a function that doesn't return an FP value. (FP return values go in st0, with the x87 stack empty other than that.) XMM6 and 7 are call-preserved, the rest are call-clobbered scratch registers.
Outside of Windows, most 32-bit calling conventions (including i386 System V on Linux) agree with this choice of EAX, EDX and ECX as call-clobbered, but all the xmm registers are call-clobbered.
For x64 under Windows, you only need to restore RBX, RBP, RDI, RSI, R12, R13, R14, and R15. XMM6..15 are call-preserved. (And you have to reserve 32 bytes of shadow space for use by the callee, whether or not there are any args that don't fit in registers.) xmm6..15 are call-preserved.
See https://en.wikipedia.org/wiki/X86_calling_conventions#Microsoft_x64_calling_convention for more details.
Other OSes use the x86-64 System V ABI (see figure 3.4), where the call-preserved integer registers are RBP, RBX, RSP, R12, R13, R14, and R15. All the XMM/YMM/ZMM registers are call-clobbered.
EFLAGS and the x87 stack are the same as in 32-bit conventions: DF=0, condition flags are clobbered, and x87 stack is empty. (x86-64 conventions return FP values in XMM0, so the x87 stack registers always need to be empty on call/return.)
For links to official calling convention docs, see https://stackoverflow.com/tags/x86/info
32-bit: EBX, ESI, EDI, EBP
64-bit Windows: RBX, RSI, RDI, RBP, R12-R15, XMM6-XMM15
64-bit Linux,BSD,Mac: RBX, RBP, R12-R15
For details see "Software optimization resources" by Agner Fog. Calling conventions are described in this pdf.
if you are unsure about the registers' situation, these instructions below could save the day easily.
PUSHA/PUSHAD -- Push all General Registers
POPA/POPAD -- Pop all General Registers
These instructions push and pop the general purpose and SI/ESI , DI/EDI registers in certain order.
The order for PUSHA/PUSHAD instruction is as follows.
Opcode Instruction Clocks Description
60 PUSHA 18 Push AX, CX, DX, BX, original SP, BP, SI, and DI
60 PUSHAD 18 Push EAX, ECX, EDX, EBX, original ESP, EBP ESI, and EDI
And the order for POPA/POPAD instruction is as follows. (in reverse order)
Opcode Instruction Clocks Description
61 POPA 24 Pop DI, SI, BP, SP, BX, DX, CX, and AX
61 POPAD 24 Pop EDI, ESI, EBP, ESP(***),EBX, EDX, ECX, and EAX
*** The ESP value is discarded instead of loaded into ESP.

Understanding the pop instruction in assembly

I am a student studying computer systems for the first time (w/ Computer Systems: A Programmer's Perspective). We are working on assembly and I am starting to understand command suffixes for x86_64 such as using leaq in something like:
leaq (%rsp, %rdx), %rax
However, I am failing to understand using a suffix for pop. For example, using the same logic, it would make sense to me that we'd use popl for something like:
popl %edi
But, in the text and other examples online, I just see:
pop %edi
What is the difference? is popl even valid? Just looking for a little more insight. Anything helps, thank you.
What you can do in asm is limited by what the hardware can do. Implicit vs. explicit operand-size in the source (suffix or not) doesn't change the machine code it will assemble to.
So the right question to ask is whether the hardware can do a 32-bit push in 64-bit mode? No, it can't, therefore no asm source syntax exists that will get it to do exactly what you were trying to do with one instruction.
Does each PUSH instruction push a multiple of 8 bytes on x64?
x86 Assembly pushl/popl don't work with "Error: suffix or operands invalid"
That's why your assembler won't accept pop %edi or popl %edi. Those are exactly equivalent because the 32-bit register implies DWORD (l) operand-size. The examples you saw of popl or pop %edi are for 32-bit mode, where EDI is the full register instead of the low half of RDI, and that instruction is encodable.
You only need a size suffix when it's ambiguous, mov $1, (%rdi). Your assembler will give an error for that instead of guessing one of b/w/l/q.
But push is a bit special: push $1 will default to a 64-bit push, even though pushw $1 is possible. How many bytes does the push instruction push onto the stack when I don't specify the operand size?

Handling Register Integer Value Wrap Around In x86 Assembly GNU on MacOSX

I am working with a simple example in x86 GNU GAS on MacOSX whereby an integer value of 300 is moved into the eax register. As expected, only 300 mod 256 (the value 44), is actually stored in %eax, as echo$? reveals from Mac terminal:
.globl _main
_main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl $300, %eax;
leave
ret
However, I was under the impression that there is an overflow/wrap-around flag to denote that a wraparound occurred or a register storing the result of the integer division of 300 and 256, the result being 1. I have been unable to find any information detailing this process (if it exists) for x86 GNU. Does anyone know how the wraparound value or an overflow flag can be accessed?
There are a couple of misconceptions in your question.
First, eax can hold values from 0 to 4294967295, so mov $300, %eax does in fact store 300 into eax.
Second, a mov instruction cannot overflow or wrap around; the size of the source and the size of the destination are the same. The overflow flag is used for arithmetic operations.
The reason echo $? prints 44 is that the operating system reports the low byte of the exit status of the process to the shell.

Mach-O 64-bit format does not support 32-bit absolute addresses. NASM Accessing Array

Running this code off my Mac computer, using command:
nasm -f macho64 -o max.a maximum.asm
This is the code I am attempting to run on my computer that finds the largest number inside an array.
section .data
data_items:
dd 3,67,34,222,45,75,54,34,44,33,22,11,66,0
section .text
global _start
_start:
mov edi, 0
mov eax, [data_items + edi*4]
mov ebx, eax
start_loop:
cmp eax, 0
je loop_exit
inc edi
mov eax, [data_items + edi*4]
cmp eax, ebx
jle start_loop
mov ebx, eax
jmp start_loop
loop_exit:
mov eax, 1
int 0x80
Error:
maximum.asm:14: error: Mach-O 64-bit format does not support 32-bit absolute addresses
maximum.asm:21: error: Mach-O 64-bit format does not support 32-bit absolute addresses
First of all, beware of NASM bugs with the macho64 output format with 64-bit absolute addressing (NASM 2.13.02+) and with RIP-relative in NASM 2.11.08. 64-bit absolute addressing is not recommended, so this answer should work even for buggy NASM 2.13.02 and higher. (The bugs don't cause this error, they lead to wrong addresses being used at runtime.)
[data_items + edi*4] is a 32-bit addressing mode. Even [data_items + rdi*4] can only use a 32-bit absolute displacement, so it wouldn't work either. Note that using an address as a 32-bit (sign-extended) immediate like cmp rdi, data_items is also a problem: only mov allows a 64-bit immediate.
64-bit code on OS X can't use 32-bit absolute addressing at all. Executables are loaded at a base address above 4GiB, so label addresses just plain don't fit in 32-bit integers, with zero- or sign-extension. RIP-relative addressing is the best / most efficient solution, whether you need it to be position-independent or not1.
In NASM, default rel at the top of your file will make all [] memory operands prefer RIP-relative addressing. See also Section 3.3 Effective Addresses in the NASM manual.
default rel ; near the top of file; affects all instructions
my_func:
...
mov ecx, [data_items] ; uses the default: RIP-relative
;mov ecx, [abs data_items] ; override to absolute [disp32], unusuable
mov ecx, [rel data_items] ; explicitly RIP-relative
But RIP-relative is only possible when there are no other registers involved, so for indexing a static array you need to get the address in a register first. Use a RIP-relative lea rsi, [rel data_items].
lea rsi, [data_items] ; can be outside the loop
...
mov eax, [rsi + rdi*4]
Or you could add rsi, 4 inside the loop and use a simpler addressing mode like mov eax, [rsi].
Note that mov rsi, data_items will work for getting an address into a register, but you don't want that because it's less efficient.
Technically, any address within +-2GiB of your array will work, so if you have multiple arrays you can address the others relative to one common base address, only tieing up one register with a pointer. e.g. lea rbx, [arr1] / ... / mov eax, [rbx + rdi*4 + arr2-arr1]. Relative Addressing errors - Mac 10.10 mentions that Agner Fog's "optimizing assembly" guide has some examples of array addressing, including one using the __mh_execute_header as a reference point. (The code in that question looks like another attempt to port this 32-bit Linux example from the PGU book to 64-bit OS X, at the same time as learning asm in the first place.)
Note that on Linux, position-dependent executables are loaded in the low 32 bits of virtual address space, so you will see code like mov eax, [array + rdi*4] or mov edi, symbol_name in Linux examples or compiler output on http://gcc.godbolt.org/. gcc -pie -fPIE will make position-independent executables on Linux, and is the default on many recent distros, but not Godbolt.
This doesn't help you on MacOS, but I mention it in case anyone's confused about code they've seen for other OSes, or why AMD64 architects bothered to allow [disp32] addressing modes at all on x86-64.
And BTW, prefer using 64-bit addressing modes in 64-bit code. e.g. use [rsi + rdi*4], not [esi + edi*4]. You usually don't want to truncate pointers to 32-bit, and it costs an extra address-size prefix to encode.
Similarly, you should be using syscall to make 64-bit system calls, not int 0x80. What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 for the differences in which registers to pass args in.
Footnote 1:
64-bit absolute addressing is supported on OS X, but only in position-dependent executables (non-PIE). This related question x64 nasm: pushing memory addresses onto the stack & call function includes an ld warning from using gcc main.o to link:
ld: warning: PIE disabled. Absolute addressing (perhaps -mdynamic-no-pic) not
allowed in code signed PIE, but used in _main from main.o. To fix this warning,
don't compile with -mdynamic-no-pic or link with -Wl,-no_pie
So the linker checks if any 64-bit absolute relocations are used, and if so disables creation of a Position-Independent Executable. A PIE can benefit from ASLR for security. I think shared-library code always has to be position-independent on OS X; I don't know if jump tables or other cases of pointers-as-data are allowed (i.e. fixed up by the dynamic linker), or if they need to be initialized at runtime if you aren't making a position-dependent executable.
mov r64, imm64 is larger (10 bytes) and not faster than lea r64, [RIP_rel32] (7 bytes).
So you could use mov rsi, qword data_items instead of a RIP-relative LEA which runs about as fast, and takes less space in code caches and the uop cache. 64-bit immediates also have a uop-cache fetch penalty for on Sandybridge-family (http://agner.org/optimize/): they take 2 cycles to read from a uop cache line instead of 1.
x86 also has a form of mov that loads/store from/to a 64-bit absolute address, but only for AL/AX/EAX/RAX. See http://felixcloutier.com/x86/MOV.html. You don't want this either, because it's larger and not faster than mov eax, [rel foo].
(Related: an AT&T syntax version of the same question)

Understanding OSX 16-Byte alignment

So it seems like everyone knows that OSX syscalls are always 16 byte stack aligned. Great, that makes sense when you have code like this:
section .data
message db 'something', 10, 0
section .text
global start
start:
push 10 ; size of the message (4 bytes)
push msg ; the address of the message (4 bytes)
push 1 ; we want to write to STD_OUT (4 bytes)
mov eax, 4 ; write(...) syscall
sub esp, 4 ; move stack pointer down to 4 bytes for a total of 16.
int 0x80 ; invoke
add esp, 16 ; clean
Perfect, the stack is aligned to 16 bytes, makes perfect sense. How about though we call syscall(1) (exit). Logically that would look something like this:
push 69 ; return value
mov eax, 1 ; exit(...) syscall
sub esp, 12 ; push down stack for total of 16 bytes.
int 0x80 ; invoke
This doesn't work though, but this does:
push 69 ; return value
mov eax, 1 ; exit(...) syscall
sub esp, 4 ; push down stack for total of 8 bytes.
int 0x80 ; invoke
That works fine, but that's only 8 bytes???? Osx is cool, but this ABI is driving me nuts. Can someone shed some light on what I'm not understanding?
Short version: you probably don't need to align to 16 bytes, you just need to always leave a 4-byte gap before your argument list.
Long version:
Here's what I think is happening: I'm not sure that it's true that the stack should be 16-byte aligned. However, logic dictates that if it is and if padding or adjusting the stack is necessary to achieve that alignment, it must happen before the arguments for the syscall are pushed, not after. There can't be an arbitrary number of bytes between the stack pointer at the time of the int 0x80 instruction and where the arguments actually are. The kernel wouldn't know where to find the actual arguments. Subtracting from the stack pointer after pushing the arguments to achieve "alignment" doesn't align the arguments, it aligns the stack pointer by inserting an arbitrary number of bytes between the stack pointer and the arguments. Whatever else may be true, that can't be right.
Then why do the first and third snippets work at all? Don't they also insert arbitrary bytes there? They work by accident. It's because they both happen to insert 4 bytes. That adjustment isn't "successful" because it achieves stack alignment, it's part of the syscall ABI. Apparently, the syscall ABI expects and requires that there be a 4-byte slot before the argument list.
The source for the syscall() function can be found here. It looks like this:
LEAF(___syscall, 0)
popl %ecx // ret addr
popl %eax // syscall number
pushl %ecx
UNIX_SYSCALL_TRAP
movl (%esp),%edx // add one element to stack so
pushl %ecx // caller "pop" will work
jnb 2f
BRANCH_EXTERN(cerror)
2:
END(___syscall)
To call this library function, the caller will have set up the stack pointer to point to the arguments to the syscall() function, which starts with the syscall number and then has the real arguments for the actual syscall. However, the caller will then have used a call instruction to call it, which pushed the return address onto the stack.
So, the above code pops the return address, pops the syscall number into %eax, pushes the return address back onto the stack (where the syscall number originally was), and then does int 0x80. So, the stack pointer points to the return address and then the arguments. There's the extra 4 bytes: the return address. I suspect the kernel ignores the return address. I guess its presence in the syscall ABI may just be to make the ABI for system calls similar to that of function calls.
What does this mean for the alignment requirement of syscalls? Well, this function is guaranteed to change the alignment of the stack from how it was set up by its caller. The caller presumably set up the stack with 16-byte alignment and this function moves it by 4 bytes before the interrupt. It may just be a myth that the stack needs to be 16-byte aligned for syscalls. On the other hand, the 16-byte alignment requirement is definitely real for calling system library functions. The Wine project, for which I develop, was burned by it. It is mostly necessary for 128-bit SSE argument data types, but Apple made their lazy symbol resolver deliberately blow up if the alignemtn is wrong even for functions which don't use such arguments so that problems would be found early. Syscalls would not be subject to that early-failure mechanism. It may be that the kernel doesn't require the 16-byte alignment. I'm not sure if any syscalls take 128-bit arguments.

Resources