Understanding OSX 16-Byte alignment - macos

So it seems like everyone knows that OSX syscalls are always 16 byte stack aligned. Great, that makes sense when you have code like this:
section .data
message db 'something', 10, 0
section .text
global start
start:
push 10 ; size of the message (4 bytes)
push msg ; the address of the message (4 bytes)
push 1 ; we want to write to STD_OUT (4 bytes)
mov eax, 4 ; write(...) syscall
sub esp, 4 ; move stack pointer down to 4 bytes for a total of 16.
int 0x80 ; invoke
add esp, 16 ; clean
Perfect, the stack is aligned to 16 bytes, makes perfect sense. How about though we call syscall(1) (exit). Logically that would look something like this:
push 69 ; return value
mov eax, 1 ; exit(...) syscall
sub esp, 12 ; push down stack for total of 16 bytes.
int 0x80 ; invoke
This doesn't work though, but this does:
push 69 ; return value
mov eax, 1 ; exit(...) syscall
sub esp, 4 ; push down stack for total of 8 bytes.
int 0x80 ; invoke
That works fine, but that's only 8 bytes???? Osx is cool, but this ABI is driving me nuts. Can someone shed some light on what I'm not understanding?

Short version: you probably don't need to align to 16 bytes, you just need to always leave a 4-byte gap before your argument list.
Long version:
Here's what I think is happening: I'm not sure that it's true that the stack should be 16-byte aligned. However, logic dictates that if it is and if padding or adjusting the stack is necessary to achieve that alignment, it must happen before the arguments for the syscall are pushed, not after. There can't be an arbitrary number of bytes between the stack pointer at the time of the int 0x80 instruction and where the arguments actually are. The kernel wouldn't know where to find the actual arguments. Subtracting from the stack pointer after pushing the arguments to achieve "alignment" doesn't align the arguments, it aligns the stack pointer by inserting an arbitrary number of bytes between the stack pointer and the arguments. Whatever else may be true, that can't be right.
Then why do the first and third snippets work at all? Don't they also insert arbitrary bytes there? They work by accident. It's because they both happen to insert 4 bytes. That adjustment isn't "successful" because it achieves stack alignment, it's part of the syscall ABI. Apparently, the syscall ABI expects and requires that there be a 4-byte slot before the argument list.
The source for the syscall() function can be found here. It looks like this:
LEAF(___syscall, 0)
popl %ecx // ret addr
popl %eax // syscall number
pushl %ecx
UNIX_SYSCALL_TRAP
movl (%esp),%edx // add one element to stack so
pushl %ecx // caller "pop" will work
jnb 2f
BRANCH_EXTERN(cerror)
2:
END(___syscall)
To call this library function, the caller will have set up the stack pointer to point to the arguments to the syscall() function, which starts with the syscall number and then has the real arguments for the actual syscall. However, the caller will then have used a call instruction to call it, which pushed the return address onto the stack.
So, the above code pops the return address, pops the syscall number into %eax, pushes the return address back onto the stack (where the syscall number originally was), and then does int 0x80. So, the stack pointer points to the return address and then the arguments. There's the extra 4 bytes: the return address. I suspect the kernel ignores the return address. I guess its presence in the syscall ABI may just be to make the ABI for system calls similar to that of function calls.
What does this mean for the alignment requirement of syscalls? Well, this function is guaranteed to change the alignment of the stack from how it was set up by its caller. The caller presumably set up the stack with 16-byte alignment and this function moves it by 4 bytes before the interrupt. It may just be a myth that the stack needs to be 16-byte aligned for syscalls. On the other hand, the 16-byte alignment requirement is definitely real for calling system library functions. The Wine project, for which I develop, was burned by it. It is mostly necessary for 128-bit SSE argument data types, but Apple made their lazy symbol resolver deliberately blow up if the alignemtn is wrong even for functions which don't use such arguments so that problems would be found early. Syscalls would not be subject to that early-failure mechanism. It may be that the kernel doesn't require the 16-byte alignment. I'm not sure if any syscalls take 128-bit arguments.

Related

Dynamic Heap Allocation Confusion

The following is a code provided in the Assembly Kip Irvine book.
I want to know what this code is doing. I understand that the
getProcessHeap returns a 32 bit integer handle to the program's existing
heap area in EAX. If the function were to suceed, it will return
a handle to the heap in EAX. If it fails, the return value in EAX
is NULL
The HeapAlloc allocates a block of memory from a heap. If it succeeds,
the return value in EAX contains the address of memory block. If it
fails, the returned value in EAX is NULL.
How is character allocation being used in CALLOC?
How is integer allocation being used in IALLOC?
Hoe is long integer allocation being used in LALLOC?
In MALLOC, how is size in bytes being allocated from the heap? Thanks
INCLUDE Irvine32.inc
HANDLE TEXTEQU <WORD>
GetProcessHeap PROTO
HeapAlloc PROTO,
hHeap : HANDLE,
dwflags: DWORD,
dwbytes: DWORD
HeapFree PROTO,
hHeap : HANDLE,
dwflags: DWORD,
lpmem : DWORD
.data
hHeap HANDLE ?
.code
CALLOC MACRO size
mov eax, sizeof BYTE
imul eax, size
push eax
call MALLOC
ENDM
IALLOC MACRO size
mov eax, sizeof WORD
imul eax, size
push eax
call MALLOC
ENDM
LALLOC MACRO size
mov eax, sizeof DWORD
imul eax, size
push eax
call MALLOC
ENDM
MALLOC PROC
push ebp
mov ebp, esp
invoke GetProcessHeap
invoke HeapAlloc, eax, 8, [ebp + 8]
pop ebp
ret 4
MALLOC ENDP
MEMFREE PROC
push ebp
mov ebp, esp
invoke GetProcessHeap
invoke HeapFree, eax, 0, [ebp + 8]
pop ebp
ret 4
MEMFREE ENDP
Although your description of these functions are essentially correct, it is worth pointing out that the GetProcessHeap, HeapAlloc, and HeapFree functions are actually Win32 API functions, meaning that they are provided as part of the operating system for applications to call. Irvine's library has just provided prototypes for these functions to make them easier to call. As such, the semantics for these functions can be obtained directly from the horse's mouth by reading Microsoft's MSDN documentation (links above).
Like the documentation explains, it is a common pattern for an application that needs to allocate a moderate sized amount of memory to just obtain that memory from the process's default heap. This saves the need to create and overhead of managing a separate, private heap, just to make an allocation. The HeapAlloc and HeapFree (and similarly-named) functions are the ones you should be using in modern Windows programming, and not the obsolete GlobalAlloc or LocalAlloc functions that you sometimes still see used or referenced in Windows programming materials that haven't been updated for this century.
Now, in the code you have, since everything ultimately goes back to MALLOC, let's start there. It sets up a stack frame, calls GetProcessHeap, calls HeapAlloc, and then tears down the stack frame. You should immediately see a bug. Remember how in your description of the GetProcessHeap and HeapAlloc functions, you were careful to describe what happens if they fail? Well, you were right; these functions can fail, and correctly written code should be checking for failures and handling them. This code does not.
In MALLOC, how is size in bytes being allocated from the heap?
The implementation of MALLOC is pretty simple: all it really does is obtain a handle to the process heap and then use HeapAlloc to allocate memory from that heap. So if you want to know how it works, go back to the documentation for HeapAlloc. From this, we see that the first parameter is the handle to the heap (returned from GetProcessHeap in eax), the second parameter is a bitwise combination of flags that control the allocation (in this case, 8, or HEAP_ZERO_MEMORY), and the third parameter is the number of bytes to allocate (in this case, [ebp + 8]).
[ebp + 8] reads the first (and presumably only) parameter that was passed to the MALLOC function on the stack. ([ebp + 4] is the pointer to the calling function (where MALLOC will ret), and [ebp + 0] is the original value of ebp, saved upon entry to the MALLOC function.)
In my opinion, this is another (minor) bug with the MALLOC function: it is insufficiently documented! How are we supposed to know that it takes a parameter, much less what the size/type and meaning of that parameter are, without diving into its implementation? A function's basic purpose and interface should be documented right there in the code, using a comment.
So, the answer to your question is, MALLOC allocates as many bytes as you ask it to allocate.
How is character allocation being used in CALLOC?
This is a simple macro wrapped around the MALLOC function. Its purpose is basically to determine how many bytes to allocate, based on the number of characters you want to allocate space for, and then pass that value as the parameter to MALLOC.
The CALLOC macro takes a single parameter, size, which is the number of characters you want to allocate space for. (By the way, I think that size is a poor choice of name for this parameter, as it isn't very descriptive.)
It then multiplies the caller-specified size by the actual number of bytes required by a character, which is determined at assembly time by the expression sizeof BYTE. This will give the number of bytes that actually need to be allocated. (Now, since sizeof BYTE is just 1, this is pretty silly and inefficient code! I guess it was written to be "portable", but come on—it's assembly language!)
Finally, it pushes the result (the number of bytes to allocate) onto the stack and calls MALLOC to do the allocation.
And now that you understand how CALLOC works, you should understand how all of these *ALLOC macros work, since they're all the same. IALLOC multiplies its parameter by the size of a short integer (sizeof WORD) to arrive at the actual number of bytes that need to be allocated, while LALLOC multiplies its parameter by the size of a long integer (sizeof DWORD).
(Note that, although the multiplications in these other macros are necessary, they are also inefficient. sizeof WORD == 2, so you could just do a left-shift by 1. Or, better yet, an addition of the value to itself. sizeof DWORD == 4, so that's a left-shift by 2. Adds and shifts are much faster to execute than multiplications.)

Allocating memory using malloc() in 32-bit and 64-bit assembly language

I have to do a 64 bits stack. To make myself comfortable with malloc I managed to write two integers(32 bits) into memory and read from there:
But, when i try to do this with 64 bits:
The first snippet of code works perfectly fine. As Jester suggested, you are writing a 64-bit value in two separate (32-bit) halves. This is the way you have to do it on a 32-bit architecture. You don't have 64-bit registers available, and you can't write 64-bit chunks of memory at once. But you already seemed to know that, so I won't belabor it.
In the second snippet of code, you tried to target a 64-bit architecture (x86-64). Now, you no longer have to write 64-bit values in two 32-bit halves, since 64-bit architectures natively support 64-bit integers. You have 64-bit wide registers available, and you can write a 64-bit chunk to memory directly. Take advantage of that to simplify (and speed up) the code.
The 64-bit registers are Rxx instead of Exx. When you use QWORD PTR, you will want to use Rxx; when you use DWORD PTR, you will want to use Exx. Both are legal in 64-bit code, but only 32-bit DWORDs are legal in 32-bit code.
A couple of other things to note:
Although it is perfectly valid to clear a register using MOV xxx, 0, it is smaller and faster to use XOR eax, eax, so this is generally what you should write. It is a very old trick, something that any assembly-language programmer should know, and if you ever try to read other people's assembly programs, you'll need to be familiar with this idiom. (But actually, in the code you're writing, you don't need to do this at all. For the reason why, see point #2.)
In 64-bit mode, all instructions implicitly zero the upper 32 bits when writing the lower 32 bits, so you can simply write XOR eax, eax instead of XOR rax, rax. This is, again, smaller and faster.
The calling convention for 64-bit programs is different than the one used in 32-bit programs. The exact specification of the calling convention is going to vary, depending on which operating system you're using. As Peter Cordes commented, there is information on this in the x86 tag wiki. Both Windows and Linux x64 calling conventions pass at least the first 4 integer parameters in registers (rather than on the stack like the x86-32 calling convention), but which registers are actually used is different. Also, the 64-bit calling conventions have different requirements than do the 32-bit calling conventions for how you must set up the stack before calling functions.
(Since your screenshot says something about "MASM", I'll assume that you're using Windows in the sample code below.)
; Set up the stack, as required by the Windows x64 calling convention.
; (Note that we use the 64-bit form of the instruction, with the RSP register,
; to support stack pointers larger than 32 bits.)
sub rsp, 40
; Dynamically allocate 8 bytes of memory by calling malloc().
; (Note that the x64 calling convention passes the parameter in a register, rather
; than via the stack. On Windows, the first parameter is passed in RCX.)
; (Also note that we use the 32-bit form of the instruction here, storing the
; value into ECX, which is safe because it implicitly zeros the upper 32 bits.)
mov ecx, 8
call malloc
; Write a single 64-bit value into memory.
; (The pointer to the memory block allocated by malloc() is returned in RAX.)
mov qword ptr [rax], 1
; ... do whatever
; Clean up the stack space that we allocated at the top of the function.
add rsp, 40
If you wanted to do this in 32-bit halves, even on a 64-bit architecture, you certainly could. That would look like the following:
sub rsp, 40 ; set up stack
mov ecx, 8 ; request 8 bytes
call malloc ; allocate memory
mov dword ptr [eax], 1 ; write "1" into low 32 bits
mov dword ptr [eax+4], 2 ; write "2" into high 32 bits
; ... do whatever
add rsp, 40 ; clean up stack
Note that these last two MOV instructions are identical to what you wrote in the 32-bit version of the code. That makes sense, because you're doing exactly the same thing.
The reason the code you originally wrote didn't work is because EAX doesn't contain a QWORD PTR, it contains a DWORD PTR. Hence, the assembler generated the "invalid instruction operands" error, because there was a mismatch. This is the same reason that you don't offset by 8, because a DWORD PTR is only 4 bytes. A QWORD PTR is indeed 8 bytes, but you don't have one of those in EAX.
Or, if you wanted to write 16 bytes:
sub rsp, 40 ; set up stack
mov ecx, 16 ; request 16 bytes
call malloc ; allocate memory
mov qword ptr [rax], 1 ; write "1" into low 64 bits
mov qword ptr [rax+8], 2 ; write "2" into high 64 bits
; ... do whatever
add rsp, 40 ; clean up stack
Compare these three snippets of code, and make sure you understand the differences and why they need to be written as they are!

How does `sub rsp, 16` aligns the stack on Mac OSX?

I'm currently learning x64 asm on Mac OSX using nasm.
I've come across the problem of aligning the stack, a necessary step for some system calls such as malloc, which is done with these instructions:
push rbp
mov rbp, rsp
sub rsp, 16
Can anyone explain me how the function prolog does align the stack ? I mean, if it's not already on a multiple of 16, why would sub rsp, 16 correct it ?
Let's say that esp = 0x35 , after sub rsp, 16, esp = 0x25 right ? So esp wasn't aligned on a multiple of 16 before the sub, and the sub didn't align it neither so I think I haven't quite understood what "aligning the stack" means.
Can someone tell me what I should understand when I read "the stack needs to be aligned on a 16 bytes boundary" ?
It doesn't align it, as you say it just keeps alignment. Obviously the sub rsp, 16 is only needed if you want to allocate space for 1-15 bytes of local variables. You should make sure the number there is the next multiple of 16 above the space needed, assuming the stack is already aligned. Note that the return address and the frame pointer also add up to 16 bytes, if you don't use a frame pointer you need to account for that too.
In general, the calling convention mandates that it be aligned in a particular way upon entry to all functions, so you just have to maintain that. The only place that is normally not the case is possibly at process or thread startup but that is usually taken care of by the system libraries.

What is this assembly function prologue / epilogue code doing with rbp / rsp / leave?

I am just starting to learn assembly for the mac using the GCC compiler to assemble my code. Unfortunately, there are VERY limited resources for learning how to do this if you are a beginner. I finally managed to find some simple sample code that I could start to rap my head around, and I got it to assemble and run correctly. Here is the code:
.text # start of code indicator.
.globl _main # make the main function visible to the outside.
_main: # actually label this spot as the start of our main function.
push %rbp # save the base pointer to the stack.
mov %rsp, %rbp # put the previous stack pointer into the base pointer.
subl $8, %esp # Balance the stack onto a 16-byte boundary.
movl $0, %eax # Stuff 0 into EAX, which is where result values go.
leave # leave cleans up base and stack pointers again.
ret
The comments explain some things in the code (I kind of understand what lines 2 - 5 do), but I dont understand what most of this means. I do understand the basics of what registers are and what each register here (rbp, rsp, esp and eax) is used for and how big they are, I also understand (generally) what the stack is, but this is still going over my head. Can anyone tell me exactly what this is doing? Also, could anyone point me in the direction of a good tutorial for beginners?
Stack is a data structure that follows LIFO principle. Whereas stacks in everyday life (outside computers, I mean) grow upward, stacks in x86 and x86-64 processors grow downward. See Wikibooks article on x86 stack (but please take into account that the code examples are 32-bit x86 code in Intel syntax, and your code is 64-bit x86-64 code in AT&T syntax).
So, what your code does (my explanations here are with Intel syntax):
push %rbp
Pushes rbp to stack, practically subtracting 8 from rsp (because the size of rbp is 8 bytes) and then stores rbp to [ss:rsp].
So, in Intel syntax push rbp practically does this:
sub rsp, 8
mov [ss:rsp], rbp
Then:
mov %rsp, %rbp
This is obvious. Just store the value of rsp into rbp.
subl $8, %esp
Subtract 8 from esp and store it into esp. Actually this is a bug in your code, even if it causes no problems here. Any instruction with a 32-bit register (eax, ebx, ecx, edx, ebp, esp, esi or edi) as destination in x86-64 sets the topmost 32 bits of the corresponding 64-bit register (rax, rbx, rcx, rdx, rbp, rsp, rsi or rdi) to zero, causing the stack pointer to point somewhere below the 4 GiB limit, effectively doing this (in Intel syntax):
sub rsp,8
and rsp,0x00000000ffffffff
Edit: added consequences of sub esp,8 below.
However, this causes no problems on a computer with less than 4 GiB of memory. On computers with more than 4 GiB memory, it may result in a segmentation fault. leave further below in your code returns a sane value to rsp. Generally in x86-64 code you don't need esp never (excluding possibly some optimizations or tweaks). To fix this bug:
subq $8, %rsp
The instructions so far are the standard entry sequence (replace $8 according to the stack usage). Wikibooks has a useful article on x86 functions and stack frames (but note again that it uses 32-bit x86 assembly with Intel syntax, not 64-bit x86-64 assembly with AT&T syntax).
Then:
movl $0, %eax
This is obvious. Store 0 into eax. This has nothing to do with the stack.
leave
This is equivalent to mov rsp, rbp followed by pop rbp.
ret
And this, finally, sets rip to the value stored at [ss:rsp], effective returning the code pointer back to where this procedure was called, and adds 8 to rsp.

Why gcc does so when creating assembler code?

I am playing around with gcc -S to understand how memory and stack works. During these plays I found several things unclear to me. Could you please help me to understand the reasons?
When calling function sets arguments for a called one it uses mov to esp instead push. What is the advantage not using push?
Function which works with its stack located arguments points to them as ebp + (N + offset) (where N is a size reserved for return address). I expect to see esp - offset which is more understandable. What is the reason to use ebp as fundamental point everywhere? I know these ones are equal but anyway?
What is this magic for in the beginning of main? Why esp must be initialized in this way only?
and esp,0xfffffff0
Thanks,
I will assume you are working under a 32-bit environment because in a 64-bit environment arguments are passed in registers.
Question 1
Perhaps you are passing a floating point argument here. You cannot push these directly, as the push instruction in a 32-bit runtime pushes 4 bytes at a time so you would have to break up the value. It is sometimes easier to subtract 8 from esp and them mov the 8-byte quadword into [esp].
Question 2
ebp is frequently used to index the parameters and locals in stack frames in 32-bit code. This allows the offsets within frames to be fixed even as the stack pointer moves. For example consider
void f(int x) {
int a;
g(x, 5);
}
Now if you only accessed the stack frame contents with esp, then a is at [esp], the return address would be at [esp+4] and x would be at [esp+8]. Now let's generate code to call g. We have to first push 5 then push x. But after pushing 5, the offset of x from esp has changed! This is why ebp is used. Normally on entry to functions we push the old value of ebp to save it, then copy esp to ebp. Now ebp can be used to access stack frame contents. It won't move when we are in the middle of passing arguments.
Question 3
This and instruction zeros out the last 4 bits of esp, aligning it to a 16-byte boundary. Since the stack grows downward, this is nice and safe.

Resources