How is the stack set up in a Windows function call? - winapi

To begin, I'd like to say I have sufficient background in assembly to understand most of what one needs to know to be a functional assembly programmer. Unfortunately I do not understand how a Windows API call works in terms of the return address.
Here's some example code written in GAS assembly for Windows using MinGW's as as the assembler and MinGW's ld as the linker...
.extern _ExitProcess#4
.text
.globl _main
_main:
pushl $0
call _ExitProcess#4
This code compiles and runs after assembling...
as program.s -o program.o
And linking it...
ld program.o -o program.exe -lkernel32
From my understanding, Windows API calls take arguments via push instructions, as can be seen above. Then during the call;
call _ExitProcess#4
the return address for the function is placed on the stack. Then, and this is where I'm confused, the function pops all the arguments off the stack.
I am confused because, since the stack is last in first out, in my mind while popping the arguments on the stack it would pop off the return address first. The arguments came first and the return address came next so it would technically be popped off first.
My question is, what does the layout of the stack look like after passing arguments via push operations to the function call and the return address placed on the stack? How are the arguments and the return address popped off the stack by the function as it executes? And finally, how is the return address popped off the stack and the function call rerturns to the address specified in the return addresss?

Almost all Windows API functions use the stdcall calling convention. This works like the normal "cdecl" convention, except as you've seen the called function is responsible for removing the argument when it returns. It does this using the RET instruction, which takes an optional immediate operand. This operand is the number of bytes to pop off the stack after first popping off the return value.
In both the cdecl and stdcall calling convention the arguments to a function aren't popped off the stack while the function is executing. They're left on the stack and accessed using ESP or EBP relative addressing. So when ExitProcess needs to access its argument it uses an instruction like mov 4(%esp), %eax or mov 4(%ebp), %eax.

Related

How to see result of MASM directives such as PROC, .SETFRAME. .PUSHREG

Writing x64 Assembly code using MASM, we can use these directives to provide frame unwinding information. For example, from .SETFRAME definition:
These directives do not generate code; they only generate .xdata and .pdata.
Since these directives don't produce any code, I cannot see their effects in Disassembly window. So, I don't see any difference, when I write assembly function with or without these directives. How can I see the result of these directives - using dumpbin or something else?
How to write code that can test this unwinding capability? For example, I intentionally write assembly code that causes an exception. I want to see the difference in exception handling behavior, when function is written with or without these directives.
In my case caller is written in C++, and can use try-catch, SSE etc. - whatever is relevant for this situation.
Answering your question:
How can I see the result of these directives - using dumpbin or something else?
You can use dumpbin /UNWINDINFO out.exe to see the additions to the .pdata resulting from your use of .SETFRAME.
The output will look something like the following:
00000054 00001530 00001541 000C2070
Unwind version: 1
Unwind flags: None
Size of prologue: 0x04
Count of codes: 2
Frame register: rbp
Frame offset: 0x0
Unwind codes:
04: SET_FPREG, register=rbp, offset=0x00
01: PUSH_NONVOL, register=rbp
A bit of explanation to the output:
The second hex number found in the output is the function address 00001530
Unwind codes express what happens in the function prolog. In the example what happens is:
RBP is pushed to the stack
RBP is used as the frame pointer
Other functions may look like the following:
000000D8 000016D0 0000178A 000C20E4
Unwind version: 1
Unwind flags: EHANDLER UHANDLER
Size of prologue: 0x05
Count of codes: 2
Unwind codes:
05: ALLOC_SMALL, size=0x20
01: PUSH_NONVOL, register=rbx
Handler: 000A2A50
One of the main differences here is that this function has an exception handler. This is indicated by the Unwind flags: EHANDLER UHANDLER as well as the Handler: 000A2A50.
Probably your best bet is to have your asm function call another C++ function, and have your C++ function throw a C++ exception. Ideally have the code there depend on multiple values in call-preserved registers, so you can make sure they get restored. But just having unwinding find the right return addresses to get back into your caller requires correct metadata to indicate where that is relative to RSP, for any given RIP.
So create a situation where a C++ exception needs to unwind the stack through your asm function; if it works then you got the stack-unwind metadata directives correct. Specifically, try{}catch in the C++ caller, and throw in a C++ function you call from asm.
That thrower can I think be extern "C" so you can call it from asm without name mangling. Or call it via a function pointer, or just look at MSVC compiler output and copy the mangled name into asm.
Apparently Windows SEH uses the same mechanism as plain C++ exceptions, so you could potentially set up a catch for the exception delivered by the kernel in response to a memory fault from something like mov ds:[0], eax (null deref). You could put this at any point in your function to make sure the exception unwind info was correct about the stack state at every point, not just getting back into sync before a function-call.
https://learn.microsoft.com/en-us/cpp/build/exception-handling-x64?view=msvc-170&viewFallbackFrom=vs-2019 has details about the metadata.
BTW, the non-Windows (e.g. GNU/Linux) equivalent of this metadata is DWARF .cfi directives which create a .eh_frame section.
I don't know equivalent details for Windows, but I do know they use similar metadata that makes it possible to unwind the stack without relying on RBP frame pointers. This lets compilers make optimized code that doesn't waste instructions on push rbp / mov rbp,rsp and leave in function prologues/epilogues, and frees up RBP for use as a general-purpose register. (Even more useful in 32-bit code where 7 instead of 6 registers besides the stack pointer is a much bigger deal than 15 vs. 14.)
The idea is that given a RIP, you can look up the offset from RSP to the return address on the stack, and the locations of any call-preserved registers. So you can restore them and continue unwinding into the parent using that return address.
The metadata indicates where each register was saved, relative to RSP or RBP, given the current RIP as a search key. In functions that use an RBP frame pointer, one piece of metadata can indicate that. (Other metadata for each push rbx / push r12 says which call-preserved regs were saved in which order).
In functions that don't use RBP as a frame pointer, every push / pop or sub/add RSP needs metadata for which RIP it happened at, so given a RIP, stack unwinding can see where the return address is, and where those saved call-preserved registers are. (Functions that use alloca or VLAs thus must use RBP as a frame pointer.)
This is the big-picture problem that the metadata has to solve. There are a lot of details, and it's much easier to leave things up to a compiler!

Why do calls to _Noreturn generate real calls instead of jumps? [duplicate]

This question already has an answer here:
Why noreturn/__builtin_unreachable prevents tail call optimization
(1 answer)
Closed 3 years ago.
I looked at
//#define _Noreturn //<= uncomment to see jmp's change to call's
_Noreturn void ext(int,char const*);
__attribute((noinline))
_Noreturn void intern(int X)
{
ext(X,"X"); //jmp unless ext is _Noreturn
}
void do_intern(void)
{
intern(0); //jmp unless intern is _Noreturn
}
int do_int_intern(void)
{
intern(0); //call in either case,
//would've expected a jmp if intern is _Noreturn
return 42; //erased if intern is _Noreturn
}
in
https://gcc.godbolt.org/z/MoT-LK and I noticed all gcc, clang, and icc generate real calls (call in x86_64) for calls to a _Noreturn function even though without the _Noreturn the call would have been a direct jump (jmp in x86_64).
Why is this?
For your examples, when compiled for x86_64, a JMP instruction couldn't be substituted for a CALL instruction as the stack won't be properly aligned. The callee expects the stack to be 16-byte aligned plus 8 for the return value.
Your example code, when compiled with -Os, generates this assembly on all the compilers in your godbolt link:
do_intern:
push rax ; ICC uses RSI here instead
xor edi, edi
call intern
To change the call into a JMP it would have to either add additional instruction to push a fake return value on the stack or undo the stack alignment made at the top of the function:
do_intern:
push rax
xor edi, edi
push rax ; or pop rax
jmp intern
Now if the compiler were really smart it could realize that there really isn't need to align the stack at the start of the function, so no need to undo it or push a fake return value:
do_intern:
xor edi, edi
jmp intern
But the compilers aren't smart enough to do this. Probably because it won't work in the general case where the function allocates variables on the stack and because there's rarely anything to be gained by improving the performance of calling functions that can never return.
In the tail-call case without _Noreturn, the CALL instruction could potentially be called multiple times during the execution of the program, so it can be worthwhile treating as a special case. With _Noreturn the CALL instruction can only be called once during the execution of the program (unless intern ends up making a recursive call to do_intern). Despite the similarity of the two cases, it would require new code to recognize the _Noreturn special case.
Note for GCC at least, it turns recognizing the _Noreturn special case is a lot harder than I initially thought, as explained by Richard Henderson on the GCC bugs mailing list:
This is the fault of my noreturn patch -- sibcalls to noreturn
functions no longer got an edge to EXIT, which meant that the code
intended to insert sibcall_epilogue patterns didn't.
It's a quandry: we need the more accurate CFG for the "control reaches
end of non-void function" test. But then if we brute force search for
sibcalls to insert the sibcall_epilogue patterns, we'll wind up with
flow2 warning about wanting to delete dead epilogue code (the loads
for the call-saved registers are dead becase we don't return).
I think the best way to solve this is to not create sibcalls to
noreturn functions. [...]
The term "sibcalls" refer to "sibling" calls, where a tail-call can use a jump instruction instead of a call instruction.
This is the reason why, at least initially, the optimization you're expecting was never (correctly) implemented in GCC. The fact that in most cases it would be better to have more accurate backtraces when non-returning functions like abort are called, seems to be a post hoc justification, though one that's probably significantly contributed to the optimization never being implemented in GCC, as well as clang and ICC.

If I am using the Win32 entry point, should I increase the esp value to remove the variables from the stack?

If I am using the Win32 entry point and I have the following code (in NASM):
extern _ExitProcess#4
global _start
section .text
_start:
mov ebp, esp
; Reserve space onto the stack for two 4 bytes variables
sub esp, 4
sub esp, 4
; ExitProcess(0)
push 0
call _ExitProcess#4
Now before exiting the process, should I increase the esp value to remove the two variables from the stack like I do with any "normal" function?
ExitProcess api can be called from any place. in any function and sub-function. and stack pointer of course can be any. you not need set any registers (include stack pointer) to some (and which ?) values. so answer - you not need increase the esp
as noted #HarryJohnston of course stack must be valid and aligned. as and before any api call. ExiProcess is usual api. and can be call as any another api. and like any another api it require only valid stack but not concrete stack pointer value. non-volatile registers need restore only we return to caller. but ExiProcess not return to caller. it at all never return
so rule is very simply - if you return from any function (entry point or absolute any - does not matter) - we need restore non volatile registers (stack pointer esp or rsp based on calling conventions) and return. if we not return to caller - we and not need restore/preserve any registers. if we return from thread or process entry point, despite good practice also restore all registers as well - in current windows implementations - even if we not do this, any way all will be work, because kernel32 shell caller simply just call ExitThread after we return. it not use any non volatile registers or local variables here. so code will be worked even without restore this from entry point, but much better restore it anyway

gcc, start symbol and push $0x00

So I've searched the interwebs for weeks and cannot find the answer to my questions.
I can see that the start symbol added by gcc sets up the initial two arguments (int argc, char *argv[]) and I believe the 3rd environment argument, but I have a couple questions about this. Why does it add all of this if the main() function is defined to have no arguments? Wouldn't it save space (and technically processing time) if it would just call _main without any arguments?
Second what does push $0x0 do? I have done tests and if you try to iterate through the command line arguments like what the default start symbol does then you need push $0x0 at the beginning, otherwise I have created misaligned stack errors if I do something like the following:
push $0x00
call _main
mov %eax, %edi
call _exit
also in my investigations I have found that the start symbol is added by the linker when you link to crt1.10.6.o
Any explanations or documentation would be greatly appreciated
The code for _start, as you have found, comes from an object file named crt1.o or similar. That object file normally comes from source code that's part of the C library. It is compiled in ignorance of how you have declared your main function, and so it has to assume you wish to make use of all three potential arguments. It does no harm to set up the arguments even if main won't use them (other than a trivial amount of space and time, as you mention).
I deduce from mov %eax,%edi that this is the x86-64 ELF ABI, which means that there are no frame pointers, so the push $0x00 is indeed just to keep the stack aligned to a 16-byte boundary, as you speculate. (In the x86-32 ELF ABI, it would also have served as the terminator for the linked list of frame pointers.)

GCC with the -fomit-frame-pointer option

I'm using GCC with the -fomit-frame-pointer and -O2 options. When I looked through the assembly code it generated,
push %ebp
movl %esp, %ebp
at the start and pop %ebp at the end is removed. But some redundant subl/addl instructions to esp is left in - subl $12, %esp at the start and addl $12, %esp at the end.
How will I be able to remove them as some inline assembly will jmp to another function before addl is excecuted.
You probably don't want to remove those -- that's usually the code that allocates and deallocates your local variables. If you remove those, your code will trample all over the return addresses and such.
The only safe way to get rid of them is not to use any local variables. Even in macros. And be really careful about inline functions, as they often have their own locals that'll get put in with yours. You may want to consider explicitly disabling function inlining for that section of code, if you can.
If you're absolutely sure that the adds and subs aren't needed (and i mean really, really sure), on my machine GCC apaprently does some stack manipulation to keep the stack aligned at 16 byte boundaries. You may be able to say "-mpreferred-stack-boundary=2", which will align to 4-byte boundaries -- which x86 processors like to do anyway, so no code is generated to realign it. Works on my box with my GCC; int main() { return 0; } turned into
main:
xorl %eax, %eax
ret
but the alignment code looked different to start with...so that may not be the problem for you.
Just so you're warned: optimization causes a lot of weird stuff like that to happen. Be careful with hand-coded assembler language and optimized <insert-almost-any-language-here> code, especially when you're doing something as unstructured as a jump from the middle of one function into another.
I solved the problem by giving a function prototype, then defining it manually like this:
void my_function();
asm (
".globl _my_function\n"
"_my_function:\n\t"
/* Assembler instructions go here */
);
Later I also wanted the function to be exported, so I added this at the end of the source file:
asm (
".section .drectve\n\t"
".ascii \" -export:my_function\"\n"
);
How will I be able to remove them as some inline assembly will jmp to another function before addl is executed.
This will corrupt your stack, that caller expects the stack pointer
to be corrected on function return. Does the other function return
by ret instruction? What exactly do you try to achieve? maybe there's another solution possible?
Please, show us the lines around the function call (in the caller) and your
entry/exit part of your function in question.

Resources