Windows 64 ABI, correct register use if i do NOT call windows API? - windows

As suggested to me in another question i checked the windows ABI and i'm left a little confused about what i can and cannot do if i'm not calling windows API myself.
My scenario is i'm programming .NET and need a small chunk of code in asm targeting a specific processor for a time critical section of code that does heavy multi pass processing on an array.
When checking the register information in the ABI at https://msdn.microsoft.com/en-us/library/9z1stfyw.aspx
I'm left a little confused about what applies to me if i
1) Don't call the windows API from the asm code
2) Don't return a value and take a single parameter.
Here is what i understand, am i getting all of it right?
RAX : i can overwrite this without preserving it as the function doesn't expect a return value
RCX : I need to preserve this as this is where the single int parameter will be passed, then i can overwrite it and not restore it
RDX/R8/R9 : Should not be initialized as there are no such parameters in my method, i can overwrite those and not restore them
R10/R11 : I can overwrite those without saving them, if the caller needs it he is in charge of preserving them
R12/R13/R14/R15/RDI/RSI/RBX : I can overwrite them but i first need to save them (or can i just not save them if i'm not calling the windows API?)
RBP/RSP : I'm assuming i shouldn't touch those?
If so am i correct that this is the right way to handle this (if i don't care about the time taking to preserve data and need as many registers available as possible)? Or is there a way to use even more registers?
; save required registers
push r12
push r13
push r14
push r15
push rdi
push rsi
push rbx
; my own array processing code here, using rax as the memory address passed as the first parameter
; safe to use rax rbx rcx rdx r8 r9 r10 r11 r12 r13 r14 r15 rdi rsi giving me 14 64bit registers
; 1 for the array address 13 for processing
; should not touch rbp rsp
; restore required registers
pop rbx
pop rsi
pop rdi
pop r15
pop r14
pop r13
pop r12

TL;DR: if you need registers that are marked preserved, push/pop them in proper order. With your code you can use those 14 registers you mention without issues. You may touch RBP if you preserve it, but don't touch RSP basically ever.
It does matter if you call Windows APIs but not in the way I assume you think. The ABI says what registers you must preserve. The preservation information means that the caller knows that there are registers you will not change. You don't need to call any Windows API functions for that requirement to be there.
The idea as an analogue (yeah, I know...): Here are five different colored stacks of sticky notes. You can use any of them, but if you need the red or the blue ones, could you keep the top one in a safe place and put it back when you stop since I need the phone numbers on them. About the other colors I don't care, they were just scratch paper and I've written the information elsewhere.
So if you call an external function you know that no function will ever change the value of the registers marked as preserved. Any other register may change their values and you have to make sure you don't have anything there that needs to be preserved.
And when your function is called, the caller expects the same: if they put a value in a preserved register, it will have the same value after the call. But any non-preserved registers may be whatever and they will make sure they store those values if they need to keep them.
The return value register you may use however you want. If the function doesn't return a value the caller must not expect it to have any specific value and also will not expect it to preserve its value.

You only need to preserve the registers you use. If you don't use all of these, you don't need to preserve all of them.
You can freely use RAX, RCX, RDX, R8, R9, R10 and R11. The latter two must be preserved by the caller, if necessary, not by your function.
Most of the time, these registers (or their subregisters like EAX) are enough for my purposes. I hardly ever need more.
Of course, if any of these (e.g. RCX) contain arguments for your function, it is up to you to preserve them for yourself as long as you need them. How you do that is also up to you. But if you push them, make sure that there is a corresponding pop somewhere.
Use This MSDN page as a guide.

Related

Position of GCC stack canaries

Unless I am misunderstanding something, it seems the position of the canary value can be before or after ebp, therefore in the second case the attacker can overwrite the frame pointer without touching the canary.
For example in this snippet, the canary is located at a lower address (ebp-0xc) than ebp therefore protecting it (an attacker must overwrite the canary to overwrite ebp):
0x080484e0 <+52>: mov eax,DWORD PTR [ebp-0xc]
0x080484e3 <+55>: xor eax,DWORD PTR gs:0x14
0x080484ea <+62>: je 0x80484f1 <func+69>
0x080484ec <+64>: call 0x8048360 <__stack_chk_fail#plt>
However looking at other code the canary is after rbp+8:
How should I interpret this? Does this depend on GCC version or something else?
The canary is always below the frame pointer, with every version of gcc I've tried. You can see that confirmed in the gdb disassembly immediately below the IDA disassembly in the blog post you linked, which has mov rax, QWORD PTR [rbp-0x8].
I think this is just an artifact of IDA's disassembler. Instead of displaying the numerical offset for rbp-relative addresses, it assigns a name to each stack slot, and displays the name instead; basically assuming that every rbp-relative access is to a local variable or argument. And it looks like it always displays that name with a + regardless of whether the offset is positive or negative. Note that buf and fd also get a + sign even though they are local variables which are clearly below the frame pointer.
In this example, it has named the canary var_8 as if it were a local variable. So I suppose to translate this properly, you have to think of var_8 as having the value -8.

If I am using the Win32 entry point, should I increase the esp value to remove the variables from the stack?

If I am using the Win32 entry point and I have the following code (in NASM):
extern _ExitProcess#4
global _start
section .text
_start:
mov ebp, esp
; Reserve space onto the stack for two 4 bytes variables
sub esp, 4
sub esp, 4
; ExitProcess(0)
push 0
call _ExitProcess#4
Now before exiting the process, should I increase the esp value to remove the two variables from the stack like I do with any "normal" function?
ExitProcess api can be called from any place. in any function and sub-function. and stack pointer of course can be any. you not need set any registers (include stack pointer) to some (and which ?) values. so answer - you not need increase the esp
as noted #HarryJohnston of course stack must be valid and aligned. as and before any api call. ExiProcess is usual api. and can be call as any another api. and like any another api it require only valid stack but not concrete stack pointer value. non-volatile registers need restore only we return to caller. but ExiProcess not return to caller. it at all never return
so rule is very simply - if you return from any function (entry point or absolute any - does not matter) - we need restore non volatile registers (stack pointer esp or rsp based on calling conventions) and return. if we not return to caller - we and not need restore/preserve any registers. if we return from thread or process entry point, despite good practice also restore all registers as well - in current windows implementations - even if we not do this, any way all will be work, because kernel32 shell caller simply just call ExitThread after we return. it not use any non volatile registers or local variables here. so code will be worked even without restore this from entry point, but much better restore it anyway

Direction Flag DF in the Windows 64-bit calling convention?

According to the docs I can find on calling windows functions, the following applies:-
The Microsoft x64 calling convention[12][13] is followed on Windows
and pre-boot UEFI (for long mode on x86-64). It uses registers RCX,
RDX, R8, R9 for the first four integer or pointer arguments (in that
order), and additional arguments are pushed onto the stack (right to
left). Integer return values (similar to x86) are returned in RAX if
64 bits or less.
In the Microsoft x64 calling convention, it's the caller's
responsibility to allocate 32 bytes of "shadow space" on the stack
right before calling the function (regardless of the actual number of
parameters used), and to pop the stack after the call. The shadow
space is used to spill RCX, RDX, R8, and R9,[14] but must be made
available to all functions, even those with fewer than four
parameters.
The registers RAX, RCX, RDX, R8, R9, R10, R11 are considered volatile
(caller-saved).[15]
The registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are
considered nonvolatile (callee-saved).[15]
So, I have been happily calling kernel32 until a call to GetEnvironmentVariableA failed under certain circumstances. I finally traced it back to the fact that the direction flag DF was set and I needed to clear it.
I have not up till now been able to find any mention of this and wondered if it was prudent to always clear it before a call.
Or maybe that would cause other problems. Anyone aware of the conventions of calling in this instance?
Windows assumes that the direction flag is cleared. Despite in article said about C run-time only, this is true for whole windows (I think because windows code itself is primarily written in c/c++). So when your programme begins to execute - you can assume that DF is 0. Usually you do not need to change this flag. However if you temporarily change it (set it to 1) in some internal routine you must clear it by cld before calling any windows API or any external module (because it assumes that DF is 0).
All windows interrupts at very beginning of execution clear DF to 0 - so it is safe to temporarily set DF to 1 in own internal code, main - before any external call reset it back to 0.

Compiler generated unexpected `IN AL, DX` (opcode `EC`) while setting up call stack

I was looking at some compiler output, and when a function is called it usually starts setting up the call stack like so:
PUSH EBP
MOV EBP, ESP
PUSH EDI
PUSH ESI
PUSH EBX
So we save the base pointer of the calling routine on the stack, move our own base pointer up, and then store the contents of a few registers on the stack. These are then restored to their original values at the end of the routine, like so:
LEA ESP, [EBP-0Ch]
POP EBX
POP ESI
POP EDI
POP EBP
RET
So far, so good. However, I noticed that in one routine the code that sets up the call stack looks a little different. In fact, it looks like this:
IN AL, DX
PUSH EDI
PUSH ESI
PUSH EBX
This is quite confusing for a number of reasons. For one thing, the end-of-method code is identical to that quoted above for the other method, and in particular seems to expect a saved copy of EBP to be available on the stack.
For another, if I understand correctly the command IN AL, DX reads into the AL register, which is the same as the EAX register, and as it so happens the very next command here is
XOR EAX, EAX
as the program wants to zero a few things it allocated on the stack.
Question: I'm wondering exactly what's going on here that I don't understand. The machine code being translated as IN AL, DX is the single byte EC, whereas the pair of instructions
PUSH EBP
MOV EBP, ESP
would correspond to three byte 55 88 EC. Is the disassembler misreading this somehow? Or is something relying on a side effect I don't understand?
If anyone's curious, this machine code was generated by the CLR's JIT compiler, and I'm viewing it with the Visual Studio debugger. Here's a minimal reproduction in C#:
class C {
string s = "";
public void f(string s) {
this.s = s;
}
}
However, note that this seems to be non-deterministic; sometimes I seem to get the IN AL, DX version, while other times there's a PUSH EBP followed by a MOV EBP, ESP.
EDIT: I'm starting to strongly suspect a disassembler bug -- I just got another situation where it shows IN AL, DX (opcode EC) and the two preceding bytes in memory are 55 88. So perhaps the disassembler is simply confused about the entry point of the method. (Though I'd still like some insight as to why that's happening!)
Sounds like you are using VS2015. Your conclusion is correct, its debugging engine has a lot of bugs. Yes, wrong address. Not the only problem, it does not restore breakpoints properly and you are apt to see the INT3 instruction still in the code. And it can't correctly refresh the disassembly when the jitter has re-generated the code and replace stub calls. You can't trust anything you see.
I recommend you use Tools > Options > Debugging > General and tick the "Use Managed Compatibility Mode" checkbox. That forces the debugger to use an older debugging engine, VS2010 vintage. It is much more stable.
You'll lose some features with this engine, like return value inspection and 64-bit Edit+Continue. Won't be missed when you do this kind of debugging. You will however see fake code addresses, as was always common before, so all CALL addresses are wrong and you can't easily identify calls into the CLR. Flipping the engine back-and-forth is a workaround of sorts, but of course a big annoyance.
This has not been worked on either, I saw no improvements in the Updates. But they no doubt had a big bug list to work through, VS2015 shipped before it was done. Hopefully VS2017 is better, we'll find out soon.
As Hans's answered, it's a bug in Visual Studio.
To confirm the same, I disassembled a binary using IDA 6.5 and Visual Studio 2019. Here is the screenshot:
Visual Studio 2019 missed 2 bytes (0x55 0x8B) while considering the start of main.
Note: 'Use managed compatibility mode' mentioned by Hans didn't fix the issue in VS2019.

Having problems with GdiGradientFill in FASM

I'm trying to write an ASM version of a Java app I developed recently, as a project in Win32 ASM, but as the title states, I'm having problems with GdiGradientFill; I'd prefer, for the moment, to use FASM, and avoid higher level ASM constructs, such as INVOKE and the use of the WIN32 includes.
What I have, atm:
PUSH [hWnd]
CALL [User32.GetWindowDC]
MOV [hDC], EAX
PUSH rectClient
PUSH [hWnd]
CALL [User32.GetClientRect]
PUSH [rectClient.left]
POP [colorOne.xPos]
PUSH [rectClient.top]
POP [colorOne.yPos]
MOV [colorOne.red], 0xC000
MOV [colorOne.green], 0xC000
MOV [colorOne.blue], 0xC000
MOV [colorOne.alpha], 0x0000
PUSH [rectClient.right]
POP [colorTwo.xPos]
PUSH [rectClient.bottom]
POP [colorTwo.yPos]
MOV [colorTwo.red], 0x0000
MOV [colorTwo.green], 0x2800
MOV [colorTwo.blue], 0x7700
MOV [colorTwo.alpha], 0x0C00
MOV [gRect.UpperLeft], 0
MOV [gRect.LowerRight], 1
PUSH GRADIENT_FILL_RECT_H
PUSH 1
PUSH gRect
PUSH 2
PUSH colorOne
PUSH [hDC]
CALL [GDI32.GdiGradientFill]
However, the code returns only a FALSE, and after going through both MSDN
(http://msdn.microsoft.com/en-us/library/windows/desktop/dd373585(v=vs.85).aspx)
and some other examples (http://www.asmcommunity.net/board/index.php?topic=4100.0), I still can't see what I am doing wrong, can anyone see the flaw here?
An additional problem has been with my attempts to use Msimg32's GradientFill, as this always leads to a crash, however, I have seen some reports that Win2K+ OS's simply pass the parameters from Msimg32 to GDI32; is this accurate, or has anyone else experienced problems with this form?
Pastebin link for whole code: http://pastebin.com/GEHDw6Qe
Thanks for any help, SS
EDIT:
Code is now working, honestly, I have no idea what has changed, I can't see anything different between the previous and now working data, other than changing the PUSH / POP sequence to MOV EAX, [rectClient.left], ect (The PUSH / POP method works, also) - Many thanks to those who offered assistance!
You're passing what looks like a RECT as the 4th parameter to GdiGradientFill. The function expects a GRADIENT_TRIANGLE.
Also, PUSH/POP is a very weird way to copy from one memory location to another. You're doing 4 memory accesses instead of two. Copy via a register; this is not Java.
Are you sure GetWindowDC is what you need? That one returns the DC for the whole window, title and border and all. For just the client area, people normally use GetDC(). When done, call ReleaseDC().

Resources