call immediate versus call dword near [dword addr] - windows

So recently I've been wanting to call some win32 calls from assembly, and I've been using NASM as my external assembler. I was calling SendMessage in my code in the following way:
call __imp__SendMessageW#16
This was assembled into a relative jump (0xE8 opcode) and the result was an access violation. In the debugger, the computed jump offset seemed to be the correct one (in that __imp__SendMessageW#16 really did seem to reside there) but nonetheless it did not work. Examining the assembly produced by Visual Studio when I called the function from C++, I noticed that it wasn't a relative immediate jump it was using, but instead (in the language of MASM) a call dword ptr [__imp__SendMessageW#16], corresponding to an 0xFF15 opcode. After some futzing around I figured out that NASM syntax encodes this as call dword near [dword __imp__SendMessageW#16], and making the change my code suddenly worked.
My question is, why does one work and not the other? Is there some relocation of code going on that causes the relative immediate call to jump somewhere unfriendly? I've never been much of an assembly programmer but my impression was always that the two calls should do the same thing and the main difference is that one is position independent and the other is not (assuming that they move the IP to the same place). The relocation of code theory makes sense given that, but then how do you explain the debugger showing the right address?
Also: what's the logic behind the [] syntax in this call? The offset is still an immediate (just little endian encoded immediately after 0xFF15), there's no memory access going on here beyond the instruction fetch (I tend to think of [] as a dereference outside the context of lea).

call dword[__imp__SendMessageW#16]
_imp_SendMessageW#16 is an address to your imports section that contains the address of the API function. You use the square brackets to deference (call the address STORED by this address)

Related

How did debuggers for 16-bit real mode programs produce stack traces?

I'm messing around with running old DOS programs in an emulator, and I've gotten to the point where I'd like to trace the program's stack. However, I'm running into a problem, specifically how to detect near calls and far calls. Some pretext:
A near call pushes only the IP onto the stack, and is expected to be paired with a ret which pops only the IP to return to.
A far call pushes both the CS and IP onto the stack, and is expected to be paired with a retf which pops both the CS and IP to return to.
There is no way to know whether a call is a near call or a far call, except by knowing which kind of instruction called it, or which return it uses.
Luckily, for the period this program was developed in, BP-based stack frames were very common, so walking the stack doesn't seem to be a problem: I just follow the BP-chain. Unfortunately, getting the CS and/or IP is difficult, because there doesn't seem to be any way for me to determine whether a call is a near call or a far call by looking at the stack alone.
I have metadata about functions available, so I can tell whether a function is a near or far call if I already know the actual CS and IP, but I can't figure out the IP and CS unless I already know if it's a far call or near call.
I'm having a little success by just guessing and seeing if my guess results in a valid function lookup, but I think this method will produce a lot of false positives.
So my question is this: How did debuggers of the DOS era deal with this problem and produce stack traces? Is there some algorithm for this I'm missing, or did they just encode debug information in the stack? (If this is the case, then I'll have to come up with something else.)
Just a guess, I've never actually used 16-bit x86 development tools (modern or back in the day):
You know the CS:IP value of the current function (or one that triggered a fault or whatever from an exception frame).
You might have metadata that tells you whether this is a "far" function that's called with a far call or not. Or you could attempt decoding until you get to a retn or retf, and use that to decide whether the return address is a near IP or a far CS:IP.
(Assuming this is a normal function that returns with some kind of ret. Or if it ends with a jmp tailcall to another function, then the return address probably matches that, but that's another level of assumptions. And figuring out that a near jmp is the end of a function instead of just a jump within a large function is am ambiguous problem without any symbol metadata.)
But anyway, apply the same thing to the parent function: after one level of successful backtracing, you now have the CS:IP of the instruction after the call in your parent function, and the SS:BP value of the BP linked list.
And BTW, yes there's a very good reason for legacy BP stack frames being widely used: [SP] isn't a valid 16-bit addressing mode, and only [BP] as a base implies SS as a segment, so yes, using BP for access to the stack was the only good option for random access (not just push/pop for temporaries). No reason not to save/restore it first (before any other registers or reserving stack space) to make a conventional stack-frame.

Does the Win32 entry point have to preserve any registers values (callee-saved registers)?

I am writing a program in NASM, and I do not want to link it with the CRT, and so I will specify the entry point (which will be the Win32 entry point). This is the program source code:
global _myEntryPoint
section .text
_myEntryPoint:
mov eax, 12345
Now this is what I know about the Win32 entry point (please correct me if I am wrong):
The Win32 entry point does not return a value like a normal
function does (to exit the Win32 entry point I have to call
ExitProcess()).
The Win32 entry point does not take any arguments.
Now what I don't know is the following:
Does the Win32 entry point have to preserve any registers values (callee-saved registers)? I think the answer is No, since when the Win32 entry point exits, it terminates the process and not return execution to a function that expects some registers values to be preserved.
As described in my answer to the proposed duplicate, you shouldn't return from the Win32 entry point at all, in which case there is obviously no need for you to preserve any registers. The way your question is phrased vaguely suggests that you were worried that you night need to restore registers before calling ExitProcess but this is definitely not the case; calling ExitProcess does not cause you to return from the entry point, it just stops running your code. (See also here for an update, and this may also be of interest.)
Should you ignore that advice and return from the entry point anyway, well, in practice the answer is the same: you don't actually need to have preserved any registers. To the best of my knowledge this behaviour is not documented, however, so if you wanted to be cautious you might choose to strictly follow the stdcall convention.

How do symbols solve walking the stack with FPO in x86 debugging?

In this answer: https://stackoverflow.com/a/8646611/192359 , it is explained that when debugging x86 code, symbols allow the debugger to display the callstack even when FPO (Frame Pointer Omission) is used.
The given explanation is:
On the x86 PDBs contain FPO information, which allows the debugger to reliably unwind a call stack.
My question is what's this information? As far as I understand, just knowing whether a function has FPO or not does not help you finding the original value of the stack pointer, since that depends on runtime information.
What am I missing here?
Fundamentally, it is always possible to walk the stack with enough information1, except in cases where the stack or execution context has been irrecoverably corrupted.
For example, even if rbp isn't used as the frame pointer, the return address is still on the stack somewhere, and you just need to know where. For a function that doesn't modify rsp (indirectly or directly) in the body of the function it would be at a simple fixed offset from rsp. For functions that modify rsp in the body of the function (i.e., that have a variable stack size), the offset from rsp might depend on the exact location in the function.
The PDB file simply contains this "side band" information which allows someone to determine the return address for any instruction in the function. Hans linked a relevant in-memory structure above - you can see that since it knows the size of the local variables and so on it can calculate the offset between rsp and the base of the frame, and hence get at the return address. It also knows how many instruction bytes are part of the "prolog" which is important because if the IP is still in that region, different rules apply (i.e., the stack hasn't been adjusted to reflect the locals in this function yet).
In 64-bit Windows, the exact function call ABI has been made a bit more concrete, and all functions generally have to provide unwind information: not in a .pdb but directly in a section included in the binary. So even without .pdb files you should be able to unwind a properly structured 64-bit Windows program. It allows any register to be used as the frame pointer, and still allows frame-pointer omission (with some restrictions). For details, start here.
1 If this weren't true, ask yourself how the currently running function could ever return? Now, technically you could design a program which clobbers or forgets the stack in a way that it cannot return, and either never exits or uses a method like exit() or abort() to terminate. This is highly unusual and not possibly outside of assembly.

How to call library functions in shellcode

I want to generate shellcode using the following NASM code:
global _start
extern exit
section .text
_start:
xor rcx, rcx
or rcx, 10
call exit
The problem here is that I cannot use this because the address of exit function cannot be hard coded. So, how do I go about using library functions without having to re-implement them using system calls?
One way that I can think of, is to retrieve the address of exit function in a pre-processing program using GetProcAddress and substitute it in the shellcode at the appropriate place.
However, this method does not generate shellcode that can be run as it is. I'm sure there must be a better way to do it.
I am not an expert on writing shellcode, but you could try to find the import address table (IAT) of your target program and use the stored function pointers to call windows functions.
Note that you would be limited to the functions the target program uses.
Also you would have to let your shellcode calculate IAT's position relative to the process's base address due to relocations. Of course you could rely on Windows not relocating, but this might result in errors in a few cases.
Another issue is that you would have to find the target process's base address from outside.
A totally different attempt would be using syscalls, but they are really hard to use, not talking about the danger using them.
Information on PE file structure:
https://msdn.microsoft.com/en-us/library/ms809762.aspx

Simple "Hello-World", null-free shellcode for Windows needed

I would like to test a buffer-overflow by writing "Hello World" to console (using Windows XP 32-Bit). The shellcode needs to be null-free in order to be passed by "scanf" into the program I want to overflow. I've found plenty of assembly-tutorials for Linux, however none for Windows. Could someone please step me through this using NASM? Thxxx!
Assembly opcodes are the same, so the regular tricks to produce null-free shellcodes still apply, but the way to make system calls is different.
In Linux you make system calls with the "int 0x80" instruction, while on Windows you must use DLL libraries and do normal usermode calls to their exported functions.
For that reason, on Windows your shellcode must either:
Hardcode the Win32 API function addresses (most likely will only work on your machine)
Use a Win32 API resolver shellcode (works on every Windows version)
If you're just learning, for now it's probably easier to just hardcode the addresses you see in the debugger. To make the calls position independent you can load the addresses in registers. For example, a call to a function with 4 arguments:
PUSH 4 ; argument #4 to the function
PUSH 3 ; argument #3 to the function
PUSH 2 ; argument #2 to the function
PUSH 1 ; argument #1 to the function
MOV EAX, 0xDEADBEEF ; put the address of the function to call
CALL EAX
Note that the argument are pushed in reverse order. After the CALL instruction EAX contains the return value, and the stack will be just like it was before (i.e. the function pops its own arguments). The ECX and EDX registers may contain garbage, so don't rely on them keeping their values after the call.
A direct CALL instruction won't work, because those are position dependent.
To avoid zeros in the address itself try any of the null-free tricks for x86 shellcode, there are many out there but my favorite (albeit lengthy) is encoding the values using XOR instructions:
MOV EAX, 0xDEADBEEF ^ 0xFFFFFFFF ; your value xor'ed against an arbitrary mask
XOR EAX, 0xFFFFFFFF ; the arbitrary mask
You can also try NEG EAX or NOT EAX (sign inversion and bit flipping) to see if they work, it's much cheaper (two bytes each).
You can get help on the different API functions you can call here: http://msdn.microsoft.com
The most important ones you'll need are probably the following:
WinExec(): http://msdn.microsoft.com/en-us/library/ms687393(VS.85).aspx
LoadLibrary(): http://msdn.microsoft.com/en-us/library/windows/desktop/ms684175(v=vs.85).aspx
GetProcAddress(): http://msdn.microsoft.com/en-us/library/ms683212%28v=VS.85%29.aspx
The first launches a command, the next two are for loading DLL files and getting the addresses of its functions.
Here's a complete tutorial on writing Windows shellcodes: http://www.codeproject.com/Articles/325776/The-Art-of-Win32-Shellcoding
Assembly language is defined by your processor, and assembly syntax is defined by the assembler (hence, at&t, and intel syntax) The main difference (at least i think it used to be...) is that windows is real-mode (call the actual interrupts to do stuff, and you can use all the memory accessible to your computer, instead of just your program) and linux is protected mode (You only have access to memory in your program's little cubby of memory, and you have to call int 0x80 and make calls to the kernel, instead of making calls to the hardware and bios) Anyway, hello world type stuff would more-or-less be the same between linux and windows, as long as they are compatible processors.
To get the shellcode from your program you've made, just load it into your target system's
debugger (gdb for linux, and debug for windows) and in debug, type d (or was it u? Anyway, it should say if you type h (help)) and between instructions and memory will be the opcodes.
Just copy them all over to your text editor into one string, and maybe make a program that translates them all into their ascii values. Not sure how to do this in gdb tho...
Anyway, to make it into a bof exploit, enter aaaaa... and keep adding a's until it crashes
from a buffer overflow error. But find exactly how many a's it takes to crash it. Then, it should tell you what memory adress that was. Usually it should tell you in the error message. If it says '9797[rest of original return adress]' then you got it. Now u gotta use ur debugger to find out where this was. disassemble the program with your debugger and look for where scanf was called. Set a breakpoint there, run and examine the stack. Look for all those 97's (which i forgot to mention is the ascii number for 'a'.) and see where they end. Then remove breakpoint and type the amount of a's you found out it took (exactly the amount. If the error message was "buffer overflow at '97[rest of original return adress]" then remove that last a, put the adress you found examining the stack, and insert your shellcode. If all goes well, you should see your shellcode execute.
Happy hacking...

Resources