How to see result of MASM directives such as PROC, .SETFRAME. .PUSHREG - windows

Writing x64 Assembly code using MASM, we can use these directives to provide frame unwinding information. For example, from .SETFRAME definition:
These directives do not generate code; they only generate .xdata and .pdata.
Since these directives don't produce any code, I cannot see their effects in Disassembly window. So, I don't see any difference, when I write assembly function with or without these directives. How can I see the result of these directives - using dumpbin or something else?
How to write code that can test this unwinding capability? For example, I intentionally write assembly code that causes an exception. I want to see the difference in exception handling behavior, when function is written with or without these directives.
In my case caller is written in C++, and can use try-catch, SSE etc. - whatever is relevant for this situation.

Answering your question:
How can I see the result of these directives - using dumpbin or something else?
You can use dumpbin /UNWINDINFO out.exe to see the additions to the .pdata resulting from your use of .SETFRAME.
The output will look something like the following:
00000054 00001530 00001541 000C2070
Unwind version: 1
Unwind flags: None
Size of prologue: 0x04
Count of codes: 2
Frame register: rbp
Frame offset: 0x0
Unwind codes:
04: SET_FPREG, register=rbp, offset=0x00
01: PUSH_NONVOL, register=rbp
A bit of explanation to the output:
The second hex number found in the output is the function address 00001530
Unwind codes express what happens in the function prolog. In the example what happens is:
RBP is pushed to the stack
RBP is used as the frame pointer
Other functions may look like the following:
000000D8 000016D0 0000178A 000C20E4
Unwind version: 1
Unwind flags: EHANDLER UHANDLER
Size of prologue: 0x05
Count of codes: 2
Unwind codes:
05: ALLOC_SMALL, size=0x20
01: PUSH_NONVOL, register=rbx
Handler: 000A2A50
One of the main differences here is that this function has an exception handler. This is indicated by the Unwind flags: EHANDLER UHANDLER as well as the Handler: 000A2A50.

Probably your best bet is to have your asm function call another C++ function, and have your C++ function throw a C++ exception. Ideally have the code there depend on multiple values in call-preserved registers, so you can make sure they get restored. But just having unwinding find the right return addresses to get back into your caller requires correct metadata to indicate where that is relative to RSP, for any given RIP.
So create a situation where a C++ exception needs to unwind the stack through your asm function; if it works then you got the stack-unwind metadata directives correct. Specifically, try{}catch in the C++ caller, and throw in a C++ function you call from asm.
That thrower can I think be extern "C" so you can call it from asm without name mangling. Or call it via a function pointer, or just look at MSVC compiler output and copy the mangled name into asm.
Apparently Windows SEH uses the same mechanism as plain C++ exceptions, so you could potentially set up a catch for the exception delivered by the kernel in response to a memory fault from something like mov ds:[0], eax (null deref). You could put this at any point in your function to make sure the exception unwind info was correct about the stack state at every point, not just getting back into sync before a function-call.
https://learn.microsoft.com/en-us/cpp/build/exception-handling-x64?view=msvc-170&viewFallbackFrom=vs-2019 has details about the metadata.
BTW, the non-Windows (e.g. GNU/Linux) equivalent of this metadata is DWARF .cfi directives which create a .eh_frame section.
I don't know equivalent details for Windows, but I do know they use similar metadata that makes it possible to unwind the stack without relying on RBP frame pointers. This lets compilers make optimized code that doesn't waste instructions on push rbp / mov rbp,rsp and leave in function prologues/epilogues, and frees up RBP for use as a general-purpose register. (Even more useful in 32-bit code where 7 instead of 6 registers besides the stack pointer is a much bigger deal than 15 vs. 14.)
The idea is that given a RIP, you can look up the offset from RSP to the return address on the stack, and the locations of any call-preserved registers. So you can restore them and continue unwinding into the parent using that return address.
The metadata indicates where each register was saved, relative to RSP or RBP, given the current RIP as a search key. In functions that use an RBP frame pointer, one piece of metadata can indicate that. (Other metadata for each push rbx / push r12 says which call-preserved regs were saved in which order).
In functions that don't use RBP as a frame pointer, every push / pop or sub/add RSP needs metadata for which RIP it happened at, so given a RIP, stack unwinding can see where the return address is, and where those saved call-preserved registers are. (Functions that use alloca or VLAs thus must use RBP as a frame pointer.)
This is the big-picture problem that the metadata has to solve. There are a lot of details, and it's much easier to leave things up to a compiler!

Related

Why do calls to _Noreturn generate real calls instead of jumps? [duplicate]

This question already has an answer here:
Why noreturn/__builtin_unreachable prevents tail call optimization
(1 answer)
Closed 3 years ago.
I looked at
//#define _Noreturn //<= uncomment to see jmp's change to call's
_Noreturn void ext(int,char const*);
__attribute((noinline))
_Noreturn void intern(int X)
{
ext(X,"X"); //jmp unless ext is _Noreturn
}
void do_intern(void)
{
intern(0); //jmp unless intern is _Noreturn
}
int do_int_intern(void)
{
intern(0); //call in either case,
//would've expected a jmp if intern is _Noreturn
return 42; //erased if intern is _Noreturn
}
in
https://gcc.godbolt.org/z/MoT-LK and I noticed all gcc, clang, and icc generate real calls (call in x86_64) for calls to a _Noreturn function even though without the _Noreturn the call would have been a direct jump (jmp in x86_64).
Why is this?
For your examples, when compiled for x86_64, a JMP instruction couldn't be substituted for a CALL instruction as the stack won't be properly aligned. The callee expects the stack to be 16-byte aligned plus 8 for the return value.
Your example code, when compiled with -Os, generates this assembly on all the compilers in your godbolt link:
do_intern:
push rax ; ICC uses RSI here instead
xor edi, edi
call intern
To change the call into a JMP it would have to either add additional instruction to push a fake return value on the stack or undo the stack alignment made at the top of the function:
do_intern:
push rax
xor edi, edi
push rax ; or pop rax
jmp intern
Now if the compiler were really smart it could realize that there really isn't need to align the stack at the start of the function, so no need to undo it or push a fake return value:
do_intern:
xor edi, edi
jmp intern
But the compilers aren't smart enough to do this. Probably because it won't work in the general case where the function allocates variables on the stack and because there's rarely anything to be gained by improving the performance of calling functions that can never return.
In the tail-call case without _Noreturn, the CALL instruction could potentially be called multiple times during the execution of the program, so it can be worthwhile treating as a special case. With _Noreturn the CALL instruction can only be called once during the execution of the program (unless intern ends up making a recursive call to do_intern). Despite the similarity of the two cases, it would require new code to recognize the _Noreturn special case.
Note for GCC at least, it turns recognizing the _Noreturn special case is a lot harder than I initially thought, as explained by Richard Henderson on the GCC bugs mailing list:
This is the fault of my noreturn patch -- sibcalls to noreturn
functions no longer got an edge to EXIT, which meant that the code
intended to insert sibcall_epilogue patterns didn't.
It's a quandry: we need the more accurate CFG for the "control reaches
end of non-void function" test. But then if we brute force search for
sibcalls to insert the sibcall_epilogue patterns, we'll wind up with
flow2 warning about wanting to delete dead epilogue code (the loads
for the call-saved registers are dead becase we don't return).
I think the best way to solve this is to not create sibcalls to
noreturn functions. [...]
The term "sibcalls" refer to "sibling" calls, where a tail-call can use a jump instruction instead of a call instruction.
This is the reason why, at least initially, the optimization you're expecting was never (correctly) implemented in GCC. The fact that in most cases it would be better to have more accurate backtraces when non-returning functions like abort are called, seems to be a post hoc justification, though one that's probably significantly contributed to the optimization never being implemented in GCC, as well as clang and ICC.

saving general purpose registers in switch_to() in linux 2.6

I saw the code of switch_to in the article "Evolution of the x86 context switch in Linux" in the link https://www.maizure.org/projects/evolution_x86_context_switch_linux/
Most versions of switch_to only save/restore ESP/RSP and/or EBP/RBP, not other call-preserved registers in the inline asm. But the Linux 2.2.0 version does save them in this function, because it uses software context switching instead of relying on hardware TSS stuff. Later Linux versions still do software context switching, but don't have these push / pop instructions.
Are the registers are saved in other function (maybe in the schedule() function)? Or is there no need to save these registers in the kernel context?
(I know that those registers of the user context are saved in the kernel stack when the system enters kernel mode).
Linux versions before 2.2.0 use hardware task switching, where the TSS saves/restores registers for you. That's what the "ljmp %0\n\t" is doing. (ljmp is AT&T syntax for a far jmp, presumably to a task gate). I'm not really familiar with hardware TSS stuff because it's not very relevant; it's still used in modern kernels for getting RSP pointing to the kernel stack for interrupt handlers, but not for context switching between tasks.
Hardware task switching is slow, so later kernels avoid it. Linux 2.2 does save/restore the call-preserved registers manually, with push/pop before/after swapping stacks. EAX, EDX, and ECX are declared as dummy outputs ("=a" (eax), "=d" (edx), "=c" (ecx)) so the compiler knows that the old values of those registers are no longer available.
This is a sensible choice because switch_to is probably used inside a non-inline function. The caller will make a function call that eventually returns (after running another task for a while) with the call-preserved registers restored, and the call-clobbered registers clobbered, just like a regular function call. (So compiler code-gen for the function that uses the switch_to macro doesn't need to emit save/restore code outside of the inline asm). If you think about writing a whole context switch function in asm (not inline asm), you'd get this clobbering of volatile registers for free because callers expect that.
So how do later kernels avoid saving/restoring those registers in inline asm?
Linux 2.4 uses "=b" (last) as an output operand, so the compiler has to save/restore EBX in a function that uses this asm. The asm still saves/restores ESI, EDI, and EBP (as well as ESP). The text of the article notes this:
The 2.4 kernel context switch brings a few minor changes: EBX is no longer pushed/popped, but it is now included in the output of the inline assembly. We have a new input argument.
I don't see where they tell the compiler about EAX, ECX, and EDX not surviving, so that's odd. It might be a bug that they get away with by making the function noinline or something?
Linux 2.6 on i386 uses more output operands that get the compiler to handle the save/restore.
But Linux 2.6 for x86-64 introduces the trick that hands off the save/restore to the compiler easily: #define __EXTRA_CLOBBER ,"rcx","rbx","rdx","r8","r9","r10", "r11","r12","r13","r14","r15"
Notice the clobbers declaration: : "memory", "cc" __EXTRA_CLOBBER
This tells the compiler that the inline asm destroys all those registers, so the compiler will emit instructions to save/restore these registers at the start/end of whatever function switch_to ultimately inlines into.
Telling the compiler that all the registers are destroyed after a context switch solves the same problem as manually saving/restoring them with inline asm. The compiler will still make a function that obeys the calling convention.
The context-switch swaps to the new task's stack, so the compiler-generated save/restore code is always running with the appropriate stack pointer. Notice that the explicit push/pop instructions inside the inline asm int Linux 2.2 and 2.4 are before / after everything else.

writing a !address equivalent in WinAPI

Implementing !address feature of Windbg...
I am using VirtualQueryEx to query another Process memory and using getModuleFileName on the base addresses returned from VirtualQueryEx gives the module name.
What is left are the other non-module regions of a Process. How do I determine if a file is mapped to a region, or if the region represents the stack or the heap or PEB/TEB etc.
Basically, How do I figure out if a region represents Heap, the stack or PEB. How does Windbg do it?
One approach is to disassemble the code in the debugger extension DLL that implements !address. There is documentation within the Windbg help file on writing an extension. You could use that documentation to reverse engineer where the handler of !address is located. Then browsing through the disassembly you can see what functions it calls.
Windbg has support for debugging another instance of Windbg, specifically to debug an extension DLL. You can use this facility to better delve into the implementation of !address.
While the reverse engineering approach may be tedious, it will be more deterministic than theorizing how !address is implemented and trying out each theory.
To add to #Χpẘ answer, the reverse of the command shouldn't be really hard as debugger extensions DLLs come with symbols (I already reversed one to explain the internal flag of the !heap command).
Note that it is just a quick overview, I haven't perused inside it too much.
According to the !address documentation the command is located in exts.dll library. The command itself is located in Extension::address.
There are two commands handled there, a kernel mode (KmAnalyzeAddress) and a user mode one (UmAnalyzeAddress).
Inside UmAnalyzeAddress, the code:
Parse the command line: UmParseCommandLine(CmdArgs &,UmFilterData &)
Check if the process PEB is available IsTypeAvailable(char const *,ulong *) with "${$ntdllsym}!_PEB"
Allocate a std::list of user mode ranges: std::list<UmRange,std::allocator<UmRange>>::list<UmRange,std::allocator<UmRange>>(void)
Starts a loop to gather the required information:
UmRangeData::GetWowState(void)
UmMapBuild
UmMapFileMappings
UmMapModules
UmMapPebs
UmMapTebsAndStacks
UmMapHeaps
UmMapPageHeaps
UmMapCLR
UmMapOthers
Finally the results are finally output to screen using UmPrintResults.
Each of the above function can be simplfied to basic components, e.g. UmFileMappingshas the following central code:
.text:101119E0 push edi ; hFile
.text:101119E1 push offset LibFileName ; "psapi.dll"
.text:101119E6 call ds:LoadLibraryExW(x,x,x)
.text:101119EC mov [ebp+hLibModule], eax
.text:101119F2 test eax, eax
.text:101119F4 jz loc_10111BC3
.text:101119FA push offset ProcName ; "GetMappedFileNameW"
.text:101119FF push eax ; hModule
.text:10111A00 mov byte ptr [ebp+var_4], 1
.text:10111A04 call ds:GetProcAddress(x,x)
Another example, to find each stacks, the code just loops trhough all threads, get their TEB and call:
.text:1010F44C push offset aNttib_stackbas ; "NtTib.StackBase"
.text:1010F451 lea edx, [ebp+var_17C]
.text:1010F457 lea ecx, [ebp+var_CC]
.text:1010F45D call ExtRemoteTyped::Field(char const *)
There is a lot of fetching from _PEB, _TEB, _HEAP and other internal structures so it's not probably doable without going directly through those structures. So, I guess that some of the information returned by !address are not accessible through usual / common APIs.
You need to determine if the address you are interested in lies within a memory mapped file. Check out --> GetMappedFileName. Getting the heap and stack addresses of a process will be a little more problematic as the ranges are dynamic and don't always lie sequentially.
Lol, I don't know, I would start with a handle to the heap. If you can spawn/inherit a process then you more than likely can access the handle to the heap. This function looks promising: GetProcessHeap . That debug app runs as admin, it can walk the process chain and spy on any user level process. I don't think you will be able to access protected memory of kernel mode apps such as File System Filters, however, as they are dug down a little lower by policy.

inline assembly error: can't find a register in class 'GENERAL_REGS' while reloading 'asm'

I have an inline AT&T style assembly block, which works with XMM registers and there are no problems in Release configuration of my XCode project, however I've stumbled upon this strange error (which is supposedly a GCC bug) in Debug configuration... Can I fix it somehow? There is nothing special in assembly code, but I am using a lot of memory constraints (12 constraints), can this cause this problem?
Not a complete answer, sorry, but the comments section is too short for this ...
Can you post a sample asm("..." :::) line that demonstrates the problem ?
The use of XMM registers is not the issue, the error message indicates that GCC wanted to create code like, say:
movdqa (%rax),%xmm0
i.e. memory loads/stores through pointers held in general registers, and you specified more memory locations than available general-purpose regs (it's probably 12 in debug mode because because RBP, RSP are used for frame/stackpointer and likely RBX for the global offset table and RAX reserved for returns) without realizing register re-use potential.
You might be able to eek things out by doing something like:
void *all_mem_args_tbl[16] = { memarg1, memarg2, ... };
void *trashme;
asm ("movq (%0), %1\n\t"
"movdqa (%1), %xmm0\n\t"
"movq 8(%0), %1\n\t"
"movdqa (%1), %xmm1\n\t"
...
: "r"all_mem_args_tbl : "r"(trashme) : ...);
i.e. put all the mem locations into a table that you pass as operand, and then manage the actual general-purpose register use on your own. It might be two pointer accesses through the indirection table, but whether that makes a difference is hard to say without knowing your complete assembler code piece.
The Debug configuration uses -O0 by default. Since this flag disables optimisations, the compiler is probably not being able to allocate registers given the constraints specified by your inline assembly code, resulting in register starvation.
One solution is to specify a different optimisation level, e.g. -Os, which is the one used by default in the Release configuration.

Simple "Hello-World", null-free shellcode for Windows needed

I would like to test a buffer-overflow by writing "Hello World" to console (using Windows XP 32-Bit). The shellcode needs to be null-free in order to be passed by "scanf" into the program I want to overflow. I've found plenty of assembly-tutorials for Linux, however none for Windows. Could someone please step me through this using NASM? Thxxx!
Assembly opcodes are the same, so the regular tricks to produce null-free shellcodes still apply, but the way to make system calls is different.
In Linux you make system calls with the "int 0x80" instruction, while on Windows you must use DLL libraries and do normal usermode calls to their exported functions.
For that reason, on Windows your shellcode must either:
Hardcode the Win32 API function addresses (most likely will only work on your machine)
Use a Win32 API resolver shellcode (works on every Windows version)
If you're just learning, for now it's probably easier to just hardcode the addresses you see in the debugger. To make the calls position independent you can load the addresses in registers. For example, a call to a function with 4 arguments:
PUSH 4 ; argument #4 to the function
PUSH 3 ; argument #3 to the function
PUSH 2 ; argument #2 to the function
PUSH 1 ; argument #1 to the function
MOV EAX, 0xDEADBEEF ; put the address of the function to call
CALL EAX
Note that the argument are pushed in reverse order. After the CALL instruction EAX contains the return value, and the stack will be just like it was before (i.e. the function pops its own arguments). The ECX and EDX registers may contain garbage, so don't rely on them keeping their values after the call.
A direct CALL instruction won't work, because those are position dependent.
To avoid zeros in the address itself try any of the null-free tricks for x86 shellcode, there are many out there but my favorite (albeit lengthy) is encoding the values using XOR instructions:
MOV EAX, 0xDEADBEEF ^ 0xFFFFFFFF ; your value xor'ed against an arbitrary mask
XOR EAX, 0xFFFFFFFF ; the arbitrary mask
You can also try NEG EAX or NOT EAX (sign inversion and bit flipping) to see if they work, it's much cheaper (two bytes each).
You can get help on the different API functions you can call here: http://msdn.microsoft.com
The most important ones you'll need are probably the following:
WinExec(): http://msdn.microsoft.com/en-us/library/ms687393(VS.85).aspx
LoadLibrary(): http://msdn.microsoft.com/en-us/library/windows/desktop/ms684175(v=vs.85).aspx
GetProcAddress(): http://msdn.microsoft.com/en-us/library/ms683212%28v=VS.85%29.aspx
The first launches a command, the next two are for loading DLL files and getting the addresses of its functions.
Here's a complete tutorial on writing Windows shellcodes: http://www.codeproject.com/Articles/325776/The-Art-of-Win32-Shellcoding
Assembly language is defined by your processor, and assembly syntax is defined by the assembler (hence, at&t, and intel syntax) The main difference (at least i think it used to be...) is that windows is real-mode (call the actual interrupts to do stuff, and you can use all the memory accessible to your computer, instead of just your program) and linux is protected mode (You only have access to memory in your program's little cubby of memory, and you have to call int 0x80 and make calls to the kernel, instead of making calls to the hardware and bios) Anyway, hello world type stuff would more-or-less be the same between linux and windows, as long as they are compatible processors.
To get the shellcode from your program you've made, just load it into your target system's
debugger (gdb for linux, and debug for windows) and in debug, type d (or was it u? Anyway, it should say if you type h (help)) and between instructions and memory will be the opcodes.
Just copy them all over to your text editor into one string, and maybe make a program that translates them all into their ascii values. Not sure how to do this in gdb tho...
Anyway, to make it into a bof exploit, enter aaaaa... and keep adding a's until it crashes
from a buffer overflow error. But find exactly how many a's it takes to crash it. Then, it should tell you what memory adress that was. Usually it should tell you in the error message. If it says '9797[rest of original return adress]' then you got it. Now u gotta use ur debugger to find out where this was. disassemble the program with your debugger and look for where scanf was called. Set a breakpoint there, run and examine the stack. Look for all those 97's (which i forgot to mention is the ascii number for 'a'.) and see where they end. Then remove breakpoint and type the amount of a's you found out it took (exactly the amount. If the error message was "buffer overflow at '97[rest of original return adress]" then remove that last a, put the adress you found examining the stack, and insert your shellcode. If all goes well, you should see your shellcode execute.
Happy hacking...

Resources