Why does Windows use RCX, RDX for pointers in a fresh x64 process, different from EAX, EBX in a newly created 32-bit process? - windows

When I create a Windows x86 process in a suspended state (CREATE_SUSPENDED) its CONTEXT contains:
Virtual Address of Entry Point in Eax register;
Virtual Address of Process Environment Block structure in Ebx register.
But when I do the same for x86_64 process then CONTEXT contains:
Virtual Address of Entry Point in Rcx register (why not Rax?)
Virtual Address of PEB structure in Rdx register (why not Rbx?)
It seems logical to me to take Rax in x64 in place of Eax in x86 and Rbx in x64 in place of Ebx in x86 .
But instead of Eax→Rax and Ebx→Rbx we see Eax→Rcx and Ebx→Rdx.
Also, I see that 64-bit Cheat Engine is aware of this when opening the 32-bit process (notice the migration of the values eax↔ecx and ebx↔edx:
What was the reason to move from *ax register to *cx and from *bx to *dx in 64-bit processes?
Is it somehow connected to calling conventions?
Is it related to Windows only or do other OSes also have this kind of register repurposing?
Update:
Screenshots of just created x64 process in a suspended state:

It seems logical to me to take Rax in x64 in place of Eax in x86 and Rbx in x64 in place of Ebx in x86.
I don't see why it would be logical to assume so.
Even if, at MS, they had defined an internal ABI documenting the context of a just-created 32-bit process, the 64-bit version of would have been designed anew, so there is no reason to assume it carries anything over from the old 32-bit ABI.
If Windows uses sysret to return to user space, a process created with a suspended state may leak the target address in rcx.
Returning via other mechanisms (e.g. iret/retf), as could be the case for 32-bit code, will of course leak different data in different registers.
What you are seeing is probably an artifact of how Windows returns to user mode. I don't know exactly what the Windows kernel code to return to user mode is, but it is reasonable to assume that MS kept the same interface for 32-bit processes and that this interface was designed before sysret was widely used.
Note that at the PE entry-point rcx contains a pointer to the PEB and rdx to the entry-point (not the other way around). The former appears to be an undocumented parameter passed to the entry-point function, the latter may be just an artifact of how the entry-point is called.
In fact, a 32-bit process will find a pointer to the PEB in the stack, as the first parameter for the PE entry-point code.
Regarding other OSes, anything that is not documented to be stable is free to change at any time (including what's left in the registers). This is true in general.
As far as stability goes, passing from a 32-bit to a 64-bit implementation is a pretty big step and, again, there is no reason to keep using a very old interface (but with wider registers) instead of improving it with all the recent knowledge.
You can easily see that, for example, Linux "repurposed" the registers in the 64-bit system call ABI.

Related

Values of Registers at Windows 10 Entry Point x64

I recently wrote a small program that simply displays a popup dialog box using the winapi. I started it in x64dbg debugger to see how it was compiled and learn a bit about assembly.
The first thing that I noticed was that the main thread does not start executing at the entry point of my code: it starts executing somewhere in ntdll.dll. This code seems to make several function calls before eventually calling kernel32 which calls the entry point.
At the entry point, the registers have some values already loaded. I know they must be important as zeroing them in the debugger causes my program to crash. rax seems to be loaded with the entry point, but I'm not sure what the values of the others do.
So what exactly does all of the code do before my entry point, and what values does it load into the registers?
Windows process execution starts in the NT loader which is what's happening in the ntdll.dll / kernel32.dll. If you want details on all that, you should take a look at the Windows Internals books.
With Visual C++ programs the 'entry-point' for the process is mainCRTstartup which is inside the Visual C/C++ Runtime. It initializes the CRT, deals with global initialization, then dispatches to main, wmain, WinMain, etc.
The source for the CRT can be found in a Visual Studio installation: C:\Program Files (x86)\Microsoft Visual Studio\201?\<edition>\VC\ools\MSVC\<msvctoolset>\crt\src\vcruntime. Note this function doesn't actually take any parameters.
There is only one x64 "ABI" defined called __fastcall and it's documented on Microsoft Docs. Per this definition: RCX, RDX, R8, and R9 are the first four parameters of the function (unless it's a float/double which it isn't going to be for the entry-point). RAX, RCX, RDX, R8, R9, R10, R11 are all volatile, with RAX being the return value. Because this function doesn't take any parameters, it doesn't care what the value is of any of those registers. Any other registers which are expected to be non-volatile will cause problems if it's zeroed--see this page for details.
Visual C++ also has a __vectorcall but this is only used for internal SIMD procedure calls and is not used for cross-process or system calls. See Microsoft Docs

What does write_cr0(read_cr0() | 0x10000) do?

I searched the web a lot but didn't find a short explanation about what write_cr0(read_cr0() | 0x10000) really do. It is related to the Linux kernel and I curios about developing LKM's. I want to know what this really do and what are the security issues with this.
It used to remove the write protection on the syscall table.
But how it is really works? and what does each thing in this line?
CR0 is one of the control registers available on x86 CPUs, which contains flags controlling CPU features related to memory protection, multitasking, paging, etc. You can find a full description in Volume 3, Section 2.5 of Intel's Software Developer's Manual.
These registers are accessed by special instructions that the compiler doesn't normally generate, so read_cr0() is a function which executes the instruction to read this register (via inline assembly) and returns the result in a general-purpose register. Likewise, write_cr0() writes to this register.
The function calls are likely to be inlined, so that the generated code would be something like
mov eax, cr0
or eax, 0x10000
mov cr0, eax
The OR with 0x10000 sets bit 16, the Write Protect bit. On early 32-bit x86 CPUs, code running at supervisor level (like the kernel) was always allowed to write all of virtual memory, regardless of whether the page was marked read-only. This bit makes that optional, so that when it is set, such accesses will cause page faults. This line of code probably follows an earlier line which temporarily cleared the bit.

saving general purpose registers in switch_to() in linux 2.6

I saw the code of switch_to in the article "Evolution of the x86 context switch in Linux" in the link https://www.maizure.org/projects/evolution_x86_context_switch_linux/
Most versions of switch_to only save/restore ESP/RSP and/or EBP/RBP, not other call-preserved registers in the inline asm. But the Linux 2.2.0 version does save them in this function, because it uses software context switching instead of relying on hardware TSS stuff. Later Linux versions still do software context switching, but don't have these push / pop instructions.
Are the registers are saved in other function (maybe in the schedule() function)? Or is there no need to save these registers in the kernel context?
(I know that those registers of the user context are saved in the kernel stack when the system enters kernel mode).
Linux versions before 2.2.0 use hardware task switching, where the TSS saves/restores registers for you. That's what the "ljmp %0\n\t" is doing. (ljmp is AT&T syntax for a far jmp, presumably to a task gate). I'm not really familiar with hardware TSS stuff because it's not very relevant; it's still used in modern kernels for getting RSP pointing to the kernel stack for interrupt handlers, but not for context switching between tasks.
Hardware task switching is slow, so later kernels avoid it. Linux 2.2 does save/restore the call-preserved registers manually, with push/pop before/after swapping stacks. EAX, EDX, and ECX are declared as dummy outputs ("=a" (eax), "=d" (edx), "=c" (ecx)) so the compiler knows that the old values of those registers are no longer available.
This is a sensible choice because switch_to is probably used inside a non-inline function. The caller will make a function call that eventually returns (after running another task for a while) with the call-preserved registers restored, and the call-clobbered registers clobbered, just like a regular function call. (So compiler code-gen for the function that uses the switch_to macro doesn't need to emit save/restore code outside of the inline asm). If you think about writing a whole context switch function in asm (not inline asm), you'd get this clobbering of volatile registers for free because callers expect that.
So how do later kernels avoid saving/restoring those registers in inline asm?
Linux 2.4 uses "=b" (last) as an output operand, so the compiler has to save/restore EBX in a function that uses this asm. The asm still saves/restores ESI, EDI, and EBP (as well as ESP). The text of the article notes this:
The 2.4 kernel context switch brings a few minor changes: EBX is no longer pushed/popped, but it is now included in the output of the inline assembly. We have a new input argument.
I don't see where they tell the compiler about EAX, ECX, and EDX not surviving, so that's odd. It might be a bug that they get away with by making the function noinline or something?
Linux 2.6 on i386 uses more output operands that get the compiler to handle the save/restore.
But Linux 2.6 for x86-64 introduces the trick that hands off the save/restore to the compiler easily: #define __EXTRA_CLOBBER ,"rcx","rbx","rdx","r8","r9","r10", "r11","r12","r13","r14","r15"
Notice the clobbers declaration: : "memory", "cc" __EXTRA_CLOBBER
This tells the compiler that the inline asm destroys all those registers, so the compiler will emit instructions to save/restore these registers at the start/end of whatever function switch_to ultimately inlines into.
Telling the compiler that all the registers are destroyed after a context switch solves the same problem as manually saving/restoring them with inline asm. The compiler will still make a function that obeys the calling convention.
The context-switch swaps to the new task's stack, so the compiler-generated save/restore code is always running with the appropriate stack pointer. Notice that the explicit push/pop instructions inside the inline asm int Linux 2.2 and 2.4 are before / after everything else.

Hooking Windows Kernel Dispatcher for System Calls

I'm trying to hook SYSENTER dispatch function from the kernel and during the past few days I was studying about what happens when a program executes SYSENTER and wants to enter to kernel then I realized IA32_SYSENTER_EIP and IA32_SYSENTER_ESP are responsible to set the kernel RIP and RSP after SYSENTER.
Yesterday I read Intel Software Developer Manuals about SWAPGS :
SWAPGS exchanges the current GS base register value with the value contained in MSR address C0000102H (IA32_KERNEL_GS_BASE). The SWAPGS instruction is a privileged instruction intended for use by system software.
When using SYSCALL to implement system calls, there is no kernel stack
at the OS entry point. Neither is there a straightforward method to
obtain a pointer to kernel structures from which the kernel stack
pointer could be read. Thus, the kernel cannot save general purpose
registers or reference memory.
From the second paragraph, there is no kernel stack at the OS entry point seems that OS kernel executes SWAPGS to set the GS and then get the kernel stack pointer but as I read, in a SYSENTER kernel RIP(EIP) and RSP (ESP) should set from IA32_SYSENTER_EIP and IA32_SYSENTER_ESP so the kernel has its stack pointer in IA32_SYSENTER_ESP !
My Questions :
If kernel stack address should come from GS then what's the purpose of IA32_SYSENTER_ESP?
What are differences between AMD LSTAR (0xC0000082) and IA32_SYSENTER_EIP? I ask it because I saw Windows set 0xc0000082 on my Intel processor.
Is there any special problem with hooking kernels SYSENTER dispatcher?It's because whenever I put a breakpoint in Windows function which is responsible for dispatching SYSENTER calls (KiSystemCall64Shadow) on a remote debugging machine (Not VM) then it causes BSOD with UNEXPECTED_KERNEL_MODE_TRAP.

Simple "Hello-World", null-free shellcode for Windows needed

I would like to test a buffer-overflow by writing "Hello World" to console (using Windows XP 32-Bit). The shellcode needs to be null-free in order to be passed by "scanf" into the program I want to overflow. I've found plenty of assembly-tutorials for Linux, however none for Windows. Could someone please step me through this using NASM? Thxxx!
Assembly opcodes are the same, so the regular tricks to produce null-free shellcodes still apply, but the way to make system calls is different.
In Linux you make system calls with the "int 0x80" instruction, while on Windows you must use DLL libraries and do normal usermode calls to their exported functions.
For that reason, on Windows your shellcode must either:
Hardcode the Win32 API function addresses (most likely will only work on your machine)
Use a Win32 API resolver shellcode (works on every Windows version)
If you're just learning, for now it's probably easier to just hardcode the addresses you see in the debugger. To make the calls position independent you can load the addresses in registers. For example, a call to a function with 4 arguments:
PUSH 4 ; argument #4 to the function
PUSH 3 ; argument #3 to the function
PUSH 2 ; argument #2 to the function
PUSH 1 ; argument #1 to the function
MOV EAX, 0xDEADBEEF ; put the address of the function to call
CALL EAX
Note that the argument are pushed in reverse order. After the CALL instruction EAX contains the return value, and the stack will be just like it was before (i.e. the function pops its own arguments). The ECX and EDX registers may contain garbage, so don't rely on them keeping their values after the call.
A direct CALL instruction won't work, because those are position dependent.
To avoid zeros in the address itself try any of the null-free tricks for x86 shellcode, there are many out there but my favorite (albeit lengthy) is encoding the values using XOR instructions:
MOV EAX, 0xDEADBEEF ^ 0xFFFFFFFF ; your value xor'ed against an arbitrary mask
XOR EAX, 0xFFFFFFFF ; the arbitrary mask
You can also try NEG EAX or NOT EAX (sign inversion and bit flipping) to see if they work, it's much cheaper (two bytes each).
You can get help on the different API functions you can call here: http://msdn.microsoft.com
The most important ones you'll need are probably the following:
WinExec(): http://msdn.microsoft.com/en-us/library/ms687393(VS.85).aspx
LoadLibrary(): http://msdn.microsoft.com/en-us/library/windows/desktop/ms684175(v=vs.85).aspx
GetProcAddress(): http://msdn.microsoft.com/en-us/library/ms683212%28v=VS.85%29.aspx
The first launches a command, the next two are for loading DLL files and getting the addresses of its functions.
Here's a complete tutorial on writing Windows shellcodes: http://www.codeproject.com/Articles/325776/The-Art-of-Win32-Shellcoding
Assembly language is defined by your processor, and assembly syntax is defined by the assembler (hence, at&t, and intel syntax) The main difference (at least i think it used to be...) is that windows is real-mode (call the actual interrupts to do stuff, and you can use all the memory accessible to your computer, instead of just your program) and linux is protected mode (You only have access to memory in your program's little cubby of memory, and you have to call int 0x80 and make calls to the kernel, instead of making calls to the hardware and bios) Anyway, hello world type stuff would more-or-less be the same between linux and windows, as long as they are compatible processors.
To get the shellcode from your program you've made, just load it into your target system's
debugger (gdb for linux, and debug for windows) and in debug, type d (or was it u? Anyway, it should say if you type h (help)) and between instructions and memory will be the opcodes.
Just copy them all over to your text editor into one string, and maybe make a program that translates them all into their ascii values. Not sure how to do this in gdb tho...
Anyway, to make it into a bof exploit, enter aaaaa... and keep adding a's until it crashes
from a buffer overflow error. But find exactly how many a's it takes to crash it. Then, it should tell you what memory adress that was. Usually it should tell you in the error message. If it says '9797[rest of original return adress]' then you got it. Now u gotta use ur debugger to find out where this was. disassemble the program with your debugger and look for where scanf was called. Set a breakpoint there, run and examine the stack. Look for all those 97's (which i forgot to mention is the ascii number for 'a'.) and see where they end. Then remove breakpoint and type the amount of a's you found out it took (exactly the amount. If the error message was "buffer overflow at '97[rest of original return adress]" then remove that last a, put the adress you found examining the stack, and insert your shellcode. If all goes well, you should see your shellcode execute.
Happy hacking...

Resources