Windows initial execution context - windows

Once Windows has loaded an executable in memory and transfert execution to the entry point, do values in registers and stack are meaningful? If so, where can I find more informations about it?

Officially, the registers at the entry point of PE file do not have defined values. You're supposed to use APIs, such as GetCommandLine to retrieve the information you need. However, since the kernel function that eventually transfers control to the entry point did not change much from the old days, some PE packers and malware started to rely on its peculiarities. The two more or less reliable registers are:
EAX points to the entry point of the application (because the kernel function uses call eax to jump to it)
EBX points to the Process Environment Block (PEB).

Chapter 5 of Windows Internals Fifth Edition covers the mechanism of Windows creating a process in detail. That would give you more information about Windows loading an executable in memory and transferring execution to the entry point.
I found this up-to-date reference that covers how registers are used in various calling conventions on various operating systems and by various compilers. It's quite detailed, and seems comprehensive:
Agner Fog's Calling Conventions document

Related

Why does Windows randomise the PEB and TEB?

According to a report by Symantec:
The address of an operating system structure known as the Process Environment Block (PEB) is also
selected randomly. The PEB randomization feature was introduced earlier in Windows XP SP2 and
Windows 2003 SP1, and is also present in Windows Vista. Although implemented separately, it is also
a form of address space randomization; but unlike the other ASLR features, PEB randomization occurs
whether or not the executable being loaded elected to use the ASLR feature.
When a process in Windows is to be randomised, the base addresses of the stack and the heap are randomised for obvious reasons.
But why is the PEB/TEB also being randomised what is the benefit here?
The way I see it, you want to randomise the base of the heap and the stack so that if an attacker gains the ability to execute shellcode on behalf of the process - you want to prevent that attacker from utilising some known pointers by randomising them. An attacker would get control over the IP, but he would not know where to jump to next.
But what is the point with the PEB? to access the PEB you have to go through the GDT by using the FS segment with code similar to this:
mov ax, fs:0x30
If the access to the PEB is already abstracted for you, when you gain code execution - you can directly access the PEB with a few instructions without having to worry about the actual address of the PEB which was randomised by ASLR.
So, what is the benefit of PEB randomization?

WinDbg not showing register values

Basically, this is the same question that was asked here.
When performing kernel debugging of a machine running Windows 7 or older, with WinDbg version 6.2 and up, the debugger doesn't show anything in the registers window. Pressing the Customize... button results in a message box that reads Registers are not yet known.
At the same time, issuing the r command results in perfectly valid register values being printed out.
What is the reason for this behaviour, and can it be fixed?
TL;DR: I wrote an extension DLL that fixes the bug. Available here.
The Problem
To understand the problem, we first need to understand that WinDbg is basically just a frontend to Microsoft's Windows Symbolic Debugger Engine, implemented inside dbgeng.dll. Other frontends include the command-line kd.exe (kernel debugger) and cdb.exe (user-mode debugger).
The engine implements everything we expect from a debugger: working with symbol files, read and writing memory and registers, setting breakpoitns, etc. The engine then exposes all of this functionality through COM-like interfaces (they implement IUnknown but are not registered components). This allows us, for instance, to write our own debugger (like this person did).
Armed with this knowledge, we can now make an educated guess as to how WinDbg obtains the values of the registers on the target machine.
The engine exposes the IDebugRegisters interface for manipulating registers. This interface declares the GetValues method for retrieving the values of multiple registers in one go. But how does WinDbg know how many registers are there? That why we have the GetNumberRegisters method.
So, to retrieve the values of all registers on the target, we'll have to do something like this:
Call IDebugRegisters::GetNumberRegisters to get the total number of registers.
Call IDebugRegisters::GetValues with the Count parameter set to the total number of registers, the Indices parameter set to NULL, and the Start parameter set to 0.
One tiny problem, though: the second call fails with E_INVALIDARG.
Ehm, excuse me? How can it fail? Especially puzzling is the documentation for this return value:
The value of the index of one of the registers is greater than the number of registers on the target machine.
But I just asked you how many registers there are, so how can that value be out of range? Okay, let's continue reading the docs anyway, maybe something will become clear:
If the return value is not S_OK, some of the registers still might have been read. If the target was not accessible, the return type is E_UNEXPECTED and Values is unchanged; otherwise, Values will contain partial results and the registers that could not be read will have type DEBUG_VALUE_INVALID.
(Emphasis mine.)
Aha! So maybe the engine just couldn't read one of the registers! But which one? Turns out that the engine chokes on the xcr0 register. From the Intel 64 and IA-32 Architectures Software Developer’s Manual:
Extended control register XCR0 contains a state-component bitmap that specifies the user state components that software has enabled the XSAVE feature set to manage. If the bit corresponding to a state component is clear in XCR0, instructions in the XSAVE feature set will not operate on that state component, regardless of the value of the instruction mask.
Okay, so the register controls the operation of the XSAVE instruction, which saves the state of the CPU's extended features (like XMM and AVX). According to the last comment on this page, this instruction requires some support from the operating system. Although the comment states that Windows 7 (that's what the VM I was testing on was running) does support this instruction, it seems that the issue at hand is related to the OS anyway, as when the target is Windows 8 everything works fine.
Really, it's unclear whether the bug is within the debugger engine, which reports more registers than it can retrieve values for, or within WinDbg, which refuses to show any values at all if the engine fails to produce all of them.
The Solution
We could, of course, bite the bullet and just use an older version of WinDbg for debugging older Windows versions. But where's the challenge in that?
Instead, I present to you a debugger extension that solves this problem. It does so by hooking (with the help of this library) the relevant debugger engine methods and returning S_OK if the only register that failed was xcr0. Otherwise, it propagates the failure. The extension supports runtime unload, so if you experience problems you can always disable the hooks.
That's it, have fun!

Can I dump/modify the content of x86 CPU cache/TLB

any apps or the system kernel can access or even modify the content of CPU cahce and/or TLB?
I found a short description about the CPU cache from this webiste:
"No programming language has direct access to CPU cache. Reading and writing the cache is something done automatically by the hardware; there's NO way to write instructions which treat the cache as any kind of separate entity. Reads and writes to the cache happen as side-effect to all instructions that touch memory."
From this message, it seems there is no way to read/write the content of CPU cahce/TLB.
However, I also got another information that conflicts with the above one. That information implies that a debug tool may be able to dump/show the content of CPU cache.
Currently I'm confused. so please help me.
I got some answers from another post: dump the contents of TLB buffer of x86 CPU. Thanks adamdunson.
People could read this document about test registers, but it is only available on very old x86 machines test registers
Another descriptions from wiki https://en.wikipedia.org/wiki/Test_register:
A test register, in the Intel 80486 processor, was a register used by
the processor, usually to do a self-test. Most of these registers were
undocumented, and used by specialized software. The test registers
were named TR3 to TR7. Regular programs don't usually require these
registers to work. With the Pentium, the test registers were replaced
by a variety of model-specific registers (MSRs).
Two test registers, TR6 and TR7, were provided for the purpose of
testing. TR6 was the test command register, and TR7 was the test data
register. These registers were accessed by variants of the MOV
instruction. A test register may either be the source operand or the
destination operand. The MOV instructions are defined in both
real-address mode and protected mode. The test registers are
privileged resources. In protected mode, the MOV instructions that
access them can only be executed at privilege level 0. An attempt to
read or write the test registers when executing at any other privilege
level causes a general protection exception. Also, those instructions
generate invalid opcode exception on any CPU newer than 80486.
In fact, I'm still expecting some similar functions on Intel i7 or i5. Unfortunately, I do not find any related document about that. If anyone has such information, please let me know.

windows memory segmentation & Ollydbg

a few questions about windows memory segmentation.
every process in windows got his own virtual memory. does it mean that each each process has it own task
(I mean own Task descriptor or Task gate) ?
I opened a simple exe with ollydbg and I saw that for each CALL intruction to a dll function is taking me to the jumping table. the jumping table had jumping instructions to the DLLs like this one :
JMP DWORD PTR DS:[402058]
my question is why its uses the data segment and not the CS selector for the base address?
if I open the memory map and find what stored at 402058 I find that it containes resorces.
if I understand correctly the addresses of the DLL function stored in the DS ?
I noticed that the memory map is organized by owner. shouldn't it be organized with segments like all the code be in CS data in DS etc ?
thank you
1.
A Process has it's own virtual address space.
I do not understand what you're referring to as "Task descriptor or Task gate", but the Windows operating system holds a descriptor for each process, called the Process Control Block, that contains information about the process (such as identification, access tokens, execution state, virtual memory mapping, etc).
A Task is a logical unit that can be used to manage a single process, or multiple processes.
Job -> Tasks
Task -> Processes
Process -> Threads
2.
In the case you mentioned, which is common for compilers, the program uses the .DATA section to store the jump table after loading the function addresses.
The reason why this happens in the first place is because the compiler cannot know the DLL base address at compile-time, therefore the address has to be fixed at load-time to point to the function. This is known as Relocation.
In order to maintain the jump table seperately from the code, compilers store it in the .DATA section. This way, we can also give it write permissions (usually the .DATA segment has write permissions) and modify it as necessary without sacrificing stability and security.
3.
Each module loaded in the process' virtual address space contains it's own sections - that's why you see a different set of .text, .data, .reloc etc for each module. The "Owner" column is the module name.
P.S. Please ask one question per post - that way it will be easily accesible by other users after you get answered, and each question will likely get more accurate answers.

How does Windows protect transition into kernel mode?

How does Windows protect against a user-mode thread from arbitrarily transitioning the CPU to kernel-mode?
I understand these things are true:
User-mode threads DO actually transition to kernel-mode when a system call is made through NTDLL.
The transition to kernel-mode is done through processor-specific instructions.
So what is special about these system calls through NTDLL? Why can't the user-mode thread fake-it and execute the processor-specific instructions to transition to kernel-mode? I know I'm missing some key piece of Windows architecture here...what is it?
You're probably thinking that thread running in user mode is calling into Ring 0, but that's not what's actually happening. The user mode thread is causing an exception that's caught by the Ring 0 code. The user mode thread is halted and the CPU switches to a kernel/ring 0 thread, which can then inspect the context (e.g., call stack and registers) of the user mode thread to figure out what to do. Before syscall, it really was an exception rather than a special exception specifically to invoke ring 0 code.
If you take the advice of the other responses and read the Intel manuals, you'll see syscall/sysenter don't take any parameters - the OS decides what happens. You can't call arbitrary code. WinNT uses function numbers that map to which kernel mode function the user mode code will execute (for example, NtOpenFile is fnc 75h on my Windows XP machine (the numbers change all the time; it's one of the jobs of NTDll is to map a function call to a fnc number, put it in EAX, point EDX to the incoming parameters then invoke sysenter).
Intel CPUs enforce security using what's called 'Protection Rings'.
There are 4 of these, numbered from 0 to 3. Code running in ring 0 has the highest privileges; it can (practically) do whatever it pleases with your computer. The code in ring 3, on the other hand, is always on a tight leash; it has only limited powers to influence things. And rings 1 and 2 are currently not used for any purpose at all.
A thread running in a higher privileged ring (such as ring 0) can transition to lower privilege ring (such as ring 1, 2 or 3) at will. However, the transition the other way around is strictly regulated. This is how the security of high privileged resources (such as memory) etc. is maintained.
Naturally, your user mode code (applications and all) runs in ring 3 while the OS's code runs in ring 0. This ensures that the user mode threads can't mess with the OS's data structures and other critical resources.
For details on how all this is actually implemented you could read this article. In addition, you may also want to go through Intel Manuals, especially Vol 1 and Vol 3A, which you can download here.
This is the story for Intel processors. I'm sure other architectures have something similar going on.
I think (I may be wrong) that the mechanism which it uses for transition is simple:
User-mode code executes a software interrupt
This (interrupt) causes a branch to a location specified in the interrupt descriptor table (IDT)
The thing that prevents user-mode code from usurping this is as follows: you need to be priviledged to write to the IDT; so only the kernel is able to specify what happens when an interrupt is executed.
Code running in User Mode (Ring 3) can't arbitrarily change to Kernel Mode (Ring 0). It can only do so using special routes -- jump gates, interrupts, and sysenter vectors. These routes are highly protected and input is scrubbed so that bad data can't (shouldn't) cause bad behavior.
All of this is set up by the kernel, usually on startup. It can only be configured in Kernel Mode so User-Mode code can't modify it.
It's probably fair to say that it does it in a (relatively) similar way to what Linux does. In both cases it's going to be CPU-specific, but on x86 probably either a software interrupt with the INT instruction, or via SYSENTER instruction.
The advantage of looking at how Linux does it is that you can do so without a Windows source licence.
The userspace source part is here here at LXR and the
kernel space bit - look at entry_32.S and entry_64.S
Under Linux on x86 there are three different mechanisms, int 0x80, syscall and sysenter.
A library which is built at runtime by the kernel called vdso is called by the C library to implement the syscall function, which uses a different mechanism depending on the CPU and which system call it is. The kernel then has handlers for those mechanisms (if they exist on the specific CPU variant).

Resources