ESP after SEH Exception with same program on different computers

ESP after SEH Exception with same program on different computers - stack-overflow

Below are a few articles that exploit different programs by using ESP after the SEH exception to POPAD back to a controllable part of the stack. The other article demonstrates stack pivoting finding suitable code inside a loaded module to pivot the stack to the desired location. These exploits rely on the ESP location after the SEH exception to be the same every time on every computer with the same OS in order to produce a reliable exploit. Otherwise 38 POPAD's or a particular ROP might not land inside your buffer.
My question: 3 Windows 10 64bit english fully updated VM's, no DEP. The same version of a test program installed on all 3. This program has No Rebase, No SafeSEH, No ASLR, not NX Compatible. After an SEH exception, the distance from ESP to the start of the controllable buffer on the stack is different on all 3 operating systems. Why am I not able to reproduce the results of these exploits with my VM's?
PUSHAD:
www.mattandreko.com/2013/04/06/buffer-overflow-in-hexchat-294/
ROP Stack Pivot:
thesprawl.org/research/corelan-tutorial-10-exercise-solution/#pivoting-to-the-stack

Related

How is table-based exception handling better than 32-bit Windows SEH?

In 32-bit Windows (at least with Microsoft compilers), exception handling is implemented using a stack of exception frames allocated dynamically on the call stack; the top of the exception stack is pointed to by a TIB entry. The runtime cost is a couple of PUSH/POP instructions per function that needs to handle exceptions, spilling the variables accessed by the exception handler onto the stack, and when handling an exception, a simple linked list walk.
In both 64-bit Windows and the Itanium / System V x86-64 ABI, unwinding instead uses a big sorted list describing all the functions in memory. The runtime cost is some tables per every function (not just ones involved in exception handling), complications for dynamically generated code, and when handling an exception, walking the function list once per every active function regardless of whether it has anything to do with exceptions or not.
How is the latter better than the former? I understand why the Itanium model is cheaper in the common case than the traditional UNIX one based on setjmp/longjmp, but a couple of PUSHes andPOPs plus some register spillage in 32-bit Windows doesn't seem that bad, for the (seemingly) much quicker and simpler handling that it provides. (IIRC, Windows API calls routinely consume Ks of stack space anyway, so it’s not like we gain anything by forcing this data out into tables.)

In addition to optimizing the happy case, perhaps there was also a concern that buffer overflow vulnerabilities could expose the information in the exception. If this information gets corrupted, it could seriously confuse the user, or maybe even cause further errors (remember std::terminate() is called if another exception gets thrown).
Source: http://www.osronline.com/article.cfm%5earticle=469.htm

Performance difference between system call vs function call

I quite often listen to driver developers saying its good to avoid kernel mode switches as much as possible. I couldn't understand the precise reason. To start with my understanding is -
System calls are software interrupts. On x86 they are triggered by using instruction sysenter. Which actually looks like a branch instruction which takes the target from a machine specific register.
System calls don't really have to change the address space or process context.
Though, they do save registers on process stack and and change stack pointer to kernel stack.
Among these operations syscall pretty much works like a normal function call. Though the sysenter could behave like a mis-predicted branch which could lead to ROB flush in processor pipeline. Even that is not really bad, its just like any other mis-predicted branch.
I heard a few people answering on Stack Overflow:
You never know how long syscall takes - [me] yeah, but thats case with any function. Amount of time it takes depends on the function
It is often scheduling spot. - [me] process can get rescheduled, even if it is running all the time in user mode. ex, while(1); doesnt guarantee a no-context switch.
Where is the actual syscall cost coming from?

You don't indicate what OS you are asking about. Let me attempt an answer anyway.
The CPU instructions syscall and sysenter should not be confused with the concept of a system call and its representation in the respective OSs.
The best explanation for the difference in the overhead incurred by each respective instruction is given by reading through the Operation sections of the Intel® 64 and IA-32 Architectures Developer's Manual volume 2A (for int, see page 3-392) and volume 2B (for sysenter see page 4-463). Also don't forget to glance at iretd and sysexit while at it.
A casual counting of the pseudo-code for the operations yields:
408 lines for int
55 lines for sysenter
Note: Although the existing answer is right in that sysenter and syscall are not interrupts or in any way related to interrupts, older kernels in the Linux and the Windows world used interrupts to implement their system call mechanism. On Linux this used to be int 0x80 and on Windows int 0x2E. And consequently on those kernel versions the IDT had to be primed to provide an interrupt handler for the respective interrupt. On newer systems, that's true, the sysenter and syscall instructions have completely replaced the old ways. With sysenter it's the MSR (machine specific register) 0x176 which gets primed with the address of the handler for sysenter (see the reading material linked below).
On Windows ...
A system call on Windows, just like on Linux, results in the switch to kernel mode. The scheduler of NT doesn't provide any guarantees about the time a thread is granted. Also it yanks away time from threads and can even end up starving threads. In general one can say that user mode code can be preempted by kernel mode code (with very few very specific exceptions to which you'll certainly get in the "advanced driver writing class"). This makes perfect sense if we only look at one example. User mode code can be swapped out - or, for that matter, the data it's trying to access. Now the CPU doesn't have the slightest clue how to access pages in the swap/paging file, so an intermediate step is required. And that's also why kernel mode code must be able to preempt user mode code. It is also the reason for one of the most prolific bug-check codes seen on Windows and mostly caused by third-party drivers: IRQL_NOT_LESS_OR_EQUAL. It means that a driver accessed paged memory when it wasn't possible to preempt the code touching that memory.
Further reading
SYSENTER and SYSEXIT in Windows by Geoff Chappell (always worth a read in my experience!)
Sysenter Based System Call Mechanism in Linux 2.6
Windows NT platform specific discussion: How Do Windows NT System Calls REALLY Work?
Windows NT platform specific discussion: System Call Optimization with the SYSENTER Instruction
Windows Internals, 5th ed., by Russinovich et. al. - pages 125 through 132.
ReactOS implementation of KiFastSystemCall

SYSENTER/SYSCALL is not a software interrupt; whole point of those instructions is to avoid overhead caused by issuing IRQ and calling interrupt handler.
Saving registers on stack costs time, this is one place where the syscall cost comes from.
Another place comes from the kernel mode switch itself. It involves changing segment registers - CS, DS, ES, FS, GS, they all have to be changed (it's less costly on x86-64, as segmentation is mostly unused, but you still need to essentially make far jump to kernel code) and also changes CPU ring of execution.
To conclude: function call is (on modern systems, where segmentation is not used) near call, while syscall involves far call and ring switch.

If malloc's can fail, how come stack variables initialization can't (at least we don't check that)?

Would the OS send a warning to the user before a threshold and then the application would actually crash if there is not enough memory to allocate the stack (local) variables of the current function?

Yes, you would get a Stack Overflow run-time error.
Side note: There is a popular web site named after this very error!

Stack allocation can fail and there's nothing you can do about it.
On a modern OS, a significant amount of memory will be committed for the stack to begin with (on Linux it seems to be 128k or so these days) and a (usually much larger, e.g. 8M on Linux, and usually configurable) range of virtual addresses will be reserved for stack growth. If you exceed the committed part, committing more memory could fail due to out-of-memory condition and your program will crash with SIGSEGV. If you exceed the reserved address range, your program will definitely fail, possibly catastrophically if it ends up overwriting other data just below the stack address range.
The solution is not to do insane things with the stack. Even the initial committed amount on Linux (128k) is more stack space than you should ever use. Don't use call recursion unless you have a logarithmic bound on the number of call levels, don't use gigantic automatic arrays or structures (including ones that might result from user-provided VLA dimensions), and you'll be just fine.
Note that there is no portable and no future-safe way to measure current stack usage and remaining availability, so you just have to be safe about it.
Edit: One guarantee you do have about stack allocations, at least on real-world systems, (without the split-stack hack) is that stack space you've already verified you have won't magically disappear. For instance if you successfully once call c() from b() from a() from main(), and they're not using any VLA's that could vary in size, a second repetition of this same call pattern in the same instance of your program won't fail. You can also find tools to perform static analysis on some programs (ones without fancy use of function pointers and/or recursion) that will determine the maximum amount of stack space ever consumed by your program, after which you could setup to verify at program start that you can successfully use that much space before proceeding.

Well... semantically speaking, there is no stack.
From the point of view of the language, automatic storage just works and dynamic storage may fail in well-determined ways (malloc returns NULL, new throws a std::bad_alloc).
Of course, implementations will usually bring up a stack to implement the automatic storage, and one that is limited in size at that. However this is an implementation detail, and need not be so.
For example, gcc -fsplit-stack allows you to have a fractionned stack that grows as you need. This technic is quite recent for C or C++ AFAIK, but languages with continuations (and thousands or millions of them) like Haskell have this built-in and Go made a point about it too.
Still, at some point, the memory will get exhausted if you keep hammering at it. This is actually undefined behavior since the Standard does not attempt to deal with this, at all. In this case, typically, the OS will send a signal to the program which will shut off and the stack will not get unwound.

The process would get killed by the OS if it runs out of stack space.
The exact mechanics are OS-specific. For example, running out of stack space on Linux triggers a segfault.

While the operating system may not inform you that you're out of stack space, you can check this yourself with a bit on inline assembly:
unsigned long StackSpace()
{
unsigned long retn = 0;
unsigned long *rv = &retn;
__asm
{
mov eax, FS:[0x08]
sub eax, esp
mov [rv], eax
}
return retn;
}
You can determine the value of FS:[*] by referring to the windows Thread Information Block
Edit: Meant to subtract esp from eax, not ebx XD

in linux kernel, the data structure thread_struct contains both field esp0 and esp, what is the difference?

This is my guess:
esp0 is initialized with the kernel stack top addr. when the kernel stack is allocated, and it is used, during process switch, to initialize tss->esp0, so that when context switches from user mode to kernel mode, the kernel stack can be located; while esp is used to save the kernel stack top of the process that is to be scheduled out during process switch.
So esp0 in a thread_struct doesn't change once initialized, while esp changes.
Is my guess right?

The thread_struct structure contains two of these ESP fields, those being esp0 and esp. However, they relate to the four fields in the tss_segment_32 structure, those being esp0, esp1, esp2 and esp.
These actually exist in the TSS so it's very much something from Intel rather than something from Linus et al.
As to why the TSS contains them, the numbers are logical if you know how the protection model works under x86. They are, in fact, the ring levels (except for esp which is for ring level 3 despite the fact it's not actually called esp3).
In other words, they contain the stack pointer to be used in the ring that you're executing in. Since Linux only uses ring 0 (kernel mode) and ring 3 (user mode), esp0 and esp are the only ones that need to be saved.
As an aside, I think the only OS I've ever seen use another ring was OS/2 which used ring 2 for certain I/O operations. Processes that were allowed to perform those operations had to be specially marked and the OS would run them in ring 2 to allow unfettered I/O access, without being allowed to bring down the kernel.

Limited stack trace in Process Explorer

I have a process running under Windows Server 2003 SP2. When I want to check stack trace of one of its threads it is always limited to 9 entries. Those entries are resolved correctly (I have PDBs in place) but list is just cut in middle.
Do you know of any limitation in Process Explorer?

I am assuming that you think the complete stack trace for this thread should have more than 9 entries. You don't mention if 32 bit OS or 64 bit OS, but I will assume 32 bit OS and then cover 64 bit as an afterthought.
Sometimes when collecting a stack trace on 32 bit systems you cannot collect any items for the stack trace or you can only collect a limited amount of stack frame information even though you know the callstack is deeper. The reasons for this are:
Different calling conventions put data in different places on the stack, making it hard to walk the stack. I can think of 4 definitions, 3 in common use, one more exotic: cdecl, fastcall, stdcall, naked.
For release builds, the code optimizer may do away with the frame pointers using a technique known as Frame Pointer Omission (FPO). Without the FPO (and sometimes, even with the FPO data in a PDB file) you cannot successfully walk the callstack.
Hooks - any helper DLLs, anti-virus, debugging hooks, instrumented code, malware, etc, may mess up the callstack at somepoint because they've inserted their own stub code on the callstack and that small section may not be walkable by the stack walker.
Bytecode virtual machines. Depending upon how the virtual machine is written, the VM may place trampolines on the callstack to aid its execution. These will make the stack hard to walk successfully.
Because of the variety of calling conventions on 32 bit Windows (from both Microsoft and other vendors) it is hard to work out what to expect when you move from one frame to another.
For 64 bit systems there is one calling convention specified. That makes life a lot easier. That said, you still have the issues of helper DLLs and hooks doing their own thing with the stack and that may still cause you problems when walking the stack.
I doubt there is a limitation in Process Explorer. I think the issue is just that walking the callstack for that thread is problematic because of one of the reasons I've listed above.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio