I am debugging a program using WinDbg.
At the crash site, the last two frames of call stack are:
ChildEBP RetAddr
WARNING: Stack unwind information not available. Following frames may be wrong.
0251bfe8 6031f8da npdf!ProvideCoreHFT2+0x24db0
0251c000 011eb7a5 npdf!ProvideCoreHFT2+0x5ac1a
...
I want to find out how frame 1 calls frame 0. Since the return address of frame 0 is 6031f8da, I opened the disassembly window and jump to that location, the code are:
...
6031f8d5 e8a6d0ffff call npdf!ProvideCoreHFT2+0x57cc0 (6031c980)
6031f8da 5f pop edi
...
My question is that the call instruction right before the return address calls npdf!ProvideCoreHFT2+0x57cc0, while the function in frame 0 is actually npdf!ProvideCoreHFT2+0x24db0. Why such inconsistency exists? How should I proceed?
Thank you very much!
Related
I was trying to implement a stack tracer, using stack pointers; RSP and RBP, but I think debuggers use an entirely different way to grab the return addresses, or maybe I am missing something. I can grab the return address of the last stack frame, but I can't get the others because I don't know the size of other stack frames, so I can't figure out how much bytes should I go back from stack frame, to get the return address. Are there anybody know which way do debuggers use to trace stack?
It is possible to trace the stack when the code uses frame pointers. In this case ebp/rbp is used as the frame pointer and functions begin with prologs and end with epilogs.
A typical prolog looks like this:
push rbp ; save previous frame pointer
mov rbp, rsp ; initialize this functions frame pointer
A typical epilog looks like this:
mov rsp, rbp ; restore the value of rsp
pop rbp ; restore previous frame pointer value from stack
retn
Thus in every place in a function rbp points to the stack position where the previous frame pointer is saved and rbp+8 contains the saved return address.
To get the called function a debugger should read [rbp+8] value and find a function to which this address belongs. This can be done by searching in debugging symbols.
Next it should read [rbp] value to get the frame pointer of the caller function. Continue this process until you find a toplevel function. This is typically a system library function that starts threads.
I've got a custom implementation of detours on macOS and a test application using it, which is written in C, compiled for macOS x86_64, running on an Intel i9 processor.
The implemention works fine with a multitude of functions. However, if I detour pthread_create, I encounter strange behaviour: threads that have been spawned via a detoured pthread_create do not execute instructions. I can step through instructions one by one but as soon as I continue it does not progress. There are no mutexes or synchronisation involved and the result of the function is 0 (success). The exact same application with detours turned off works fine so it's unlikely to be the culprit.
This does not happen all the time - sometimes they are fine but at other times the test applications stalls in the following state:
(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x00007fff7296f55e libsystem_kernel.dylib`__ulock_wait + 10
frame #1: 0x00007fff72a325c2 libsystem_pthread.dylib`_pthread_join + 347
frame #2: 0x0000000100001186 DetoursTestApp`main + 262
frame #3: 0x00007fff7282ccc9 libdyld.dylib`start + 1
frame #4: 0x00007fff7282ccc9 libdyld.dylib`start + 1
thread #2
frame #0: 0x00007fff72a2cb7c libsystem_pthread.dylib`thread_start
Relevant memory pages have the executable flag set. The detour function that intercepts the thread creation looks like this:
static int pthread_create_detour(pthread_t* thread,
const pthread_attr_t* attr,
void* (*start_routine)(void*),
void* arg)
{
detour_count++;
pthread_fn original = (pthread_fn)detour_original(dlsym((void*)-1, "pthread_create"));
return original(thread, attr, start_routine, arg);
}
Where detour_original retrieves the pointer to [original function + size of function's prologue].
Tracing through the instructions, everything seems to be working correctly and pthread_create terminates successfully. Tracing the application's system calls via dtruss does show calls to
bsdthread_create(0x10DB964B0, 0x0, 0x7000080DB000) = 29646848 0
With what I have confirmed are the correct arguments.
This behaviour is only observed in release builds - debug works fine but the disassembly and execution of a detoured pthread_create and associated detours code seems to be identical in both cases.
Workarounds
I found a couple of odd workarounds for this issue that don't make much sense. Given the detour function, a number of things can be substituted into the following:
static int pthread_create_detour(pthread_t* thread,
const pthread_attr_t* attr,
void* (*start_routine)(void*),
void* arg)
{
detour_count++;
pthread_fn original = (pthread_fn)detour_original(dlsym((void*)-1, "pthread_create"));
<...> <== SUBSTITUTE HERE
return original(thread, attr, start_routine, arg);
}
A cache flush.
__asm__ __volatile__("" ::: "memory");
_mm_clflush(real_pthread_create);
A sleep of any duration - usleep(1)
A printf statement.
A memory allocation larger than 32768 bytes, e.g. void *data = malloc(40000);.
Cache?
All of these seem to point to a stale instruction cache. However, the Intel manual states the following:
A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction.
What's even more interesting is that those workarounds have to be executed for every new thread created, with the execution happening on the main thread, so it's very unlikely to be the cache. I have also tried putting in cache flushes at every memory write that writes instructions but that did not help. I've also written a memcpy that bypasses the cache with the use of Intel's intrinsic _mm_stream_si32 and swapped it out for every instruction memory write in my implementation without any success.
Race condition?
The next suspect in line is a race condition. However, it's not clear what would be racing as at first there are no other threads. I have put in a fibonacci sequence calculation for a randomly-generated number and that would still stall the newly-spawned threads.
The question
What is causing this issue? What other mechanisms could be responsible for this?
At this point I have run out of things to check so any suggestions will be welcome.
I found that the reason why the spawned thread was not executing instructions was that the r8 register wasn't being cleared at the right time in the execution of pthread_create due to an issue with my detours implementation.
If we look at the disassembly of the function, it is split up to two parts - the "head" and the "body" that's found in an internal _pthread_create function. The head does two things - zeroes out r8 and jumps to the body:
libsystem_pthread.dylib`pthread_create:
0x7fff72a2e236 <+0>: 45 31 c0 xor r8d, r8d
0x7fff72a2e239 <+3>: e9 40 37 00 00 jmp 0x7fff72a3197e ; _pthread_create
libsystem_pthread.dylib`_pthread_create:
0x7fff72a3197e <+0>: 55 push rbp
0x7fff72a3197f <+1>: 48 89 e5 mov rbp, rsp
0x7fff72a31982 <+4>: 41 57 push r15
<...> // the rest of the 1409 instructions
My implementation would detour the internal _pthread_create function instead of the head containing the actual entry point which meant that the r8 would get cleared at the wrong time (before the detour). Since the detour function would contain some could, the execution would go something like:
pthread_create (r8 gets cleared) -> _pthread_create -> chain of jumps -> pthread_create_detour -> trampoline (containing the beginning of _pthread_create) -> _pthread_create + 6
Which meant that depending on the contents of the pthread_create_detour function the r8 would not always end up with a 0 when it returned to the internal function.
It's not yet clear why having r8 set to something other than 0 before _pthread_create would not crash but instead start up a thread in a locked up state. An important detail is that the stalled thread would have the rflags register set to 0x200 which should never be the case according to Intel's manual. This is what lead me to inspecting the CPU state more closely, leading to the answer.
I have set a breakpoint in nt!ntWriteFile from Windbg. I'm using kernel debugging and I want to get the user stack + kernel stack trace when certain program (for example, notepad.exe) ends up calling this API. When the breakpoint kicks in I do the following:
.reload /user
K
but the result is similar to this (in this case notepad.exe is the current process):
# ChildEBP RetAddr
00 8f5a8c34 76e96c73 nt!NtWriteFile
01 8f5a8c38 badb0d00 ntdll!KiFastSystemCall+0x3
02 8f5a8c3c 0320ef04 0xbadb0d00
03 8f5a8c40 00000000 0x320ef04
My question are:
What is 0xbadb0d00? I always see this address.
Is the address 0x320ef04 the function on user land (inside notepad.exe in this case) from which the call begins? In this case, would that be the full stack trace (user stack + kernel stack).
Is there another easier way to get this?
Thank you.
Updated:
As I read in this link (thanks to Thomas Weller) 0xbadb0d00 is used to initialize uninitialized memory in some circumstances. Now I have even more doubts. Why does the stack trace show uninitialized memory? Why notepad.exe stack-trace does not appear in the output if I'm in its context?
The Windows host I'm debugging is a Windows 7 32 bits.
I have crash dumps that have WerpReportFault() in their stack and they really don't look the way I expect them to.
My expectation
If have seen WerpReportFault()along with 0x80000003 breakpoints and I was able to use WinDbg to re-dump with different exception pointers, taken from the second argument passed to WerpReportFault().
I'm very sure that has worked before, since I even recommended that in my answer over there. There are also other sites suggesting this technique, e.g. James Ross
My current observations
The dumps I'm analyzing have an "ordinary exception" inside, e.g. an access violation:
0:000> .exr -1
ExceptionAddress: 53ec8b55
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 00000000
Parameter[1]: 53ec8b55
Attempt to read from address 53ec8b55
But they still have WerpReportFault() as the stack:
0:000> k
ChildEBP RetAddr
0018f25c 74c4171a ntdll!NtWaitForMultipleObjects+0x15
0018f2f8 75181a08 KERNELBASE!WaitForMultipleObjectsEx+0x100
0018f340 75184200 kernel32!WaitForMultipleObjectsExImplementation+0xe0
0018f35c 751a80ec kernel32!WaitForMultipleObjects+0x18
0018f3c8 751a7fab kernel32!WerpReportFaultInternal+0x186
0018f3dc 751a78a0 kernel32!WerpReportFault+0x70
0018f3ec 751a781f kernel32!BasepReportFault+0x20
0018f478 7295fa2e kernel32!UnhandledExceptionFilter+0x1af
Argument 2 does not seem to be a good exception pointer to be used in the .dump command.
0:000> kb
ChildEBP RetAddr Args to Child
[...]
0018f3dc 751a78a0 0018f4a0 00000001 0018f478 kernel32!WerpReportFault+0x70
[...]
Question
What causes the problems I have and how do I get around it? I know it must be possible, because !analyze -v can tell me the real call stack.
Is it due to Visual Basic 6 and the unhandled exception filter?
0018f478 7295fa2e 00000000 72a2bd04 0018f4a8 kernel32!UnhandledExceptionFilter+0x1af
0018ff80 00440fe2 00443860 7518338a 7efde000 msvbvm60!Zombie_Release+0x10fd5
I really want to have a nice call stack, since all my manual debugging and all my scripts are broken which rely on k and !clrstack and similar. They can't deal with WerpReportFault() on the stack.
All the dumps are 32 bit, as you can imagine from the VB6 dependency.
Such a problem is caused by a wrong context. It seems to be set to the normal context record. To set it to the exception context, use .ecxr. To switch back to the normal context (which you see), use .cxr
I would really like a debugging tool that is able to visualise the current stack frame (bytes between RSP and RBP) as a block diagram.
Something like this, but with real execution values in the cells:
http://abrickshort.files.wordpress.com/2006/11/stackframe.jpg
Does such software exist? I'm using a UNIX system.
PS.
Aware of gdb's "examine bytes" function. That's what I use now, but I would like pretty diagrams to show my supervisor.
Cheers
GDB won't be able to give you the diagram off-the-shelf, but info frame n gives almost everything you need:
(gdb) info frame 2
Stack frame at 0x7ffff7fe3fe0:
rip = 0x3cbd806ccb in start_thread (pthread_create.c:301); saved rip 0x3cbd0e0c2d
called by frame at 0x0, caller of frame at 0x7ffff7fe3ed0
source language c.
Arglist at 0x7ffff7fe3ec8, args: arg=0x7ffff7fe4700
Locals at 0x7ffff7fe3ec8, Previous frame's sp is 0x7ffff7fe3fe0
Saved registers:
rbx at 0x7ffff7fe3fd0, rip at 0x7ffff7fe3fd8