Windows SuspendThread doesn't? (GetThreadContext fails) - windows

We have an Windows32 application in which one thread can stop another to inspect its
state [PC, etc.], by doing SuspendThread/GetThreadContext/ResumeThread.
if (SuspendThread((HANDLE)hComputeThread[threadId])<0) // freeze thread
ThreadOperationFault("SuspendThread","InterruptGranule");
CONTEXT Context, *pContext;
Context.ContextFlags = (CONTEXT_INTEGER | CONTEXT_CONTROL);
if (!GetThreadContext((HANDLE)hComputeThread[threadId],&Context))
ThreadOperationFault("GetThreadContext","InterruptGranule");
Extremely rarely, on a multicore system, GetThreadContext returns error code 5 (Windows system error code "Access Denied").
The SuspendThread documentation seems to clearly indicate that the targeted thread is suspended, if no error is returned. We are checking the return status of SuspendThread and ResumeThread; they aren't complaining, ever.
How can it be the case that I can suspend a thread, but can't access its context?
This blog
http://www.dcl.hpi.uni-potsdam.de/research/WRK/2009/01/what-does-suspendthread-really-do/
suggests that SuspendThread, when it returns, may have started the
suspension of the other thread, but that thread hasn't yet suspended. In this case, I can kind of see how GetThreadContext would be problematic, but this seems like a stupid way to define SuspendThread. (How would the call of SuspendThread know when the target thread was actually suspended?)
EDIT: I lied. I said this was for Windows.
Well, the strange truth is that I don't see this behavior under Windows XP 64 (at least not in the last week and I don't really know what happened before that)... but we have been testing this Windows application under Wine on Ubuntu 10.x. The Wine source for the guts of GetThreadContext contains
an Access Denied return response on line 819 when an attempt to grab the thread state fails for some reason. I'm guessing, but it appears that Wine GetThreadStatus believes that a thread just might not be accessible repeatedly. Why that would be true after a SuspendThead is beyond me, but there's the code. Thoughts?
EDIT2: I lied again. I said we only saw the behavior on Wine. Nope... we have now found a Vista Ultimate system that seems to produce the same error (again, rarely). So, it appears that Wine and Windows agree on an obscure case. It also appears that the mere enabling of the Sysinternals Process monitor program aggravates the situation and causes the problem to appear on Windows XP 64; I suspect a Heisenbug. (The Process Monitor
doesn't even exist on the Wine-tasting (:-) machine or the XP 64 system I use for development).
What on earth is it?
EDIT3: Sept 15 2010. I've added careful checking to the error return status, without otherwise disturbing the code, for SuspendThread, ResumeThread, and GetContext. I haven't seen any hint of this behavior on Windows systems since I did that. Haven't gotten back to the Wine experiment.
Nov 2010: Strange. It seems that if I compile this under VisualStudio 2005, it fails on Windows Vista and 7, but not earlier OSes. If I compile under VisualStudio 2010, it doesn't fail anywhere. One might point a finger at VisualStudio2005, but I'm suspicious of a location-sensitivve problem, and different optimizers in VS 2005 and VS 2010 place the code a slightly different places.
Nov 2012: Saga continues. We see this failure on a number of XP and Windows 7 machines, at a pretty low rate (once every several thousand runs). Our Suspend activities are applied to threads that mostly execute pure computational code but that sometimes make calls into Windows. I don't recall seeing this issue when the PC of the thread was in our computational code. Of course, I can't see the PC of the thread when it hangs because GetContext won't give it to me, so I can't directly confirm that the problem only happens when executing system calls. But, all our system calls are channeled through one point, and so far the evidence is that point was executed when we get the hang. So the indirect evidence suggests GetContext on a thread only fails if a system call is being executed by that thread. I haven't had the energy to build a critical experiment to test this hypothesis yet.

Let me quote from Richter/Nassare's "Windows via C++ 5Ed" which may shed some light:
DWORD SuspendThread(HANDLE hThread);
Any thread can call this function to
suspend another thread (as long as you
have the thread's handle). It goes
without saying (but I'll say it
anyway) that a thread can suspend
itself but cannot resume itself. Like
ResumeThread, SuspendThread returns
the thread's previous suspend count. A
thread can be suspended as many as
MAXIMUM_SUSPEND_COUNT times (defined
as 127 in WinNT.h). Note that
SuspendThread is asynchronous with
respect to kernel-mode execution, but
user-mode execution does not occur
until the thread is resumed.
In real life, an application must be
careful when it calls SuspendThread
because you have no idea what the
thread might be doing when you attempt
to suspend it. If the thread is
attempting to allocate memory from a
heap, for example, the thread will
have a lock on the heap. As other
threads attempt to access the heap,
their execution will be halted until
the first thread is resumed.
SuspendThread is safe only if you know
exactly what the target thread is (or
might be doing) and you take extreme
measures to avoid problems or
deadlocks caused by suspending the
thread.
...
Windows actually lets you look inside
a thread's kernel object and grab its
current set of CPU registers. To do
this, you simply call
GetThreadContext:
BOOL GetThreadContext( HANDLE
hThread, PCONTEXT pContext);
To call this function, just allocate a
CONTEXT structure, initialize some
flags (the structure's ContextFlags
member) indicating which registers you
want to get back, and pass the address
of the structure to GetThreadContext.
The function then fills in the members
you've requested.
You should call SuspendThread before
calling GetThreadContext; otherwise,
the thread might be scheduled and the
thread's context might be different
from what you get back. A thread
actually has two contexts: user mode
and kernel mode. GetThreadContext can
return only the user-mode context of a
thread. If you call SuspendThread to
stop a thread but that thread is
currently executing in kernel mode,
its user-mode context is stable even
though SuspendThread hasn't actually
suspended the thread yet. But the
thread cannot execute any more
user-mode code until it is resumed, so
you can safely consider the thread
suspended and GetThreadContext will
work.
My guess is that GetThreadContext may fail if you just called SuspendThread, while the thread is in kernel mode, and the kernel is locking the thread context block at this time.
Maybe on multicore systems, one core is handling the kernel-mode execution of the thread that it's user mode was just suspended, keep locking the CONTEXT structure of the thread, exactly when the other core is calling GetThreadContext.
Since this behaviour is not documented, I suggest contacting microsoft.

There are some particular problems surrounding suspending a thread that owns a CriticalSection. I can't find a good reference to it now, but there is one mention of it on Raymond Chen's blog and another mention on Chris Brumme's blog. Basically, if you are unlucky enough to call SuspendThread while the thread is accessing an OS lock (e.g., heap lock, DllMain lock, etc.), then really strange things can happen. I would assume that this is the case that you are running into extremely rarely.
Does retrying the call to GetThreadContext work after a processor yield like Sleep(0)?

Old issue but good to see you still kept it updated with status changes after experiencing the issue for another more than 2 years.
The cause of your problem is that there is a bug in the translation layer of the x64 version of WoW64, as per:
http://social.msdn.microsoft.com/Forums/en/windowscompatibility/thread/1558e9ca-8180-4633-a349-534e8d51cf3a
There is a rather critical bug in GetThreadContext under WoW64 which makes it return stale contents which makes it unusable in many situations. The contents is stored in user-mode This is why you think the value is not-null but in the stale contents it is still null.
This is why it fails on newer OS but not older ones, try running it on Windows 7 32bit OS.
As for why this bug seems to happen less often with solutions built on Visual Studio 2010 / 2012 it is likely that there is something the compiler is doing which is mitigating most of the problem, for this you should inspect the IL generated from both 2005 and 2010 and see what the differences are. For example does the problem happen if the project is built without optimizations perhaps?
Finally, some further reading:
http://www.nynaeve.net/?p=129

Maybe a thread safety issue. Are you sure that the hComputeThread struct isn't changing out from under you? Maybe the thread was exiting when you called suspend? This may cause suspend to succeed, but by the time you call get context it is gone and the handle is invalid.

Calling SuspendThread on a thread that owns a synchronization object, such as a mutex or critical section, can lead to a deadlock if the calling thread tries to obtain a synchronization object owned by a suspended thread.
- MSDN

Related

Crash with all threads running SIGSEGV handler

We develop a user-space process running on Linux 3.4.11 in an embedded MIPS system. The process creates multiple (>10) threads using pthreads. The process has a SIGSEGV signal handler which, among other things, generates a log message which goes to our log file. As part of this flow, it acquires a semaphore (bad, I know...).
During our testing the process appeared to hang. We're currently unable to build gdb for the target platform, so I wrote a CLI tool that uses ptrace to extract the register values and USER data using PTRACE_PEEKUSR.
What surprised me to see is that all of our threads were inside our crash handler, trying to acquire the semaphore. This (obviously?) indicates a deadlock on the semaphore, which means that a thread died while holding it. When I dug up the stack, it seemed that almost all of the threads (except one) were in a blocking call (recv, poll, sleep) when the signal handler started running. Manual stack reconstruction on MIPS is a pain so we have not fully done it yet. One thread appeared to be in the middle of a malloc call, which to me indicates that it crashed due to a heap corruption.
A couple of things are still unclear:
1) Assuming one thread crashed in malloc, why would all other threads be running the SIGSEGV handler? As I understand it, a SIGSEGV signal is delivered to the faulting thread, no? Does it mean that each and every one of our threads crashed?
2) Looking at the sigcontext struct for MIPS, it seems it does not contain the memory address which was accessed (badaddr). Is there another place that has it? I couldn't find it anywhere, but it seemed odd to me that it would not be available.
And of course, if anyone can suggest ways to continue the analysis, it would be appreciated!
Yes, it is likely that all of your threads crashed in turn, assuming that you have captured the thread state correctly.
siginfo_t has a si_addr member, which should give you the address of the fault. Whether your kernel fills that in is a different matter.
In-process crash handlers will always be unreliable. You should use an out-of-process handler, and set kernel.core_pattern to invoke it. In current kernels, it is not necessary to write the core file to disk; you can either read the core file from standard input, or just map the process memory of the zombie process (which is still available when the kernel invokes the crash handler).

What does a thread do when switching from user to kernel mode under WindowsNT (recent x86 versions, Vista and Win7)?

From what I understand, a thread that executes in user mode can eventually enter code that switches to kernel mode (using sysenter). But how can a thread that's been emanating from user code execute kernel code?
Eg: I'm calling CreateFile(), it then delegates to NtCreateFile(), which in turn calls ZwCreateFile(), than ZiFastSystemCall()... than sysenter... profit, kernel access?
Edit
This question:
How does Windows protect transition into kernel mode has an answer that helped me understand, see quote:
"The user mode thread is causing an exception that's caught by the Ring 0 code. The user mode thread is halted and the CPU switches to a kernel/ring 0 thread, which can then inspect the context (e.g., call stack & registers) of the user mode thread to figure out what to do." Also see this blog post, very informative: http://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protection
The short answer is that it can't.
What happens is that when you create a user-mode thread, the kernel creates a matching kernel mode thread. When "your" thread needs to execute some code in kernel mode, it's actually executed in the matching kernel mode thread.
Disclaimer: The last time I really looked closely at this was probably with Win2K or maybe even NT4 -- but I doubt much has changed in this respect.

File operation functions return, but are not actually committed when Windows shuts down

I am working on an MFC application that can (among other things) be used to shut Windows down. When doing this, Windows of course sends the WM_QUERYENDSESSION and WM_ENDSESSION to all applications, mine included. However, the problem is that my application, as part of some destructors, delete certain files (with CFile::Remove) that have been used during the execution. I have reason to believe that the destructors are called (but that is hard to know for certain) when the application is closed by Windows.
However, when Windows starts back up again, I do occasionally notice that the files that were supposed to be deleted are still present. This does not happen consistently, even when the execution of the program is identical (I have a script for testing this). This leads me to think that one of two things are happening: Either a) the destructors are not consistently being called, or b) the Remove function returns, but the file is not actually deleted before Windows is shut down.
The only work-around I have found so far is that if I get the system to wait with the shutdown for approximately 10 seconds after my program has stopped, then the files will be properly deleted. This leads me to believe that b) may be the case.
I hope someone is able to help me with this problem.
Regards
Mort
Once your program returns from WM_ENDSESSION, Windows can terminate it at any time:
If the session is being ended, this parameter is TRUE; the session can end any time after all applications have returned from processing this message.
If the session ends quickly, then it may end before your destructors run. You must do all your cleanup before returning from WM_ENDSESSION, because there is no guarantee that you will get a chance to do it afterwards.
The problem here is that some versions of Windows report back that file handling operations have been completed before they actually have. This isn't a problem unless shutdown is triggered as some operations, including file delete will be abandoned.
I would suggest that you cope with this by forcing your code to wait for a confirmed deletion of the files (have a process look for the files and raise an event when they've gone) before calling for system shutdown.
If the system is properly shut down (nut went sudden power loss or etc.) then all the cached data is flushed. In particular this includes flushing the global file descriptor table (or whatever it's called in your file system) which should commit the file deletion.
So the problem seems to be that the user-mode code doesn't call DeleteFile, or it failes (for whatever reason).
Note that there are several ways the application (process) may exit, whereas not always d'tors are called. There are automatic objects which are destroyed in the context of their callstack, plus there are global/static objects, which are initialized and destroyed by the CRT init/cleanup code.
Below is a short summary of ways to terminate the process, with the consequences:
All process threads exit conventionally (return from their procedure). The OS terminates the process that has no threads. All the d'tors are executed.
Some threads either exit via ExitThread or killed by TerminateThread. The automatic objects of those threads are not d'tructed.
Process exited by ExitProcess. Automatic objects are not destructed, global may be destructed (this happens in the CRT is used in a DLL)
Process is terminated by TerminateProcess. All d'tors are not called.
I suggest you check if the DeleteFile (or CFile::Remove that wraos it) is called indeed, and check also if it succeeds. For instance you may open the same file twice for whatever reason

Interrupt processing in Windows

I want to know which threads processes device interrupts. What happens when there is a interrupt when a user mode thread is running? Also do other user threads get a chance to run when the system is processing an interrupt?
Kindly suggest me some reference material describing how interrupts are handled by windows.
Device interrupts themselves are (usually) processed by whatever thread had the CPU that took the interrupt, but in a ring 0 and at a different protection level. This limits some of the actions an interrupt handler can take, because most of the time the current thread will not be related to the thread that is waiting for the event to happen that the interrupt is indicating.
The kernel itself is closed source, and only documented through its internal API. That API is exposed to device driver authors, and described in the driver development kits.
Some resources to get you started:
Any edition of Microsoft Windows Internals by Solomon and Russinovich. The current seems to be the 4th edition, but even an old edition will help.
The Windows DDK, now renamed the WDK. Its documentation is available online too. Be sure to read the Kernel Mode Design Guide...
Sysinternals has tools and articles to probe at and explain the kernel's behavior. This used to be an independent site until Microsoft got tired of Mark Russinovich seeming to know more about how the kernel worked than they did. ;-)
Note that source code to many of the common device drivers are included in the DDK in the samples. Although the production versions are almost certainly different, reading the sample drivers can answer some questions even if you don't want to implement a driver yourself.
Like any other operating system, Windows processes interrupts in Kernel mode, with an elevated Interrupt Priority Level (I think they call them IRPL's, but I don't know what the "R" stands for). Any user thread or lower-level Kernel thread running on the same machine will be interrupted while the interrupt request is processed, and will be resumed when the ineterrupt processing is complete.
In order to learn more about device interrupts on Windows you need to study device driver development. This is a niche topic, I don't think you can find many useful resources in the Web and you may have to look for a book or a training course.
Anyway, Windows handle interrupts with Interrupt Request Levels (IRQLs) and Deferred procedure calls. An interrupt is handled in Kernel mode, which runs in higher priority than user mode. A proper interrupt handler needs to react very quickly. It only performs the absolutely necessary operations and registers a Deferred Procedure Call to run in the future. This will happen, when the system is in a Interrupt Request Level.

How does a debugger work?

I keep wondering how does a debugger work? Particulary the one that can be 'attached' to already running executable. I understand that compiler translates code to machine language, but then how does debugger 'know' what it is being attached to?
The details of how a debugger works will depend on what you are debugging, and what the OS is. For native debugging on Windows you can find some details on MSDN: Win32 Debugging API.
The user tells the debugger which process to attach to, either by name or by process ID. If it is a name then the debugger will look up the process ID, and initiate the debug session via a system call; under Windows this would be DebugActiveProcess.
Once attached, the debugger will enter an event loop much like for any UI, but instead of events coming from the windowing system, the OS will generate events based on what happens in the process being debugged – for example an exception occurring. See WaitForDebugEvent.
The debugger is able to read and write the target process' virtual memory, and even adjust its register values through APIs provided by the OS. See the list of debugging functions for Windows.
The debugger is able to use information from symbol files to translate from addresses to variable names and locations in the source code. The symbol file information is a separate set of APIs and isn't a core part of the OS as such. On Windows this is through the Debug Interface Access SDK.
If you are debugging a managed environment (.NET, Java, etc.) the process will typically look similar, but the details are different, as the virtual machine environment provides the debug API rather than the underlying OS.
As I understand it:
For software breakpoints on x86, the debugger replaces the first byte of the instruction with CC (int3). This is done with WriteProcessMemory on Windows. When the CPU gets to that instruction, and executes the int3, this causes the CPU to generate a debug exception. The OS receives this interrupt, realizes the process is being debugged, and notifies the debugger process that the breakpoint was hit.
After the breakpoint is hit and the process is stopped, the debugger looks in its list of breakpoints, and replaces the CC with the byte that was there originally. The debugger sets TF, the Trap Flag in EFLAGS (by modifying the CONTEXT), and continues the process. The Trap Flag causes the CPU to automatically generate a single-step exception (INT 1) on the next instruction.
When the process being debugged stops the next time, the debugger again replaces the first byte of the breakpoint instruction with CC, and the process continues.
I'm not sure if this is exactly how it's implemented by all debuggers, but I've written a Win32 program that manages to debug itself using this mechanism. Completely useless, but educational.
In Linux, debugging a process begins with the ptrace(2) system call. This article has a great tutorial on how to use ptrace to implement some simple debugging constructs.
If you're on a Windows OS, a great resource for this would be "Debugging Applications for Microsoft .NET and Microsoft Windows" by John Robbins:
http://www.amazon.com/dp/0735615365
(or even the older edition: "Debugging Applications")
The book has has a chapter on how a debugger works that includes code for a couple of simple (but working) debuggers.
Since I'm not familiar with details of Unix/Linux debugging, this stuff may not apply at all to other OS's. But I'd guess that as an introduction to a very complex subject the concepts - if not the details and APIs - should 'port' to most any OS.
I think there are two main questions to answer here:
1. How the debugger knows that an exception occurred?
When an exception occurs in a process that’s being debugged, the debugger gets notified by the OS before any user exception handlers defined in the target process are given a chance to respond to the exception. If the debugger chooses not to handle this (first-chance) exception notification, the exception dispatching sequence proceeds further and the target thread is then given a chance to handle the exception if it wants to do so. If the SEH exception is not handled by the target process, the debugger is then sent another debug event, called a second-chance notification, to inform it that an unhandled exception occurred in the target process. Source
2. How the debugger knows how to stop on a breakpoint?
The simplified answer is: When you put a break-point into the program, the debugger replaces your code at that point with a int3 instruction which is a software interrupt. As an effect the program is suspended and the debugger is called.
Another valuable source to understand debugging is Intel CPU manual (Intel® 64 and IA-32 Architectures
Software Developer’s Manual). In the volume 3A, chapter 16, it introduced the hardware support of debugging, such as special exceptions and hardware debugging registers. Following is from that chapter:
T (trap) flag, TSS — Generates a debug exception (#DB) when an attempt is
made to switch to a task with the T flag set in its TSS.
I am not sure whether Window or Linux use this flag or not, but it is very interesting to read that chapter.
Hope this helps someone.
My understanding is that when you compile an application or DLL file, whatever it compiles to contains symbols representing the functions and the variables.
When you have a debug build, these symbols are far more detailed than when it's a release build, thus allowing the debugger to give you more information. When you attach the debugger to a process, it looks at which functions are currently being accessed and resolves all the available debugging symbols from here (since it knows what the internals of the compiled file looks like, it can acertain what might be in the memory, with contents of ints, floats, strings, etc.). Like the first poster said, this information and how these symbols work greatly depends on the environment and the language.

Resources