Crash with all threads running SIGSEGV handler - debugging

We develop a user-space process running on Linux 3.4.11 in an embedded MIPS system. The process creates multiple (>10) threads using pthreads. The process has a SIGSEGV signal handler which, among other things, generates a log message which goes to our log file. As part of this flow, it acquires a semaphore (bad, I know...).
During our testing the process appeared to hang. We're currently unable to build gdb for the target platform, so I wrote a CLI tool that uses ptrace to extract the register values and USER data using PTRACE_PEEKUSR.
What surprised me to see is that all of our threads were inside our crash handler, trying to acquire the semaphore. This (obviously?) indicates a deadlock on the semaphore, which means that a thread died while holding it. When I dug up the stack, it seemed that almost all of the threads (except one) were in a blocking call (recv, poll, sleep) when the signal handler started running. Manual stack reconstruction on MIPS is a pain so we have not fully done it yet. One thread appeared to be in the middle of a malloc call, which to me indicates that it crashed due to a heap corruption.
A couple of things are still unclear:
1) Assuming one thread crashed in malloc, why would all other threads be running the SIGSEGV handler? As I understand it, a SIGSEGV signal is delivered to the faulting thread, no? Does it mean that each and every one of our threads crashed?
2) Looking at the sigcontext struct for MIPS, it seems it does not contain the memory address which was accessed (badaddr). Is there another place that has it? I couldn't find it anywhere, but it seemed odd to me that it would not be available.
And of course, if anyone can suggest ways to continue the analysis, it would be appreciated!

Yes, it is likely that all of your threads crashed in turn, assuming that you have captured the thread state correctly.
siginfo_t has a si_addr member, which should give you the address of the fault. Whether your kernel fills that in is a different matter.
In-process crash handlers will always be unreliable. You should use an out-of-process handler, and set kernel.core_pattern to invoke it. In current kernels, it is not necessary to write the core file to disk; you can either read the core file from standard input, or just map the process memory of the zombie process (which is still available when the kernel invokes the crash handler).

Related

kernel code sleeping while holding a spinlock

Suppose that a Linux driver code acquires a spinlock, inside the critical section a function call force the process running on top of the driver to sleep. Knowing that to hold spinlock disables preemption on the relevant processor, is it possible for the process to wake up, and consequently to permit the driver code to release the spinlock ?
No, it is not allowed to sleep while holding a spinlock. Code that does this is buggy.
The only way the process could be woken is if code running on another core did something to wake it up (which means that yes, it will certainly deadlock if there is only one core).
while spinlock other process can't do anything to wake it up, you could try to use semaphores for this type of switching

How does debuggers/exceptions work on a compiled program?

A debugger makes perfect sense when you're talking about an interpreted program because instructions always pass through the interpreter for verification before execution. But how does a debugger for a compiled application work? If the instructions are already layed out in memory and run, how can I be notified that a 'breakpoint' has been reached, or that an 'exception' has occurred?
With the help of hardware and/or the operating system.
Most modern CPUs have several debug registers that can be set to trigger a CPU exception when a certain address is reached. They often also support address watchpoints, which trigger exceptions when the application reads from or writes to a specified address or address range, and single-stepping, which causes a process to execute a single instruction and throw an exception. These exceptions can be caught by a debugger attached to the program (see below).
Alternatively, some debuggers create breakpoints by temporarily replacing the instruction at the breakpoint with an interrupt or trap instruction (thereby also causing the program to raise a CPU exception). Once the breakpoint is hit, the debugger replaces it with the original instruction and single-steps the CPU past that instruction so that the program behaves normally.
As far as exceptions go, that depends on the system you're working on. On UNIX systems, debuggers generally use the ptrace() system call to attach to a process and get a first shot at handling its signals.
TL;DR - low-level magic.

Spinlock not working to protect critical section on multi-core system

I have a character device driver which is causing a system deadlock on a multicore system. The write call has a critical section protected by a spin lock (spin_lock_irqsave). The ISR must obtain this lock to finish its task as well. If the ISR is called on one core while the write is executing the critical section on another, a panic occurs due to a watchdog timer detecting a hard lockup on the core for the ISR. The write process never returns to finish executing. Shouldn't the write process continue to execute on its core, release the lock which will allow the other core in its ISR to then run?
The critical section requires about 5us to complete. The hard lock occurs after 5 seconds.
I assume I'm doing something wrong but do not know what.
Appreciate any help!
Turns out the critical section was calling wait_for_completion_timeout. Even though the timeout was zero, it still slept and didn't wake up to release the spin lock if the interrupt occurred in the blocking section. Using try_wait_for_completion in this case resolved the issue.
I would have posted source but it spans many modules and has architecture abstractions for portability between operating systems. Would have been a mess.

CPU usage doesn't drop during SleepEx()

My program is a slideshow. It runs on a machine with other processes, so while it's waiting to display the next slide I call SleepEx(N, false), expecting it to reduce to near-zero the amount of CPU it uses (N is between 100ms and 5000ms). On my development XP Pro machine that's exactly what happens but on my customer's XP Home machine it registers 30-80% CPU during the SleepEx(). The code is a single thread so whatever is using all that cpu is within the call to SleepEX(). Has anyone seen this before?
Which process is taking all that CPU? If you break into the process with a debugger - where in the stack trace is it spending time?
Try to use ProcDump to create a dump of the process when it reaches that CPU spike. Then analyze the stack trace to see where it's stuck. Do this several times you get a good sampling of where it's spending time.
I have seen this before. You block main window message processing thread.
You should not place Sleep() function in single-threaded application if it has main window message processing function. Windowed application always should process window messages without noticeable delay, in another case it will cause deadlock at least for application.
Consequences depends on windows platform, compiler settings and CPU configuration, usually application in debug mode has temporary workaround. But if you start such application compiled with release settings it can consume one CPU core with function, which have blocked his main window message processing thread.
Remarks section at MSDN Sleep() function description clearly states this situation.
You just have to lauch new thread, to use Sleep() function right there to allow free flow of window messages in main thread.

Windows SuspendThread doesn't? (GetThreadContext fails)

We have an Windows32 application in which one thread can stop another to inspect its
state [PC, etc.], by doing SuspendThread/GetThreadContext/ResumeThread.
if (SuspendThread((HANDLE)hComputeThread[threadId])<0) // freeze thread
ThreadOperationFault("SuspendThread","InterruptGranule");
CONTEXT Context, *pContext;
Context.ContextFlags = (CONTEXT_INTEGER | CONTEXT_CONTROL);
if (!GetThreadContext((HANDLE)hComputeThread[threadId],&Context))
ThreadOperationFault("GetThreadContext","InterruptGranule");
Extremely rarely, on a multicore system, GetThreadContext returns error code 5 (Windows system error code "Access Denied").
The SuspendThread documentation seems to clearly indicate that the targeted thread is suspended, if no error is returned. We are checking the return status of SuspendThread and ResumeThread; they aren't complaining, ever.
How can it be the case that I can suspend a thread, but can't access its context?
This blog
http://www.dcl.hpi.uni-potsdam.de/research/WRK/2009/01/what-does-suspendthread-really-do/
suggests that SuspendThread, when it returns, may have started the
suspension of the other thread, but that thread hasn't yet suspended. In this case, I can kind of see how GetThreadContext would be problematic, but this seems like a stupid way to define SuspendThread. (How would the call of SuspendThread know when the target thread was actually suspended?)
EDIT: I lied. I said this was for Windows.
Well, the strange truth is that I don't see this behavior under Windows XP 64 (at least not in the last week and I don't really know what happened before that)... but we have been testing this Windows application under Wine on Ubuntu 10.x. The Wine source for the guts of GetThreadContext contains
an Access Denied return response on line 819 when an attempt to grab the thread state fails for some reason. I'm guessing, but it appears that Wine GetThreadStatus believes that a thread just might not be accessible repeatedly. Why that would be true after a SuspendThead is beyond me, but there's the code. Thoughts?
EDIT2: I lied again. I said we only saw the behavior on Wine. Nope... we have now found a Vista Ultimate system that seems to produce the same error (again, rarely). So, it appears that Wine and Windows agree on an obscure case. It also appears that the mere enabling of the Sysinternals Process monitor program aggravates the situation and causes the problem to appear on Windows XP 64; I suspect a Heisenbug. (The Process Monitor
doesn't even exist on the Wine-tasting (:-) machine or the XP 64 system I use for development).
What on earth is it?
EDIT3: Sept 15 2010. I've added careful checking to the error return status, without otherwise disturbing the code, for SuspendThread, ResumeThread, and GetContext. I haven't seen any hint of this behavior on Windows systems since I did that. Haven't gotten back to the Wine experiment.
Nov 2010: Strange. It seems that if I compile this under VisualStudio 2005, it fails on Windows Vista and 7, but not earlier OSes. If I compile under VisualStudio 2010, it doesn't fail anywhere. One might point a finger at VisualStudio2005, but I'm suspicious of a location-sensitivve problem, and different optimizers in VS 2005 and VS 2010 place the code a slightly different places.
Nov 2012: Saga continues. We see this failure on a number of XP and Windows 7 machines, at a pretty low rate (once every several thousand runs). Our Suspend activities are applied to threads that mostly execute pure computational code but that sometimes make calls into Windows. I don't recall seeing this issue when the PC of the thread was in our computational code. Of course, I can't see the PC of the thread when it hangs because GetContext won't give it to me, so I can't directly confirm that the problem only happens when executing system calls. But, all our system calls are channeled through one point, and so far the evidence is that point was executed when we get the hang. So the indirect evidence suggests GetContext on a thread only fails if a system call is being executed by that thread. I haven't had the energy to build a critical experiment to test this hypothesis yet.
Let me quote from Richter/Nassare's "Windows via C++ 5Ed" which may shed some light:
DWORD SuspendThread(HANDLE hThread);
Any thread can call this function to
suspend another thread (as long as you
have the thread's handle). It goes
without saying (but I'll say it
anyway) that a thread can suspend
itself but cannot resume itself. Like
ResumeThread, SuspendThread returns
the thread's previous suspend count. A
thread can be suspended as many as
MAXIMUM_SUSPEND_COUNT times (defined
as 127 in WinNT.h). Note that
SuspendThread is asynchronous with
respect to kernel-mode execution, but
user-mode execution does not occur
until the thread is resumed.
In real life, an application must be
careful when it calls SuspendThread
because you have no idea what the
thread might be doing when you attempt
to suspend it. If the thread is
attempting to allocate memory from a
heap, for example, the thread will
have a lock on the heap. As other
threads attempt to access the heap,
their execution will be halted until
the first thread is resumed.
SuspendThread is safe only if you know
exactly what the target thread is (or
might be doing) and you take extreme
measures to avoid problems or
deadlocks caused by suspending the
thread.
...
Windows actually lets you look inside
a thread's kernel object and grab its
current set of CPU registers. To do
this, you simply call
GetThreadContext:
BOOL GetThreadContext( HANDLE
hThread, PCONTEXT pContext);
To call this function, just allocate a
CONTEXT structure, initialize some
flags (the structure's ContextFlags
member) indicating which registers you
want to get back, and pass the address
of the structure to GetThreadContext.
The function then fills in the members
you've requested.
You should call SuspendThread before
calling GetThreadContext; otherwise,
the thread might be scheduled and the
thread's context might be different
from what you get back. A thread
actually has two contexts: user mode
and kernel mode. GetThreadContext can
return only the user-mode context of a
thread. If you call SuspendThread to
stop a thread but that thread is
currently executing in kernel mode,
its user-mode context is stable even
though SuspendThread hasn't actually
suspended the thread yet. But the
thread cannot execute any more
user-mode code until it is resumed, so
you can safely consider the thread
suspended and GetThreadContext will
work.
My guess is that GetThreadContext may fail if you just called SuspendThread, while the thread is in kernel mode, and the kernel is locking the thread context block at this time.
Maybe on multicore systems, one core is handling the kernel-mode execution of the thread that it's user mode was just suspended, keep locking the CONTEXT structure of the thread, exactly when the other core is calling GetThreadContext.
Since this behaviour is not documented, I suggest contacting microsoft.
There are some particular problems surrounding suspending a thread that owns a CriticalSection. I can't find a good reference to it now, but there is one mention of it on Raymond Chen's blog and another mention on Chris Brumme's blog. Basically, if you are unlucky enough to call SuspendThread while the thread is accessing an OS lock (e.g., heap lock, DllMain lock, etc.), then really strange things can happen. I would assume that this is the case that you are running into extremely rarely.
Does retrying the call to GetThreadContext work after a processor yield like Sleep(0)?
Old issue but good to see you still kept it updated with status changes after experiencing the issue for another more than 2 years.
The cause of your problem is that there is a bug in the translation layer of the x64 version of WoW64, as per:
http://social.msdn.microsoft.com/Forums/en/windowscompatibility/thread/1558e9ca-8180-4633-a349-534e8d51cf3a
There is a rather critical bug in GetThreadContext under WoW64 which makes it return stale contents which makes it unusable in many situations. The contents is stored in user-mode This is why you think the value is not-null but in the stale contents it is still null.
This is why it fails on newer OS but not older ones, try running it on Windows 7 32bit OS.
As for why this bug seems to happen less often with solutions built on Visual Studio 2010 / 2012 it is likely that there is something the compiler is doing which is mitigating most of the problem, for this you should inspect the IL generated from both 2005 and 2010 and see what the differences are. For example does the problem happen if the project is built without optimizations perhaps?
Finally, some further reading:
http://www.nynaeve.net/?p=129
Maybe a thread safety issue. Are you sure that the hComputeThread struct isn't changing out from under you? Maybe the thread was exiting when you called suspend? This may cause suspend to succeed, but by the time you call get context it is gone and the handle is invalid.
Calling SuspendThread on a thread that owns a synchronization object, such as a mutex or critical section, can lead to a deadlock if the calling thread tries to obtain a synchronization object owned by a suspended thread.
- MSDN

Resources