TCriticalSection.LockCount and negative values [duplicate] - windows

I am debugging a deadlock issue and call stack shows that threads are waiting on some events.
Code is using critical section as synchronization primitive I think there is some issue here.
Also the debugger is pointing to a critical section that is owned by some other thread,but lock count is -2.
As per my understanding lock count>0 means that critical section is locked by one or more threads.
So is there any possibility that I am looking at right critical section which could be the culprit in deadlock.
In what scenarios can a critical section have negative lock count?

Beware: since Windows Server 2003 (for client OS this is Vista and newer) the meaning of LockCount has changed and -2 is a completely normal value, commonly seen when a thread has entered a critical section without waiting and no other thread is waiting for the CS. See Displaying a Critical Section:
In Microsoft Windows Server 2003 Service Pack 1 and later versions of Windows, the LockCount field is parsed as follows:
The lowest bit shows the lock status. If this bit is 0, the critical section is locked; if it is 1, the critical section is not locked.
The next bit shows whether a thread has been woken for this lock. If this bit is 0, then a thread has been woken for this lock; if it is 1, no thread has been woken.
The remaining bits are the ones-complement of the number of threads waiting for the lock.

I am assuming that you are talking about CCriticalSection class in MFC. I think you are looking at the right critical section. I have found that the critical section's lock count can go negative if the number of calls to Lock() function is less than the number of Unlock() calls. I found that this generally happens in the following type of code:
void f()
{
CSingleLock lock(&m_synchronizer, TRUE);
//Some logic here
m_synchronizer.Unlock();
}
At the first glance this code looks perfectly safe. However, note that I am using CCriticalSection's Unlock() method directly instead of CSingleLock's Unlock() method. Now what happens is that when the function exits, CSingleLock in its destructor calls Unlock() of the critical section again and its lock count becomes negative. After this the application will be in a bad shape and strange things start to happen. If you are using MFC critical sections then do check for this type of problems.

Related

KSPIN_LOCK blocks when acquiring from Driver's main thread

I have a KSPIN_LOCK which is shared among a Windows driver's main thread and some threads I created with PsCreateSystemThread. The problem is that the main thread blocks if I try to acquire the spinlock and doesn't unblock. I'm very confused as to why this happens.. it's probably somehow connected to the fact that the main thread runs at driver IRQL, while the other threads run at PASSIVE_LEVEL as far as I know.
NOTE: If I only run the main thread, acquiring/releasing the lock works just fine.
NOTE: I'm using the functions KeAcquireSpinLock and KeReleaseSpinLock to acquire/release the lock.
Here's my checklist for a "stuck" spinlock:
Make sure the spinlock was initialized with KeInitializeSpinLock. If the KSPIN_LOCK holds uninitialized garbage, then the first attempt to acquire it will likely spin forever.
Check that you're not acquiring it recursively/nested. KSPIN_LOCK does not support recursion, and if you try it, it will spin forever.
Normal spinlocks must be acquired at IRQL <= DISPATCH_LEVEL. If you need something that works at DIRQL, check out [1] and [2].
Check for leaks. If one processor acquires the spinlock, but forgets to release it, then the next processor will spin forever when trying to acquire the lock.
Ensure there's no memory-safety issues. If code randomly writes a non-zero value on top of the spinlock, that'll cause it to appear to be acquired, and the next acquisition will spin forever.
Some of these issues can be caught easily and automatically with Driver Verifier; use it if you're not using it already. Other issues can be caught if you encapsulate the spinlock in a little helper that adds your own asserts. For example:
typedef struct _MY_LOCK {
KSPIN_LOCK Lock;
ULONG OwningProcessor;
KIRQL OldIrql;
} MY_LOCK;
void MyInitialize(MY_LOCK *lock) {
KeInitializeSpinLock(&lock->Lock);
lock->OwningProcessor = (ULONG)-1;
}
void MyAcquire(MY_LOCK *lock) {
ULONG current = KeGetCurrentProcessorIndex();
NT_ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL);
NT_ASSERT(current != lock->OwningProcessor); // check for recursion
KeAcquireSpinLock(&lock->Lock, &lock->OldIrql);
NT_ASSERT(lock->OwningProcessor == (ULONG)-1); // check lock was inited
lock->OwningProcessor = current;
}
void MyRelease(MY_LOCK *lock) {
NT_ASSERT(KeGetCurrentProcessorIndex() == lock->OwningProcessor);
lock->OwningProcessor = (ULONG)-1;
KeReleaseSpinLock(&lock->Lock, lock->OldIrql);
}
Wrappers around KSPIN_LOCK are common. The KSPIN_LOCK is like a race car that has all the optional features stripped off to maximize raw speed. If you aren't counting microseconds, you might reasonably decide to add back the heated seats and FM radio by wrapping the low-level KSPIN_LOCK in something like the above. (And with the magic of #ifdefs, you can always take the airbags out of your retail builds, if you need to.)

A DLL should free heap memory only if the DLL is unloaded dynamically?

Question Purpose: Reality check on the MS docs of DllMain.
It is "common" knowledge that you shouldn't do too much in DllMain, there are definite things you must never do, some best practises.
I now stumbled over a new gem in the docs, that makes little sense to me: (emph. mine)
When handling DLL_PROCESS_DETACH, a DLL should free resources such as
heap memory only if the DLL is being unloaded dynamically (the
lpReserved parameter is NULL). If the process is terminating (the
lpvReserved parameter is non-NULL), all threads in the process except
the current thread either have exited already or have been explicitly
terminated by a call to the ExitProcess function, which might leave
some process resources such as heaps in an inconsistent state. In this
case, it is not safe for the DLL to clean up the resources. Instead,
the DLL should allow the operating system to reclaim the memory.
Since global C++ objects are cleaned up during DllMain/DETACH, this would imply that global C++ objects must not free any dynamic memory, because the heap may be in an inconsistent state. / When the DLL is "linked statically" to the executable. / Certainly not what I see out there - global C++ objects (iff there are) of various (ours, and third party) libraries allocate and deallocate just fine in their destructors. (Barring other ordering bugs, o.c.)
So, what specific technical problem is this warning aimed at?
Since the paragraph mentions thread termination, could there be a heap corruption problem when some threads are not cleaned up correctly?
The ExitProcess API in general does the follwoing:
Enter Loader Lock critical section
lock main process heap (returned by GetProcessHeap()) via HeapLock(GetProcessHeap()) (ok, of course via RtlLockHeap) (this is very important step for avoid deadlock)
then terminate all threads in process, except current (by call NtTerminateProcess(0, 0) )
then call LdrShutdownProcess - inside this api loader walk by loaded module list and sends DLL_PROCESS_DETACH with lpvReserved nonnull.
finally call NtTerminateProcess(NtCurrentProcess(), ExitCode ) which terminates the process.
The problem here is that threads terminated in arbitrary place. For example, thread can allocate or free memory from any heap and be inside heap critical section, when it terminated. As a result, if code during DLL_PROCESS_DETACH tries to free a block from the same heap, it deadlocks when trying to enter this heap's critical section (if of course heap implementation use it).
Note that this does not affect the main process heap, because we call HeapLock for it before terminate all threads (except current). The purpose of this: We wait in this call until all another threads exit from process heap critical section and after we acquire the critical section, no other threads can enter it - because the main process heap is locked.
So, when we terminate threads after locking the main heap - we can be sure that no other threads that are killed are inside main heap critical section or heap structure in inconsistent state. Thanks to RtlLockHeap call. But this is related only to main process heap. Any other heaps in the process are not locked. So these can be in inconsistent state during DLL_PROCESS_DETACH or can be exclusively acquired by an already terminated thread.
So - using HeapFree for GetProcessHeap or saying LocalFree is safe (however not documented) here.
Using HeapFree for any other heaps is not safe if DllMain is called during process termination.
Also if you use another custom data structures by several threads - it can be in inconsistent state, because another threads (which can use it) terminated in arbitrary point.
So this note is warning that when lpvReserved parameter is non-NULL (what is mean DllMain is called during process termination) you need to be especially careful in clean up the resources. Anyway all internal memory allocations will be free by operation system when process died.
As an addendum to RbMm's excellent answer, I'll add a quote from ExitProcess that does a much better job - than the DllMain docs do - at explaining, why heap operation (or any operation, really) can be compromised:
If one of the terminated threads in the process holds a lock and the
DLL detach code in one of the loaded DLLs attempts to acquire the same
lock, then calling ExitProcess results in a deadlock. In contrast, if
a process terminates by calling TerminateProcess, the DLLs that the
process is attached to are not notified of the process termination.
Therefore, if you do not know the state of all threads in your
process, it is better to call TerminateProcess than ExitProcess. Note
that returning from the main function of an application results in a
call to ExitProcess.
So, it all boils down to: IFF you application has "runaway" threads that may hold any lock, the (CRT) heap lock being a prominent example, you have a big problem during shutdown, when you need to access the same structures (e.g. the heap), that your "runaway" threads are using.
Which just goes to show that you should shut down all your threads in a controlled way.

Pausing a kthread in Linux

here https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/#queuing-disciplines
it is written:
As you’ll see from the previous post, the NET_TX_SOFTIRQ softirq has
the function net_tx_action registered to it. This means that there is
a kernel thread executing net_tx_action. That thread is occasionally
paused and raise_softirq_irqoff resumes it. Let’s take a look at what
net_tx_action does so we can understand how the kernel processes
transmit requests.
It is written that kthread is occasionally paused. When a kthread is paused and why?
How kthread knows about work to execute? Does it poll a queue?
I think what's said about pausing a thread there is more like a figure of speech. In this case it's not kthread that is paused, the thread works just fine.
The body of work related to softirq is in __do_softirq() function.
There's a number of softirq types, and each softirq type is represented by a bit in a bitmask. Whenever there's work for a specific type of softirq, the corresponding bit is raised in the bitmask. __do_softirq() processes this bitmask bit by bit starting with the least significant bit, and does the work for each softirq type that has the bit set. Thus softirq types are processed in the order of priority, with the bit 0 representing the highest priority. In fact, if you look at the code you'll see that the bitmask is saved (copied) and then cleared before the processing starts, and it's the copy that is processed.
The bit for NET_TX_SOFTIRQ is raised each time a new skb is submitted to the kernel networking stack to send data out. That causes __do_softirq() to call net_tx_action() for outgoing data. If there's no data to send out, then the bit is not raised. Essentially, that's what causes the kernel softirq thread to "pause" which is just a layman way to say that there's no work for it, so net_tx_action() is not called. As soon as there's more data, the bit is raised again as data is submitted to the kernel networking stack. __do_softirq() sees that and calls net_tx_action() again.
There's a softirq thread on each CPU. A thread is run when there's at least one pending softirq type. The threads are defined in softirq_threads structure and started in spawn_softirqd() function.

What does the term "interrupt safe" mean?

I come across this term every now and then.
And now I really need a clear explanation as I wish to use some MPI routines that
are said not to be interrupt-safe.
I believe it's another wording for reentrant. If a function is reentrant it can be interrupted in the middle and called again.
For example:
void function()
{
lock(mtx);
/* code ... */
unlock(mtx);
}
This function can clearly be called by different threads (the mutex will protect the code inside). But if a signal arrives after lock(mtx) and the function is called again it will deadlock. So it's not interrupt-safe.
Code that is safe from concurrent access from an interrupt is said to be interrupt-safe.
Consider a situation that your process is in critical section and an asynchronous event comes and interrupts your process to access the same shared resource that process was accessing before preemption.
It is a major bug if an interrupt occurs in the middle of code that is manipulating a resource and the interrupt handler can access the same resource. Locking can save you!

Are mutexes really slower?

I have read so many times, here and everywhere on the net, that mutexes are slower than critical section/semaphores/insert-your-preferred-synchronisation-method-here. but i have never seen any paper or study or whatever to back up this claim.
so, where does this idea come from ? is it a myth or a reality ? are mutexes really slower ?
In the book "Multithreading applications in win32" by Jim Beveridge and Robert Wiener it says: "It takes almost 100 times longer to lock an unowned mutex than it does to lock an unowned critical section because the critical section can be done in user mode without involving the kernel"
And on MSDN here it says "critical section objects provide a slightly faster, more efficient mechanism for mutual-exclusion synchronization"
I don't believe that any of the answers hit on the key point of why they are different.
Mutexes are at operating system level. A named mutex exists and is accessible from ANY process in the operating system (provided its ACL allows access from all).
Critical sections are faster as they don't require the system call into kernel mode, however they will only work WITHIN a process, you cannot lock more than one process using a critical section. So depending on what you are trying to achieve and what your software design looks like, you should choose the most appropriate tool for the job.
I'll additionally point out to you that Semaphores are separate to mutex/critical sections, because of their count. Semaphores can be used to control multiple concurrent access to a resource, where as a mutex/critical section is either being accessed or not being accessed.
A CRITICAL_SECTION is implemented as a spinlock with a capped spin count. See MSDN InitializeCriticalSectionAndSpinCount for the indication of this.
When the spin count 'elapsed', the critical section locks a semaphore (or whatever kernel-lock it is implemented with).
So in code it works like this (not really working, should just be an example) :
CRITICAL_SECTION s;
void EnterCriticalSection( CRITICAL_SECTION* s )
{
int spin_count = s.max_count;
while( --spin_count >= 0 )
{
if( InterlockedExchange( &s->Locked, 1 ) == 1 )
{
// we own the lock now
s->OwningThread = GetCurrentThread();
return;
}
}
// lock the mutex and wait for an unlock
WaitForSingleObject( &s->KernelLock, INFINITE );
}
So if your critical section is only held a very short time, and the entering thread does only wait very few 'spins' (cycles) the critical section can be very efficient. But if this is not the case, the critical section wastes many cycles doing nothing, and then falls back to a kernel synchronization object.
So the tradeoff is :
Mutex :
Slow acquire/release, but no wasted cycles for long 'locked regions'
CRITICAL_SECTION : Fast acquire/release for unowned 'regions', but wasted cycles for owned sections.
Yes, critical sections are more efficient. For a very good explanation, get "Concurrent Programming on Windows".
In a nutshell: a mutex is a kernel object, so there is always a context switch when you acquire one, even if "free". A critical section can be acquired without a context switch in that case, and (on an multicore/processor machine) it will even spin a few cycles if it's blocked to prevent the expensive context switch.
A mutex (at least in windows) allows for synchronizations between different processes in addition to threads. This means extra work must be done to ensure this. Also, as Brian pointed out, using a mutex also requires a switch to "kernel" mode, which causes another speed hit (I believe, i.e. infer, that the kernel is required for this interprocess synchronization, but I've got nothing to back me up on that).
Edit: You can find explicit reference to interprocess synchronization here and for more info on this topic, have a look at Interprocess Synchronization

Resources