Are mutexes really slower? - windows

I have read so many times, here and everywhere on the net, that mutexes are slower than critical section/semaphores/insert-your-preferred-synchronisation-method-here. but i have never seen any paper or study or whatever to back up this claim.
so, where does this idea come from ? is it a myth or a reality ? are mutexes really slower ?

In the book "Multithreading applications in win32" by Jim Beveridge and Robert Wiener it says: "It takes almost 100 times longer to lock an unowned mutex than it does to lock an unowned critical section because the critical section can be done in user mode without involving the kernel"
And on MSDN here it says "critical section objects provide a slightly faster, more efficient mechanism for mutual-exclusion synchronization"

I don't believe that any of the answers hit on the key point of why they are different.
Mutexes are at operating system level. A named mutex exists and is accessible from ANY process in the operating system (provided its ACL allows access from all).
Critical sections are faster as they don't require the system call into kernel mode, however they will only work WITHIN a process, you cannot lock more than one process using a critical section. So depending on what you are trying to achieve and what your software design looks like, you should choose the most appropriate tool for the job.
I'll additionally point out to you that Semaphores are separate to mutex/critical sections, because of their count. Semaphores can be used to control multiple concurrent access to a resource, where as a mutex/critical section is either being accessed or not being accessed.

A CRITICAL_SECTION is implemented as a spinlock with a capped spin count. See MSDN InitializeCriticalSectionAndSpinCount for the indication of this.
When the spin count 'elapsed', the critical section locks a semaphore (or whatever kernel-lock it is implemented with).
So in code it works like this (not really working, should just be an example) :
CRITICAL_SECTION s;
void EnterCriticalSection( CRITICAL_SECTION* s )
{
int spin_count = s.max_count;
while( --spin_count >= 0 )
{
if( InterlockedExchange( &s->Locked, 1 ) == 1 )
{
// we own the lock now
s->OwningThread = GetCurrentThread();
return;
}
}
// lock the mutex and wait for an unlock
WaitForSingleObject( &s->KernelLock, INFINITE );
}
So if your critical section is only held a very short time, and the entering thread does only wait very few 'spins' (cycles) the critical section can be very efficient. But if this is not the case, the critical section wastes many cycles doing nothing, and then falls back to a kernel synchronization object.
So the tradeoff is :
Mutex :
Slow acquire/release, but no wasted cycles for long 'locked regions'
CRITICAL_SECTION : Fast acquire/release for unowned 'regions', but wasted cycles for owned sections.

Yes, critical sections are more efficient. For a very good explanation, get "Concurrent Programming on Windows".
In a nutshell: a mutex is a kernel object, so there is always a context switch when you acquire one, even if "free". A critical section can be acquired without a context switch in that case, and (on an multicore/processor machine) it will even spin a few cycles if it's blocked to prevent the expensive context switch.

A mutex (at least in windows) allows for synchronizations between different processes in addition to threads. This means extra work must be done to ensure this. Also, as Brian pointed out, using a mutex also requires a switch to "kernel" mode, which causes another speed hit (I believe, i.e. infer, that the kernel is required for this interprocess synchronization, but I've got nothing to back me up on that).
Edit: You can find explicit reference to interprocess synchronization here and for more info on this topic, have a look at Interprocess Synchronization

Related

TCriticalSection.LockCount and negative values [duplicate]

I am debugging a deadlock issue and call stack shows that threads are waiting on some events.
Code is using critical section as synchronization primitive I think there is some issue here.
Also the debugger is pointing to a critical section that is owned by some other thread,but lock count is -2.
As per my understanding lock count>0 means that critical section is locked by one or more threads.
So is there any possibility that I am looking at right critical section which could be the culprit in deadlock.
In what scenarios can a critical section have negative lock count?
Beware: since Windows Server 2003 (for client OS this is Vista and newer) the meaning of LockCount has changed and -2 is a completely normal value, commonly seen when a thread has entered a critical section without waiting and no other thread is waiting for the CS. See Displaying a Critical Section:
In Microsoft Windows Server 2003 Service Pack 1 and later versions of Windows, the LockCount field is parsed as follows:
The lowest bit shows the lock status. If this bit is 0, the critical section is locked; if it is 1, the critical section is not locked.
The next bit shows whether a thread has been woken for this lock. If this bit is 0, then a thread has been woken for this lock; if it is 1, no thread has been woken.
The remaining bits are the ones-complement of the number of threads waiting for the lock.
I am assuming that you are talking about CCriticalSection class in MFC. I think you are looking at the right critical section. I have found that the critical section's lock count can go negative if the number of calls to Lock() function is less than the number of Unlock() calls. I found that this generally happens in the following type of code:
void f()
{
CSingleLock lock(&m_synchronizer, TRUE);
//Some logic here
m_synchronizer.Unlock();
}
At the first glance this code looks perfectly safe. However, note that I am using CCriticalSection's Unlock() method directly instead of CSingleLock's Unlock() method. Now what happens is that when the function exits, CSingleLock in its destructor calls Unlock() of the critical section again and its lock count becomes negative. After this the application will be in a bad shape and strange things start to happen. If you are using MFC critical sections then do check for this type of problems.

use of spin variants in network processing

I have written a Kernel module that is interacting with net-filter hooks.
The net-filter hooks operate in Softirq context.
I am accessing a global data structure
"Hash Table" from the softirq context as well as from Process context. The process context access is due to a sysctl file being used to modify the contents of the Hash-table.
I am using spinlock_irq_save.
Is this choice of spin_lock api correct ?? In terms of performance and locking standards.
what would happen if an interrupt is scheduled on another processor? while on the current processor lock is already hold by a process context code?
Firstly:
So, with all the above details I concluded that my softirqs can run concurrently on both cores.
Yes, this is correct. Your softirq handler may be executed "simultaneously on more than one CPU".
Your conclusion to use spinlocks sounds correct to me. However, this assumes that the critical section (ie., that which is executed with the spinlock held) has the following properties:
It must not sleep (for example, acquire a blocking mutex)
It should be as short as possible
Generally, if you're just updating your hash table, you should be fine here.
If an IRQ handler tries to acquire a spinlock that is held by a process context, that's fine. As long as your process context does not sleep with that lock held, the lock should be released within a short amount of time, allowing the IRQ handler to make forward progress.
I think the solution is appropriate . Softirqs anyways runs with preemption disabled . To share a data with a process, the process must also disable both preemption and interrupts. In case of timer, which only reduces the time stamp of an entry can do it atomically i.e. the time stamp variable must be atomic. If in another core softirqs run and wants to acquire the spinlock, when it is already held in the other core,it must wait.

CRITICAL_SECTION in boost?

is there something in boost that translates to windows CRITICAL_SECTION?
CRITICAL_SECTION is a so called "user mode" mutex that uses spin locks instead of blocking and avoids expensive transitions to the kernel.
Boost::Mutex is what you want, versions up to 1.34.1 used a win32 critical section, but new ones use a win32 event and locks. I don't know why - win32 mutexes are perfectly fine and as fast as an event (surely, he said...) unless you don't know if you need the crossprocess capability of them, or the sole-process limitation of a critical_section.
That said, chances are the performance implications of locking are mainly down to losing the rest of your threrad quantum, not necessarily kernel transitions.

How best to synchronize memory access shared between kernel and user space, in Windows

I can't find any function to acquire spinlock in Win32 Apis.
Is there a reason?
When I need to use spinlock, what do I do?
I know there is an CriticalSectionAndSpinCount function.
But that's not what I want.
Edit:
I want to synchronize a memory which will be shared between kernel space and user space. -The memory will be mapped.
I should lock it when I access the data structure and the locking time will be very short.
The data structure(suppose it is a queue) manages event handles to interaction each other.
What synchronization mechanism should I use?
A spinlock is clearly not appropriate for user-level synchronization. From http://www.microsoft.com/whdc/driver/kernel/locks.mspx:
All types of spin locks raise the IRQL
to DISPATCH_LEVEL or higher. Spin
locks are the only synchronization
mechanism that can be used at IRQL >=
DISPATCH_LEVEL. Code that holds a spin
lock runs at IRQL >= DISPATCH_LEVEL,
which means that the system’s thread
switching code (the dispatcher) cannot
run and, therefore, the current thread
cannot be pre-empted.
Imagine if it were possible to take a spin lock in user mode: Suddenly the thread would not be able to be pre-empted. So on a single-cpu machine, this is now an exclusive and real-time thread. The user-mode code would now be responsible for handling interrupts and other kernel-level tasks. The code could no longer access any paged memory, which means that the user-mode code would need to know what memory is currently paged and act accordingly. Cats and dogs living together, mass hysteria!
Perhaps a better question would be to tell us what you are trying to accomplish, and ask what synchronization method would be most appropriate.
There is a managed user-mode SpinLock as described here. Handle with care, as advised in the docs - it's easy to go badly wrong with these locks.
The only way to access this in native code is via the Win32 API you named already - CriticalSectionAndSpinCount and its siblings.

Avoiding sleep while holding a spinlock

I've recently read section 5.5.2 (Spinlocks and Atomic Context) of LDDv3 book:
Avoiding sleep while holding a lock can be more difficult; many kernel functions can sleep, and this behavior is not always well documented. Copying data to or from user space is an obvious example: the required user-space page may need to be swapped in from the disk before the copy can proceed, and that operation clearly requires a sleep. Just about any operation that must allocate memory can sleep; kmalloc can decide to give up the processor, and wait for more memory to become available unless it is explicitly told not to. Sleeps can happen in surprising places; writing code that will execute under a spinlock requires paying attention to every function that you call.
It's clear to me that spinlocks must always be held for the minimum time possible and I think that it's relatively easy to write correct spinlock-using code from scratch.
Suppose, however, that we have a big project where spinlocks are widely used.
How can we make sure that functions called from critical sections protected by spinlocks will never sleep?
Thanks in advance!
What about enabling "Sleep-inside-spinlock checking" for your kernel ? It is usually found under Kernel Debugging when you run make config. You might also try to duplicate its behavior in your code.
One thing I noticed on a lot of projects is people seem to misuse spinlocks, they get used instead of the other locking primitives that should have be used.
A linux spinlock only exists in multiprocessor builds (in single process builds the spinlock preprocessor defines are empty) spinlocks are for short duration locks on a multi processor platform.
If code fails to aquire a spinlock it just spins the processor until the lock is free. So either another process running on a different processor must free the lock or possibly it could be freed by an interrupt handler but the wait event mechanism is much better way of waiting on an interrupt.
The irqsave spinlock primitive is a tidy way of disabling/ enabling interrupts so a driver can lock out an interrupt handler but this should only be held for long enough for the process to update some variables shared with an interrupt handler, if you disable interupts you are not going to be scheduled.
If you need to lock out an interrupt handler use a spinlock with irqsave.
For general kernel locking you should be using mutex/semaphore api which will sleep on the lock if they need to.
To lock against code running in other processes use muxtex/semaphore
To lock against code running in an interrupt context use irq save/restore or spinlock_irq save/restore
To lock against code running on other processors then use spinlocks and avoid holding the lock for long.
I hope this helps

Resources