Is using mutexes over CS harmful for the system? - windows

I came across few articles talking about differences between Mutexes and Critical sections.
One of the major differences which I came across is , Mutexes run in kernel mode whereas Critical sections mainly run in user mode.
So if this is the case then arent the applications which use mutexes harmful for the system in case the application crashes?
Thanks.

Use Win32 Mutexes handles when you need to have a lock or synchronization across threads in different processes.
Use Win32 CRITICAL_SECTIONs when you need to have a lock between threads within the same process. It's cheaper as far as time and doesn't involve a kernel system call unless there is lock contention. Critical Section objects in Win32 can't span process boundaries anyway.
"Harmful" is the wrong word to use. More like "Win32 mutexes are slightly more expensive that Win32 Critical Sections in terms of performance". A running app that uses mutexes instead of critical sections won't likely hurt system performance. It will just run minutely slower. But depending on how often your lock is acquired and released, the difference may not even be measurable.
I forget the perf metrics I did a long time ago. The bottom line is that EnterCriticalSection and LeaveCriticalSection APIs are on the order of 10-100x faster than the equivalent usage of WaitForSingleObject and ReleaseMutex. (on the order of 1 microsecond vs 1 millisecond).

Related

semaphore on uni-processor in linux kernel

I try to understand how the synchronization works in linux kernel.
I read that semaphores can be use for exceptions but I can not find an example for a situation , semaphore is needed.
So why using a semaphore in uni-processor system?
I am assuming that you are interested in locking generally, rather than semaphores opposed to mutexes (see also "Difference between Counting and Binary Semaphores"[1]). I won't give a detailed explanation of locking, just point out a couple of things.
It usually makes sense to assume that code you might write could be executed on a multi-processor system (uni-processor is increasingly rare these days). I assume because you explicitly mentioned uni-processor that you understand that case.
The Linux kernel can be built to be fully preemptive while running kernel code[2][3]. In that case threads can be interrupted and resumed at almost any point, including for example in the middle of writing I/O to a device. If a thread writing I/O is interrupted and another accessing the same device switched to things will probably not work as intended.
[1] Differnce between Counting and Binary Semaphores
[2] https://kernelnewbies.org/FAQ/Preemption
[3] https://rt.wiki.kernel.org/index.php/CONFIG_PREEMPT_RT_Patch

What is the difference between monitors and other synchronization primitives

What is the actual difference between monitors and other synchronization primitives like mutexes, WinAPI events and critical sections? It looks for me that it's quite the same thing -- one thread at the time can lock the monitor, while other threads should wait for it to become free, much like in the case of events and critical sections.
So, what is the difference? Where am I wrong?
All these synchronization primitives under Windows have similar operations(wait and signal), but slightly different behaviour of these operations. So primitives' usage is usually differs.
Critical section has owner thread, so it can be released(signaled) only by the owner.
Also, unlike to other primitives, operations for critical section use pointer instead of HANDLE, so critical sections cannot be used by WaitForMultipleObjects and similar functions.
Mutexes are very similar to critical sections, but they are identified by a HANDLE, so they can be waited for together with other objects (using WaitForMultipleObjects).
SignalObjectAndWait function can also be used for mutexes.
Events support manually-reset mode, when successfull waiting on event doesn't reset it. So several waiters can bypass waiting for single event at the same time.
Semaphores (WinAPI variant for monitors) allows usage limit above 1, that is code section protected by semaphore is no longer exclusive, like with critical section and mutexes.
Also, semaphores has no owner semantic, so they can be signalled by any thread. This feature is critical for some algorithms.

Why spinlocks don't work in uniprocessor (unicore) systems?

I know that spinlocks work with spining, different kernel paths exist and Kernels are preemptive, so why spinlocks don't work in uniprocessor systems? (for example, in Linux)
If I understand your question, you're asking why spin locks are a bad idea on single core machines.
They should still work, but can be much more expensive than true thread-sleeping concurrency:
When you use a spinlock, you're essentially asserting that you don't think you will have to wait long. You are saying that you think it's better to maintain the processor time slice with a busy loop than the cost of sleeping your thread and context-shifting to another thread or process. If you have to wait a very short amount of time, you can sleep and be reawakened almost immediately, but the cost of going down and up is more expensive than just waiting around.
This is more likely to be OK on multi-core processors, since they have much better concurrency profiles than single core processors. On multi core processors, between loop iterations, some other thread may have taken care of your prerequisite. On single core processors, it's not possible that someone else could have helped you out - you've locked up the one and only core.
The problem here is that if you wait or sleep on a lock, you hint to the system that you don't have everything you need yet, so it should go do some other stuff and come back to you later. With a spin lock, you never tell the system this, so you lock it up waiting for something else to happen - but, meanwhile, you're holding up the whole system, so something else can't happen.
The nature of a spinlock is that it does not deschedule the process - instead it spins until the process acquires the lock.
On a uniprocessor, it will either immediately acquire the lock or it will spin forever - if the lock is contended, then there will never be an opportunity for the process which currently holds the resource to give it up. Spinlocks are only useful when another process can execute while one is spinning on the lock - which means multiprocessor systems.
there are different versions of spinlock:
spin_lock_irqsave(&xxx_lock, flags);
... critical section here ..
spin_unlock_irqrestore(&xxx_lock, flags);
In Uni processor spin_lock_irqsave() should be used when data needs to shared between process context and interrupt context, as in this case IRQ also gets disabled. spin_lock_irqsave() work under all circumstances, but partly because they are safe they are also fairly slow.
However, in case data needs to be protected across different CPUs then it is better to use below versions, these are cheaper ones as IRQs dont get disabled in this case:
spin_lock(&lock);
...
spin_unlock(&lock);
In uniprocessor systems calling spin_lock_irqsave(&xxx_lock, flags); has the same effect as disabling interrupts which will provide the needed interrupt concurrency protection without unneeded SMP protection. However, in multiprocessor systems this covers both interrupt and SMP concurrency issues.
Spinlocks are, by their nature, intended for use on multiprocessor systems, although a uniprocessor workstation running a preemptive kernel behaves like SMP, as far as concurrency is concerned. If a nonpreemptive uniprocessor system ever went into a spin on a lock, it would spin forever; no other thread would ever be able to obtain the CPU to release the lock. For this reason, spinlock operations on uniprocessor systems without preemption enabled are optimized to do nothing, with the exception of the ones that change the IRQ masking status. Because of preemption, even if you never expect your code to run on an SMP system, you still need to implement proper locking.
Ref:Linux device drivers
By Jonathan Corbet, Alessandro Rubini, Greg Kroah-Hartma
Find the following two paragraph in Operating System Three Easy Pieces that might be helpful:
For spin locks, in the single CPU case, performance overheads can be
quite painful; imagine the case where the thread holding the lock is
pre-empted within a critical section. The scheduler might then run
every other thread (imagine there are N − 1 others), each of which
tries to ac- quire the lock. In this case, each of those threads will
spin for the duration of a time slice before giving up the CPU, a
waste of CPU cycles.
However, on multiple CPUs, spin locks work
reasonably well (if the number of threads roughly equals the number of
CPUs). The thinking goes as follows: imagine Thread A on CPU 1 and
Thread B on CPU 2, both contending for a lock. If Thread A (CPU 1)
grabs the lock, and then Thread B tries to, B will spin (on CPU 2).
However, presumably the crit- ical section is short, and thus soon the
lock becomes available, and is ac- quired by Thread B. Spinning to
wait for a lock held on another processor doesn’t waste many cycles in
this case, and thus can be effective

Usage of spinlock and cli together

I recently downloaded linux source from http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.34.1.tar.bz2 . I came across the below paragraph in the file called spinlocks.txt in linux-2.6.34.1\Documentation folder.
" it does mean that if you have some code that does
cli();
.. critical section ..
sti();
and another sequence that does
spin_lock_irqsave(flags);
.. critical section ..
spin_unlock_irqrestore(flags);
then they are NOT mutually exclusive, and the critical regions can happen
at the same time on two different CPU's. That's fine per se, but the
critical regions had better be critical for different things (ie they
can't stomp on each other). "
How can they impact if some code is using cli()/sti() and other part of the same code uses spin_lock_irqsave(flags)/spin_unlock_irqrestore(flags) ?
The key part here is "on two different CPUs". Some background:
Historically on uni-processor (UP) systems the only source of concurrency was hardware interrupts. It was enough to cli/sti around the critical section to prevent an IRQ handler from messing things up.
Then there was the giant lock design where the kernel would effectively run on a single CPU and only one process could be in the kernel at a time (that what the giant lock was for). Again, disabling interrupts was enough to protect kernel from itself.
On full SMP systems, where multiple threads could be active in the kernel at the same time and interrupts could be delivered to pretty much any CPU, it's no longer enough to only disable interrupts on single processor, or only grab a single lock. Both are required: disabling interrupts protects from IRQ handler on the same CPU, holding a lock protects from other threads entering the same critical sections on different CPU. This is exactly why spin_lock_irqsave() and spin_unlock_irqrestore() were invented.

What to avoid for performance reasons in multithreaded code?

I'm currently reviewing/refactoring a multithreaded application which is supposed to be multithreaded in order to be able to use all the available cores and theoretically deliver a better / superior performance (superior is the commercial term for better :P)
What are the things I should be aware when programming multithreaded applications?
I mean things that will greatly impact performance, maybe even to the point where you don't gain anything with multithreading at all but lose a lot by design complexity. What are the big red flags for multithreading applications?
Should I start questioning the locks and looking to a lock-free strategy or are there other points more important that should light a warning light?
Edit: The kind of answers I'd like are similar to the answer by Janusz, I want red warnings to look up in code, I know the application doesn't perform as well as it should, I need to know where to start looking, what should worry me and where should I put my efforts. I know it's kind of a general question but I can't post the entire program and if I could choose one section of code then I wouldn't be needing to ask in the first place.
I'm using Delphi 7, although the application will be ported / remake in .NET (c#) for the next year so I'd rather hear comments that are applicable as a general practice, and if they must be specific to either one of those languages
One thing to definitely avoid is lots of write access to the same cache lines from threads.
For example: If you use a counter variable to count the number of items processed by all threads, this will really hurt performance because the CPU cache lines have to synchronize whenever the other CPU writes to the variable.
One thing that decreases performance is having two threads with much hard drive access. The hard drive would jump from providing data for one thread to the other and both threads would wait for the disk all the time.
Something to keep in mind when locking: lock for as short a time as possible. For example, instead of this:
lock(syncObject)
{
bool value = askSomeSharedResourceForSomeValue();
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
}
Do this (if possible):
bool value = false;
lock(syncObject)
{
value = askSomeSharedResourceForSomeValue();
}
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
Of course, this example only works if DoSomethingIfTrue() and DoSomethingIfFalse() don't require synchronization, but it illustrates this point: locking for as short a time as possible, while maybe not always improving your performance, will improve the safety of your code in that it reduces surface area for synchronization problems.
And in certain cases, it will improve performance. Staying locked for long lengths of time means that other threads waiting for access to some resource are going to be waiting longer.
More threads then there are cores, typically means that the program is not performing optimally.
So a program which spawns loads of threads usually is not designed in the best fashion. A good example of this practice are the classic Socket examples where every incoming connection got it's own thread to handle of the connection. It is a very non scalable way to do things. The more threads there are, the more time the OS will have to use for context switching between threads.
You should first be familiar with Amdahl's law.
If you are using Java, I recommend the book Java Concurrency in Practice; however, most of its help is specific to the Java language (Java 5 or later).
In general, reducing the amount of shared memory increases the amount of parallelism possible, and for performance that should be a major consideration.
Threading with GUI's is another thing to be aware of, but it looks like it is not relevant for this particular problem.
What kills performance is when two or more threads share the same resources. This could be an object that both use, or a file that both use, a network both use or a processor that both use. You cannot avoid these dependencies on shared resources but if possible, try to avoid sharing resources.
Run-time profilers may not work well with a multi-threaded application. Still, anything that makes a single-threaded application slow will also make a multi-threaded application slow. It may be an idea to run your application as a single-threaded application, and use a profiler, to find out where its performance hotspots (bottlenecks) are.
When it's running as a multi-threaded aplication, you can use the system's performance-monitoring tool to see whether locks are a problem. Assuming that your threads would lock instead of busy-wait, then having 100% CPU for several threads is a sign that locking isn't a problem. Conversely, something that looks like 50% total CPU utilitization on a dual-processor machine is a sign that only one thread is running, and so maybe your locking is a problem that's preventing more than one concurrent thread (when counting the number of CPUs in your machine, beware multi-core and hyperthreading).
Locks aren't only in your code but also in the APIs you use: e.g. the heap manager (whenever you allocate and delete memory), maybe in your logger implementation, maybe in some of the O/S APIs, etc.
Should I start questioning the locks and looking to a lock-free strategy
I always question the locks, but have never used a lock-free strategy; instead my ambition is to use locks where necessary, so that it's always threadsafe but will never deadlock, and to ensure that locks are acquired for a tiny amount of time (e.g. for no more than the amount of time it takes to push or pop a pointer on a thread-safe queue), so that the maximum amount of time that a thread may be blocked is insignificant compared to the time it spends doing useful work.
You don't mention the language you're using, so I'll make a general statement on locking. Locking is fairly expensive, especially the naive locking that is native to many languages. In many cases you are reading a shared variable (as opposed to writing). Reading is threadsafe as long as it is not taking place simultaneously with a write. However, you still have to lock it down. The most naive form of this locking is to treat the read and the write as the same type of operation, restricting access to the shared variable from other reads as well as writes. A read/writer lock can dramatically improve performance. One writer, infinite readers. On an app I've worked on, I saw a 35% performance improvement when switching to this construct. If you are working in .NET, the correct lock is the ReaderWriterLockSlim.
I recommend looking into running multiple processes rather than multiple threads within the same process, if it is a server application.
The benefit of dividing the work between several processes on one machine is that it is easy to increase the number of servers when more performance is needed than a single server can deliver.
You also reduce the risks involved with complex multithreaded applications where deadlocks, bottlenecks etc reduce the total performance.
There are commercial frameworks that simplifies server software development when it comes to load balancing and distributed queue processing, but developing your own load sharing infrastructure is not that complicated compared with what you will encounter in general in a multi-threaded application.
I'm using Delphi 7
You might be using COM objects, then, explicitly or implicitly; if you are, COM objects have their own complications and restrictions on threading: Processes, Threads, and Apartments.
You should first get a tool to monitor threads specific to your language, framework and IDE. Your own logger might do fine too (Resume Time, Sleep Time + Duration). From there you can check for bad performing threads that don't execute much or are waiting too long for something to happen, you might want to make the event they are waiting for to occur as early as possible.
As you want to use both cores you should check the usage of the cores with a tool that can graph the processor usage on both cores for your application only, or just make sure your computer is as idle as possible.
Besides that you should profile your application just to make sure that the things performed within the threads are efficient, but watch out for premature optimization. No sense to optimize your multiprocessing if the threads themselves are performing bad.
Looking for a lock-free strategy can help a lot, but it is not always possible to get your application to perform in a lock-free way.
Threads don't equal performance, always.
Things are a lot better in certain operating systems as opposed to others, but if you can have something sleep or relinquish its time until it's signaled...or not start a new process for virtually everything, you're saving yourself from bogging the application down in context switching.

Resources