Usage of spinlock and cli together - linux-kernel

I recently downloaded linux source from http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.34.1.tar.bz2 . I came across the below paragraph in the file called spinlocks.txt in linux-2.6.34.1\Documentation folder.
" it does mean that if you have some code that does
cli();
.. critical section ..
sti();
and another sequence that does
spin_lock_irqsave(flags);
.. critical section ..
spin_unlock_irqrestore(flags);
then they are NOT mutually exclusive, and the critical regions can happen
at the same time on two different CPU's. That's fine per se, but the
critical regions had better be critical for different things (ie they
can't stomp on each other). "
How can they impact if some code is using cli()/sti() and other part of the same code uses spin_lock_irqsave(flags)/spin_unlock_irqrestore(flags) ?

The key part here is "on two different CPUs". Some background:
Historically on uni-processor (UP) systems the only source of concurrency was hardware interrupts. It was enough to cli/sti around the critical section to prevent an IRQ handler from messing things up.
Then there was the giant lock design where the kernel would effectively run on a single CPU and only one process could be in the kernel at a time (that what the giant lock was for). Again, disabling interrupts was enough to protect kernel from itself.
On full SMP systems, where multiple threads could be active in the kernel at the same time and interrupts could be delivered to pretty much any CPU, it's no longer enough to only disable interrupts on single processor, or only grab a single lock. Both are required: disabling interrupts protects from IRQ handler on the same CPU, holding a lock protects from other threads entering the same critical sections on different CPU. This is exactly why spin_lock_irqsave() and spin_unlock_irqrestore() were invented.

Related

Processor pipeline state preservation

Is there any situation where the state of the processor pipeline (with already decoded or prefetched instructions) is saved and subsequently reloaded after resumption during a thread sleep/ context switch / interrupt etc.? (May be as a optimization).
This isn't possible for any CPU I'm aware of. There's no interface for doing it, and no conditions under which a CPU does it on its own. Dumping a huge amount of internal CPU state to RAM would take more cycles than it would save. Having the OS keep track of the variable-size chunks of RAM needed for this would just make the overhead worse.
If anything was worth saving, BTW, it would be results of already executed instruction that can't retire yet, because of a load that missed in cache. (All the common out-of-order execution designs for mainstream ISAs use in-order retirement to support precise exceptions. Out-of-order retirement with checkpointing / rollback on exceptions and mispredicts has been proposed. Search kilo-instruction processor, IIRC.)
(flawed idea): An aggressive out-of-order design could avoid wasting too much work on context switches by delaying the write of the interrupt-return address when an external interrupt arrives. i.e. they could pretend that the interrupt came in later than it did by allowing some instructions already in the pipeline to keep executing. If the user-space instruction pointer isn't needed until the interrupt handler returns, the CPU could clearing the pipeline.
Hrm, this has the major difficulty that register values on entry into the interrupt handler also depends on the architectural state, so this probably can't work.
This def. can't work for interrupts generated by user-space, because that fixes the return address.
This isn't an issue for threads that put themselves to sleep while waiting on a spinlock with monitor / mwait or something. mwait presumably doesn't take effect until it retires, and it won't retire until all previous work has been done. It would defeat the intended purpose for the CPU to be aggressive about speculatively executing past mwait, I think. Or maybe mwait doesn't even flush the pipeline, and just saves power.
The idea has been proposed, but you'd need a much denser memory technology which is only now becoming available. See this paper for example:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6489970
Basically, they propose pipes composed of a new set of latches & registers based on memristors (resistive non-volatile memory components), that can hold multiple values corresponding to multiple threads. Control logic can then tell all latches which thread should be active, and allow simultaneous context switching throughout the entire pipe.
Keep in mind that this only enhances the granularity to the latch level. Modern CPUs with simultaneous multithreading can already have different threads active on different units level without context switches, simple through arbitration. Other units with inherent parallelism may already handle multiple threads per cycle (e.g. - multi-ported ALUs)

Is using mutexes over CS harmful for the system?

I came across few articles talking about differences between Mutexes and Critical sections.
One of the major differences which I came across is , Mutexes run in kernel mode whereas Critical sections mainly run in user mode.
So if this is the case then arent the applications which use mutexes harmful for the system in case the application crashes?
Thanks.
Use Win32 Mutexes handles when you need to have a lock or synchronization across threads in different processes.
Use Win32 CRITICAL_SECTIONs when you need to have a lock between threads within the same process. It's cheaper as far as time and doesn't involve a kernel system call unless there is lock contention. Critical Section objects in Win32 can't span process boundaries anyway.
"Harmful" is the wrong word to use. More like "Win32 mutexes are slightly more expensive that Win32 Critical Sections in terms of performance". A running app that uses mutexes instead of critical sections won't likely hurt system performance. It will just run minutely slower. But depending on how often your lock is acquired and released, the difference may not even be measurable.
I forget the perf metrics I did a long time ago. The bottom line is that EnterCriticalSection and LeaveCriticalSection APIs are on the order of 10-100x faster than the equivalent usage of WaitForSingleObject and ReleaseMutex. (on the order of 1 microsecond vs 1 millisecond).

Are there any performance penalties for running SMP enabled Linux kernel on a Uni processor (ARM Cortex A8 based SOC)?

This is a two fold question that raised from my trivial observation that I am running a SMP enabled Linux on our ARM-Cortex 8 based SoC. First part is about performance (memory space/CPU time) difference between SMP and NON-SMP Linux kernel on a Uni processor system. Does any difference exits?
Second part is about use of Spinlock. AFAIK spinklock are noop in case uni-processor. Since there is only one CPU and only one process will be running on it (at a time ) there is no other process for busy-looping. So for synchronization I just need to disable interrupt for protecting my critical section. Is this understanding of mine correct?
Ignore portability of drivers factor for this discussion.
A large amount of synchronisation code in the kernel compiles way to almost nothing in uni-processor kernels which descries the behaviour you describe. Performance of n-way system is definitely not 2n - and gets worse as the number of CPUs.
You should continue to write your driver with using synchronisation mechanisms for SMP systems - safe in the knowledge that you'll get the correct single-processor case when the kernel is configured for uni-processor.
Disabling interrupts globally is like taking a sledge-hammer to a nut - maybe just disabling pre-emption on the current CPU is enough - which the spinlock does even on uni-processor systems.
If you've not already done so, take a look at Chapter 5 of Linux Device Drivers 3rd Edition - there are a variety of spinlock options depending on the circumstance.
As you have stated that you are running the linux kernel as compiled in SMP mode on Uni-processor system so it's clear that you'll not get any benefit in terms of speed & memory.
As the linux-kernel uses extensive locking for synchronization. But it Uni-Processor mode there may be no need of locking theoretically but there are many cases where its necessary so try to use Locking where its needed but not as much as in SMP.
but you should know it well that Spinlocks are implemented by set of macros, some prevent concurrency with IRQ handlers while the
other ones not.Spinlocks are suitable to protect small pieces of code which are intended to run
for a very short time.
As of your second question, you are trying to remove spinlocks by disabling interrupts for Uni-Processor mode but Spinlock macros are in non-preemptible UP(Uni-Processor) kernels evaluated to empty macros(or some of them to macros just disabling/enabling interrupts). UP kernels with
preemption enabled use spinlocks to disable preemption. For most purposes, pre-emption can be tought of as SMP equivalent. so in UP kernels if you use Spinlocks then they will be just empty macro & i think it will be better to use it.
there are basically four technique for synchronization as..1->Nonpreemptability,2->Atomic Operations,3->Interrupt Disabling,4->Locks.
but as you are saying to disable interrupt for synchronization then remember Because of its simplicity, interrupt disabling is used by kernel functions for implementing a critical region.
This technique does not always prevent kernel control path interleaving.
Critical section should be short because any communication between CPU and I/O is blocked while a kernel control path is running in this section.
so if you need synchronization in Uni-Processor then use semaphore.

Why spinlocks don't work in uniprocessor (unicore) systems?

I know that spinlocks work with spining, different kernel paths exist and Kernels are preemptive, so why spinlocks don't work in uniprocessor systems? (for example, in Linux)
If I understand your question, you're asking why spin locks are a bad idea on single core machines.
They should still work, but can be much more expensive than true thread-sleeping concurrency:
When you use a spinlock, you're essentially asserting that you don't think you will have to wait long. You are saying that you think it's better to maintain the processor time slice with a busy loop than the cost of sleeping your thread and context-shifting to another thread or process. If you have to wait a very short amount of time, you can sleep and be reawakened almost immediately, but the cost of going down and up is more expensive than just waiting around.
This is more likely to be OK on multi-core processors, since they have much better concurrency profiles than single core processors. On multi core processors, between loop iterations, some other thread may have taken care of your prerequisite. On single core processors, it's not possible that someone else could have helped you out - you've locked up the one and only core.
The problem here is that if you wait or sleep on a lock, you hint to the system that you don't have everything you need yet, so it should go do some other stuff and come back to you later. With a spin lock, you never tell the system this, so you lock it up waiting for something else to happen - but, meanwhile, you're holding up the whole system, so something else can't happen.
The nature of a spinlock is that it does not deschedule the process - instead it spins until the process acquires the lock.
On a uniprocessor, it will either immediately acquire the lock or it will spin forever - if the lock is contended, then there will never be an opportunity for the process which currently holds the resource to give it up. Spinlocks are only useful when another process can execute while one is spinning on the lock - which means multiprocessor systems.
there are different versions of spinlock:
spin_lock_irqsave(&xxx_lock, flags);
... critical section here ..
spin_unlock_irqrestore(&xxx_lock, flags);
In Uni processor spin_lock_irqsave() should be used when data needs to shared between process context and interrupt context, as in this case IRQ also gets disabled. spin_lock_irqsave() work under all circumstances, but partly because they are safe they are also fairly slow.
However, in case data needs to be protected across different CPUs then it is better to use below versions, these are cheaper ones as IRQs dont get disabled in this case:
spin_lock(&lock);
...
spin_unlock(&lock);
In uniprocessor systems calling spin_lock_irqsave(&xxx_lock, flags); has the same effect as disabling interrupts which will provide the needed interrupt concurrency protection without unneeded SMP protection. However, in multiprocessor systems this covers both interrupt and SMP concurrency issues.
Spinlocks are, by their nature, intended for use on multiprocessor systems, although a uniprocessor workstation running a preemptive kernel behaves like SMP, as far as concurrency is concerned. If a nonpreemptive uniprocessor system ever went into a spin on a lock, it would spin forever; no other thread would ever be able to obtain the CPU to release the lock. For this reason, spinlock operations on uniprocessor systems without preemption enabled are optimized to do nothing, with the exception of the ones that change the IRQ masking status. Because of preemption, even if you never expect your code to run on an SMP system, you still need to implement proper locking.
Ref:Linux device drivers
By Jonathan Corbet, Alessandro Rubini, Greg Kroah-Hartma
Find the following two paragraph in Operating System Three Easy Pieces that might be helpful:
For spin locks, in the single CPU case, performance overheads can be
quite painful; imagine the case where the thread holding the lock is
pre-empted within a critical section. The scheduler might then run
every other thread (imagine there are N − 1 others), each of which
tries to ac- quire the lock. In this case, each of those threads will
spin for the duration of a time slice before giving up the CPU, a
waste of CPU cycles.
However, on multiple CPUs, spin locks work
reasonably well (if the number of threads roughly equals the number of
CPUs). The thinking goes as follows: imagine Thread A on CPU 1 and
Thread B on CPU 2, both contending for a lock. If Thread A (CPU 1)
grabs the lock, and then Thread B tries to, B will spin (on CPU 2).
However, presumably the crit- ical section is short, and thus soon the
lock becomes available, and is ac- quired by Thread B. Spinning to
wait for a lock held on another processor doesn’t waste many cycles in
this case, and thus can be effective

What happens when kernel code is interrupted?

I am reading Operating System Concepts (Silberschatz,Galvin,Gagne), 6th edition, chapter 20.
I understand that Linux kernel code is non preemptible (before 2.6 version). But it can be interrupted by hardware interrupts. What happens if the kernel was in the middle of a critical section and the interrupt occured and it too executed the critical section?
From what I read in the book:
The second protection scheme that
Linux uses applies to critical
sections that occur in the interrupt service routines. The basic tool is
the processor interrupt control
hardware...
Ok, this scheme is used when an ISR has a critical section. But it will only disble further interrupts. What about the kernel code which was interrupted by this interrupt in the first place?
But it will only disble further interrupts. What about the kernel code which was interrupted
by this interrupt in the first place?
If the interrupt handler and other kernel code need access to the same data, you need to protect against that, which is usually done by a spinlock , great care must be taken, you don't want to introduce a deadlock ,and you must ensure such a spinlock is not held for too long. For spinlocks used in a hardware interrupt handler you have to disable interrupts on that processor whilst holding the lock - which in linux is done with the function spin_lock_irqsave().
(whilst a bit outdated, you can read about the concept here)
The kernel code which was interrupted by this interrupt in the first place gets interrupted.
This is why writing interrupt handlers is such a painful task: they can't do anything that would endanger the correctness of the main stream of execution.
For example, the way Apple's xnu kernel handles most kinds of device interrupts is to capture the information in the interrupt to a record in memory, add that record to a queue, and then resume normal execution; the kernel then picks up interrupts from the queue some time later (in the scheduler's main loop, i assume). That way, the interrupt handler only interacts with the rest of the system through the interrupt queue, and there is little danger of it causing trouble.
There is a bit of middle ground; on many architectures (including the x86), it is possible for privileged code to mask interrupts, so that they won't cause interruption. That can be used to protect passages of code which really shouldn't be interrupted. However, those architectures typically also have non-maskable interrupts, which ignore the masking, so interruption still has to be considered.

Resources