I'm currently about to write aio support for a kernel module I wrote for communication with a fast device (125 MB/s) in an embedded environment. To get started, I wanted to look at a few examples on how to approach this, so I ran grep -ri aio_write . in a 3.13 kernel.
However, upon further inspection, I noticed that most if not all fs drivers actually perform aio_write and aio_read synchronously. At most, they make use of vectored inputs to save on buffer allocation.
The only driver I was able to find, which actually supports asynchronous aio, is the gadget USB driver, drivers/usb/gadget/inode.c
Does this observation mean I'm likely not to benefit from writing my own aio routines?
Are libaio, librt considered good enough for my target bandwidth?
Is this feature too new or very controversial and likely to be gone in a few kernels' time?
I already noticed that pinning multiple user pages and DMA'ing them out to my peripheral is actually slower than just copy_from_user them to a PAGE_SIZE kmalloc'd buffer and loop a few times.
I fear that I might be putting a lot of time and effort into aio when it would actually end up slower than my simplistic read/write approach of copying user buffers over and DMA'ing them out synchronously and blocking until the DMA is done.
I'd be grateful for some experienced users' insights into this to help me evaluate aio.
I came across few articles talking about differences between Mutexes and Critical sections.
One of the major differences which I came across is , Mutexes run in kernel mode whereas Critical sections mainly run in user mode.
So if this is the case then arent the applications which use mutexes harmful for the system in case the application crashes?
Thanks.
Use Win32 Mutexes handles when you need to have a lock or synchronization across threads in different processes.
Use Win32 CRITICAL_SECTIONs when you need to have a lock between threads within the same process. It's cheaper as far as time and doesn't involve a kernel system call unless there is lock contention. Critical Section objects in Win32 can't span process boundaries anyway.
"Harmful" is the wrong word to use. More like "Win32 mutexes are slightly more expensive that Win32 Critical Sections in terms of performance". A running app that uses mutexes instead of critical sections won't likely hurt system performance. It will just run minutely slower. But depending on how often your lock is acquired and released, the difference may not even be measurable.
I forget the perf metrics I did a long time ago. The bottom line is that EnterCriticalSection and LeaveCriticalSection APIs are on the order of 10-100x faster than the equivalent usage of WaitForSingleObject and ReleaseMutex. (on the order of 1 microsecond vs 1 millisecond).
This is a two fold question that raised from my trivial observation that I am running a SMP enabled Linux on our ARM-Cortex 8 based SoC. First part is about performance (memory space/CPU time) difference between SMP and NON-SMP Linux kernel on a Uni processor system. Does any difference exits?
Second part is about use of Spinlock. AFAIK spinklock are noop in case uni-processor. Since there is only one CPU and only one process will be running on it (at a time ) there is no other process for busy-looping. So for synchronization I just need to disable interrupt for protecting my critical section. Is this understanding of mine correct?
Ignore portability of drivers factor for this discussion.
A large amount of synchronisation code in the kernel compiles way to almost nothing in uni-processor kernels which descries the behaviour you describe. Performance of n-way system is definitely not 2n - and gets worse as the number of CPUs.
You should continue to write your driver with using synchronisation mechanisms for SMP systems - safe in the knowledge that you'll get the correct single-processor case when the kernel is configured for uni-processor.
Disabling interrupts globally is like taking a sledge-hammer to a nut - maybe just disabling pre-emption on the current CPU is enough - which the spinlock does even on uni-processor systems.
If you've not already done so, take a look at Chapter 5 of Linux Device Drivers 3rd Edition - there are a variety of spinlock options depending on the circumstance.
As you have stated that you are running the linux kernel as compiled in SMP mode on Uni-processor system so it's clear that you'll not get any benefit in terms of speed & memory.
As the linux-kernel uses extensive locking for synchronization. But it Uni-Processor mode there may be no need of locking theoretically but there are many cases where its necessary so try to use Locking where its needed but not as much as in SMP.
but you should know it well that Spinlocks are implemented by set of macros, some prevent concurrency with IRQ handlers while the
other ones not.Spinlocks are suitable to protect small pieces of code which are intended to run
for a very short time.
As of your second question, you are trying to remove spinlocks by disabling interrupts for Uni-Processor mode but Spinlock macros are in non-preemptible UP(Uni-Processor) kernels evaluated to empty macros(or some of them to macros just disabling/enabling interrupts). UP kernels with
preemption enabled use spinlocks to disable preemption. For most purposes, pre-emption can be tought of as SMP equivalent. so in UP kernels if you use Spinlocks then they will be just empty macro & i think it will be better to use it.
there are basically four technique for synchronization as..1->Nonpreemptability,2->Atomic Operations,3->Interrupt Disabling,4->Locks.
but as you are saying to disable interrupt for synchronization then remember Because of its simplicity, interrupt disabling is used by kernel functions for implementing a critical region.
This technique does not always prevent kernel control path interleaving.
Critical section should be short because any communication between CPU and I/O is blocked while a kernel control path is running in this section.
so if you need synchronization in Uni-Processor then use semaphore.
I know that spinlocks work with spining, different kernel paths exist and Kernels are preemptive, so why spinlocks don't work in uniprocessor systems? (for example, in Linux)
If I understand your question, you're asking why spin locks are a bad idea on single core machines.
They should still work, but can be much more expensive than true thread-sleeping concurrency:
When you use a spinlock, you're essentially asserting that you don't think you will have to wait long. You are saying that you think it's better to maintain the processor time slice with a busy loop than the cost of sleeping your thread and context-shifting to another thread or process. If you have to wait a very short amount of time, you can sleep and be reawakened almost immediately, but the cost of going down and up is more expensive than just waiting around.
This is more likely to be OK on multi-core processors, since they have much better concurrency profiles than single core processors. On multi core processors, between loop iterations, some other thread may have taken care of your prerequisite. On single core processors, it's not possible that someone else could have helped you out - you've locked up the one and only core.
The problem here is that if you wait or sleep on a lock, you hint to the system that you don't have everything you need yet, so it should go do some other stuff and come back to you later. With a spin lock, you never tell the system this, so you lock it up waiting for something else to happen - but, meanwhile, you're holding up the whole system, so something else can't happen.
The nature of a spinlock is that it does not deschedule the process - instead it spins until the process acquires the lock.
On a uniprocessor, it will either immediately acquire the lock or it will spin forever - if the lock is contended, then there will never be an opportunity for the process which currently holds the resource to give it up. Spinlocks are only useful when another process can execute while one is spinning on the lock - which means multiprocessor systems.
there are different versions of spinlock:
spin_lock_irqsave(&xxx_lock, flags);
... critical section here ..
spin_unlock_irqrestore(&xxx_lock, flags);
In Uni processor spin_lock_irqsave() should be used when data needs to shared between process context and interrupt context, as in this case IRQ also gets disabled. spin_lock_irqsave() work under all circumstances, but partly because they are safe they are also fairly slow.
However, in case data needs to be protected across different CPUs then it is better to use below versions, these are cheaper ones as IRQs dont get disabled in this case:
spin_lock(&lock);
...
spin_unlock(&lock);
In uniprocessor systems calling spin_lock_irqsave(&xxx_lock, flags); has the same effect as disabling interrupts which will provide the needed interrupt concurrency protection without unneeded SMP protection. However, in multiprocessor systems this covers both interrupt and SMP concurrency issues.
Spinlocks are, by their nature, intended for use on multiprocessor systems, although a uniprocessor workstation running a preemptive kernel behaves like SMP, as far as concurrency is concerned. If a nonpreemptive uniprocessor system ever went into a spin on a lock, it would spin forever; no other thread would ever be able to obtain the CPU to release the lock. For this reason, spinlock operations on uniprocessor systems without preemption enabled are optimized to do nothing, with the exception of the ones that change the IRQ masking status. Because of preemption, even if you never expect your code to run on an SMP system, you still need to implement proper locking.
Ref:Linux device drivers
By Jonathan Corbet, Alessandro Rubini, Greg Kroah-Hartma
Find the following two paragraph in Operating System Three Easy Pieces that might be helpful:
For spin locks, in the single CPU case, performance overheads can be
quite painful; imagine the case where the thread holding the lock is
pre-empted within a critical section. The scheduler might then run
every other thread (imagine there are N − 1 others), each of which
tries to ac- quire the lock. In this case, each of those threads will
spin for the duration of a time slice before giving up the CPU, a
waste of CPU cycles.
However, on multiple CPUs, spin locks work
reasonably well (if the number of threads roughly equals the number of
CPUs). The thinking goes as follows: imagine Thread A on CPU 1 and
Thread B on CPU 2, both contending for a lock. If Thread A (CPU 1)
grabs the lock, and then Thread B tries to, B will spin (on CPU 2).
However, presumably the crit- ical section is short, and thus soon the
lock becomes available, and is ac- quired by Thread B. Spinning to
wait for a lock held on another processor doesn’t waste many cycles in
this case, and thus can be effective
How can I let threads in a kernel module communicate? I'm writing a kernel module and my architecture is going to use three threads that need to communicate. So far, my research has led me to believe the only way is using shared memory (declaring global variables) and locking mechanisms to synchronize reading/writing between the threads. There's rather scarce material on this out there.
Is there any other way I might take into consideration? What's the most common, the standard in the kernel code?
You don't say what operating system you're programming on. I'll assume Linux, which is the most common unix system.
There are several good books on Linux kernel programming. Linux Device Drivers is available online as well as on paper. Chapter 5 deals with concurrency; you can jump in directly to chapter 5 though it would be best to skim through at least chapters 1 and 3 first. Subsequent chapters have relevant sections as well (in particular wait queues are discussed in chapter 6).
The Linux kernel concurrency model is built on shared variables. There is a large range of synchronization methods: atomic integer variables, mutual exclusion locks (spinlocks for nonblocking critical sections, semaphores for blocking critical sections), reader-writer locks, condition variables, wait queues, …