Linux kernel bottom half priority VS kthread priority - linux-kernel

I have a kernel module that has one running kthread and one threaded IRQ whose bottom half is implemented neither as work queue nor as tasklet; I'm just doing all the processing I need directly in it. Both Kthread and bottom half have to access SPI bus so a mutex protection is used to protect the bus access.
The problem I'm facing here is that kthread is kept blocked on the mutex to be released although bottom half process keeps locking and unlocking the mutex several times for a pretty long time. It seems like that bottom half process has a much higher priority than my kthread. Your help is appreciated.

Related

Interrupt Nested, Sequencing

I am reading the Linux Kernel documents and I have these questions(X86_64 Arch);
When PIC sends an interrupt to CPU, will that disable that specific interrupt till the acknowledgement comes from CPU? If that is the case, why do we need to local_irq_disable() in the ISR?
Related to above question, but say if CPU is processing an interrupt in its ISR and if there are 3 interrupts send by the same device to CPU, how does this going to be handled? Will that be serialised in some buffer(if yes, where)?
X86 architecture supports priority based interrupts?
The PIC is a very old interrupt controller, today interrupts are mostly delivered through MSI or through the APIC hierarchy.
The matter is actually more complicated with the IRQ routing, virtualization and so on.
I won't discuss these.
The interrupt priority concept still exists (though a bit simplified) and it works like this:
When an interrupt request is received by the interrupt controller, all the lower priority interrupts are masked and the interrupt is sent to the CPU.
What actually happens is that interrupts are ordered by their request number, with lower numbers having higher priority (0 has more priority than 1).
When any request line is toggled or asserted, the interrupt controller will scan the status of each request line from the number 0 up to the last one.
It stops as soon as it finds a line asserted or which is marked (with the use or a secondary register) in processing.
This way if request line 2 is first asserted and then request line 4 is, the interrupt controller won't server this last request until the first one is "done" because line 2 stops the scanning.
So local_irq_disable may be used to disable all interrupts, including those with higher priority.
AFAIK, this function should be rarely used today. It is a very simple, but inefficient, way to make sure no other code can run (potentially altering common structures).
In general, there needs to be some coordination between the ISR and the device to avoid losing interrupts.
Some devices require the software to write to a special register to let them know it is able to process the next interrupt. This way the device may implement an internal queue of notifications.
The keyboard controller works kind of like this, if you don't read the scancodes fast enough, you simply lose them.
If the device fires interrupts at will and too frequently, the interrupt controller can buffer the requests so they don't get lost.
Both the PIC and the LAPIC can buffer at most one request while another one is in progress (they basically use the fact that they have a request register and an in-progress register for each interrupt).
So in the case of three interrupts in a row, one is surely lost. If the interrupt controller couldn't deliver the first one to the CPU because a higher priority interrupt was in progress, then two will be lost.
In general, the software doesn't except the interrupt controller to buffer any request.
So you shouldn't find code that relies on this (after all, the only number in CS are 0, 1, and infinity. So 2 doesn't exist as far as the software is concerned).
The x86, as a CPU core, doesn't support priority when dealing with interrupt. If the interrupts are not masked, and a hardware interrupt arrives, it is served. It's up to the software and the interrupt controller to prioritize interrupts.
The PIC and LAPIC (and so the MSIs and the IOAPIC) both give interrupts a priority, so for all practical purposes the x86 supports a priority-based interrupt mechanism.
Note however that giving interrupt priority is not necessarily good, it's hard to tell if a network packet is more important than a keystroke.
So Linux has the guideline to do as little work as possible in the ISR and instead to queue the rest of the work to be processed asynchronously out of the ISR.
This may mean to just return from the ISR to work function in order to not block other interrupts.
In the vast majority of cases, only a small portion of code needs to be run in a critical section, a condition where no other interrupt should occur, so the general approach is to return the EOI to the interrupt controller and unmask the interrupt in the CPU as early as possible and write the code so that it can be interrupted.
In case one needs to stop the other interrupt for performance reasons, the approach usually taken is to split the interrupt across different cores so the load is within the required metrics.
Before multi-core systems were widespread, having too many interrupts would effectively slow down some operations.
I guess it would be possible to load a driver that would denial other interrupts for its own performance but that is a form of QoS/Real-time requirement that is up to the user to settle.

Control Block Processes

Whenever a process is moved into the waiting state, I understand that the CPU moved to another process. But whenever a process is in waiting state if it is still needing to make a request to another I/O resource does that computation not require processing? Is there i'm assuming a small part of the processor that is dedicated to help computation of the I/O request to move data back and forth?
I hope this question makes sense lol.
IO operations are actually tasks for peripheral devices to do some work. Usually you set the task by writing data to special areas of memory which belongs to devices. They monitor changes in that small area and start to execute the tasks. So CPU does not need to do anything while the operation is in progress and can switch to another program. When the IO is completed usually an interrupt is triggered. This is a special hardware mechanism which pauses currently executed program in arbitrary place and switches to a special suprogramm, which decides what to do later. There can be another designs, for example device may set special flag somewhere in it's memory region and OS must check it from time to time.
The problem is that these IO are usually quite small, such as send 1 byte over COM port, so CPU has to be interrupted too often. You can't achieve high speed with them. Here is where DMA comes handy. This is a special coprocessor (or part of peripheral device) which has direct access to RAM and can feed big blocks of memory in-to devices. So it can process megabytes of data without interrupting CPU.

Why spinlocks don't work in uniprocessor (unicore) systems?

I know that spinlocks work with spining, different kernel paths exist and Kernels are preemptive, so why spinlocks don't work in uniprocessor systems? (for example, in Linux)
If I understand your question, you're asking why spin locks are a bad idea on single core machines.
They should still work, but can be much more expensive than true thread-sleeping concurrency:
When you use a spinlock, you're essentially asserting that you don't think you will have to wait long. You are saying that you think it's better to maintain the processor time slice with a busy loop than the cost of sleeping your thread and context-shifting to another thread or process. If you have to wait a very short amount of time, you can sleep and be reawakened almost immediately, but the cost of going down and up is more expensive than just waiting around.
This is more likely to be OK on multi-core processors, since they have much better concurrency profiles than single core processors. On multi core processors, between loop iterations, some other thread may have taken care of your prerequisite. On single core processors, it's not possible that someone else could have helped you out - you've locked up the one and only core.
The problem here is that if you wait or sleep on a lock, you hint to the system that you don't have everything you need yet, so it should go do some other stuff and come back to you later. With a spin lock, you never tell the system this, so you lock it up waiting for something else to happen - but, meanwhile, you're holding up the whole system, so something else can't happen.
The nature of a spinlock is that it does not deschedule the process - instead it spins until the process acquires the lock.
On a uniprocessor, it will either immediately acquire the lock or it will spin forever - if the lock is contended, then there will never be an opportunity for the process which currently holds the resource to give it up. Spinlocks are only useful when another process can execute while one is spinning on the lock - which means multiprocessor systems.
there are different versions of spinlock:
spin_lock_irqsave(&xxx_lock, flags);
... critical section here ..
spin_unlock_irqrestore(&xxx_lock, flags);
In Uni processor spin_lock_irqsave() should be used when data needs to shared between process context and interrupt context, as in this case IRQ also gets disabled. spin_lock_irqsave() work under all circumstances, but partly because they are safe they are also fairly slow.
However, in case data needs to be protected across different CPUs then it is better to use below versions, these are cheaper ones as IRQs dont get disabled in this case:
spin_lock(&lock);
...
spin_unlock(&lock);
In uniprocessor systems calling spin_lock_irqsave(&xxx_lock, flags); has the same effect as disabling interrupts which will provide the needed interrupt concurrency protection without unneeded SMP protection. However, in multiprocessor systems this covers both interrupt and SMP concurrency issues.
Spinlocks are, by their nature, intended for use on multiprocessor systems, although a uniprocessor workstation running a preemptive kernel behaves like SMP, as far as concurrency is concerned. If a nonpreemptive uniprocessor system ever went into a spin on a lock, it would spin forever; no other thread would ever be able to obtain the CPU to release the lock. For this reason, spinlock operations on uniprocessor systems without preemption enabled are optimized to do nothing, with the exception of the ones that change the IRQ masking status. Because of preemption, even if you never expect your code to run on an SMP system, you still need to implement proper locking.
Ref:Linux device drivers
By Jonathan Corbet, Alessandro Rubini, Greg Kroah-Hartma
Find the following two paragraph in Operating System Three Easy Pieces that might be helpful:
For spin locks, in the single CPU case, performance overheads can be
quite painful; imagine the case where the thread holding the lock is
pre-empted within a critical section. The scheduler might then run
every other thread (imagine there are N − 1 others), each of which
tries to ac- quire the lock. In this case, each of those threads will
spin for the duration of a time slice before giving up the CPU, a
waste of CPU cycles.
However, on multiple CPUs, spin locks work
reasonably well (if the number of threads roughly equals the number of
CPUs). The thinking goes as follows: imagine Thread A on CPU 1 and
Thread B on CPU 2, both contending for a lock. If Thread A (CPU 1)
grabs the lock, and then Thread B tries to, B will spin (on CPU 2).
However, presumably the crit- ical section is short, and thus soon the
lock becomes available, and is ac- quired by Thread B. Spinning to
wait for a lock held on another processor doesn’t waste many cycles in
this case, and thus can be effective

How to reserve a core for one thread on windows?

I am working on a very time sensitive application which polls a region of shared memory taking action when it detects a change has occurred. Changes are rare but I need to minimize the time from change to action. Given the infrequency of changes I think the CPU cache is getting cold. Is there a way to reserve a core for my polling thread so that it does not have to compete with other threads for either cache or CPU?
Thread affinity alone (SetThreadAffinityMask) will not be enough. It does not reserve a CPU core, but it does the opposite, it binds the thread to only the cores that you specify (that is not the same thing!).
By constraining the CPU affinity, you reduce the likelihood that your thread will run. If another thread with higher priority runs on the same core, your thread will not be scheduled until that other thread is done (this is how Windows schedules threads).
Without constraining affinity, your thread has a chance of being migrated to another core (taking the last time it was run as metric for that decision). Thread migration is undesirable if it happens often and soon after the thread has run (or while it is running) but it is a harmless, beneficial thing if a couple of dozen milliseconds have passed since it was last scheduled (caches will have been overwritten by then anyway).
You can "kind of" assure that your thread will run by giving it a higher priority class (no guarantee, but high likelihood). If you then use SetThreadAffinityMask as well, you have a reasonable chance that the cache is always warm on most common desktop CPUs (which luckily are normally VIPT and PIPT). For the TLB, you will probably be less lucky, but there's nothing you can do about it.
The problem with a high priority thread is that it will starve other threads because scheduling is implemented so it serves higher priority classes first, and as long as these are not satisfied, lower classes get zero. So, the solution in this case must be to block. Otherwise, you may impair the system in an unfavorable way.
Try this:
create a semaphore and share it with the other process
set priority to THREAD_PRIORITY_TIME_CRITICAL
block on the semaphore
in the other process, after writing data, call SignalObjectAndWait on the semaphore with a timeout of 1 (or even zero timeout)
if you want, you can experiment binding them both to the same core
This will create a thread that will be the first (or among the first) to get CPU time, but it is not running.
When the writer thread calls SignalObjectAndWait, it atomically signals and blocks (even if it waits for "zero time" that is enough to reschedule). The other thread will wake from the Semaphore and do its work. Thanks to its high priority, it will not be interrupted by other "normal" (that is, non-realtime) threads. It will keep hogging CPU time until done, and then block again on the semaphore. At this point, SignalObjectAndWait returns.
Using the Task Manager, you can set the "affinity" of processes.
You would have to set the affinity of your time-critical app to core 4, and the affinity of all the other processes to cores 1, 2, and 3. Assuming four cores of course.
You could call the SetProcessAffinityMask on every process but yours with a mask that excludes just the core that will "belong" to your process, and use it on your process to set it to run just on this core (or, even better, SetThreadAffinityMask just on the thread that does the time-critical task).
Given the infrequency of changes I think the CPU cache is getting cold.
That sounds very strange.
Let's assume your polling thread and the writing thread are on different cores.
The polling thread will be reading the shared memory address and so will be caching the data. That cache line is probably marked as exclusive. Then the write thread finally writes; first, it reads the cache line of memory in (so that line is now marked as shared on both cores) and then it writes. Writing causes the polling thread CPU's cache line to be marked as invalid. The polling thread then comes to read again; if it reads while the writing thread still has the data cached, it will read from the second cores cache, invalidating its cache line and taking ownership for itself. There's a lot of bus traffic overhead to do this.
Another issue is that the writing thread, if it doesn't write often, will almost certainly lose the TLB entry for the page with the shared memory address. Recalculating the physical address is a long, slow process. Since the polling thread polls often, possibly that page is always in that cores TLB; and in that sense, you might well do better, in latency terms, to have both threads on the same core. (Although if they're both compute intensive, they might interfere destructively and that cost could be much higher - I can't know, as I don't know what the threads are doing).
One thing you could do is use a hyperthread on the writing thread core; if you know early on you're going to write, get the hyperthread to read the shared memory address. This will load the TLB and cache while the writing thread is still busy computing, giving you parallelism.
The Win32 function SetThreadAffinityMask() is what you are looking for.

Usage of spinlock and cli together

I recently downloaded linux source from http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.34.1.tar.bz2 . I came across the below paragraph in the file called spinlocks.txt in linux-2.6.34.1\Documentation folder.
" it does mean that if you have some code that does
cli();
.. critical section ..
sti();
and another sequence that does
spin_lock_irqsave(flags);
.. critical section ..
spin_unlock_irqrestore(flags);
then they are NOT mutually exclusive, and the critical regions can happen
at the same time on two different CPU's. That's fine per se, but the
critical regions had better be critical for different things (ie they
can't stomp on each other). "
How can they impact if some code is using cli()/sti() and other part of the same code uses spin_lock_irqsave(flags)/spin_unlock_irqrestore(flags) ?
The key part here is "on two different CPUs". Some background:
Historically on uni-processor (UP) systems the only source of concurrency was hardware interrupts. It was enough to cli/sti around the critical section to prevent an IRQ handler from messing things up.
Then there was the giant lock design where the kernel would effectively run on a single CPU and only one process could be in the kernel at a time (that what the giant lock was for). Again, disabling interrupts was enough to protect kernel from itself.
On full SMP systems, where multiple threads could be active in the kernel at the same time and interrupts could be delivered to pretty much any CPU, it's no longer enough to only disable interrupts on single processor, or only grab a single lock. Both are required: disabling interrupts protects from IRQ handler on the same CPU, holding a lock protects from other threads entering the same critical sections on different CPU. This is exactly why spin_lock_irqsave() and spin_unlock_irqrestore() were invented.

Resources