Interrupt scheduling and handling in linux - linux-kernel

let's say we are getting 100 interrupts from Net device, 50 interrupts from USB, 25 interrupts from SPI device, 25 interrupts from I2c.
It is coming in sequence as follows
5Net - 4USB - 2SPI -2I2C and the same sequence follows.
The top-level handler can dispatch a device-specific handler to service the interrupt
Now the processor will interrupt the running task as soon as it gets the Net device's interrupt. On completing the Top half of Net device's INterrupt handler, it has to execute the top half of USB and SPI and I2C.
And the same sequence will be followed after completing the 1st set of sequence . When the interrupted task will wake again? Do the interrupted task wait until all the 100 interrupt are serviced by their respective device specific handlers?. How the Interrupts are shared to different cores in case multi-core systems as hundreds of thousands of interrupts will have to be serviced?
As far as I know when executing Interrupt handler the processor will be in interrupt context so that there wont be any context switching. As different ISR will have to service hundreds of thousands of Interrupts, do the processor will always be in interrupt context?

When the interrupted task will wake again?
When interrupts are cleared and the scheduler decides to give this task processor time.
Do the interrupted task wait until all the 100 interrupt are serviced by their respective device specific handlers?
You described only 4 IRQs sources (some net device, usb, spi, i2c). So if all IRQ lines are high and enabled, than the cpus which handle these irqs will switch to specific interrupt handlers. If the interrupt is still triggered after the handler, then the cpu servicing it will branch again and again to the interrupt handler until the interrupt is cleared. On multi-cpu system with 5 cpus, 4 may execute interrupt handlers for your devices simultaneously, while the other one will execute your task. So your task may not be interrupted at all. Or it may wait forever for the cpu, on a single cpu system when the interrupt handler is badly written and never clears the IRQ line.
How the Interrupts are shared to different cores in case multi-core systems as lakhs of interrupts will have to be serviced?
I think it is best explained here: multi-core CPU interrupts .
As different ISR will have to service lakhs of Interrupt, do the processor will always be in interrupt context?
It will stay in interrupt context until the IRQ is enabled and IRQ is triggered. You can just disable the IRQ line and return the cpu to the scheduler, if you need.

Related

Interrupt Nested, Sequencing

I am reading the Linux Kernel documents and I have these questions(X86_64 Arch);
When PIC sends an interrupt to CPU, will that disable that specific interrupt till the acknowledgement comes from CPU? If that is the case, why do we need to local_irq_disable() in the ISR?
Related to above question, but say if CPU is processing an interrupt in its ISR and if there are 3 interrupts send by the same device to CPU, how does this going to be handled? Will that be serialised in some buffer(if yes, where)?
X86 architecture supports priority based interrupts?
The PIC is a very old interrupt controller, today interrupts are mostly delivered through MSI or through the APIC hierarchy.
The matter is actually more complicated with the IRQ routing, virtualization and so on.
I won't discuss these.
The interrupt priority concept still exists (though a bit simplified) and it works like this:
When an interrupt request is received by the interrupt controller, all the lower priority interrupts are masked and the interrupt is sent to the CPU.
What actually happens is that interrupts are ordered by their request number, with lower numbers having higher priority (0 has more priority than 1).
When any request line is toggled or asserted, the interrupt controller will scan the status of each request line from the number 0 up to the last one.
It stops as soon as it finds a line asserted or which is marked (with the use or a secondary register) in processing.
This way if request line 2 is first asserted and then request line 4 is, the interrupt controller won't server this last request until the first one is "done" because line 2 stops the scanning.
So local_irq_disable may be used to disable all interrupts, including those with higher priority.
AFAIK, this function should be rarely used today. It is a very simple, but inefficient, way to make sure no other code can run (potentially altering common structures).
In general, there needs to be some coordination between the ISR and the device to avoid losing interrupts.
Some devices require the software to write to a special register to let them know it is able to process the next interrupt. This way the device may implement an internal queue of notifications.
The keyboard controller works kind of like this, if you don't read the scancodes fast enough, you simply lose them.
If the device fires interrupts at will and too frequently, the interrupt controller can buffer the requests so they don't get lost.
Both the PIC and the LAPIC can buffer at most one request while another one is in progress (they basically use the fact that they have a request register and an in-progress register for each interrupt).
So in the case of three interrupts in a row, one is surely lost. If the interrupt controller couldn't deliver the first one to the CPU because a higher priority interrupt was in progress, then two will be lost.
In general, the software doesn't except the interrupt controller to buffer any request.
So you shouldn't find code that relies on this (after all, the only number in CS are 0, 1, and infinity. So 2 doesn't exist as far as the software is concerned).
The x86, as a CPU core, doesn't support priority when dealing with interrupt. If the interrupts are not masked, and a hardware interrupt arrives, it is served. It's up to the software and the interrupt controller to prioritize interrupts.
The PIC and LAPIC (and so the MSIs and the IOAPIC) both give interrupts a priority, so for all practical purposes the x86 supports a priority-based interrupt mechanism.
Note however that giving interrupt priority is not necessarily good, it's hard to tell if a network packet is more important than a keystroke.
So Linux has the guideline to do as little work as possible in the ISR and instead to queue the rest of the work to be processed asynchronously out of the ISR.
This may mean to just return from the ISR to work function in order to not block other interrupts.
In the vast majority of cases, only a small portion of code needs to be run in a critical section, a condition where no other interrupt should occur, so the general approach is to return the EOI to the interrupt controller and unmask the interrupt in the CPU as early as possible and write the code so that it can be interrupted.
In case one needs to stop the other interrupt for performance reasons, the approach usually taken is to split the interrupt across different cores so the load is within the required metrics.
Before multi-core systems were widespread, having too many interrupts would effectively slow down some operations.
I guess it would be possible to load a driver that would denial other interrupts for its own performance but that is a form of QoS/Real-time requirement that is up to the user to settle.

How the kernel different subsystems share CPU time

Processes in userspace are scheduled by the kernel scheduler to get processor time but how the different kernel tasks get CPU time? I mean, when no process at userspace are requering CPU time (so CPU is iddle by executing NOP instructions) but some kernel subsystem need to carry out some task regularly, are timers and other hw and sw interrupts the common methods to get CPU time in kernel space?.
It's pretty much the same scheduler. The only difference I could think of is that kernel code has much more control over execution flow. For example, there is direct call to scheduler schedule().
Also in kernel you have 3 execution contexts - hardware interrupt, softirq/bh and process. In hard (and probably soft) interrupt context you can't sleep, so scheduling is not done during executing code in this context.

Why do we need Interrupt context?

I am having doubts, why exactly we need interrupt context? Everything tells what are the properties but no one explains why we come up with this concept?
Another doubt related to same concept is, If we are not disabling the interrupt in interrupt handler, then what is the use of running this interrupt handler code in interrupt context ?
The interrupt context is fundamentally different from the process context:
It is not associated with a process; a specific process does not serve interrupts, the kernel does. Even if a process will be interrupted, it has no significance over any parameters of the interrupt itself or the routine that will serve it. It follows that at the very least, interrupt context must be different from process context conceptually.
Additionally, if an interrupt were to be serviced in a process context, and (re-) scheduled some work at a later time, what context would that run in? The original process may not even exist at that later time. Ergo, we need some context which is independent from processes for a practical reason.
Interrupt handling must be fast; your interrupt handler has interrupted (d'oh) some other code. Significant work should be pushed outside the interrupt handler, onto the "bottom half". It is unacceptable to block a process for work which is not even remotely its concern, either in user or in kernel space.
Disabling the interrupt is something you can (actually could, before 2.6.36) request to be disabled when registering your ISR. Recall that a handler can serve interrupts on multiple CPUs simultaneously, and can thus race with itself. Non-Maskable Interrupts (NMIs) can not be disabled.
Why do we need Interrupt context?
First, what do we mean by interrupt context? A context is usually a state. There are two separate concepts of state.
CPU context
Every CPU architecture has a mechanism for handling interrupts. There maybe a single interrupt vector called for every system interrupt, or the CPU/hardware may be capable of dispatching the CPU to a particular address based on the interrupt source. There are also mechanisms for masking/unmasking interrupts. Each interrupt maybe masked individually, or there maybe a global mask for the entire CPU(s). Finally, there is an actual CPU state. Some may have separate stacks, register sets, and CPU modes implying some memory and other privileges. Your question is about Linux in general and it must handle all cases.
Linux context
Generally all of the architectures have a separate kernel stack, process context (ala ps) and VM (virtual memory) context for each process. The VM has different privileges for user and kernel modes. In order for the kernel to run all the time, it must remain mapped for all processes on a device. A kernel thread is a special case that doesn't care so much about the VM, because it is privileged and can access all kernel memory. However, it does have a separate stack and process context. User registers are typically stored upon the kernel stack when exceptions happen. Exceptions are at least page faults, system calls and interrupts. These items may nest. Ie, you may call write() from user space and while the kernel is transferring a user buffer, it may page fault to read some swapped out user space data. The page fault may again have to service an interrupt.
Interrupt recursion
Linux general wants you to leave interrupts masked as the VM, the execptions, and process management (context and context switching) have to work together. In order to keep things simple for the VM, the kernel stack and process context are generally rooted in either a single 4k (or 8k) area which is a single VM page. This page is always mapped. Typically, all CPUs will switch from interrupt mode to system mode when servicing an interrupt and use the same kernel stack as all other exceptions. The stack is small so to allow recursion (and large stack allocation) can blow up the stack resulting in stack overflows at the kernel level. This is bad.
Atomicity
Many kernel structures need to stay consistent over multiple bus cycles; Ie, a linked list must update both prev and next node links when adding an element. A typical mechanism to do this maybe to mask interrupts, to ensure the code is atomic. Some CPUs may allow bus locking, but this is not universal. The context switching code must also be atomic. A consequence of an interrupt is typically rescheduling. Ie, a kernel interrupt handler may have acked a disk controller and started a write operation. Then a kernel thread may schedule to write more buffered data from the original user space write().
Interrupts occurring at any time can break some sub-sytem's assumptions of atomic behavior. Instead of allowing interrupt to use the sub-system, they are prohibited from using it.
Summary
Linux must handle three thing. The current process execution context, the current virtual memory layout and hardware requests. They all need to work together. As the interrupts may happen at any time, they occur in any process context. Using sleep(), etc in an interrupt would put random processes to sleep. Allowing large stack allocation in an interrupt could blow up the limited stack. These design choices limit what can happen in a Linux interrupt handler. Various configuration options can allow re-entrant interrupts, but this is often CPU specific.
A benefit of keeping the top half, now the main interrupt handler small is that interrupt latency is reduced. Busy work should be done in a kernel thread. An interrupt service routine that would need to un-mask interrupts is already somewhat anti-social to the Linux eco-system. That work should be put in a kernel thread.
The Linux interrupt context really doesn't exist in some sense. It is only a CPU interrupt which may happen in any process context. The Linux interrupt context is actually a set of coding limitations that happen as a consequence of this.

task gate, interrupt gate, call gate

I have been trying to read more about different gates in x86 architecture. If I understand correctly then interrupt and trap gate are used for hw and sw interrupt handling respectively.
Whereas CALL gate is probably no more used, as ppl prefer replaced by SYSENTER and SYSEXIT.
I was wondering how task gates are used (I know they are used for hw task switch). What does that exactly mean? Does hw task refer to OS task/process. Or is it more like switching between two different instances of operating system. (May be on servers.)?
On a side note, can it happen that some of the interrupts are handled in the user mode. (Can we handle divide by zero interrupt in the user mode. If it can be then does that mean IDT handler entry for divide by zero contains address from the user space?)
Thanks
Everything you might want to know about interrupts and gates is in the Intel developer manual, volume 3. In short:
Task gates were originally designed as a CPU-mediated method of performing task switching; the CPU can automatically record the state of the process during the task switching operation. These are not typically used in modern operating systems; the OS usually does the state-saving operations on its own.
At least in Linux, all interrupt handlers are in kernel space and execute at ring 0. If you want to handle a divide-by-zero exception, you register a userspace signal handler for SIGFPE; the kernel-space interrupt handler raises the SIGFPE signal, indirectly triggering the userspace handler code (the userspace code is executed after returning from the interrupt handler).
The state of affairs is that only interrupt and trap gates was actually in use and stay in use now. In theory, both of them can be used as for s/w and for h/w event handling. The only difference between them is that interrupt gate call automatically prohibits future interrupts, that can be useful in some cases of hardware interrupt handling.
By default people try to use trap gates, because unnecessary interrupt disabling is a bad thing, because interrupt disabling increase interrupt handling latencies and increase probability of interrupt lost.
Call gates was never been in actual use. It is inconvenient and not optimal way for system call implementation. Instead call gate, most of the operating systems use trap gate (int 0x80 in Linux and int 0x2E in Windows) or sysenter/sysexit syscall/sysrt instructions.
Task gate was never been in actual use too. It is not optimal, inconvenient and limited feature, if not ugly at all. Instead of it, operating systems usually implements task switching on its own side by kernel mode task stacks switching.
Initially, Intel delivered hardware support of multitasking by introduction of TSS (Task State Segment) and Task Gate. According to that features, processor is able to automatically store the state of one task and restore state of another one in reply to the request came from hw or sw. Sw request can be done by issuing call or jmp instructions with TSS selector or task gate selector used as instruction operand. Hw request can be done by hardware traping into the task gate in appropriate IDT entry. But as I've already mentioned, no one really uses it. Instead of it, operating systems use only one TSS for all tasks (TSS must be used in any case, because during control transfer from the less privileged segment to more privileged segment CPU switch stacks and it capture address of the stack for more privileged segment from the TSS) and make task switch manually.
In theory, interrupts and exceptions can be handled in user mode (ring 3), but in practice it is not useful and operating system handle all such events on the kernel side (in ring 0). The reason is simple, interrupt and exception handlers must always reside in the memory and be accessible from the any address space. Kernel part of address space is shared and the same in all address spaces of all tasks in the system, but the user part of address space is wired to the particular task. If you want to handle exception in user mode you will be forced to reprogram IDT on each task switch that will introduce significant performance penalty. If you want to handle interrupts in the same way you will be forced to share interrupt handlers between all tasks on the same addresses. As unwanted consequence, any task in the system will be able to corrupt handler.

What happens when kernel code is interrupted?

I am reading Operating System Concepts (Silberschatz,Galvin,Gagne), 6th edition, chapter 20.
I understand that Linux kernel code is non preemptible (before 2.6 version). But it can be interrupted by hardware interrupts. What happens if the kernel was in the middle of a critical section and the interrupt occured and it too executed the critical section?
From what I read in the book:
The second protection scheme that
Linux uses applies to critical
sections that occur in the interrupt service routines. The basic tool is
the processor interrupt control
hardware...
Ok, this scheme is used when an ISR has a critical section. But it will only disble further interrupts. What about the kernel code which was interrupted by this interrupt in the first place?
But it will only disble further interrupts. What about the kernel code which was interrupted
by this interrupt in the first place?
If the interrupt handler and other kernel code need access to the same data, you need to protect against that, which is usually done by a spinlock , great care must be taken, you don't want to introduce a deadlock ,and you must ensure such a spinlock is not held for too long. For spinlocks used in a hardware interrupt handler you have to disable interrupts on that processor whilst holding the lock - which in linux is done with the function spin_lock_irqsave().
(whilst a bit outdated, you can read about the concept here)
The kernel code which was interrupted by this interrupt in the first place gets interrupted.
This is why writing interrupt handlers is such a painful task: they can't do anything that would endanger the correctness of the main stream of execution.
For example, the way Apple's xnu kernel handles most kinds of device interrupts is to capture the information in the interrupt to a record in memory, add that record to a queue, and then resume normal execution; the kernel then picks up interrupts from the queue some time later (in the scheduler's main loop, i assume). That way, the interrupt handler only interacts with the rest of the system through the interrupt queue, and there is little danger of it causing trouble.
There is a bit of middle ground; on many architectures (including the x86), it is possible for privileged code to mask interrupts, so that they won't cause interruption. That can be used to protect passages of code which really shouldn't be interrupted. However, those architectures typically also have non-maskable interrupts, which ignore the masking, so interruption still has to be considered.

Resources