Interrupt Nested, Sequencing - linux-kernel

I am reading the Linux Kernel documents and I have these questions(X86_64 Arch);
When PIC sends an interrupt to CPU, will that disable that specific interrupt till the acknowledgement comes from CPU? If that is the case, why do we need to local_irq_disable() in the ISR?
Related to above question, but say if CPU is processing an interrupt in its ISR and if there are 3 interrupts send by the same device to CPU, how does this going to be handled? Will that be serialised in some buffer(if yes, where)?
X86 architecture supports priority based interrupts?

The PIC is a very old interrupt controller, today interrupts are mostly delivered through MSI or through the APIC hierarchy.
The matter is actually more complicated with the IRQ routing, virtualization and so on.
I won't discuss these.
The interrupt priority concept still exists (though a bit simplified) and it works like this:
When an interrupt request is received by the interrupt controller, all the lower priority interrupts are masked and the interrupt is sent to the CPU.
What actually happens is that interrupts are ordered by their request number, with lower numbers having higher priority (0 has more priority than 1).
When any request line is toggled or asserted, the interrupt controller will scan the status of each request line from the number 0 up to the last one.
It stops as soon as it finds a line asserted or which is marked (with the use or a secondary register) in processing.
This way if request line 2 is first asserted and then request line 4 is, the interrupt controller won't server this last request until the first one is "done" because line 2 stops the scanning.
So local_irq_disable may be used to disable all interrupts, including those with higher priority.
AFAIK, this function should be rarely used today. It is a very simple, but inefficient, way to make sure no other code can run (potentially altering common structures).
In general, there needs to be some coordination between the ISR and the device to avoid losing interrupts.
Some devices require the software to write to a special register to let them know it is able to process the next interrupt. This way the device may implement an internal queue of notifications.
The keyboard controller works kind of like this, if you don't read the scancodes fast enough, you simply lose them.
If the device fires interrupts at will and too frequently, the interrupt controller can buffer the requests so they don't get lost.
Both the PIC and the LAPIC can buffer at most one request while another one is in progress (they basically use the fact that they have a request register and an in-progress register for each interrupt).
So in the case of three interrupts in a row, one is surely lost. If the interrupt controller couldn't deliver the first one to the CPU because a higher priority interrupt was in progress, then two will be lost.
In general, the software doesn't except the interrupt controller to buffer any request.
So you shouldn't find code that relies on this (after all, the only number in CS are 0, 1, and infinity. So 2 doesn't exist as far as the software is concerned).
The x86, as a CPU core, doesn't support priority when dealing with interrupt. If the interrupts are not masked, and a hardware interrupt arrives, it is served. It's up to the software and the interrupt controller to prioritize interrupts.
The PIC and LAPIC (and so the MSIs and the IOAPIC) both give interrupts a priority, so for all practical purposes the x86 supports a priority-based interrupt mechanism.
Note however that giving interrupt priority is not necessarily good, it's hard to tell if a network packet is more important than a keystroke.
So Linux has the guideline to do as little work as possible in the ISR and instead to queue the rest of the work to be processed asynchronously out of the ISR.
This may mean to just return from the ISR to work function in order to not block other interrupts.
In the vast majority of cases, only a small portion of code needs to be run in a critical section, a condition where no other interrupt should occur, so the general approach is to return the EOI to the interrupt controller and unmask the interrupt in the CPU as early as possible and write the code so that it can be interrupted.
In case one needs to stop the other interrupt for performance reasons, the approach usually taken is to split the interrupt across different cores so the load is within the required metrics.
Before multi-core systems were widespread, having too many interrupts would effectively slow down some operations.
I guess it would be possible to load a driver that would denial other interrupts for its own performance but that is a form of QoS/Real-time requirement that is up to the user to settle.

Related

When an ISR is running what happens to the interrupts on that particular IRQ line.would they be lost or stored so it can be processed at later point

When an Interrupt service routine is being handled that particular IRQ line is disabled,then what happens when a device registered on the same IRQ line raises an interrupt.? Is that interrupt lost or stored so it can be processed at later point.
kindly someone explain.
Thanks in advance.
In general, the interrupt is lost. That is, unless the device driver can deduce that a missed interrupt occurred, like by regularly inspecting device registers related to interrupt status.
Many, if not most, device drivers do not do that. It is almost always better to handle the interrupt expeditiously and return from interrupt so the next interrupt can be handled sooner.
A reasonable goal is to limit the code path ISR logic to less than a dozen—preferably even less—lines of simple source code. This is easily achieved by servicing whatever needs servicing: usually a few transfers from/to device registers, marking a blocked task on that i/o to be ready, and returning. Of course, the rest of the driver (non ISR portions) may have to do a little more work to accomplish such ISR efficiency, but that is good driver design IMHO.
I have discussed with many device driver engineers who claim that having the ISR do more work on the spot (and not deferred to thread-based processing) can help improve latency and system performance. I remain unconvinced that assertion is ever true.
Check out my answer here: On x86, when the OS disables interrupts, do they vanish, or do they queue and 'wait' for interrupts to come back on?
The interrupts on that particular IRQ line are lost. So, the ISR routine should execute as quickly as possible so that such a sceanrio doesn't arise. That's why we moved to the top-half, bottom-half approach (tasklets, workqueues) and now to Threaded IRQs.

Why do we need Interrupt context?

I am having doubts, why exactly we need interrupt context? Everything tells what are the properties but no one explains why we come up with this concept?
Another doubt related to same concept is, If we are not disabling the interrupt in interrupt handler, then what is the use of running this interrupt handler code in interrupt context ?
The interrupt context is fundamentally different from the process context:
It is not associated with a process; a specific process does not serve interrupts, the kernel does. Even if a process will be interrupted, it has no significance over any parameters of the interrupt itself or the routine that will serve it. It follows that at the very least, interrupt context must be different from process context conceptually.
Additionally, if an interrupt were to be serviced in a process context, and (re-) scheduled some work at a later time, what context would that run in? The original process may not even exist at that later time. Ergo, we need some context which is independent from processes for a practical reason.
Interrupt handling must be fast; your interrupt handler has interrupted (d'oh) some other code. Significant work should be pushed outside the interrupt handler, onto the "bottom half". It is unacceptable to block a process for work which is not even remotely its concern, either in user or in kernel space.
Disabling the interrupt is something you can (actually could, before 2.6.36) request to be disabled when registering your ISR. Recall that a handler can serve interrupts on multiple CPUs simultaneously, and can thus race with itself. Non-Maskable Interrupts (NMIs) can not be disabled.
Why do we need Interrupt context?
First, what do we mean by interrupt context? A context is usually a state. There are two separate concepts of state.
CPU context
Every CPU architecture has a mechanism for handling interrupts. There maybe a single interrupt vector called for every system interrupt, or the CPU/hardware may be capable of dispatching the CPU to a particular address based on the interrupt source. There are also mechanisms for masking/unmasking interrupts. Each interrupt maybe masked individually, or there maybe a global mask for the entire CPU(s). Finally, there is an actual CPU state. Some may have separate stacks, register sets, and CPU modes implying some memory and other privileges. Your question is about Linux in general and it must handle all cases.
Linux context
Generally all of the architectures have a separate kernel stack, process context (ala ps) and VM (virtual memory) context for each process. The VM has different privileges for user and kernel modes. In order for the kernel to run all the time, it must remain mapped for all processes on a device. A kernel thread is a special case that doesn't care so much about the VM, because it is privileged and can access all kernel memory. However, it does have a separate stack and process context. User registers are typically stored upon the kernel stack when exceptions happen. Exceptions are at least page faults, system calls and interrupts. These items may nest. Ie, you may call write() from user space and while the kernel is transferring a user buffer, it may page fault to read some swapped out user space data. The page fault may again have to service an interrupt.
Interrupt recursion
Linux general wants you to leave interrupts masked as the VM, the execptions, and process management (context and context switching) have to work together. In order to keep things simple for the VM, the kernel stack and process context are generally rooted in either a single 4k (or 8k) area which is a single VM page. This page is always mapped. Typically, all CPUs will switch from interrupt mode to system mode when servicing an interrupt and use the same kernel stack as all other exceptions. The stack is small so to allow recursion (and large stack allocation) can blow up the stack resulting in stack overflows at the kernel level. This is bad.
Atomicity
Many kernel structures need to stay consistent over multiple bus cycles; Ie, a linked list must update both prev and next node links when adding an element. A typical mechanism to do this maybe to mask interrupts, to ensure the code is atomic. Some CPUs may allow bus locking, but this is not universal. The context switching code must also be atomic. A consequence of an interrupt is typically rescheduling. Ie, a kernel interrupt handler may have acked a disk controller and started a write operation. Then a kernel thread may schedule to write more buffered data from the original user space write().
Interrupts occurring at any time can break some sub-sytem's assumptions of atomic behavior. Instead of allowing interrupt to use the sub-system, they are prohibited from using it.
Summary
Linux must handle three thing. The current process execution context, the current virtual memory layout and hardware requests. They all need to work together. As the interrupts may happen at any time, they occur in any process context. Using sleep(), etc in an interrupt would put random processes to sleep. Allowing large stack allocation in an interrupt could blow up the limited stack. These design choices limit what can happen in a Linux interrupt handler. Various configuration options can allow re-entrant interrupts, but this is often CPU specific.
A benefit of keeping the top half, now the main interrupt handler small is that interrupt latency is reduced. Busy work should be done in a kernel thread. An interrupt service routine that would need to un-mask interrupts is already somewhat anti-social to the Linux eco-system. That work should be put in a kernel thread.
The Linux interrupt context really doesn't exist in some sense. It is only a CPU interrupt which may happen in any process context. The Linux interrupt context is actually a set of coding limitations that happen as a consequence of this.

task gate, interrupt gate, call gate

I have been trying to read more about different gates in x86 architecture. If I understand correctly then interrupt and trap gate are used for hw and sw interrupt handling respectively.
Whereas CALL gate is probably no more used, as ppl prefer replaced by SYSENTER and SYSEXIT.
I was wondering how task gates are used (I know they are used for hw task switch). What does that exactly mean? Does hw task refer to OS task/process. Or is it more like switching between two different instances of operating system. (May be on servers.)?
On a side note, can it happen that some of the interrupts are handled in the user mode. (Can we handle divide by zero interrupt in the user mode. If it can be then does that mean IDT handler entry for divide by zero contains address from the user space?)
Thanks
Everything you might want to know about interrupts and gates is in the Intel developer manual, volume 3. In short:
Task gates were originally designed as a CPU-mediated method of performing task switching; the CPU can automatically record the state of the process during the task switching operation. These are not typically used in modern operating systems; the OS usually does the state-saving operations on its own.
At least in Linux, all interrupt handlers are in kernel space and execute at ring 0. If you want to handle a divide-by-zero exception, you register a userspace signal handler for SIGFPE; the kernel-space interrupt handler raises the SIGFPE signal, indirectly triggering the userspace handler code (the userspace code is executed after returning from the interrupt handler).
The state of affairs is that only interrupt and trap gates was actually in use and stay in use now. In theory, both of them can be used as for s/w and for h/w event handling. The only difference between them is that interrupt gate call automatically prohibits future interrupts, that can be useful in some cases of hardware interrupt handling.
By default people try to use trap gates, because unnecessary interrupt disabling is a bad thing, because interrupt disabling increase interrupt handling latencies and increase probability of interrupt lost.
Call gates was never been in actual use. It is inconvenient and not optimal way for system call implementation. Instead call gate, most of the operating systems use trap gate (int 0x80 in Linux and int 0x2E in Windows) or sysenter/sysexit syscall/sysrt instructions.
Task gate was never been in actual use too. It is not optimal, inconvenient and limited feature, if not ugly at all. Instead of it, operating systems usually implements task switching on its own side by kernel mode task stacks switching.
Initially, Intel delivered hardware support of multitasking by introduction of TSS (Task State Segment) and Task Gate. According to that features, processor is able to automatically store the state of one task and restore state of another one in reply to the request came from hw or sw. Sw request can be done by issuing call or jmp instructions with TSS selector or task gate selector used as instruction operand. Hw request can be done by hardware traping into the task gate in appropriate IDT entry. But as I've already mentioned, no one really uses it. Instead of it, operating systems use only one TSS for all tasks (TSS must be used in any case, because during control transfer from the less privileged segment to more privileged segment CPU switch stacks and it capture address of the stack for more privileged segment from the TSS) and make task switch manually.
In theory, interrupts and exceptions can be handled in user mode (ring 3), but in practice it is not useful and operating system handle all such events on the kernel side (in ring 0). The reason is simple, interrupt and exception handlers must always reside in the memory and be accessible from the any address space. Kernel part of address space is shared and the same in all address spaces of all tasks in the system, but the user part of address space is wired to the particular task. If you want to handle exception in user mode you will be forced to reprogram IDT on each task switch that will introduce significant performance penalty. If you want to handle interrupts in the same way you will be forced to share interrupt handlers between all tasks on the same addresses. As unwanted consequence, any task in the system will be able to corrupt handler.

What happens when kernel code is interrupted?

I am reading Operating System Concepts (Silberschatz,Galvin,Gagne), 6th edition, chapter 20.
I understand that Linux kernel code is non preemptible (before 2.6 version). But it can be interrupted by hardware interrupts. What happens if the kernel was in the middle of a critical section and the interrupt occured and it too executed the critical section?
From what I read in the book:
The second protection scheme that
Linux uses applies to critical
sections that occur in the interrupt service routines. The basic tool is
the processor interrupt control
hardware...
Ok, this scheme is used when an ISR has a critical section. But it will only disble further interrupts. What about the kernel code which was interrupted by this interrupt in the first place?
But it will only disble further interrupts. What about the kernel code which was interrupted
by this interrupt in the first place?
If the interrupt handler and other kernel code need access to the same data, you need to protect against that, which is usually done by a spinlock , great care must be taken, you don't want to introduce a deadlock ,and you must ensure such a spinlock is not held for too long. For spinlocks used in a hardware interrupt handler you have to disable interrupts on that processor whilst holding the lock - which in linux is done with the function spin_lock_irqsave().
(whilst a bit outdated, you can read about the concept here)
The kernel code which was interrupted by this interrupt in the first place gets interrupted.
This is why writing interrupt handlers is such a painful task: they can't do anything that would endanger the correctness of the main stream of execution.
For example, the way Apple's xnu kernel handles most kinds of device interrupts is to capture the information in the interrupt to a record in memory, add that record to a queue, and then resume normal execution; the kernel then picks up interrupts from the queue some time later (in the scheduler's main loop, i assume). That way, the interrupt handler only interacts with the rest of the system through the interrupt queue, and there is little danger of it causing trouble.
There is a bit of middle ground; on many architectures (including the x86), it is possible for privileged code to mask interrupts, so that they won't cause interruption. That can be used to protect passages of code which really shouldn't be interrupted. However, those architectures typically also have non-maskable interrupts, which ignore the masking, so interruption still has to be considered.

tasklet advantage in userspace application

Got some doubts with bottom half.Here, I consider tasklets only.
Also , I consider non-preemptible kernel only.
Suppose consider a ethernet driver in which rx interrupt processing is doing some 10 functions calls.(bad programming :) )
Now, looking at performance perspective if 9 function calls can be moved to a tasklet and only 1 needs to be called in interrupt handling , Can I really get some good performance in a tcp read application.
Or in other words, when there is switch to user space application all the 9 function calls for the tasklets scheduled will be called, in effective the user space application will be able to get the packet cum data only after "all the taskets scheduled" are completed ? correct?
I understand that by having bottom half , we are enabling all interrupts .. but I have a doubt whether the application that relies on the interrupt actually gain anything by having the entire 10 functions in interrupt handler itself or in the bottom half.
In Short, by having tasklet do I gain performance improvement in user space application ,here ?
Since tasklets are not queued but scheduled, i.e. several hardware interrupts posting the same tasklet might result in a single tasklet function invocation, you would be able to save up to 90% of the processing in extreme cases.
On the other hand there's already a high-priority soft IRQ for net-rx.
In my experience on fast machines, moving work from the handler to the tasklet does not make the machine run faster. I've added macros in the handler that can turn my schedule_tasklet() call into a call to the tasklet function itself, and it's easy to benchmark both ways and see the difference.
But it's important that interrupt handlers finish quickly. As Nikolai mentioned, you might benefit if your device likes to interrupt a lot, but most high-bandwidth devices have interrupt mitigation hardware that makes this a less serious problem.
Using tasklets is the way that core kernel people are going to do things, so all else being equal, it's probably best to follow their lead, especially if you ever want to see your driver accepted into the mainline kernel.
I would also note that calling lots of functions isn't necessarily bad practice; modern branch predictors can make branch-heavy code run just as fast as non-branch-heavy code. Far more important in my opinion are the potential cache effects of having to do half the job now, and then half the job later.
A tasklet does not run in context of the user process. If your ISR schedules a tasklet, it will run immediately after your isr is done, but with interrupts enabled. The benefit of this is that your packet processing is not preventing additional interrupts.
In your TCP example, the hardware hands off the packet to the network stack and your driver is done -- the net stack handles waking up the process etc. so there really no way for the hw's driver to execute in the process context of the data's recipient, because the hw doesn't even know who that is.

Resources