I understand that modern operation systems provide api's for debugging. When a debugger process asks the kernel to set a breakpoint on machine code instruction of another process, the kernel replaces the first byte of the instruction with an opcode that causes an interrupt.
The interrupt handler would halt the process, save the registers and notify the debugging process.
What I do not understand is what exactly happens on out of order execution processors. The interrupt instruction could be executed before it's predecessors or after it's ancestors and thus, at the time of the interrupt, the registers and memory will contain wrong values.
That is why all the ordered events such as interrupts, faults, exceptions, etc, are always handled at the commit point in out-of-order processors, where the original program order is restored and the correct machine state can be captured. This means that you may know of a pending event but still delay handling it.
Note that actions visible by the external world such as stores to memory are also handled past this stage, so you can never view the speculative internal state of an out-of-order core (well, except for side channel attack methods...), and any interrupt or breakpoint will also be ordered correctly with regards to them
Related
I understand that instructions can be re-ordered by the processor in addition to compilers.
I have a few questions that I can not get my head around.
Say we have three instructions:
Program order
S1
S2
S3
After re-ordering by the processor, order becomes (for whatever reason):
S3
S2
S1
So when the processor executes S1 (in the program order), what woul be the value of the Program Counter?
If windows (or another OS), context switches the thread out and schedules it in another processor, how would the other processor know which instruction to execute next? (Is it guaranteed to make the same re-orderings?)
Is a memory fence (for example, a full fence created by an atomic compare and swap instruction) on one processor valid after the thread is scheduled on another thread?
Any ideas on this is highly appreciated.
There is an instruction pointer associated with each instruction.
Although instructions may be executed out of order, they always complete in order. When an interrupt or fault occurs, all instructions preceding the saved IP address have been completed. The results of any subsequent instructions are discarded. When execution resumes, it starts at the saved address.
The steps taken by the OS to schedule a thread on another processor include fencing operations on both processors, so when the thread resumes on the new processor, all preceding operations are fully fenced (whether or not any explicit fences exist in the code of the thread).
Unlike static compile-time ordering, out-of-order exec preserves the illusion of running instructions in program order. Including the situation seen by an interrupt handler. Current CPUs don't rename the privilege level, so they generally roll back to a consistent state as part of taking an exception or interrupt, not keeping un-executed instructions in flight. When an interrupt occurs, what happens to instructions in the pipeline?
This also means that interrupts are delivered strictly between instructions, not in the middle of one. Interrupting an assembly instruction while it is operating (except for "interruptible" instructions like rep movsb that logically work as multiple instructions, or vpgatherdd that has documented semantics for a page fault in one of the gather operands.)
Memory ordering as observed by other cores is another matter, and can differ from program order even on an in-order CPU. (Can a speculatively executed CPU branch contain opcodes that access RAM?)
The kernel code for a context switch needs to include a strong enough barrier for a thread to see its own stores in program order when it resumes on another core. Generally just release/acquire sync is sufficient (and you already need something like that for the kernel on the other core to restore register values). Maybe also an sfence to make that apply even for NT stores on x86.
I am reading the Linux Kernel documents and I have these questions(X86_64 Arch);
When PIC sends an interrupt to CPU, will that disable that specific interrupt till the acknowledgement comes from CPU? If that is the case, why do we need to local_irq_disable() in the ISR?
Related to above question, but say if CPU is processing an interrupt in its ISR and if there are 3 interrupts send by the same device to CPU, how does this going to be handled? Will that be serialised in some buffer(if yes, where)?
X86 architecture supports priority based interrupts?
The PIC is a very old interrupt controller, today interrupts are mostly delivered through MSI or through the APIC hierarchy.
The matter is actually more complicated with the IRQ routing, virtualization and so on.
I won't discuss these.
The interrupt priority concept still exists (though a bit simplified) and it works like this:
When an interrupt request is received by the interrupt controller, all the lower priority interrupts are masked and the interrupt is sent to the CPU.
What actually happens is that interrupts are ordered by their request number, with lower numbers having higher priority (0 has more priority than 1).
When any request line is toggled or asserted, the interrupt controller will scan the status of each request line from the number 0 up to the last one.
It stops as soon as it finds a line asserted or which is marked (with the use or a secondary register) in processing.
This way if request line 2 is first asserted and then request line 4 is, the interrupt controller won't server this last request until the first one is "done" because line 2 stops the scanning.
So local_irq_disable may be used to disable all interrupts, including those with higher priority.
AFAIK, this function should be rarely used today. It is a very simple, but inefficient, way to make sure no other code can run (potentially altering common structures).
In general, there needs to be some coordination between the ISR and the device to avoid losing interrupts.
Some devices require the software to write to a special register to let them know it is able to process the next interrupt. This way the device may implement an internal queue of notifications.
The keyboard controller works kind of like this, if you don't read the scancodes fast enough, you simply lose them.
If the device fires interrupts at will and too frequently, the interrupt controller can buffer the requests so they don't get lost.
Both the PIC and the LAPIC can buffer at most one request while another one is in progress (they basically use the fact that they have a request register and an in-progress register for each interrupt).
So in the case of three interrupts in a row, one is surely lost. If the interrupt controller couldn't deliver the first one to the CPU because a higher priority interrupt was in progress, then two will be lost.
In general, the software doesn't except the interrupt controller to buffer any request.
So you shouldn't find code that relies on this (after all, the only number in CS are 0, 1, and infinity. So 2 doesn't exist as far as the software is concerned).
The x86, as a CPU core, doesn't support priority when dealing with interrupt. If the interrupts are not masked, and a hardware interrupt arrives, it is served. It's up to the software and the interrupt controller to prioritize interrupts.
The PIC and LAPIC (and so the MSIs and the IOAPIC) both give interrupts a priority, so for all practical purposes the x86 supports a priority-based interrupt mechanism.
Note however that giving interrupt priority is not necessarily good, it's hard to tell if a network packet is more important than a keystroke.
So Linux has the guideline to do as little work as possible in the ISR and instead to queue the rest of the work to be processed asynchronously out of the ISR.
This may mean to just return from the ISR to work function in order to not block other interrupts.
In the vast majority of cases, only a small portion of code needs to be run in a critical section, a condition where no other interrupt should occur, so the general approach is to return the EOI to the interrupt controller and unmask the interrupt in the CPU as early as possible and write the code so that it can be interrupted.
In case one needs to stop the other interrupt for performance reasons, the approach usually taken is to split the interrupt across different cores so the load is within the required metrics.
Before multi-core systems were widespread, having too many interrupts would effectively slow down some operations.
I guess it would be possible to load a driver that would denial other interrupts for its own performance but that is a form of QoS/Real-time requirement that is up to the user to settle.
I am having doubts, why exactly we need interrupt context? Everything tells what are the properties but no one explains why we come up with this concept?
Another doubt related to same concept is, If we are not disabling the interrupt in interrupt handler, then what is the use of running this interrupt handler code in interrupt context ?
The interrupt context is fundamentally different from the process context:
It is not associated with a process; a specific process does not serve interrupts, the kernel does. Even if a process will be interrupted, it has no significance over any parameters of the interrupt itself or the routine that will serve it. It follows that at the very least, interrupt context must be different from process context conceptually.
Additionally, if an interrupt were to be serviced in a process context, and (re-) scheduled some work at a later time, what context would that run in? The original process may not even exist at that later time. Ergo, we need some context which is independent from processes for a practical reason.
Interrupt handling must be fast; your interrupt handler has interrupted (d'oh) some other code. Significant work should be pushed outside the interrupt handler, onto the "bottom half". It is unacceptable to block a process for work which is not even remotely its concern, either in user or in kernel space.
Disabling the interrupt is something you can (actually could, before 2.6.36) request to be disabled when registering your ISR. Recall that a handler can serve interrupts on multiple CPUs simultaneously, and can thus race with itself. Non-Maskable Interrupts (NMIs) can not be disabled.
Why do we need Interrupt context?
First, what do we mean by interrupt context? A context is usually a state. There are two separate concepts of state.
CPU context
Every CPU architecture has a mechanism for handling interrupts. There maybe a single interrupt vector called for every system interrupt, or the CPU/hardware may be capable of dispatching the CPU to a particular address based on the interrupt source. There are also mechanisms for masking/unmasking interrupts. Each interrupt maybe masked individually, or there maybe a global mask for the entire CPU(s). Finally, there is an actual CPU state. Some may have separate stacks, register sets, and CPU modes implying some memory and other privileges. Your question is about Linux in general and it must handle all cases.
Linux context
Generally all of the architectures have a separate kernel stack, process context (ala ps) and VM (virtual memory) context for each process. The VM has different privileges for user and kernel modes. In order for the kernel to run all the time, it must remain mapped for all processes on a device. A kernel thread is a special case that doesn't care so much about the VM, because it is privileged and can access all kernel memory. However, it does have a separate stack and process context. User registers are typically stored upon the kernel stack when exceptions happen. Exceptions are at least page faults, system calls and interrupts. These items may nest. Ie, you may call write() from user space and while the kernel is transferring a user buffer, it may page fault to read some swapped out user space data. The page fault may again have to service an interrupt.
Interrupt recursion
Linux general wants you to leave interrupts masked as the VM, the execptions, and process management (context and context switching) have to work together. In order to keep things simple for the VM, the kernel stack and process context are generally rooted in either a single 4k (or 8k) area which is a single VM page. This page is always mapped. Typically, all CPUs will switch from interrupt mode to system mode when servicing an interrupt and use the same kernel stack as all other exceptions. The stack is small so to allow recursion (and large stack allocation) can blow up the stack resulting in stack overflows at the kernel level. This is bad.
Atomicity
Many kernel structures need to stay consistent over multiple bus cycles; Ie, a linked list must update both prev and next node links when adding an element. A typical mechanism to do this maybe to mask interrupts, to ensure the code is atomic. Some CPUs may allow bus locking, but this is not universal. The context switching code must also be atomic. A consequence of an interrupt is typically rescheduling. Ie, a kernel interrupt handler may have acked a disk controller and started a write operation. Then a kernel thread may schedule to write more buffered data from the original user space write().
Interrupts occurring at any time can break some sub-sytem's assumptions of atomic behavior. Instead of allowing interrupt to use the sub-system, they are prohibited from using it.
Summary
Linux must handle three thing. The current process execution context, the current virtual memory layout and hardware requests. They all need to work together. As the interrupts may happen at any time, they occur in any process context. Using sleep(), etc in an interrupt would put random processes to sleep. Allowing large stack allocation in an interrupt could blow up the limited stack. These design choices limit what can happen in a Linux interrupt handler. Various configuration options can allow re-entrant interrupts, but this is often CPU specific.
A benefit of keeping the top half, now the main interrupt handler small is that interrupt latency is reduced. Busy work should be done in a kernel thread. An interrupt service routine that would need to un-mask interrupts is already somewhat anti-social to the Linux eco-system. That work should be put in a kernel thread.
The Linux interrupt context really doesn't exist in some sense. It is only a CPU interrupt which may happen in any process context. The Linux interrupt context is actually a set of coding limitations that happen as a consequence of this.
I have been trying to read more about different gates in x86 architecture. If I understand correctly then interrupt and trap gate are used for hw and sw interrupt handling respectively.
Whereas CALL gate is probably no more used, as ppl prefer replaced by SYSENTER and SYSEXIT.
I was wondering how task gates are used (I know they are used for hw task switch). What does that exactly mean? Does hw task refer to OS task/process. Or is it more like switching between two different instances of operating system. (May be on servers.)?
On a side note, can it happen that some of the interrupts are handled in the user mode. (Can we handle divide by zero interrupt in the user mode. If it can be then does that mean IDT handler entry for divide by zero contains address from the user space?)
Thanks
Everything you might want to know about interrupts and gates is in the Intel developer manual, volume 3. In short:
Task gates were originally designed as a CPU-mediated method of performing task switching; the CPU can automatically record the state of the process during the task switching operation. These are not typically used in modern operating systems; the OS usually does the state-saving operations on its own.
At least in Linux, all interrupt handlers are in kernel space and execute at ring 0. If you want to handle a divide-by-zero exception, you register a userspace signal handler for SIGFPE; the kernel-space interrupt handler raises the SIGFPE signal, indirectly triggering the userspace handler code (the userspace code is executed after returning from the interrupt handler).
The state of affairs is that only interrupt and trap gates was actually in use and stay in use now. In theory, both of them can be used as for s/w and for h/w event handling. The only difference between them is that interrupt gate call automatically prohibits future interrupts, that can be useful in some cases of hardware interrupt handling.
By default people try to use trap gates, because unnecessary interrupt disabling is a bad thing, because interrupt disabling increase interrupt handling latencies and increase probability of interrupt lost.
Call gates was never been in actual use. It is inconvenient and not optimal way for system call implementation. Instead call gate, most of the operating systems use trap gate (int 0x80 in Linux and int 0x2E in Windows) or sysenter/sysexit syscall/sysrt instructions.
Task gate was never been in actual use too. It is not optimal, inconvenient and limited feature, if not ugly at all. Instead of it, operating systems usually implements task switching on its own side by kernel mode task stacks switching.
Initially, Intel delivered hardware support of multitasking by introduction of TSS (Task State Segment) and Task Gate. According to that features, processor is able to automatically store the state of one task and restore state of another one in reply to the request came from hw or sw. Sw request can be done by issuing call or jmp instructions with TSS selector or task gate selector used as instruction operand. Hw request can be done by hardware traping into the task gate in appropriate IDT entry. But as I've already mentioned, no one really uses it. Instead of it, operating systems use only one TSS for all tasks (TSS must be used in any case, because during control transfer from the less privileged segment to more privileged segment CPU switch stacks and it capture address of the stack for more privileged segment from the TSS) and make task switch manually.
In theory, interrupts and exceptions can be handled in user mode (ring 3), but in practice it is not useful and operating system handle all such events on the kernel side (in ring 0). The reason is simple, interrupt and exception handlers must always reside in the memory and be accessible from the any address space. Kernel part of address space is shared and the same in all address spaces of all tasks in the system, but the user part of address space is wired to the particular task. If you want to handle exception in user mode you will be forced to reprogram IDT on each task switch that will introduce significant performance penalty. If you want to handle interrupts in the same way you will be forced to share interrupt handlers between all tasks on the same addresses. As unwanted consequence, any task in the system will be able to corrupt handler.
I am reading Operating System Concepts (Silberschatz,Galvin,Gagne), 6th edition, chapter 20.
I understand that Linux kernel code is non preemptible (before 2.6 version). But it can be interrupted by hardware interrupts. What happens if the kernel was in the middle of a critical section and the interrupt occured and it too executed the critical section?
From what I read in the book:
The second protection scheme that
Linux uses applies to critical
sections that occur in the interrupt service routines. The basic tool is
the processor interrupt control
hardware...
Ok, this scheme is used when an ISR has a critical section. But it will only disble further interrupts. What about the kernel code which was interrupted by this interrupt in the first place?
But it will only disble further interrupts. What about the kernel code which was interrupted
by this interrupt in the first place?
If the interrupt handler and other kernel code need access to the same data, you need to protect against that, which is usually done by a spinlock , great care must be taken, you don't want to introduce a deadlock ,and you must ensure such a spinlock is not held for too long. For spinlocks used in a hardware interrupt handler you have to disable interrupts on that processor whilst holding the lock - which in linux is done with the function spin_lock_irqsave().
(whilst a bit outdated, you can read about the concept here)
The kernel code which was interrupted by this interrupt in the first place gets interrupted.
This is why writing interrupt handlers is such a painful task: they can't do anything that would endanger the correctness of the main stream of execution.
For example, the way Apple's xnu kernel handles most kinds of device interrupts is to capture the information in the interrupt to a record in memory, add that record to a queue, and then resume normal execution; the kernel then picks up interrupts from the queue some time later (in the scheduler's main loop, i assume). That way, the interrupt handler only interacts with the rest of the system through the interrupt queue, and there is little danger of it causing trouble.
There is a bit of middle ground; on many architectures (including the x86), it is possible for privileged code to mask interrupts, so that they won't cause interruption. That can be used to protect passages of code which really shouldn't be interrupted. However, those architectures typically also have non-maskable interrupts, which ignore the masking, so interruption still has to be considered.