About preemptive and non-preemptive kernel - linux-kernel

Here is my point about preemptive and non-preemptive kernel. As the interruption handling process is implemented in the kernel, does it imply that nested interruption can only happen in a preemptive kernel?

No. "pre-emptive" versus "non-pre-emptive" kernels are referring to kernel code being prempted by code not running in interrupt context. Interrupts are special, and even "non-pre-emptive" kernels typically allow kernel code to be preempted by interrupt handlers (and often even allow nested interrupts).

Related

What is the relation between reentrant kernel and preemptive kernel?

What is the relation between reentrant kernel and preemptive kernel?
If a kernel is preemptive, must it be reentrant? (I guess yes)
If a kernel is reentrant, must it be preemptive? (I am not sure)
I have read https://stackoverflow.com/a/1163946, but not sure about if there is relation between the two concepts.
I guess my questions are about operating system concepts in general. But if it matters, I am interested mostly in Linux kernel, and encounter the two concepts when reading Understanding the Linux Kernel.
What is reentrant kernel:
As the name suggests, a reentrant kernel is the one which allows
multiple processes to be executing in the kernel mode at any given
point of time and that too without causing any consistency problems
among the kernel data structures.
What is kernel preemption:
Kernel preemption is a method used mainly in monolithic and hybrid
kernels where all or most device drivers are run in kernel space,
whereby the scheduler is permitted to forcibly perform a context
switch (i.e. preemptively schedule; on behalf of a runnable and higher
priority process) on a driver or other part of the kernel during its
execution, rather than co-operatively waiting for the driver or kernel
function (such as a system call) to complete its execution and return
control of the processor to the scheduler.
Can I imagine a preemptive kernel which is not reentrant? Hardly, but I can. Let's consider an example: some thread performs a system call. While entering a kernel it takes a big kernel lock and forbids all interrupt except scheduler timer irq. After that this thread is preempted in kernel by a scheduler. Now we may switch to another userspace thread. This process do some work in userspace and after that enters kernel, take big kernel lock and sleeps and so on. In practice looks like this solution can't be implemented, because of huge latency due to forbidding interrupts on a big time intervals.
Can I imagine reentrant kernel which is not preemptive? Why not? Just use cooperative preemption in kernel. Thread 1 enters kernel and calls thread_yield() after some time. Thread 2 enters kernel do it's own work maybe call another thread_yield maybe not. There is nothing special here.
As for linux kernel it is absolutely reentrant, the kernel preemption may be configured by CONFIG_PREEMPT. Also voluntary preemption is possible and many other different options.

How the kernel different subsystems share CPU time

Processes in userspace are scheduled by the kernel scheduler to get processor time but how the different kernel tasks get CPU time? I mean, when no process at userspace are requering CPU time (so CPU is iddle by executing NOP instructions) but some kernel subsystem need to carry out some task regularly, are timers and other hw and sw interrupts the common methods to get CPU time in kernel space?.
It's pretty much the same scheduler. The only difference I could think of is that kernel code has much more control over execution flow. For example, there is direct call to scheduler schedule().
Also in kernel you have 3 execution contexts - hardware interrupt, softirq/bh and process. In hard (and probably soft) interrupt context you can't sleep, so scheduling is not done during executing code in this context.

Why do we need Interrupt context?

I am having doubts, why exactly we need interrupt context? Everything tells what are the properties but no one explains why we come up with this concept?
Another doubt related to same concept is, If we are not disabling the interrupt in interrupt handler, then what is the use of running this interrupt handler code in interrupt context ?
The interrupt context is fundamentally different from the process context:
It is not associated with a process; a specific process does not serve interrupts, the kernel does. Even if a process will be interrupted, it has no significance over any parameters of the interrupt itself or the routine that will serve it. It follows that at the very least, interrupt context must be different from process context conceptually.
Additionally, if an interrupt were to be serviced in a process context, and (re-) scheduled some work at a later time, what context would that run in? The original process may not even exist at that later time. Ergo, we need some context which is independent from processes for a practical reason.
Interrupt handling must be fast; your interrupt handler has interrupted (d'oh) some other code. Significant work should be pushed outside the interrupt handler, onto the "bottom half". It is unacceptable to block a process for work which is not even remotely its concern, either in user or in kernel space.
Disabling the interrupt is something you can (actually could, before 2.6.36) request to be disabled when registering your ISR. Recall that a handler can serve interrupts on multiple CPUs simultaneously, and can thus race with itself. Non-Maskable Interrupts (NMIs) can not be disabled.
Why do we need Interrupt context?
First, what do we mean by interrupt context? A context is usually a state. There are two separate concepts of state.
CPU context
Every CPU architecture has a mechanism for handling interrupts. There maybe a single interrupt vector called for every system interrupt, or the CPU/hardware may be capable of dispatching the CPU to a particular address based on the interrupt source. There are also mechanisms for masking/unmasking interrupts. Each interrupt maybe masked individually, or there maybe a global mask for the entire CPU(s). Finally, there is an actual CPU state. Some may have separate stacks, register sets, and CPU modes implying some memory and other privileges. Your question is about Linux in general and it must handle all cases.
Linux context
Generally all of the architectures have a separate kernel stack, process context (ala ps) and VM (virtual memory) context for each process. The VM has different privileges for user and kernel modes. In order for the kernel to run all the time, it must remain mapped for all processes on a device. A kernel thread is a special case that doesn't care so much about the VM, because it is privileged and can access all kernel memory. However, it does have a separate stack and process context. User registers are typically stored upon the kernel stack when exceptions happen. Exceptions are at least page faults, system calls and interrupts. These items may nest. Ie, you may call write() from user space and while the kernel is transferring a user buffer, it may page fault to read some swapped out user space data. The page fault may again have to service an interrupt.
Interrupt recursion
Linux general wants you to leave interrupts masked as the VM, the execptions, and process management (context and context switching) have to work together. In order to keep things simple for the VM, the kernel stack and process context are generally rooted in either a single 4k (or 8k) area which is a single VM page. This page is always mapped. Typically, all CPUs will switch from interrupt mode to system mode when servicing an interrupt and use the same kernel stack as all other exceptions. The stack is small so to allow recursion (and large stack allocation) can blow up the stack resulting in stack overflows at the kernel level. This is bad.
Atomicity
Many kernel structures need to stay consistent over multiple bus cycles; Ie, a linked list must update both prev and next node links when adding an element. A typical mechanism to do this maybe to mask interrupts, to ensure the code is atomic. Some CPUs may allow bus locking, but this is not universal. The context switching code must also be atomic. A consequence of an interrupt is typically rescheduling. Ie, a kernel interrupt handler may have acked a disk controller and started a write operation. Then a kernel thread may schedule to write more buffered data from the original user space write().
Interrupts occurring at any time can break some sub-sytem's assumptions of atomic behavior. Instead of allowing interrupt to use the sub-system, they are prohibited from using it.
Summary
Linux must handle three thing. The current process execution context, the current virtual memory layout and hardware requests. They all need to work together. As the interrupts may happen at any time, they occur in any process context. Using sleep(), etc in an interrupt would put random processes to sleep. Allowing large stack allocation in an interrupt could blow up the limited stack. These design choices limit what can happen in a Linux interrupt handler. Various configuration options can allow re-entrant interrupts, but this is often CPU specific.
A benefit of keeping the top half, now the main interrupt handler small is that interrupt latency is reduced. Busy work should be done in a kernel thread. An interrupt service routine that would need to un-mask interrupts is already somewhat anti-social to the Linux eco-system. That work should be put in a kernel thread.
The Linux interrupt context really doesn't exist in some sense. It is only a CPU interrupt which may happen in any process context. The Linux interrupt context is actually a set of coding limitations that happen as a consequence of this.

task gate, interrupt gate, call gate

I have been trying to read more about different gates in x86 architecture. If I understand correctly then interrupt and trap gate are used for hw and sw interrupt handling respectively.
Whereas CALL gate is probably no more used, as ppl prefer replaced by SYSENTER and SYSEXIT.
I was wondering how task gates are used (I know they are used for hw task switch). What does that exactly mean? Does hw task refer to OS task/process. Or is it more like switching between two different instances of operating system. (May be on servers.)?
On a side note, can it happen that some of the interrupts are handled in the user mode. (Can we handle divide by zero interrupt in the user mode. If it can be then does that mean IDT handler entry for divide by zero contains address from the user space?)
Thanks
Everything you might want to know about interrupts and gates is in the Intel developer manual, volume 3. In short:
Task gates were originally designed as a CPU-mediated method of performing task switching; the CPU can automatically record the state of the process during the task switching operation. These are not typically used in modern operating systems; the OS usually does the state-saving operations on its own.
At least in Linux, all interrupt handlers are in kernel space and execute at ring 0. If you want to handle a divide-by-zero exception, you register a userspace signal handler for SIGFPE; the kernel-space interrupt handler raises the SIGFPE signal, indirectly triggering the userspace handler code (the userspace code is executed after returning from the interrupt handler).
The state of affairs is that only interrupt and trap gates was actually in use and stay in use now. In theory, both of them can be used as for s/w and for h/w event handling. The only difference between them is that interrupt gate call automatically prohibits future interrupts, that can be useful in some cases of hardware interrupt handling.
By default people try to use trap gates, because unnecessary interrupt disabling is a bad thing, because interrupt disabling increase interrupt handling latencies and increase probability of interrupt lost.
Call gates was never been in actual use. It is inconvenient and not optimal way for system call implementation. Instead call gate, most of the operating systems use trap gate (int 0x80 in Linux and int 0x2E in Windows) or sysenter/sysexit syscall/sysrt instructions.
Task gate was never been in actual use too. It is not optimal, inconvenient and limited feature, if not ugly at all. Instead of it, operating systems usually implements task switching on its own side by kernel mode task stacks switching.
Initially, Intel delivered hardware support of multitasking by introduction of TSS (Task State Segment) and Task Gate. According to that features, processor is able to automatically store the state of one task and restore state of another one in reply to the request came from hw or sw. Sw request can be done by issuing call or jmp instructions with TSS selector or task gate selector used as instruction operand. Hw request can be done by hardware traping into the task gate in appropriate IDT entry. But as I've already mentioned, no one really uses it. Instead of it, operating systems use only one TSS for all tasks (TSS must be used in any case, because during control transfer from the less privileged segment to more privileged segment CPU switch stacks and it capture address of the stack for more privileged segment from the TSS) and make task switch manually.
In theory, interrupts and exceptions can be handled in user mode (ring 3), but in practice it is not useful and operating system handle all such events on the kernel side (in ring 0). The reason is simple, interrupt and exception handlers must always reside in the memory and be accessible from the any address space. Kernel part of address space is shared and the same in all address spaces of all tasks in the system, but the user part of address space is wired to the particular task. If you want to handle exception in user mode you will be forced to reprogram IDT on each task switch that will introduce significant performance penalty. If you want to handle interrupts in the same way you will be forced to share interrupt handlers between all tasks on the same addresses. As unwanted consequence, any task in the system will be able to corrupt handler.

What happens when kernel code is interrupted?

I am reading Operating System Concepts (Silberschatz,Galvin,Gagne), 6th edition, chapter 20.
I understand that Linux kernel code is non preemptible (before 2.6 version). But it can be interrupted by hardware interrupts. What happens if the kernel was in the middle of a critical section and the interrupt occured and it too executed the critical section?
From what I read in the book:
The second protection scheme that
Linux uses applies to critical
sections that occur in the interrupt service routines. The basic tool is
the processor interrupt control
hardware...
Ok, this scheme is used when an ISR has a critical section. But it will only disble further interrupts. What about the kernel code which was interrupted by this interrupt in the first place?
But it will only disble further interrupts. What about the kernel code which was interrupted
by this interrupt in the first place?
If the interrupt handler and other kernel code need access to the same data, you need to protect against that, which is usually done by a spinlock , great care must be taken, you don't want to introduce a deadlock ,and you must ensure such a spinlock is not held for too long. For spinlocks used in a hardware interrupt handler you have to disable interrupts on that processor whilst holding the lock - which in linux is done with the function spin_lock_irqsave().
(whilst a bit outdated, you can read about the concept here)
The kernel code which was interrupted by this interrupt in the first place gets interrupted.
This is why writing interrupt handlers is such a painful task: they can't do anything that would endanger the correctness of the main stream of execution.
For example, the way Apple's xnu kernel handles most kinds of device interrupts is to capture the information in the interrupt to a record in memory, add that record to a queue, and then resume normal execution; the kernel then picks up interrupts from the queue some time later (in the scheduler's main loop, i assume). That way, the interrupt handler only interacts with the rest of the system through the interrupt queue, and there is little danger of it causing trouble.
There is a bit of middle ground; on many architectures (including the x86), it is possible for privileged code to mask interrupts, so that they won't cause interruption. That can be used to protect passages of code which really shouldn't be interrupted. However, those architectures typically also have non-maskable interrupts, which ignore the masking, so interruption still has to be considered.

Resources