Nested interrupts in unicore processor? - linux-kernel

Assuming a least priority interrupt has occured on a unicore processor.
Which leads to the execution of the ISR by disabling the current IRQ.
Mean-while a high priority interrupt occured.
Will the current ISR will get pre-empted and control will be given to high priority ISR ?
If yes then after serving high priority ISR will the control be given back to low priority ISR ?
If interrupt is served after disabling the scheduler then who will take care of switching of low priority ISR to high priority ISR.. vice versa. ?

Related

Can I increase a thread irq priority

I write a device driver to receive data from the hardware. The logic is when device data is ready, it will send an irq to the Linux host. The communication interface is SPI. However, the SPI controller driver to read data in the machine can not be used in irq immediately, because it will sleep.
So, I use "request_threaded_irq" to create a thread irq and put the data read function into the bottom irq.
unfortunately, I found the bottom of the irq has a large and unstable delay, varying from tens of microseconds to hundreds of microseconds.
My question is, is there a method to to make the time delay less than a value, such as increase the bottom irq priority ?

Interrupt Nested, Sequencing

I am reading the Linux Kernel documents and I have these questions(X86_64 Arch);
When PIC sends an interrupt to CPU, will that disable that specific interrupt till the acknowledgement comes from CPU? If that is the case, why do we need to local_irq_disable() in the ISR?
Related to above question, but say if CPU is processing an interrupt in its ISR and if there are 3 interrupts send by the same device to CPU, how does this going to be handled? Will that be serialised in some buffer(if yes, where)?
X86 architecture supports priority based interrupts?
The PIC is a very old interrupt controller, today interrupts are mostly delivered through MSI or through the APIC hierarchy.
The matter is actually more complicated with the IRQ routing, virtualization and so on.
I won't discuss these.
The interrupt priority concept still exists (though a bit simplified) and it works like this:
When an interrupt request is received by the interrupt controller, all the lower priority interrupts are masked and the interrupt is sent to the CPU.
What actually happens is that interrupts are ordered by their request number, with lower numbers having higher priority (0 has more priority than 1).
When any request line is toggled or asserted, the interrupt controller will scan the status of each request line from the number 0 up to the last one.
It stops as soon as it finds a line asserted or which is marked (with the use or a secondary register) in processing.
This way if request line 2 is first asserted and then request line 4 is, the interrupt controller won't server this last request until the first one is "done" because line 2 stops the scanning.
So local_irq_disable may be used to disable all interrupts, including those with higher priority.
AFAIK, this function should be rarely used today. It is a very simple, but inefficient, way to make sure no other code can run (potentially altering common structures).
In general, there needs to be some coordination between the ISR and the device to avoid losing interrupts.
Some devices require the software to write to a special register to let them know it is able to process the next interrupt. This way the device may implement an internal queue of notifications.
The keyboard controller works kind of like this, if you don't read the scancodes fast enough, you simply lose them.
If the device fires interrupts at will and too frequently, the interrupt controller can buffer the requests so they don't get lost.
Both the PIC and the LAPIC can buffer at most one request while another one is in progress (they basically use the fact that they have a request register and an in-progress register for each interrupt).
So in the case of three interrupts in a row, one is surely lost. If the interrupt controller couldn't deliver the first one to the CPU because a higher priority interrupt was in progress, then two will be lost.
In general, the software doesn't except the interrupt controller to buffer any request.
So you shouldn't find code that relies on this (after all, the only number in CS are 0, 1, and infinity. So 2 doesn't exist as far as the software is concerned).
The x86, as a CPU core, doesn't support priority when dealing with interrupt. If the interrupts are not masked, and a hardware interrupt arrives, it is served. It's up to the software and the interrupt controller to prioritize interrupts.
The PIC and LAPIC (and so the MSIs and the IOAPIC) both give interrupts a priority, so for all practical purposes the x86 supports a priority-based interrupt mechanism.
Note however that giving interrupt priority is not necessarily good, it's hard to tell if a network packet is more important than a keystroke.
So Linux has the guideline to do as little work as possible in the ISR and instead to queue the rest of the work to be processed asynchronously out of the ISR.
This may mean to just return from the ISR to work function in order to not block other interrupts.
In the vast majority of cases, only a small portion of code needs to be run in a critical section, a condition where no other interrupt should occur, so the general approach is to return the EOI to the interrupt controller and unmask the interrupt in the CPU as early as possible and write the code so that it can be interrupted.
In case one needs to stop the other interrupt for performance reasons, the approach usually taken is to split the interrupt across different cores so the load is within the required metrics.
Before multi-core systems were widespread, having too many interrupts would effectively slow down some operations.
I guess it would be possible to load a driver that would denial other interrupts for its own performance but that is a form of QoS/Real-time requirement that is up to the user to settle.

Interrupt scheduling and handling in linux

let's say we are getting 100 interrupts from Net device, 50 interrupts from USB, 25 interrupts from SPI device, 25 interrupts from I2c.
It is coming in sequence as follows
5Net - 4USB - 2SPI -2I2C and the same sequence follows.
The top-level handler can dispatch a device-specific handler to service the interrupt
Now the processor will interrupt the running task as soon as it gets the Net device's interrupt. On completing the Top half of Net device's INterrupt handler, it has to execute the top half of USB and SPI and I2C.
And the same sequence will be followed after completing the 1st set of sequence . When the interrupted task will wake again? Do the interrupted task wait until all the 100 interrupt are serviced by their respective device specific handlers?. How the Interrupts are shared to different cores in case multi-core systems as hundreds of thousands of interrupts will have to be serviced?
As far as I know when executing Interrupt handler the processor will be in interrupt context so that there wont be any context switching. As different ISR will have to service hundreds of thousands of Interrupts, do the processor will always be in interrupt context?
When the interrupted task will wake again?
When interrupts are cleared and the scheduler decides to give this task processor time.
Do the interrupted task wait until all the 100 interrupt are serviced by their respective device specific handlers?
You described only 4 IRQs sources (some net device, usb, spi, i2c). So if all IRQ lines are high and enabled, than the cpus which handle these irqs will switch to specific interrupt handlers. If the interrupt is still triggered after the handler, then the cpu servicing it will branch again and again to the interrupt handler until the interrupt is cleared. On multi-cpu system with 5 cpus, 4 may execute interrupt handlers for your devices simultaneously, while the other one will execute your task. So your task may not be interrupted at all. Or it may wait forever for the cpu, on a single cpu system when the interrupt handler is badly written and never clears the IRQ line.
How the Interrupts are shared to different cores in case multi-core systems as lakhs of interrupts will have to be serviced?
I think it is best explained here: multi-core CPU interrupts .
As different ISR will have to service lakhs of Interrupt, do the processor will always be in interrupt context?
It will stay in interrupt context until the IRQ is enabled and IRQ is triggered. You can just disable the IRQ line and return the cpu to the scheduler, if you need.

Is it a good practice to set interrupt affinity and io handling thread affinity to the same core?

I am trying to understand irq affinity and its impact to system performance.
I went through why-interrupt-affinity-with-multiple-cores-is-not-such-a-good-thing, and learnt that NIC irq affinity should be set to a different core other then the core that handling the network data, since the handling will not be interrupted by incoming irq.
I doubt this, since if we use a different core to handling the data from the irq core, we will get more cache miss when trying to retrieve the network data from kernel. So I am actually believe that by setting the irq affinity as the same core of the thread handling the incoming data will improve performance due to less cache miss.
I am trying to come up with some verification code, but before I will present any results, am I missing something?
IRQ affinity is a double edged sword. In my experience it can improve performance but only in a very specific configuration with a pre-defined workload. As far as your question is concerned, (consider only RX path) typically when a NIC card interrupts one of the core, in majority of the cases the interrupt handler will not do much, except for triggering a mechanism (bottom-half, tasklet, kernel thread or networking stack thread) to process incoming packet in some other context. If the same packet processing core is handling the interrupts (ISR handler does not do much), it is bound to loose some cache benefit due to context switches and may increase cache misses. How much is the impact, depends on variety of other factors.
In NIC drivers typically, affinity of core is aligned to each RX queue [separating each RX queue processing across different cores], which provides more performance benefit.
Controlling interrupt affinity can have quite few useful applications. Few that sprint to mind.
isolating cores - preventing I/O load spreading onto mission critical CPU cores, or cores that exclusive for RT priority (scheduled softirqs could starve on those)
increase system performance - throughput and latency - by keeping interrupts on relevant NUMA for multi-CPU systems
CPU efficiency - e.g. dedicating CPU, NIC channel with its interrupt to single application will squeeze more from that CPU, about 30% more with heavy traffic usecases.
This might require flow steering to make incoming traffic target the channel.
Note: without affinity app might seem to deliver more throughput but would spill load onto many cores). The gains is thanks to cache locality, and context switch prevention.
latency - setting up application as above will halve the latency (from 8us to 4us on modern system)

What is the kernel timer system and how is it related to the scheduler?

I'm having a hard time understanding this.
How does the scheduler know that a certain period of time has passed?
Does it use some sort of syscall or interrupt for that?
What's the point of using the constant HZ instead of seconds?
What does the system timer have to do with the scheduler?
How does the scheduler know that a certain period of time has passed?
The scheduler consults the system clock.
Does it use some sort of syscall or interrupt for that?
Since the system clock is updated frequently, it suffices for the scheduler to just read its current value. The scheduler is already in kernel mode so there is no syscall interface involved in reading the clock.
Yes, there are timer interrupts that trigger an ISR, an interrupt service routine, which reads hardware registers and advances the current value of the system clock.
What's the point of using the constant HZ instead of seconds?
Once upon a time there was significant cost to invoking the ISR, and on each invocation it performed a certain amount of bookkeeping, such as looking for scheduler quantum expired and firing TCP RTO retransmit timers. The hardware had limited flexibility and could only invoke the ISR at fixed intervals, e.g. every 10ms if HZ is 100. Higher HZ values made it more likely the ISR would run and find there is nothing to do, that no events had occurred since the previous run, in which case the ISR represented overhead, cycles stolen from a foreground user task. Lower HZ values would impact dispatch latency, leading to sluggish network and interactive response times. The HZ tuning tradeoff tended to wind up somewhere near 100 or 1000 for practical hardware systems. APIs that reported system clock time could only do so in units of ticks, where each ISR invocation would advance the clock by one tick. So callers would need to know the value of HZ in order to convert from tick units to S.I. units. Modern systems perform network tasks on a separately scheduled TCP kernel thread, and may support tickless kernels which discard many of these outdated assumptions.
What does the system timer have to do with the scheduler?
The scheduler runs when the system timer fires an interrupt.
The nature of a pre-emptive scheduler is it can pause "spinning" usermode code, e.g. while (1) {}, and manipulate the run queue, even on a single-core system.
Additionally, the scheduler runs when a process voluntarily gives up its time slice, e.g. when issuing syscalls or taking page faults.

Resources