How the kernel different subsystems share CPU time - linux-kernel

Processes in userspace are scheduled by the kernel scheduler to get processor time but how the different kernel tasks get CPU time? I mean, when no process at userspace are requering CPU time (so CPU is iddle by executing NOP instructions) but some kernel subsystem need to carry out some task regularly, are timers and other hw and sw interrupts the common methods to get CPU time in kernel space?.

It's pretty much the same scheduler. The only difference I could think of is that kernel code has much more control over execution flow. For example, there is direct call to scheduler schedule().
Also in kernel you have 3 execution contexts - hardware interrupt, softirq/bh and process. In hard (and probably soft) interrupt context you can't sleep, so scheduling is not done during executing code in this context.


Flame graph(perf record) cannot display accurate CPU idle usage

When the CPU usage is 60%, the flame graphs(perf record) is used to capture the CPU usage. Why is 40% idle-related stack usage not displayed in the flame graphs? The usage of the idle stack is often less than 5%.
For flame graphs, the point is normally to measure where a process spends CPU time while it's running, not which blocking functions it calls that make it sleep, or where it gets scheduled out and sleeps when it doesn't want to.
I capture performance for one cpu processor, not one process. According to the operating system design, if there is no active task on the CPU, the CPU calls an idle waiting function. For example, Linux often calls schedule_idle until it is interrupted by a new task. Therefore, it is expected that the schedule_idle can be found in flame gragh and it consumes 40% of the cpu usage.
Perf events like cycles don't increment when the clock is halted (e.g. cycles is cpu_clk_unhalted.thread_p or similar). If you really wanted to see time spend idle, you might be able to disable idle power saving to get Linux to just spin in a loop instead of using x86 monitor/mwait or even basic hlt to put the CPU into a C-state where the clock doesn't tick.
Or run your code pinned to one logical core, and on the other logical core, pin a task that runs the pause instruction in a loop. So the physical core's clock keeps ticking for the core you're counting events for.
You should still get counts for cpu_clk_unhalted.thread_any ([Core cycles when at least one thread on the physical core is not in halt state]) when recording that event on the logical core with your task, even when that logical core is asleep.
And you can also record counts for cpu_clk_unhalted.thread to count cycles when this (hardware) thread aka logical core isn't halted, to know how much CPU time you actually used. (Or use the software event task-clock for that.)
Use perf list to see events available on your CPU, and read their descriptions carefully.

What is the relation between reentrant kernel and preemptive kernel?

What is the relation between reentrant kernel and preemptive kernel?
If a kernel is preemptive, must it be reentrant? (I guess yes)
If a kernel is reentrant, must it be preemptive? (I am not sure)
I have read, but not sure about if there is relation between the two concepts.
I guess my questions are about operating system concepts in general. But if it matters, I am interested mostly in Linux kernel, and encounter the two concepts when reading Understanding the Linux Kernel.
What is reentrant kernel:
As the name suggests, a reentrant kernel is the one which allows
multiple processes to be executing in the kernel mode at any given
point of time and that too without causing any consistency problems
among the kernel data structures.
What is kernel preemption:
Kernel preemption is a method used mainly in monolithic and hybrid
kernels where all or most device drivers are run in kernel space,
whereby the scheduler is permitted to forcibly perform a context
switch (i.e. preemptively schedule; on behalf of a runnable and higher
priority process) on a driver or other part of the kernel during its
execution, rather than co-operatively waiting for the driver or kernel
function (such as a system call) to complete its execution and return
control of the processor to the scheduler.
Can I imagine a preemptive kernel which is not reentrant? Hardly, but I can. Let's consider an example: some thread performs a system call. While entering a kernel it takes a big kernel lock and forbids all interrupt except scheduler timer irq. After that this thread is preempted in kernel by a scheduler. Now we may switch to another userspace thread. This process do some work in userspace and after that enters kernel, take big kernel lock and sleeps and so on. In practice looks like this solution can't be implemented, because of huge latency due to forbidding interrupts on a big time intervals.
Can I imagine reentrant kernel which is not preemptive? Why not? Just use cooperative preemption in kernel. Thread 1 enters kernel and calls thread_yield() after some time. Thread 2 enters kernel do it's own work maybe call another thread_yield maybe not. There is nothing special here.
As for linux kernel it is absolutely reentrant, the kernel preemption may be configured by CONFIG_PREEMPT. Also voluntary preemption is possible and many other different options.

How to improve scheduling and interrupt latency

How to improve scheduler and interrupt latency:
Embedded system based on 10 cores mips64 processor
9 cores run SMP linux. kernel version
We have realtime performance required process which has to complete certain tasks within 1ms. At maximum load conditions it may take 800uS.
This process starts the processing after receiving GPIO interrupt (1ms interrupt provided by FPGA. implemented as a kernel driver).
Till then it will make a icotl call to gpio driver and will be put to sleep by the virtue of wake_up_interruptible system call
The GPIO ISR will wake_up() this process
To prevent other processes hogging CPU for this process, we run this process on an "isolcpus" core.
We have set priority to be highest among user thread for this process as below:
Priority: 80, Scheduling type:SCHED_FIFO
threadSetRtPriority(SCHED_FIFO, 80);
All /proc/sys/kernel/sched_ parameter values are default. We haven't fine tuned them
Sometimes we see that ISR has called wake_up, but the process is scheduled only after 350uS.
This is a big time since our processor is running at 1.25GHz.
This big number for scheduling latency, is puzzling us, as we have already isolated the core exclusively for this process by using "isolcpus"
We profile the max CPU cycle count between consecutive 1ms GPIO ISR calls. This max time is more than 1.5ms.
This big number for interrupt latency is too a concern for us, as this will eat up into the time available for the process to do its processing within 1ms boundary.
Please help us with inputs to reduce the interrupt and scheduling latency numbers
The standard Linux kernel does not provide real-time scheduling. A level of real-time determinism can be achieved with the RT_Preempt patch. It still requires careful design, and is no substitute for an RTOS for critical real-time requirements.
I have been working on linux kernel 4.8 preempt-rt which has the RT_Preempt patch applied from this repo: linux kernel 4.8 preempt-rt and have some promising results!
I have benchmarked both preempt-rt and non-preempt-rt linux kernels by running rt-benchmark cyclictests and found that the Max Latency in case of preempt-rt linux kernel has come down to 61 us as against 2025 us when using non-preempt linux kernel, which might as well help your case.
The results have clearly tempted me to use the prempt-rt kernel as there is an overwhelming difference in Max Latency between the two. I have documented the results here: sachin-mokashi-linux-preempt-rt, in case if it might be of help to you!

Is there some sort of hardware support required for the implementation of the scheduler?

The state of the process at any given time consists of the processes in execution right? So at the moment say there are 4 userspace programs running on the processors. Now after each time slice, I assume control has to pass over to the scheduler so that the appropriate process can be scheduled next. What initiates this transfer of control? For me it seems like there has to be some kind of special timer/register in hardware that keeps count of the current time taken by the process since the process itself has no mechanism to keep track of the time for which it has executed... Is my intuition right??
First of all, this answer concerns the x86 architecture only.
There are different kinds of schedulers: preemptive and non-preemptive (cooperative).
Preemptive schedulers preempt the execution of a process, that is, initiate a context switch using a TSS (Task State Segment), which then performs a jump to another process. The process is stopped and another one is started.
Cooperative schedulers do not stop processes. They rely on the process, which give up the CPU in favor of the scheduler, also called "yielding," similar to user-level threads without kernel support.
Preemption can be accomplished in two ways: as the result of some I/O-bound action or while the CPU is at play.
Imagine you sent some instructions to the FPU. It takes some time until it's finished. Instead of sitting around idly, you could do something else while the FPU performs its calculations! So, as the result of an I/O operation, the scheduler switches to another process, possibly resuming with the preempted process right after the FPU is done.
However, regular preemption, as required by many scheduling algorithms, can only be implemented with some interruption mechanism happening with a certain frequency, independently of the process. A timer chip was deemed suitable and with the IBM 5150 (a.k.a. IBM PC) released in 1981, an x86 system was delivered, incorporating, inter alia, an Intel 8086, an Intel 8042 keyboard controller chip, the Intel 8259 PIC (Programmable Interrupt Controller), and the Intel 8253 PIT (Programmable Interval Timer).
The i8253 connected, like a few other peripheral device, to the i8259. A couple of times in a second (18 Hz?) it issued an #INT signal to the PIC on IRQ 0 and after acknowledging and all the CPU was interrupted and a handler was executed.
That very handler could contain scheduling code, which decides on the next process to execute1.
Of course, we (most of us) are living in the 21st century by now and there's no IBM PC or one of its derivatives like the XT or AT used. The PIC has changed to the more sophisticated Intel 82093AA APIC to handle multiple processors/cores and for general improvement but the PIT has remained the same, I think, maybe in shape of some integrated version like the Intel AIP.
Cooperative schedulers do not need a regular interrupt and therefore no special hardware support (except maybe for hardware-supported multitasking). The process yields the CPU deliberately and if it doesn't, you have a problem. The reason as to why few OSes actually use cooperative schedulers: it poses a big security hole.
1 Note, however, that OSes on the 8086 (mostly DOS) didn't have a real
scheduler. The x86 architecture only natively supported multitasking in the
hardware with the advent of one of the 80386 versions (SX, DX, and whatever). I just wanted to stress that the IBM 5150 was the first x86 system with a timer chip (and, of course, the first PC altogether).
Systems running an OS with preemptive schedulers, (ie. all those in common use), are, IME, all provided with a hardware timer interrupt that causes a driver to run and can change the set of running threads.
Such a timer interrupt is very useful for providing timeouts for system calls, sleep() functionality and other time-related functions. It can also help share out the available CPU amongst ready threads when the system is overloaded, or the thread/s run on it are CPU-intensive, and so the number of ready threads exceeds the number of cores available to run them.
It is quite possible to implement a preemptive scheduler without any hardware timer, allowing the set of running threads to be secheduled upon software interrupts, (system calls), from threads that are already running, and all the other interrupts upon I/O completion from the peripheral drivers for disk, NIC, KB, mouse etc. I've never seen it done though - the timer functionality is too useful:)

how does kernel code run on SMP machines

How does the kernel code run on SMP machines? i know that module (driver) code can run on several processors\cores, but it this the same also for the core kernel code?
Drivers are part of kernel, whether they are modular or built-in.
It is the scheduler that schedules Tasks[processes/threads] to each CPU/core.
Scheduler is a single Software entity that runs itself and runs other processes(kernel, its drivers, kernel threads, system calls, apps, ...).
Every process runs on the scheduler as per scheduling Algorithm under use.
It is the the scheduler that decides which process is supposed to be run on which CPU/core
Ex: Say Round Robin Scheduler, It keeps a time slice for every process that enters the "Ready Queue[RQ]". If the scheduler finds any processor/core idle and there are processes in RQ, it starts a timer to generate an interrupt when the timer reaches the time slice limit, and this interrupt will trigger the scheduler in the interrupt handler, and a process from the RQ will be given to the idle core for execution/running.
Thus, at any point of time, all the processors can be made to run the tasks, hence achieving high through put, if there are enough tasks to be run.
