Understanding the `scheduler_tick()` function of Linux Kernel - linux-kernel

To my knowledge, I understood that the scheduler_tick() function periodically gets called by the timer interrupt with the frequency of HZ.
I'm new to kernel development, so these might be naive questions but I'm trying to understand the execution of the scheduler_tick() function. I'd appreciate your help on these questions:
It is possible for the timer to invoke this function again before the previous call returns?
Is it called per-core? (with per-core timer interrupt)
If the HZ frequency is high, the scheduler executes a process for longer than its timeslice. If the HZ is low, I presume a significant overhead due to frequent interruptions and invocation of this function. Is this assumption correct?

Answering your main questions and comments at the same time:
It's not possible for scheduler code to be called while already scheduling, as it runs with interrupts disabled.
Yes, each core has its own runqueue and does its own scheduling.
It's the opposite: high HZ -> scheduler will run more frequently (since HZ represents the number of scheduling ticks per second); low HZ -> scheduler will run less frequently.
Scheduling on every single core simultaneously is not needed, each CPU has its own timer and does its own timer interrupts regardless of sync with other CPUs. Scheduler ticks don't happen at the same time on all CPUs (furthermore CPUs could be idle and not ticking at all in a "tickless" kernel, configured with CONFIG_NOHZ). Tickless kernels make things a bit tricker, for more info check out Documentation/timers/NO_HZ.txt and "How are jiffies incremented in a tickless kernel?".
As per what trigger_load_balance() does and how, see Documentation/scheduler/sched-domains.txt, which explains it more or less in detail. In particular, trigger_load_balance() doesn't do the actual load balancing, it just checks if it's needed and if so raises a soft IRQ (SCHED_SOFTIRQ), which is then handled by run_rebalance_domains(), which calls rebalance_domains(), which does the actual work. The latter function indeed uses RCU/locking, as you can see by the calls to functions such as spin_trylock() and rcu_read_lock().

Related

OS thread scheduling and cpu usage relations

As I know, for threads scheduling, Linux implements a fair scheduler and Windows implements the Round-robin (RR) schedulers: each thread has a time slice for its execution (correct me if I'm wrong).
I wonder, is the CPU usage related to the thread scheduling?
For example: there are 2 threads executing at the same time, and the time slice for system is 15ms. The cpu has only 1 core.
Thread A needs 10ms to finish the job and then sleep 5ms, run in a loop.
Thread B needs 5ms to finish the job and then sleep 10ms, also in a loop.
Will the CPU usage be 100%?
How is the thread scheduled? Will thread A use up all its time and then schedule out?
One More Scenario:
If I got a thread A running, that is then blocked by some condition (e.g network). Will the CPU at 100% affect the wakeup time of this thread? For example, a thread B may be running in this time window, will the thread A be preempted by the OS?
As i know that Linux implements a fair scheduler and Windows System
implements the Round-robin (RR) schedulers for threads scheduling,
Both Linux and Windows use priority-based, preemptive thread schedulers. Fairness matters but it's not, strictly speaking, the objective. Exactly how these scheduler work depends on the version and the flavor (client vs. server) of the system. Generally, thread schedulers are designed to maximize responsiveness and alleviate scheduling hazards such as inversion and starvation. Although some scheduling decisions are made in a round-robin fashion, there are situations in which the scheduler may insert the preempted thread at the front of the queue rather than at the back.
each thread has a time slice for its execution.
The time slice (or quantum) is really more like a guideline than a rule. It's important to understand that a time slice is divisible and it equals some variable number of clock cycles. The scheduler charges CPU usage in terms of clock cycles, not time slices. A thread may run for more than a time slice (e.g., a time slice and a half). A thread may also voluntarily relinquish the rest of its time slice. This is possible because the only way for a thread to relinquish its time slice is by performing a system call (sleep, yield, acquire lock, request synchronous I/O). All of these are privileged operations that cannot be performed in user-mode (otherwise, a thread can go to sleep without telling the OS!). The scheduler can change the state of the thread from "ready" to "waiting" and schedule some other ready thread to run. If a thread relinquishes the rest of its time slice, it will not be compensated the next time it is scheduled to run.
One particularly interesting case is when a hardware interrupt occurs while a thread is running. In this case, the processor will automatically switch to the interrupt handler, forcibly preempting the thread even if its time slice has not finished yet. In this case, the thread will not be charged for the time it takes to handle the interrupt. Note that the interrupt handler would be indeed utilizing the CPU. By the way, the overhead of context switching itself is also not charged towards any time slice. Moreover, on Windows, the fact that a thread is running in user-mode or kernel-mode by itself does not have an impact on its priority or time slice. On Linux, the scheduler is invoked at specific places in the kernel to avoid starvation (kernel preemption implemented in Linux 2.5+).
So the CPU usage will be 100%? And how is the thread scheduled? Will
thread A use up all its time and then schedule out?
It's easy to answer these questions now. When a thread goes to sleep, the other gets scheduled. Note that this happens even if the threads have different priorities.
If i got a thread running, and blocked by some
condition(e.g network). Will the CPU 100% will affect the wakeup time
of this thread? For example, another thread may running in its time
window and will not schedule out by the OS?
Linux and Windows schedulers implement techniques to enable threads that are waiting on I/O operations to "wake up quickly" and get higher chances of being scheduled soon. For example, on Windows, the priority of a thread waiting on an I/O operation may be boosted a bit when the I/O operation completes. This means that it can preempt another running thread before finishing its time slice, even if both threads had the same priorities initially. When a boosted-priority thread wakes up, its original priority is restored.
So the CPU usage will be 100%?
Ideally speaking, the answer would be yes and by ideally I mean , you are not considering the time wasted in doing performing a context switch. Practically , the CPU utilization is increased by keeping it busy all of the time but still there is some amount of time that is wasted in doing a context switch(the time it takes to switch from one process or thread to another).
But I would say that in your case the time constraints of both threads are aligned perfectly to have maximum CPU utilization.
And how is the thread scheduled? Will thread A use up all its time and
then schedule out?
Well it really depends, in most modern operating systems implementations , if there is another process in the ready queue, the current process is scheduled out as soon as it is done with CPU , regardless of whether it still has time quantum left. So yeah if you are considering a modern OS design then the thread A is scheduled out right after 10ms.

What is the kernel timer system and how is it related to the scheduler?

I'm having a hard time understanding this.
How does the scheduler know that a certain period of time has passed?
Does it use some sort of syscall or interrupt for that?
What's the point of using the constant HZ instead of seconds?
What does the system timer have to do with the scheduler?
How does the scheduler know that a certain period of time has passed?
The scheduler consults the system clock.
Does it use some sort of syscall or interrupt for that?
Since the system clock is updated frequently, it suffices for the scheduler to just read its current value. The scheduler is already in kernel mode so there is no syscall interface involved in reading the clock.
Yes, there are timer interrupts that trigger an ISR, an interrupt service routine, which reads hardware registers and advances the current value of the system clock.
What's the point of using the constant HZ instead of seconds?
Once upon a time there was significant cost to invoking the ISR, and on each invocation it performed a certain amount of bookkeeping, such as looking for scheduler quantum expired and firing TCP RTO retransmit timers. The hardware had limited flexibility and could only invoke the ISR at fixed intervals, e.g. every 10ms if HZ is 100. Higher HZ values made it more likely the ISR would run and find there is nothing to do, that no events had occurred since the previous run, in which case the ISR represented overhead, cycles stolen from a foreground user task. Lower HZ values would impact dispatch latency, leading to sluggish network and interactive response times. The HZ tuning tradeoff tended to wind up somewhere near 100 or 1000 for practical hardware systems. APIs that reported system clock time could only do so in units of ticks, where each ISR invocation would advance the clock by one tick. So callers would need to know the value of HZ in order to convert from tick units to S.I. units. Modern systems perform network tasks on a separately scheduled TCP kernel thread, and may support tickless kernels which discard many of these outdated assumptions.
What does the system timer have to do with the scheduler?
The scheduler runs when the system timer fires an interrupt.
The nature of a pre-emptive scheduler is it can pause "spinning" usermode code, e.g. while (1) {}, and manipulate the run queue, even on a single-core system.
Additionally, the scheduler runs when a process voluntarily gives up its time slice, e.g. when issuing syscalls or taking page faults.

How does a system wide profiler (e.g. perf) correlate counters with instructions?

I'm trying to understand how a system wide profiler works. Let's take linux perf as example. For a certain profiling time it can provide:
Various aggregated hadware performance counters
Time spent and hardware counters (e.g. #instructions) for each user space process and kernel space function
Information about context switches
etc.
The first thing I'm almost sure about is that the report is just an estimation of what's really happening. So I think there's some kernel module that launches software interrupts at a certain sampling rate. The lower the sampling rate, the lower the profiler overhead. The interrupt can read the model specific registers that store the performance counters.
The next part is to correlate the counters with the software that's running on the machine. That's the part I don't understand.
So where does the profiler gets its data from?
Can you interrogate for example the task scheduler to find out what was running when you interrupted him? Won't that affect the
execution of the scheduler (e.g. instead of continuing the
interrupted function it will just schedule another one, making the
profiler result not accurate). Is the list of task_struct objects available?
How can profilers even correlate HW
metrics even at instruction level?
So I think there's some kernel module that launches software interrupts at a certain sampling rate.
Perf is not module, it is part of the Linux kernel, implemented in
kernel/events/core.c and for every supported architecture and cpu model, for example arch/x86/kernel/cpu/perf_event*.c. But Oprofile was a module, with similar approach.
Perf generally works by asking PMU (Performance monitoring unit) of CPU to generate interrupt after N events of some hardware performance counter (Yokohama, slide 5 "• Interrupt when threshold reached: allows sampling"). Actually it may be implemented as:
select some PMU counter
initialize it to -N, where N is the sampling period (we want interrupt after N events, for example, after 2 millions of cycles perf record -c 2000000 -e cycles, or some N computed and tuned by perf when no extra option is set or -F is given)
set this counter to wanted event, and ask PMU to generate interrupt on overflow (ARCH_PERFMON_EVENTSEL_INT). It will happen after N increments of our counter.
All modern Intel chips supports this, for example, Nehalem: https://software.intel.com/sites/default/files/76/87/30320 - Nehalem Performance Monitoring Unit Programming Guide
EBS - Event Based Sampling. A technique in which counters are pre-loaded with a large negative count, and they are configured to interrupt the processor on overflow. When the counter overflows the interrupt service routine capture profiling data.
So, when you use hardware PMU, there is no additional work at timer interrupt with special reading of hardware PMU counters. There is some work to save/restore PMU state at task switch, but this (*_sched_in/*_sched_out of kernel/events/core.c) will not change PMU counter value for current thread nor will export it to user-space.
There is a handler: arch/x86/kernel/cpu/perf_event.c: x86_pmu_handle_irq which finds the overflowed counter and calls perf_sample_data_init(&data, 0, event->hw.last_period); to record the current time, IP of last executed command (it can be inexact because of out-of-order nature of most Intel microarchitetures, there is limited workaround for some events - PEBS, perf record -e cycles:pp), stacktrace data (if -g was used in record), etc. Then handler resets the counter value to the -N (x86_perf_event_set_period, wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask); - note the minus before left)
The lower the sampling rate, the lower the profiler overhead.
Perf allows you to set target sampling rate with -F option, -F 1000 means around 1000 irq/s. High rates are not recommended due to high overhead. Ten years ago Intel VTune recommended not more than 1000 irq/s (http://www.cs.utah.edu/~mhall/cs4961f09/VTune-1.pdf "Try to get about a 1000 samples per second per logical CPU."), perf usually don't allow too high rate for non-root (autotuned to lower rate when "perf interrupt took too long" - check in your dmesg; also check sysctl -a|grep perf, for example kernel.perf_cpu_time_max_percent=25 - which means that perf will try to use not more then 25 % of CPU)
Can you interrogate for example the task scheduler to find out what was running when you interrupted him?
No. But you can enable tracepoint at sched_switch or other sched event (list all available in sched: perf list 'sched:*'), and use it as profiling event for the perf. You can even ask perf to record stacktrace at this tracepoint:
perf record -a -g -e "sched:sched_switch" sleep 10
Won't that affect the execution of the scheduler
Enabled tracepoint will make add some perf event sampling work to the function with tracepoint
Is the list of task_struct objects available?
Only via ftrace...
Information about context switches
This is software perf event, just call to perf_sw_event with PERF_COUNT_SW_CONTEXT_SWITCHES event from sched/core.c (indirectly). Example of direct call - migration software event: kernel/sched/core.c set_task_cpu(): p->se.nr_migrations++; perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
PS: there are good slides on perf, ftrace and other profiling and tracing subsystems in Linux by Gregg: http://www.brendangregg.com/linuxperf.html
This is pretty much answers all three of your questions.
Profiling consits of two types: Counting and sampling. Counting measures the
overall
number
of events during the entire execution without offering any insight
regarding
the
instructions or functions that
generated
them
. On
the other hand,
sampling gives a correlation of
the events to the code
through captured samples of the Instruction Pointer
.
When sampling, the
kernel instructs the processor to issue an interrupt when
a chosen
event counter exceeds a
threshold. T
his interrupt is caught by the kernel and the sampled data
including the Instruction
Pointer
value are stored into a ring buffer. The buffer is polled periodically by the userspace
perf tool and its contents
written to disk.
In post processing, the Instruction Pointer is matched to
addresses in binary files, which can be translated into function names and such
Refer http://openlab.web.cern.ch/sites/openlab.web.cern.ch/files/technical_documents/TheOverheadOfProfilingUsingPMUhardwareCounters.pdf

Why does linux kernel need idle thread?

Rather than "do nothing" if there is nothing to do (including SMP), why linux kernel runs idle thread?
When the scheduler decides to switch to the idle task, at this point, the dynamic tick begins to work, by disabling periodic tick until the next timer expires. The tick will be reenabled after this time span or when an interrupt occurs at some time.
In the mean time, the CPU is going to a well-deserved sleep, in an architecture-specific way, therefore saving your power. Take a look at the definition of cpu_idle() in arch/x86/kernel/process.c.
/*
* The idle thread. There's no useful work to be
* done, so just try to conserve power and have a
* low exit latency (ie sit in a loop waiting for
* somebody to say that they'd like to reschedule)
*/
void cpu_idle(void)
What do you mean by "do nothing"??
When the CPU is powered up there is a rather long list of things that happen. Once powered up the CPU cannot "do nothing". It has to do something because there is a voltage and a periodic clock signal. You can power it down again and do ABSOLUTELY nothing but then you have to go through the long list of things to get a steady clock signal when you need it again.
So the idle thread is a thread that does the bare minimum. i.e. if multiplying two floating point numbers required the least number of cycles and the least number of electronic circuitry; then the idle thread would be multiplying two floats all the time. In addition as Wang said the Linux kernel (in some configurations) monitors when cores start executing idle thread and also switches them to a lower frequency, disabling any sort of periodic OS house keeping. This results in a bit of latency when the core is needed but then there is much less power being used.

Windows thread-switching latency after IO completion - microseconds or milliseconds

I am trying to determine approximate time delay (Win 7, Vista, XP) to switch threads when an IO operation completes.
What I (think I) know is that:
a) Thread contex switches are themselves computationally very fast. (By very fast, I mean typically way under 1ms, maybe even under 1us? - assuming a relatively fast, unloaded machine etc.)
b) Round robin time slice quantums are on the order of 10-15ms.
What I can't seem to find is information about the typical latency time from a (high priority) thread becoming active/signaled - via, say, a synchronous disk write completing - and that thread actually running again.
E.g., I have read in at least one place that all inactive threads remain asleep until ~10ms system quantum expires and then (assuming they are ready to go), they all get reactivated almost synchronously together. But in another place I read that the delay between when a thread completes an I/O operation and when it becomes active/signaled and runs again is measured in microseconds, not milliseconds.
My context for asking is related to capture and continuous streaming write to a RAID array of SSDs, from a high speed camera, where unless I can start a new write after a prior one has finished in well under 1ms (it would be best if under 1/10ms, on average), it will be problematic.
Any information regarding this issue would be most appreciated.
Thanks,
David
Thread context switches cost between 2,000 and 10,000 cpu cycles, so a handful of microseconds.
An I/O completion is fast when a thread is blocking on the synchronization handle that signals completion. That makes the Windows thread scheduler temporarily boost the thread priority. Which in turn makes it likely (but not guaranteed) to be chosen as the thread that gets the processor love. So that's typically microseconds, not milliseconds.
Do note that disk writes normally go through the file system cache. Which makes the WriteFile() call a simple memory-to-memory copy that doesn't block the thread. This runs at memory bus speeds, 5 gigabytes per second and up. Data is then written to the disk in a lazy fashion, the thread isn't otherwise involved or delayed by that. You'll only get slow writes when the file system cache is filled to capacity and you don't use overlapped I/O. Which is certainly a possibility if you write video streams. The amount of RAM makes a great deal of difference. And SSD controllers are not made the same. Nothing you can reason out up front, you'll have to test.

Resources