How to improve scheduler and interrupt latency:
Background:
Embedded system based on 10 cores mips64 processor
9 cores run SMP linux. kernel version 2.6.32.27
We have realtime performance required process which has to complete certain tasks within 1ms. At maximum load conditions it may take 800uS.
This process starts the processing after receiving GPIO interrupt (1ms interrupt provided by FPGA. implemented as a kernel driver).
Till then it will make a icotl call to gpio driver and will be put to sleep by the virtue of wake_up_interruptible system call
The GPIO ISR will wake_up() this process
To prevent other processes hogging CPU for this process, we run this process on an "isolcpus" core.
We have set priority to be highest among user thread for this process as below:
Priority: 80, Scheduling type:SCHED_FIFO
threadSetRtPriority(SCHED_FIFO, 80);
All /proc/sys/kernel/sched_ parameter values are default. We haven't fine tuned them
Problem:
Sometimes we see that ISR has called wake_up, but the process is scheduled only after 350uS.
This is a big time since our processor is running at 1.25GHz.
This big number for scheduling latency, is puzzling us, as we have already isolated the core exclusively for this process by using "isolcpus"
We profile the max CPU cycle count between consecutive 1ms GPIO ISR calls. This max time is more than 1.5ms.
This big number for interrupt latency is too a concern for us, as this will eat up into the time available for the process to do its processing within 1ms boundary.
Please help us with inputs to reduce the interrupt and scheduling latency numbers
The standard Linux kernel does not provide real-time scheduling. A level of real-time determinism can be achieved with the RT_Preempt patch. It still requires careful design, and is no substitute for an RTOS for critical real-time requirements.
I have been working on linux kernel 4.8 preempt-rt which has the RT_Preempt patch applied from this repo: linux kernel 4.8 preempt-rt and have some promising results!
I have benchmarked both preempt-rt and non-preempt-rt linux kernels by running rt-benchmark cyclictests and found that the Max Latency in case of preempt-rt linux kernel has come down to 61 us as against 2025 us when using non-preempt linux kernel, which might as well help your case.
The results have clearly tempted me to use the prempt-rt kernel as there is an overwhelming difference in Max Latency between the two. I have documented the results here: sachin-mokashi-linux-preempt-rt, in case if it might be of help to you!
Related
To my knowledge, I understood that the scheduler_tick() function periodically gets called by the timer interrupt with the frequency of HZ.
I'm new to kernel development, so these might be naive questions but I'm trying to understand the execution of the scheduler_tick() function. I'd appreciate your help on these questions:
It is possible for the timer to invoke this function again before the previous call returns?
Is it called per-core? (with per-core timer interrupt)
If the HZ frequency is high, the scheduler executes a process for longer than its timeslice. If the HZ is low, I presume a significant overhead due to frequent interruptions and invocation of this function. Is this assumption correct?
Answering your main questions and comments at the same time:
It's not possible for scheduler code to be called while already scheduling, as it runs with interrupts disabled.
Yes, each core has its own runqueue and does its own scheduling.
It's the opposite: high HZ -> scheduler will run more frequently (since HZ represents the number of scheduling ticks per second); low HZ -> scheduler will run less frequently.
Scheduling on every single core simultaneously is not needed, each CPU has its own timer and does its own timer interrupts regardless of sync with other CPUs. Scheduler ticks don't happen at the same time on all CPUs (furthermore CPUs could be idle and not ticking at all in a "tickless" kernel, configured with CONFIG_NOHZ). Tickless kernels make things a bit tricker, for more info check out Documentation/timers/NO_HZ.txt and "How are jiffies incremented in a tickless kernel?".
As per what trigger_load_balance() does and how, see Documentation/scheduler/sched-domains.txt, which explains it more or less in detail. In particular, trigger_load_balance() doesn't do the actual load balancing, it just checks if it's needed and if so raises a soft IRQ (SCHED_SOFTIRQ), which is then handled by run_rebalance_domains(), which calls rebalance_domains(), which does the actual work. The latter function indeed uses RCU/locking, as you can see by the calls to functions such as spin_trylock() and rcu_read_lock().
The state of the process at any given time consists of the processes in execution right? So at the moment say there are 4 userspace programs running on the processors. Now after each time slice, I assume control has to pass over to the scheduler so that the appropriate process can be scheduled next. What initiates this transfer of control? For me it seems like there has to be some kind of special timer/register in hardware that keeps count of the current time taken by the process since the process itself has no mechanism to keep track of the time for which it has executed... Is my intuition right??
First of all, this answer concerns the x86 architecture only.
There are different kinds of schedulers: preemptive and non-preemptive (cooperative).
Preemptive schedulers preempt the execution of a process, that is, initiate a context switch using a TSS (Task State Segment), which then performs a jump to another process. The process is stopped and another one is started.
Cooperative schedulers do not stop processes. They rely on the process, which give up the CPU in favor of the scheduler, also called "yielding," similar to user-level threads without kernel support.
Preemption can be accomplished in two ways: as the result of some I/O-bound action or while the CPU is at play.
Imagine you sent some instructions to the FPU. It takes some time until it's finished. Instead of sitting around idly, you could do something else while the FPU performs its calculations! So, as the result of an I/O operation, the scheduler switches to another process, possibly resuming with the preempted process right after the FPU is done.
However, regular preemption, as required by many scheduling algorithms, can only be implemented with some interruption mechanism happening with a certain frequency, independently of the process. A timer chip was deemed suitable and with the IBM 5150 (a.k.a. IBM PC) released in 1981, an x86 system was delivered, incorporating, inter alia, an Intel 8086, an Intel 8042 keyboard controller chip, the Intel 8259 PIC (Programmable Interrupt Controller), and the Intel 8253 PIT (Programmable Interval Timer).
The i8253 connected, like a few other peripheral device, to the i8259. A couple of times in a second (18 Hz?) it issued an #INT signal to the PIC on IRQ 0 and after acknowledging and all the CPU was interrupted and a handler was executed.
That very handler could contain scheduling code, which decides on the next process to execute1.
Of course, we (most of us) are living in the 21st century by now and there's no IBM PC or one of its derivatives like the XT or AT used. The PIC has changed to the more sophisticated Intel 82093AA APIC to handle multiple processors/cores and for general improvement but the PIT has remained the same, I think, maybe in shape of some integrated version like the Intel AIP.
Cooperative schedulers do not need a regular interrupt and therefore no special hardware support (except maybe for hardware-supported multitasking). The process yields the CPU deliberately and if it doesn't, you have a problem. The reason as to why few OSes actually use cooperative schedulers: it poses a big security hole.
1 Note, however, that OSes on the 8086 (mostly DOS) didn't have a real
scheduler. The x86 architecture only natively supported multitasking in the
hardware with the advent of one of the 80386 versions (SX, DX, and whatever). I just wanted to stress that the IBM 5150 was the first x86 system with a timer chip (and, of course, the first PC altogether).
Systems running an OS with preemptive schedulers, (ie. all those in common use), are, IME, all provided with a hardware timer interrupt that causes a driver to run and can change the set of running threads.
Such a timer interrupt is very useful for providing timeouts for system calls, sleep() functionality and other time-related functions. It can also help share out the available CPU amongst ready threads when the system is overloaded, or the thread/s run on it are CPU-intensive, and so the number of ready threads exceeds the number of cores available to run them.
It is quite possible to implement a preemptive scheduler without any hardware timer, allowing the set of running threads to be secheduled upon software interrupts, (system calls), from threads that are already running, and all the other interrupts upon I/O completion from the peripheral drivers for disk, NIC, KB, mouse etc. I've never seen it done though - the timer functionality is too useful:)
I'm trying to understand how a system wide profiler works. Let's take linux perf as example. For a certain profiling time it can provide:
Various aggregated hadware performance counters
Time spent and hardware counters (e.g. #instructions) for each user space process and kernel space function
Information about context switches
etc.
The first thing I'm almost sure about is that the report is just an estimation of what's really happening. So I think there's some kernel module that launches software interrupts at a certain sampling rate. The lower the sampling rate, the lower the profiler overhead. The interrupt can read the model specific registers that store the performance counters.
The next part is to correlate the counters with the software that's running on the machine. That's the part I don't understand.
So where does the profiler gets its data from?
Can you interrogate for example the task scheduler to find out what was running when you interrupted him? Won't that affect the
execution of the scheduler (e.g. instead of continuing the
interrupted function it will just schedule another one, making the
profiler result not accurate). Is the list of task_struct objects available?
How can profilers even correlate HW
metrics even at instruction level?
So I think there's some kernel module that launches software interrupts at a certain sampling rate.
Perf is not module, it is part of the Linux kernel, implemented in
kernel/events/core.c and for every supported architecture and cpu model, for example arch/x86/kernel/cpu/perf_event*.c. But Oprofile was a module, with similar approach.
Perf generally works by asking PMU (Performance monitoring unit) of CPU to generate interrupt after N events of some hardware performance counter (Yokohama, slide 5 "• Interrupt when threshold reached: allows sampling"). Actually it may be implemented as:
select some PMU counter
initialize it to -N, where N is the sampling period (we want interrupt after N events, for example, after 2 millions of cycles perf record -c 2000000 -e cycles, or some N computed and tuned by perf when no extra option is set or -F is given)
set this counter to wanted event, and ask PMU to generate interrupt on overflow (ARCH_PERFMON_EVENTSEL_INT). It will happen after N increments of our counter.
All modern Intel chips supports this, for example, Nehalem: https://software.intel.com/sites/default/files/76/87/30320 - Nehalem Performance Monitoring Unit Programming Guide
EBS - Event Based Sampling. A technique in which counters are pre-loaded with a large negative count, and they are configured to interrupt the processor on overflow. When the counter overflows the interrupt service routine capture profiling data.
So, when you use hardware PMU, there is no additional work at timer interrupt with special reading of hardware PMU counters. There is some work to save/restore PMU state at task switch, but this (*_sched_in/*_sched_out of kernel/events/core.c) will not change PMU counter value for current thread nor will export it to user-space.
There is a handler: arch/x86/kernel/cpu/perf_event.c: x86_pmu_handle_irq which finds the overflowed counter and calls perf_sample_data_init(&data, 0, event->hw.last_period); to record the current time, IP of last executed command (it can be inexact because of out-of-order nature of most Intel microarchitetures, there is limited workaround for some events - PEBS, perf record -e cycles:pp), stacktrace data (if -g was used in record), etc. Then handler resets the counter value to the -N (x86_perf_event_set_period, wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask); - note the minus before left)
The lower the sampling rate, the lower the profiler overhead.
Perf allows you to set target sampling rate with -F option, -F 1000 means around 1000 irq/s. High rates are not recommended due to high overhead. Ten years ago Intel VTune recommended not more than 1000 irq/s (http://www.cs.utah.edu/~mhall/cs4961f09/VTune-1.pdf "Try to get about a 1000 samples per second per logical CPU."), perf usually don't allow too high rate for non-root (autotuned to lower rate when "perf interrupt took too long" - check in your dmesg; also check sysctl -a|grep perf, for example kernel.perf_cpu_time_max_percent=25 - which means that perf will try to use not more then 25 % of CPU)
Can you interrogate for example the task scheduler to find out what was running when you interrupted him?
No. But you can enable tracepoint at sched_switch or other sched event (list all available in sched: perf list 'sched:*'), and use it as profiling event for the perf. You can even ask perf to record stacktrace at this tracepoint:
perf record -a -g -e "sched:sched_switch" sleep 10
Won't that affect the execution of the scheduler
Enabled tracepoint will make add some perf event sampling work to the function with tracepoint
Is the list of task_struct objects available?
Only via ftrace...
Information about context switches
This is software perf event, just call to perf_sw_event with PERF_COUNT_SW_CONTEXT_SWITCHES event from sched/core.c (indirectly). Example of direct call - migration software event: kernel/sched/core.c set_task_cpu(): p->se.nr_migrations++; perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
PS: there are good slides on perf, ftrace and other profiling and tracing subsystems in Linux by Gregg: http://www.brendangregg.com/linuxperf.html
This is pretty much answers all three of your questions.
Profiling consits of two types: Counting and sampling. Counting measures the
overall
number
of events during the entire execution without offering any insight
regarding
the
instructions or functions that
generated
them
. On
the other hand,
sampling gives a correlation of
the events to the code
through captured samples of the Instruction Pointer
.
When sampling, the
kernel instructs the processor to issue an interrupt when
a chosen
event counter exceeds a
threshold. T
his interrupt is caught by the kernel and the sampled data
including the Instruction
Pointer
value are stored into a ring buffer. The buffer is polled periodically by the userspace
perf tool and its contents
written to disk.
In post processing, the Instruction Pointer is matched to
addresses in binary files, which can be translated into function names and such
Refer http://openlab.web.cern.ch/sites/openlab.web.cern.ch/files/technical_documents/TheOverheadOfProfilingUsingPMUhardwareCounters.pdf
Does anybody know how Heterogeneous Multi-Processing (HMP) scheduling is implemented in the Linux kernel scheduler?
This has been implemented in the kernel supplied with the ODROID-XU3 board.
(https://github.com/hardkernel/linux.git -b odroidxu3-3.10.y-android)
I roughly know that it calculates the load of a certain process and based on that load it will reschedule to a faster or slower CPU.
I'm looking for a more detail explanation and if possible the code location of the functions that implement this functionality.
Code:
Checkout the source-code under #ifdef CONFIG_SCHED_HMP mainly within kernel/sched/core.c
A (not so) brief overview:
big.LITTLE CPUs can be configured in 2 modes of operation:
IKS – In Kernel Switcher (also known as CPU Migration)
GTS - Global Task Scheduling (also known as big.LITTLE MP)
GTS is the heterogeneous form of operation i.e HMP.
At the most abstracted level, HMP is currently supported by simply extending DVFS and SMP load balancing. Both of these are made fully aware of the performance advantage of big cores (over the LITTLE cores) and schedule high-priority, cpu-intensive, foreground tasks accordingly.
Dynamic voltage and frequency scaling (DVFS) is used to adapt to instantaneous changes in required performance. The migration modes of big.LITTLE extends this concept by enabling a transition to "big" CPU cores above the highest DVFS operating point of the LITTLE cores. The migration takes about 30 microseconds. By contrast, the DVFS driver evaluates the performance of the OS and the individual cores typically every 50 milliseconds, although some implementations sample slightly more frequently. It takes about 100 microseconds to change voltage and frequency. Because the time taken to migrate a CPU or a cluster is shorter than the DVFS change time and an order of magnitude shorter than the OS evaluation period for DVFS changes, big.LITTLE transitions will enable the processors to run at lower operating points, more frequently, and further, be completely invisible to the user.
In the Global Task Scheduling model, the DVFS mechanisms are still in operation, but the operating system kernel scheduler is aware of the big and LITTLE cores in the system and seeks to load balance high performance threads to high performance cores, and low performance or memory bound threads to the high efficiency cores. This is similar to SMP load balancers today, that automatically balance threads across the cores available in the system, and idle unused cores. In big.LITTLE Global Task Scheduling, the same mechanism is in operation, but the OS keeps track of the load history of each thread and uses that history plus real-time performance sampling to balance threads appropriately among big and LITTLE cores.
Reference: community.arm.com : Ten Things to Know About big.LITTLE
That is, if the core processor most of the time waiting for data from RAM or cache-L3 with cache-miss, but the system is a real-time (real-time thread priority), and the thread is attached (affinity) to the core and works without switching thread/context, what kind of load(usage) CPU-Core should show on modern x86_64?
That is, CPU usage is displayed as decrease only when logged in Idle?
And if anyone knows, if the behavior is different in this case for other processors: ARM, Power[PC], Sparc?
Clarification: shows CPU-usage in standard Task manager in OS-Windows
A hardware thread (logical core) that's stalled on a cache miss can't be doing anything else, so it still counts as busy for the purposes of task-managers / CPU time accounting / OS process scheduler time-slices / stuff like that.
This is true across all architectures.
Without hyperthreading, "hardware thread" / "logical core" are the same as a "physical core".
Morphcore / other on-the-fly changing between hyperthreading and a more powerful single core could make there be a difference between a thread that keeps many execution units busy, vs. a thread that is blocked on cache misses a lot of the time.
I don't get the link between the OS CPU usage statistics and the optimal use of the pipeline. I think they are uncorrelated as the OS doesn't measure the pipeline load.
I'm writing this in the hope that Peter Cordes can help me understand it better and as a continuation of the comments.
User programs relinquish control to OS very often: when they need input from user or when
they are done with the signal/message. GUI program are basically just big loops and at
each iteration control is given to the OS until the next message.
When the OS has the control it schedules others threads/tasks and if not other actions
are needed just enter the idle process (long time ago a tight loop, now a sleep state)
until the next interrupt. This is the Idle Time.
Time spent on an ISR processing user input is considered idle time by any OS.
An a cache miss there would be still considered idle time.
A heavy program takes more time to complete the work for a given message thereby returning
control to OS say 2 times in a second instead of
20.
If the OS measures that in the last second, it got control for 20ms only then the
CPU usage is (1000-20)/1000 = 98%.
This has nothing to do with the optimal use of the CPU architecture, as said stalls can
occur in the OS code and still be part of the Idle time statistic.
The CPU utilization at pipeline level is not what is measured and it is orthogonal to the
OS statistics.
CPU usage is meant to be used by sysadmin, it is a measure of the load you put on a system,
it is not the measure of how efficiently the assembly of a program was generated.
Sysadmins can't help with that, but measuring how often the OS got the control back (without
preempting) is a measure of how much load a program is putting on the system.
And sysadmins can definitively do terminate heavy programs.