rcu_sched vs rcu_preempt CPU stalls - linux-kernel

I have basic query regarding the RCU CPU stalls, sometimes there is message like INFO: rcu_sched self-detected stall on CPU, and sometimes it's INFO: rcu_preempt detected stall on CPUs/task, what is the difference between these two?
Is it dependent upon what flavor of RCU implementation enabled in Kernel configs, for instance in case of CONFIG_PREEMPT=y, there is going to be " INFO: rcu_preempt" CPU stall warnings?

Related

Flame graph(perf record) cannot display accurate CPU idle usage

When the CPU usage is 60%, the flame graphs(perf record) is used to capture the CPU usage. Why is 40% idle-related stack usage not displayed in the flame graphs? The usage of the idle stack is often less than 5%.
For flame graphs, the point is normally to measure where a process spends CPU time while it's running, not which blocking functions it calls that make it sleep, or where it gets scheduled out and sleeps when it doesn't want to.
I capture performance for one cpu processor, not one process. According to the operating system design, if there is no active task on the CPU, the CPU calls an idle waiting function. For example, Linux often calls schedule_idle until it is interrupted by a new task. Therefore, it is expected that the schedule_idle can be found in flame gragh and it consumes 40% of the cpu usage.
Perf events like cycles don't increment when the clock is halted (e.g. cycles is cpu_clk_unhalted.thread_p or similar). If you really wanted to see time spend idle, you might be able to disable idle power saving to get Linux to just spin in a loop instead of using x86 monitor/mwait or even basic hlt to put the CPU into a C-state where the clock doesn't tick.
Or run your code pinned to one logical core, and on the other logical core, pin a task that runs the pause instruction in a loop. So the physical core's clock keeps ticking for the core you're counting events for.
You should still get counts for cpu_clk_unhalted.thread_any ([Core cycles when at least one thread on the physical core is not in halt state]) when recording that event on the logical core with your task, even when that logical core is asleep.
And you can also record counts for cpu_clk_unhalted.thread to count cycles when this (hardware) thread aka logical core isn't halted, to know how much CPU time you actually used. (Or use the software event task-clock for that.)
Use perf list to see events available on your CPU, and read their descriptions carefully.

High CPU/Memory use by the vmcompute process

We are running a Windows 2019 cluster in AWS ECS.
From time to time the instances get problems with higher cpu and memory usage, not related to container usage
When checking the instances we can see that the vmcompute process have spiked its memory usage (commit) to up to 90% of the system memory, and having a average CPU usage of at least 30-40%
But I fail to understand why that is happening, and if it is a real issue?
Or if the memory and CPU usage will decrease when more load is put onto the containers?

Is it a good practice to set interrupt affinity and io handling thread affinity to the same core?

I am trying to understand irq affinity and its impact to system performance.
I went through why-interrupt-affinity-with-multiple-cores-is-not-such-a-good-thing, and learnt that NIC irq affinity should be set to a different core other then the core that handling the network data, since the handling will not be interrupted by incoming irq.
I doubt this, since if we use a different core to handling the data from the irq core, we will get more cache miss when trying to retrieve the network data from kernel. So I am actually believe that by setting the irq affinity as the same core of the thread handling the incoming data will improve performance due to less cache miss.
I am trying to come up with some verification code, but before I will present any results, am I missing something?
IRQ affinity is a double edged sword. In my experience it can improve performance but only in a very specific configuration with a pre-defined workload. As far as your question is concerned, (consider only RX path) typically when a NIC card interrupts one of the core, in majority of the cases the interrupt handler will not do much, except for triggering a mechanism (bottom-half, tasklet, kernel thread or networking stack thread) to process incoming packet in some other context. If the same packet processing core is handling the interrupts (ISR handler does not do much), it is bound to loose some cache benefit due to context switches and may increase cache misses. How much is the impact, depends on variety of other factors.
In NIC drivers typically, affinity of core is aligned to each RX queue [separating each RX queue processing across different cores], which provides more performance benefit.
Controlling interrupt affinity can have quite few useful applications. Few that sprint to mind.
isolating cores - preventing I/O load spreading onto mission critical CPU cores, or cores that exclusive for RT priority (scheduled softirqs could starve on those)
increase system performance - throughput and latency - by keeping interrupts on relevant NUMA for multi-CPU systems
CPU efficiency - e.g. dedicating CPU, NIC channel with its interrupt to single application will squeeze more from that CPU, about 30% more with heavy traffic usecases.
This might require flow steering to make incoming traffic target the channel.
Note: without affinity app might seem to deliver more throughput but would spill load onto many cores). The gains is thanks to cache locality, and context switch prevention.
latency - setting up application as above will halve the latency (from 8us to 4us on modern system)

Intentionally high CPU usage, GCD, QOS_CLASS_BACKGROUND, and spindump

I am developing a program that happens to use a lot of CPU cycles to do its job. I have noticed that it, and other CPU intensive tasks, like iMovie import/export or Grapher Examples, will trigger a spin dump report, logged in Console:
1/21/16 12:37:30.000 PM kernel[0]: process iMovie[697] thread 22740 caught burning CPU! It used more than 50% CPU (Actual recent usage: 77%) over 180 seconds. thread lifetime cpu usage 91.400140 seconds, (87.318264 user, 4.081876 system) ledger info: balance: 90006145252 credit: 90006145252 debit: 0 limit: 90000000000 (50%) period: 180000000000 time since last refill (ns): 116147448571
1/21/16 12:37:30.881 PM com.apple.xpc.launchd[1]: (com.apple.ReportCrash[705]) Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.ReportCrash
1/21/16 12:37:30.883 PM ReportCrash[705]: Invoking spindump for pid=697 thread=22740 percent_cpu=77 duration=117 because of excessive cpu utilization
1/21/16 12:37:35.199 PM spindump[423]: Saved cpu_resource.diag report for iMovie version 9.0.4 (1634) to /Library/Logs/DiagnosticReports/iMovie_2016-01-21-123735_cudrnaks-MacBook-Pro.cpu_resource.diag
I understand that high CPU usage may be associated with software errors, but some operations simply require high CPU usage. It seems a waste of resources to watch-dog and report processes/threads that are expected to use a lot of CPU.
In my program, I use four serial GCD dispatch queues, one for each core of the i7 processor. I have tried using QOS_CLASS_BACKGROUND, and spin dump recognizes this:
Primary state: 31 samples Non-Frontmost App, Non-Suppressed, Kernel mode, Thread QoS Background
The fan spins much more slowly when using QOS_CLASS_BACKGROUND instead of QOS_CLASS_USER_INITIATED, and the program takes about 2x longer to complete. As a side issue, Activity Monitor still reports the same % CPU usage and even longer total CPU Time for the same task.
Based on Apple's Energy Efficiency documentation, QOS_CLASS_BACKGROUND seems to be the proper choice for something that takes a long time to complete:
Work takes significant time, such as minutes or hours.
So why then does it still complain about using a lot of CPU time? I've read about methods to disable spindump, but these methods disable it for all processes. Is there a programmatic way to tell the system that this process/thread is expected to use a lot of CPU, so don't bother watch-dogging it?

How the kernel different subsystems share CPU time

Processes in userspace are scheduled by the kernel scheduler to get processor time but how the different kernel tasks get CPU time? I mean, when no process at userspace are requering CPU time (so CPU is iddle by executing NOP instructions) but some kernel subsystem need to carry out some task regularly, are timers and other hw and sw interrupts the common methods to get CPU time in kernel space?.
It's pretty much the same scheduler. The only difference I could think of is that kernel code has much more control over execution flow. For example, there is direct call to scheduler schedule().
Also in kernel you have 3 execution contexts - hardware interrupt, softirq/bh and process. In hard (and probably soft) interrupt context you can't sleep, so scheduling is not done during executing code in this context.

Resources