Flame graph(perf record) cannot display accurate CPU idle usage - performance

When the CPU usage is 60%, the flame graphs(perf record) is used to capture the CPU usage. Why is 40% idle-related stack usage not displayed in the flame graphs? The usage of the idle stack is often less than 5%.

For flame graphs, the point is normally to measure where a process spends CPU time while it's running, not which blocking functions it calls that make it sleep, or where it gets scheduled out and sleeps when it doesn't want to.
I capture performance for one cpu processor, not one process. According to the operating system design, if there is no active task on the CPU, the CPU calls an idle waiting function. For example, Linux often calls schedule_idle until it is interrupted by a new task. Therefore, it is expected that the schedule_idle can be found in flame gragh and it consumes 40% of the cpu usage.
Perf events like cycles don't increment when the clock is halted (e.g. cycles is cpu_clk_unhalted.thread_p or similar). If you really wanted to see time spend idle, you might be able to disable idle power saving to get Linux to just spin in a loop instead of using x86 monitor/mwait or even basic hlt to put the CPU into a C-state where the clock doesn't tick.
Or run your code pinned to one logical core, and on the other logical core, pin a task that runs the pause instruction in a loop. So the physical core's clock keeps ticking for the core you're counting events for.
You should still get counts for cpu_clk_unhalted.thread_any ([Core cycles when at least one thread on the physical core is not in halt state]) when recording that event on the logical core with your task, even when that logical core is asleep.
And you can also record counts for cpu_clk_unhalted.thread to count cycles when this (hardware) thread aka logical core isn't halted, to know how much CPU time you actually used. (Or use the software event task-clock for that.)
Use perf list to see events available on your CPU, and read their descriptions carefully.

Related

Should CPU time always be identical between executions of same code?

My understanding of CPU time is that it should always be the same between every execution, on a same machine. It should require an identical amount of cpu cycles every time.
But I'm running some tests now, of executing a basic echo "Hello World", and it's giving me 0.003 to 0.005 seconds.
Is my understanding of CPU time wrong, or there's an issue in my measurement?
Your understanding is completely wrong. Real-world computers running modern OSes on modern CPUs are not simple, theoretical abstractions. There are all kinds of factors that can affect how much CPU time code requires to execute.
Consider memory bandwidth. On a typical modern machine, all the tasks running on the machine's cores are competing for access to the system memory. If the code is running at the same time code on another core is using lots of memory bandwidth, that may result in accesses to RAM taking more clock cycles.
Many other resources are shared as well, such as caches. Say the code is frequently interrupted to let other code run on the core. That will mean that the code will frequently find the cache cold and take lots of cache misses. That will also result in the code taking more clock cycles.
Let's talk about page faults as well. The code itself may be in memory or it may not be when the code starts running. Even if the code is in memory, you may or may not take soft page faults (to update the operating system's tracking of what memory is being actively used) depending on when that page last took a soft page fault or how long ago it was loaded into RAM.
And your basic hello world program is doing I/O to the terminal. The time that takes can depend on what else is interacting with the terminal at the time.
The biggest effects on modern systems include:
virtual memory lazily paging in code and data from disk if it's not hot in pagecache. (First run of a program tends to have a lot more startup overhead.)
CPU frequency isn't fixed. (idle / turbo speeds. grep MHz /proc/cpuinfo).
CPU caches can be hot or not
(for very short intervals) an interrupt randomly happening or not in your timed region.
So even if cycles were fixed (which they very much are not), you wouldn't see equal times.
Your assumption is not totally wrong, but it only applies to core clock cycles for individual loops, and only to cases that don't involve any memory access. (e.g. data already hot in L1d cache, code already hot in L1i cache inside a CPU core). And assuming no interrupt happens while the timed loop is running.
Running a whole program is a much larger scale of operation and will involve shared resources (and possible contention for them) like access to main memory. And as #David pointed out, a write system call to print a string on a terminal emulator - that communication with another process can be slow and involves waking up another process, if your program ends up waiting for it. Redirecting to /dev/null or a regular file would remove that, or just closing stdout like ./hello >&- would make your write system call return -EBADF (on Linux).
Modern CPUs are very complex beasts. You presumably have an Intel or AMD x86-64 CPU with out-of-order execution, and a dozen or so buffers for incoming / outgoing cache lines, allowing it to track about that many outstanding cache misses (memory-level parallelism). And 2 levels of private cache per core, and a shared L3 cache. Good luck predicting an exact number of clock cycles for anything but the most controlled conditions.
But yes, if you do control the condition, the same small loop will typically run at the same number of core clock cycles per iteration.
However, even that's not always the case. I've seen cases where the same loop seems to have have two stable states for how the CPU schedules instructions. Different entry condition quirks can lead to an ongoing speed difference over millions of loop iterations.
I've seen this occasionally when microbenchmarking stuff on modern Intel CPUs like Sandybridge and Skylake. It's usually not clear exactly what the two stable states are, and what exactly is causing the bottleneck, even with the help of performance counters and https://agner.org/optimize
In one case I remember, an interrupt tended to get the loop into the efficient mode of execution. #BeeOnRope was measuring slow cycles/iteration using or RDPMC for a short interval (or maybe RDTSC with core clock fixed = TSC reference clocks), while I was measuring it running faster by using a really large repeat count and just using perf stat on the whole program (which was a static executable with just that one loop written by hand in asm). And #Bee was able to repro my results by increasing the iteration count so an interrupt would happen inside the timed region, and returning from the interrupt tended to get the CPU out of that non-optimal uop-scheduling pattern, whatever it was.

Why will "while true" use 100% of CPU resources?

I ran the following Java code on a Linux server:
while (true) {
int a = 1+2;
}
It caused one CPU core to reach 100% usage. I'm confused about this, because I learnt that CPUs deal with tasks by time splitting, meaning that the CPU will do one task in one time slot (CPU time range scheduler). If there are 10 time slots, the while true task should have at most use 10% CPU usage, because the other 90% would be assigned to other tasks. So why is it 100%?
If your CPU usage is not at 100%, then the process can use as much as it wants (up to 100%), until other processes are requesting the use of the resource. The process scheduler attempts to maximize CPU usage, and never leave a process starved of CPU time if there are no other processes that need it.
So your while loop will use 100% of available idle CPU resources, and will only begin to use less when other CPU intensive processes are started up. (if you're using Linux/Unix, you could observe this with top by starting up your while loop, then starting another CPU intensive process and watch the % CPU drop for the process with the loop).
In a multitasking OS the CPU time is split between execution flows (processes, threads) - that's true. So what happens here - the loop is being executed until some point when clock interruption occurs which is used by OS to schedule next execution "piece" from other process or thread. But as soon as your code doesn't reach specific points (input/output or other system calls which can switch the process into the "waiting" state, such as waiting for a synchronization objects sleep operations, etc.) the process keeps being in the "running" state which tells scheduler to keep it in the execution queue and reschedule its execution at the first best opportunity. If there are some "competitive" processes which keep their "running" state for long period of time as well the CPU usage will be shared between your and all these processes, otherwise your process execution will rescheduled immediately which will cause continuous high CPU consumption anyway.

How does a system wide profiler (e.g. perf) correlate counters with instructions?

I'm trying to understand how a system wide profiler works. Let's take linux perf as example. For a certain profiling time it can provide:
Various aggregated hadware performance counters
Time spent and hardware counters (e.g. #instructions) for each user space process and kernel space function
Information about context switches
etc.
The first thing I'm almost sure about is that the report is just an estimation of what's really happening. So I think there's some kernel module that launches software interrupts at a certain sampling rate. The lower the sampling rate, the lower the profiler overhead. The interrupt can read the model specific registers that store the performance counters.
The next part is to correlate the counters with the software that's running on the machine. That's the part I don't understand.
So where does the profiler gets its data from?
Can you interrogate for example the task scheduler to find out what was running when you interrupted him? Won't that affect the
execution of the scheduler (e.g. instead of continuing the
interrupted function it will just schedule another one, making the
profiler result not accurate). Is the list of task_struct objects available?
How can profilers even correlate HW
metrics even at instruction level?
So I think there's some kernel module that launches software interrupts at a certain sampling rate.
Perf is not module, it is part of the Linux kernel, implemented in
kernel/events/core.c and for every supported architecture and cpu model, for example arch/x86/kernel/cpu/perf_event*.c. But Oprofile was a module, with similar approach.
Perf generally works by asking PMU (Performance monitoring unit) of CPU to generate interrupt after N events of some hardware performance counter (Yokohama, slide 5 "• Interrupt when threshold reached: allows sampling"). Actually it may be implemented as:
select some PMU counter
initialize it to -N, where N is the sampling period (we want interrupt after N events, for example, after 2 millions of cycles perf record -c 2000000 -e cycles, or some N computed and tuned by perf when no extra option is set or -F is given)
set this counter to wanted event, and ask PMU to generate interrupt on overflow (ARCH_PERFMON_EVENTSEL_INT). It will happen after N increments of our counter.
All modern Intel chips supports this, for example, Nehalem: https://software.intel.com/sites/default/files/76/87/30320 - Nehalem Performance Monitoring Unit Programming Guide
EBS - Event Based Sampling. A technique in which counters are pre-loaded with a large negative count, and they are configured to interrupt the processor on overflow. When the counter overflows the interrupt service routine capture profiling data.
So, when you use hardware PMU, there is no additional work at timer interrupt with special reading of hardware PMU counters. There is some work to save/restore PMU state at task switch, but this (*_sched_in/*_sched_out of kernel/events/core.c) will not change PMU counter value for current thread nor will export it to user-space.
There is a handler: arch/x86/kernel/cpu/perf_event.c: x86_pmu_handle_irq which finds the overflowed counter and calls perf_sample_data_init(&data, 0, event->hw.last_period); to record the current time, IP of last executed command (it can be inexact because of out-of-order nature of most Intel microarchitetures, there is limited workaround for some events - PEBS, perf record -e cycles:pp), stacktrace data (if -g was used in record), etc. Then handler resets the counter value to the -N (x86_perf_event_set_period, wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask); - note the minus before left)
The lower the sampling rate, the lower the profiler overhead.
Perf allows you to set target sampling rate with -F option, -F 1000 means around 1000 irq/s. High rates are not recommended due to high overhead. Ten years ago Intel VTune recommended not more than 1000 irq/s (http://www.cs.utah.edu/~mhall/cs4961f09/VTune-1.pdf "Try to get about a 1000 samples per second per logical CPU."), perf usually don't allow too high rate for non-root (autotuned to lower rate when "perf interrupt took too long" - check in your dmesg; also check sysctl -a|grep perf, for example kernel.perf_cpu_time_max_percent=25 - which means that perf will try to use not more then 25 % of CPU)
Can you interrogate for example the task scheduler to find out what was running when you interrupted him?
No. But you can enable tracepoint at sched_switch or other sched event (list all available in sched: perf list 'sched:*'), and use it as profiling event for the perf. You can even ask perf to record stacktrace at this tracepoint:
perf record -a -g -e "sched:sched_switch" sleep 10
Won't that affect the execution of the scheduler
Enabled tracepoint will make add some perf event sampling work to the function with tracepoint
Is the list of task_struct objects available?
Only via ftrace...
Information about context switches
This is software perf event, just call to perf_sw_event with PERF_COUNT_SW_CONTEXT_SWITCHES event from sched/core.c (indirectly). Example of direct call - migration software event: kernel/sched/core.c set_task_cpu(): p->se.nr_migrations++; perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
PS: there are good slides on perf, ftrace and other profiling and tracing subsystems in Linux by Gregg: http://www.brendangregg.com/linuxperf.html
This is pretty much answers all three of your questions.
Profiling consits of two types: Counting and sampling. Counting measures the
overall
number
of events during the entire execution without offering any insight
regarding
the
instructions or functions that
generated
them
. On
the other hand,
sampling gives a correlation of
the events to the code
through captured samples of the Instruction Pointer
.
When sampling, the
kernel instructs the processor to issue an interrupt when
a chosen
event counter exceeds a
threshold. T
his interrupt is caught by the kernel and the sampled data
including the Instruction
Pointer
value are stored into a ring buffer. The buffer is polled periodically by the userspace
perf tool and its contents
written to disk.
In post processing, the Instruction Pointer is matched to
addresses in binary files, which can be translated into function names and such
Refer http://openlab.web.cern.ch/sites/openlab.web.cern.ch/files/technical_documents/TheOverheadOfProfilingUsingPMUhardwareCounters.pdf

What exactly is CPU load if instructions are executed one at a time?

I know this question has been asked many times in many different manners, but it's still not clear for me what the CPU load % means.
I'll start explaining how I perceive the concepts now (of course, I might, and sure will, be wrong):
A CPU core can only execute one instruction at a time. It will not execute the next instruction until it finishes executing the current one.
Suppose your box has one single CPU with one single core. Parallel computing is hence not possible. Your OS's scheduler will pick up a process, set the IP to the entry point, and send that instruction to the CPU. It won't move to the next instruction until the CPU finishes executing the current instruction. After a certain amount of time it will switch to another process, and so on. But it will never switch to another process if the CPU is currently executing an instruction. It will wait until the CPU becomes free to switch to another process. Since you only have one single core, you can't have two processes executing simultaneously.
I/O is expensive. Whenever a process wants to read a file from the disk, it has to wait until the disk accomplishes its task, and the current process can't execute its next instruction until then. The CPU is not doing anything while the disk is working, and so our OS will switch to another process until the disk finishes its job in order not to waste time.
Following these principles, I've come myself to the conclusion that CPU load at a given time can only be one of the following two values:
0% - Idle. CPU is doing nothing at all.
100% - Busy. CPU is currently executing an instruction.
This is obviously false as taskmgr reports %1, 12%, 15%, 50%, etc. CPU usage values.
What does it mean that a given process, at a given time, is utilizing 1% of a given CPU core (as reported by taskmgr)? While that given process is executing, what happens with the 99%?
What does it mean that the overall CPU usage is 19% (as reported by Rainmeter at the moment)?
If you look into the task manager on Windows there is Idle process, that does exactly that, it just shows amount of cycles not doing anything useful. Yes, CPU is always busy, but it might be just running in a loop waiting for useful things to come.
Since you only have one single core, you can't have two processes
executing simultaneously.
This is not really true. Yes, true parallelism is not possible with single core, but you can create illusion of one with preemptive multitasking. Yes, it is impossible to interrupt instruction, but it is not a problem because most of the instructions require tiny amount of time to finish. OS shares time with time slices, which are significantly longer than execution time of single instruction.
What does it mean that a given process, at a given time, is utilizing 1% of a given CPU core
Most of the time applications are not doing anything useful. Think of application that waits for user to click a button to start processing something. This app doesn't need CPU, so it sleeps most of the time, or every time it gets time slice it just goes into sleep (see event loop in Windows). GetMessage is blocking, so it means that thread will sleep until message arrives. So what CPU load really means? So imagine the app receives some events or data to do things, it will do operations instead of sleeping. So if it utilizes X% of CPU means that over sampling period of time that app used X% of CPU time. CPU time usage is average metric.
PS: To summarize concept of CPU load, think of speed (in terms of physics). There are instantaneous and average speeds, so speaking of CPU load, there also are instantaneous and average measurements. Instantaneous is always equal to either 0% or 100%, because at some point of time process either uses CPU or not. If process used 100% of CPU in the course of 250ms and didn't use for next 750ms then we can say that process loaded CPU for 25% with sampling period of 1 second (average measurement can only be applied with certain sampling period).
http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages
A single-core CPU is like a single lane of traffic. Imagine you are a bridge operator ... sometimes your bridge is so busy there are cars lined up to cross. You want to let folks know how traffic is moving on your bridge. A decent metric would be how many cars are waiting at a particular time. If no cars are waiting, incoming drivers know they can drive across right away. If cars are backed up, drivers know they're in for delays.
This is basically what CPU load is. "Cars" are processes using a slice of CPU time ("crossing the bridge") or queued up to use the CPU. Unix refers to this as the run-queue length: the sum of the number of processes that are currently running plus the number that are waiting (queued) to run.
Also see: http://en.wikipedia.org/wiki/Load_(computing)

Which one will workload(usage) of the CPU-Core if there is a persistent cache-miss, will be 100%?

That is, if the core processor most of the time waiting for data from RAM or cache-L3 with cache-miss, but the system is a real-time (real-time thread priority), and the thread is attached (affinity) to the core and works without switching thread/context, what kind of load(usage) CPU-Core should show on modern x86_64?
That is, CPU usage is displayed as decrease only when logged in Idle?
And if anyone knows, if the behavior is different in this case for other processors: ARM, Power[PC], Sparc?
Clarification: shows CPU-usage in standard Task manager in OS-Windows
A hardware thread (logical core) that's stalled on a cache miss can't be doing anything else, so it still counts as busy for the purposes of task-managers / CPU time accounting / OS process scheduler time-slices / stuff like that.
This is true across all architectures.
Without hyperthreading, "hardware thread" / "logical core" are the same as a "physical core".
Morphcore / other on-the-fly changing between hyperthreading and a more powerful single core could make there be a difference between a thread that keeps many execution units busy, vs. a thread that is blocked on cache misses a lot of the time.
I don't get the link between the OS CPU usage statistics and the optimal use of the pipeline. I think they are uncorrelated as the OS doesn't measure the pipeline load.
I'm writing this in the hope that Peter Cordes can help me understand it better and as a continuation of the comments.
User programs relinquish control to OS very often: when they need input from user or when
they are done with the signal/message. GUI program are basically just big loops and at
each iteration control is given to the OS until the next message.
When the OS has the control it schedules others threads/tasks and if not other actions
are needed just enter the idle process (long time ago a tight loop, now a sleep state)
until the next interrupt. This is the Idle Time.
Time spent on an ISR processing user input is considered idle time by any OS.
An a cache miss there would be still considered idle time.
A heavy program takes more time to complete the work for a given message thereby returning
control to OS say 2 times in a second instead of
20.
If the OS measures that in the last second, it got control for 20ms only then the
CPU usage is (1000-20)/1000 = 98%.
This has nothing to do with the optimal use of the CPU architecture, as said stalls can
occur in the OS code and still be part of the Idle time statistic.
The CPU utilization at pipeline level is not what is measured and it is orthogonal to the
OS statistics.
CPU usage is meant to be used by sysadmin, it is a measure of the load you put on a system,
it is not the measure of how efficiently the assembly of a program was generated.
Sysadmins can't help with that, but measuring how often the OS got the control back (without
preempting) is a measure of how much load a program is putting on the system.
And sysadmins can definitively do terminate heavy programs.

Resources