How to analyze perf.data of perf sched record - linux-kernel

I have collected some perf data using :
perf sched record –g
I need to analyze the perf.data generated by this command.
I am using following command for analysis:
perf report
I see multiple sched events :
62K sched:sched_switch ▒
0 sched:sched_stat_wait ▒
0 sched:sched_stat_sleep ▒
0 sched:sched_stat_iowait ▒
120K sched:sched_stat_runtime ▒
10 sched:sched_process_fork ▒
31K sched:sched_wakeup ▒
10 sched:sched_wakeup_new ▒
873 sched:sched_migrate_task
After I open one of the events, I see something like :
+ 80.00% 0.00% ksmtuned bash [.] make_child
+ 80.00% 0.00% ksmtuned libc-2.17.so [.] __libc_fork
+ 80.00% 0.00% ksmtuned [kernel.kallsyms] [k] stub_clone
+ 80.00% 0.00% ksmtuned [kernel.kallsyms] [k] sys_clone
+ 80.00% 80.00% ksmtuned [kernel.kallsyms] [k] do_fork
+ 10.00% 0.00% bash bash [.] make_child
I am not able to interpret this information.
Following are the questions I have :
1) What are the first two columns showing % values ?
2) Why don’t the % values add up to 100 % ?
3) What is the significance of these numbers ?

perf sched record is special variant of perf, and I think it is more correct to use perf sched subcommands to analyze resulting perf.data file.
There is man page on perf sched: http://man7.org/linux/man-pages/man1/perf-sched.1.html
'perf sched record <command>' to record the scheduling events
of an arbitrary workload.
'perf sched latency' to report the per task scheduling latencies
and other scheduling properties of the workload.
'perf sched script' to see a detailed trace of the workload that
was recorded...
'perf sched replay' to simulate the workload that was recorded
via perf sched record. (this is done by starting up mockup threads
that mimic the workload based on the events in the trace. These
threads can then replay the timings (CPU runtime and sleep patterns)
of the workload as it occurred when it was recorded - and can repeat
it a number of times, measuring its performance.)
'perf sched map' to print a textual context-switching outline of
workload captured via perf sched record. Columns stand for
individual CPUs, and the two-letter shortcuts stand for tasks that
are running on a CPU. A '*' denotes the CPU that had the event, and
a dot signals an idle CPU.
Also there is letter from 2009 which describe perf sched functionality: https://lwn.net/Articles/353295/ "[Announce] 'perf sched': Utility to capture, measure and analyze scheduler latencies and behavior" and the recommended usage of perf sched record results is perf sched latency not perf report:
... experimental version of a utility
that tries to meet this ambitious goal: the new 'perf sched' family of
tools that uses performance events to objective characterise arbitrary
workloads from a scheduling and latency point of view.
'perf sched' has five sub-commands currently:
perf sched record # low-overhead recording of arbitrary workloads
perf sched latency # output per task latency metrics
perf sched map # show summary/map of context-switching
perf sched trace # output finegrained trace
perf sched replay # replay a captured workload using simlated threads
Desktop users would generally use 'perf sched record' to capture a trace
(which creates perf.data), and 'perf sched latency' tool to check
latencies (which analyzes the trace in perf.data). The other tools too
use an already recorded perf.data.

Related

Why does `native_write_msr` dominate my profiling result?

I have a program that runs on a multi-thread framework with Linux kernel 4.18 and Intel CPU. I ran perf record -p pid -g -e cycles:u --call-graph lbr -F 99 -- sleep 20 to collect stack trace and generate flame graph.
My program was running under a low workload, so the time spent on futex_wait is expected. But the top of the stack is a kernel function native_write_msr. According to What does native_write_msr in kernel do? and https://elixir.bootlin.com/linux/v4.18/source/arch/x86/include/asm/msr.h#L103, this function is used for performance counters. I have disabled the tracepoint in native_write_msr.
And pidstat -p pid 1 told me that the system CPU usage is quite low.
05:44:34 PM UID PID %usr %system %guest %CPU CPU Command
05:44:35 PM 1001 67441 60.00 4.00 0.00 64.00 11 my_profram
05:44:36 PM 1001 67441 58.00 7.00 0.00 65.00 11 my_profram
05:44:37 PM 1001 67441 61.00 3.00 0.00 64.00 11 my_profram
My questions are
Why does native_write_msr appear so many times in the stack traces (as a result, it occupies a large space in the flame graph for about 80%). Is it a block operation, or it realeases the CPU when called?
Why is the system CPU usage relatively low against the frame graph? According to the graph, 80% of the CPU time should belong to %system instead of %usr.
Any help is appreciated. If I miss any useful infomation, please comment.
Thank you very much!
From the flamegraph, you could find that native_write_msr function is called by the function schedule. When a running process is removed from one core (because it's migrated to another core or stopped by the scheduler to run another process), the scheduler need to dump the process's perf data and clean its perf configurations, so we don't mess up perf data of different processes. The scheduler may need to write to msr in this step, thus calling native_write_msr. So native_write_msr are called for so many times because scheduling or core migrations happens too frequently.

"perf" alternate performance management tools that can run on Google Cloud Platform

I tried using "perf" command in my Ubuntu Virtual Machine hosted on GCP. But when I run:
sudo perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./AppName
Output:
Performance counter stats for './AppName':
490.299513 task-clock (msec) # 0.081 CPUs utilized
<not supported> cycles
<not supported> instructions
<not supported> cache-references
<not supported> cache-misses
6.036963754 seconds time elapsed
Except the task-clock all others are shown as "not supported".
I want the cache-references and cache-misses data. So is there any alternative to "perf"?
I reproduced the same scenario in my project for ubuntu, redhat by using a different kind of machines (N1, N2) and having the same output as you provided:
root#perf-test-ubuntu:~# perf stat -e task-clock,cycles,instructions,cache-references,cache-misses
^C
Performance counter stats for 'system wide':
44450.342564 task-clock (msec) # 1.000 CPUs utilized
<not supported> cycles
<not supported> instructions
<not supported> cache-references
<not supported> cache-misses
44.470234233 seconds time elapsed
I found some useful links [1][2], it seems Linux perf tool by default tries to use hardware performance monitoring counters. When your OS is virtualized, you have no direct access to all counters; several virtualization solutions may allow access to some basic counters, if configured. [1]
The data you want is in hardware based Performance Monitoring Counters. These are typically NOT emulated by virtual machine environments because they are overhead and generally meaningless given the underlying caching statistics are maintained on a per core / package basis and the workload can be scheduled to run on a variety of cores, each with its own hardware level statistics. [2]
Hope these information are helpful for you.
[1]-Linux perf events profiling in Google Compute Engine not working
[2]-https://www.researchgate.net/post/Why_doesnt_perf_report_cache-refernces_cache-misses

Different profiling modes for different cores using perf

I have the following questions regarding perf.
a) Is it possible that I run different profiling modes on different cores simultaneously. e.g. Core 0 with event based sampling (sampling every N events) and Core 1 with free running counter based sampling?
b) In case a) is not possible. Then is it possible to get a snapshot of the PMU counters on the other cores (Core 1) for every sample (overflow at N events) on Core 0?
P.S: The platform is a RPi 3b+ based on the Arm Cortex A53
It is possible to operate different profiling modes on different cores of the CPU simultaneously.
perf also has a processor-wide mode wherein all threads running on the designated processors are monitored. Counts and samples are thus aggregated per CPU/core.
-C, --cpu=
Count only on the list of CPUs provided. Multiple CPUs can be
provided as a comma-separated list with no space: 0,1. Ranges of
CPUs are specified with -: 0-2. In per-thread mode, this option
is ignored. The -a option is still necessary to activate
system-wide monitoring. Default is to count on all CPUs.
Running both the free-running counter as well as the sampling mechanism of perf simultaneously, is possible on different cores of the CPU like below -
eg. for cpu 0:
perf stat --cpu 0 -B dd if=/dev/zero of=/dev/null count=1000000
and for cpu 1:
perf record --cpu 1 sleep 20

What can be the reason for CPU load to NOT scale linearly with the number of traffic processing workers?

We are writing a Front End that is supposed to process large volume of traffic (in our case it is Diameter traffic, but that may be irrelevant to the question). As client connects, the server socket gets assigned to one of the Worker processes that perform all the actual traffic processing. In other words, Worker does all the work, and more Workers should be added when more clients get connected.
One would expect the CPU load per message to be the same for different number of Workers, because Workers are totally independent, and serve different sets of client connections. Yet our tests show that it takes more CPU time per message, as the number of Workers grow.
To be more precise, the CPU load depends on the TPS (Transactions or Request-Responses per second) as follows.
For 1 Worker:
60K TPS - 16%, 65K TPS - 17%... i.e. ~0.26% CPU per KTPS
For 2 Workers:
80K TPS - 29%, 85K TPS - 30%... i.e. ~0.35% CPU per KTPS
For 4 Workers:
85K TPS - 33%, 90K TPS - 37%... i.e. ~0.41% CPU per KTPS
What is the explanation for this? Workers are independent processes and there is no inter-process communication between them. Also each Worker is single-threaded.
The programming language is C++
This effect is observed on any hardware, which is close to this one: 2 Intel Xeon CPU, 4-6 cores, 2-3 GHz
OS: RedHat Linux (RHEL) 5.8, 6.4
CPU load measurements are done using mpstat and top.
If either the size of the program code used by a worker or the size of the data processed by a worker (or both) is non-small, the reason could be the reduced effectiveness of the various caches: The locality-over-time of how a single worker accesses its program code and/or its data is disturbed by other workers intervening.
The effect can be quite complicated to understand, because:
it depends massively on the structure of your code's computations,
modern CPUs have about three levels of cache,
each cache has a different size,
some caches are local to one core, others are not,
how often the workers intervene depends on your operating system's scheduling strategy
which gets even more complicated if there are multiple cores,
unless your programming language's run-time system also intervenes,
in which case it is more complicated still,
your network interface is a computer of its own and has a cache, too,
and probably more.
Caveat: Given the relatively coarse granularity of process scheduling, the effect of this ought not to be as large as it is, I think.
But then: Have you looked up how "percent of CPU" is even defined?
Until you reach CPU saturation on your machine you cannot be sure that the effect is actually as large as it looks. And when you do reach saturation, it may not be the CPU at all that is the bottleneck here, so are you sure you need to care about CPU load?
I complete agree with #Lutz Prechelt. Here I just want to add the method about how to investigate on the issue and the answer is Perf.
Perf is a performance analyzing tool in Linux which collects both kernel and userspace events and provide some nice metrics. It’s been widely used in my team to find bottom neck in CPU-bound applications.
the output of perf is like this:
Performance counter stats for './cache_line_test 0 1 2 3':
1288.050638 task-clock # 3.930 CPUs utilized
185 context-switches # 0.144 K/sec
8 cpu-migrations # 0.006 K/sec
395 page-faults # 0.307 K/sec
3,182,411,312 cycles # 2.471 GHz [39.95%]
2,720,300,251 stalled-cycles-frontend # 85.48% frontend cycles idle [40.28%]
764,587,902 stalled-cycles-backend # 24.03% backend cycles idle [40.43%]
1,040,828,706 instructions # 0.33 insns per cycle
# 2.61 stalled cycles per insn [51.33%]
130,948,681 branches # 101.664 M/sec [51.48%]
20,721 branch-misses # 0.02% of all branches [50.65%]
652,263,290 L1-dcache-loads # 506.396 M/sec [51.24%]
10,055,747 L1-dcache-load-misses # 1.54% of all L1-dcache hits [51.24%]
4,846,815 LLC-loads # 3.763 M/sec [40.18%]
301 LLC-load-misses # 0.01% of all LL-cache hits [39.58%]
It output your cache miss rate with will easy you to tune your program and see the effect.
I write a article about cache line effects and perf and you can read it for more details.

_spin_unlock_irqrestore() has very high sampling rate in my kvm, why?

I run a SPECJbb benchmark in my KVM virtual machine. It shows a drastic drop on throughput between Warehouse 2 and Warehouse 3(The different between them is just addding on cocurrent task)
Then I use perf in my guest virtual machine. It shows that _spin_unlock_irqrestore has very high sampling rate.
Events: 31K cycles
74.89% [kernel] [k] _spin_unlock_irqrestore
7.36% perf-1968.map [.] 0x7f84b913e064
6.82% [kernel] [k] __do_softirq
6.39% [kernel] [k] handle_IRQ_event
...
It seems that only 7.36% cpu time running my Java program. Why _spin_unlock_irqrestore's sampling rate is so high? And what does it do?
It's bad reporting by perf, not cycles consumed by _spin_unlock_irqrestore.
When IRQs are disabled, perf's interrupts are not processed. Instead, they're processed when interrupts are re-enabled. When perf's interrupt handler looks at the instruction pointer, to see what code was running, it finds the function that enabled interrupts - quite often it's _spin_unlock_irqrestore.
So all you know is that the cycles were consumed by code that had interrupts disabled, and enabled them using _spin_unlock_irqrestore.
If you can get perf to use NMI (non maskable interrupt), it could solve this problem.
I know that it can be done with oprofile (perf's predecessor) by changing the makefile, but don't know about perf.

Resources