Performance monitoring with perf

Performance monitoring with perf - performance

Disclaimer: I am new to perf and still trying to learn the ins/outs.
I have an executable that is running on my target system running Linux. I would like to use perf in order to profile/monitor its performance over time. For arguments sake I am trying to prove that my CPU utilization that is currently measured from top and collectD can be replaced by monitoring it via perf which will result in much more granular data.
Since we are trying to get a plot over time I have been using perf record -e cycles -p <pid>. Afterwards I can get the data to display via perf report.
Questions:
Just displaying the data with perf report shows me a summary of
the entire data set that I took correct?
If I were to run perf report -D I get a dump of all data. (Just as a side question the timestamp is uptime in ns correct?) Now I would assume that sample is based on the frequency that could be set in perf record correct? I have run into issues by taking the time delta of the timestamp and it appears to be recorded at a random interval.
Once I dump the data there is nothing in here that really shouts out "this is your count!!" So the assumption was that the "period" field from the dump is the raw count. Is this true? Meaning that if period = 100, I could assume that for that interval, my program used 100 cycles? Additionally, I am starting to get the feel that this is for not just the application but for each library or kernel call that the program makes. I.e. if a malloc is called a different event will be record outlining that calls cycles taken. So overall how can I derive duration or event + the number of cycles from the event + which event it actually was from this field to get a true measure of CPU utilization?
IF this application of perf is not what it was intended to do then I will also like to know why not? Additionally, I think this same type of analysis would be useful for all of the other types of statistics since you can then pinpoint in time when an anomaly occurred in your running code. Just for reference I am running perf against top collecting at 1s. I am doing this since I want to compare top output to perf output. Any insight would be helpful since as I said I am still learning and new to this powerful tool.
(Linux Kernel version: 3.10.82)

ANSWER #1
Yes mostly. perf report does show you a summary of the trace collected. Samples collected by perf record are saved into a binary file called, by default, perf.data. The perf report command reads this file and generates a concise execution profile. By default, samples are sorted by functions with the most samples first. However, you can do much more detailed profiling also using this report.
ANSWER #2
You should ideally use perf script -D to get a trace of all data. The timestamp is in microseconds. Although, in kernels newer than the one you specify, with the help of a command line switch (-ns) you can display the time in nanoseconds as well. Here is the source -
Timestamp
It is quite hard to tell this without looking at what kind of "deltas" are you getting. Remember the period of collecting samples is usually tuned. There are two ways of specifying the rate at which to collect samples --
You can use the perf record (--c for count) to specify the period at which to collect samples. This will mean that for every c occurrences of the event that you are measuring, you will have a sample for that. You can then modify the sampling period and test various values. This means that at every two occurences of the event for which you are measuring, the counter will overflow and you will record a sample.
The other way around to express the sampling period, is to specify the average rate of samples per second (frequency) - which you can do using perf record -F. So perf record -F 1000 will record around 1000 samples per second and these samples will be generated when the hardware/PMU counter corresponding to the event overflows. This means that the kernel will dynamically adjust the sampling period. And you will get sample times at different random moments.
You can see for yourself in code here:
How perf dynamically updates time
ANSWER #3
Why not ? Ideally you should get the number of event samples collected if you do a perf report and just do a deeper analysis. Also when you do a perf record and finish recording samples, you would get a notification on the command line about the number of samples collected corresponding to the event you measured. (This may not be available in the kernel module you use, I would suggest you switch to a newer linux version if possible!). The number of samples should be the raw count - not the period.
If your period is 100 - it means that for the whole duration of the trace, perf recorded every 100th event. That means, if a total of 1000 events happened for the trace duration, perf approximately collected event 1, 100, 200, 300...1000.
Yes the samples recorded are not only from the application. In fact, you can use switches like this : perf record -e <event-name:u> or <event-name:k> (u for userspace and k for kernel) to record events. Additionally perf records samples from shared libraries as well. (Please consult the perf man-page for more details).
As I said previously, perf report should be an ideal tool to calculate the number of samples of event cycles recorded by perf. The number of events collected/recorded is not exact because it is simply not possible for hardware to record all cycle events. This is because recording and preparing details of all the events require the kernel to maintain a ring buffer which gets written to periodically as and when the counter overflows. This writing to the buffer happens via interrupts. They take up a fraction of CPU time- this time is lost and could have been used to record events which will now be lost as the CPU was busy servicing interrupts. You can get a really great estimate by perf even then, though.
CONCLUSION
perf does especially what it intends to do given the limitations of hardware resources we have at hand currently. I would suggest going through the man-pages for each command to understand better.
QUESTIONS
I assume you are looking at perf report. I also assume you are talking about the overhead % in perf report. Theoretically, it can be considered to be an arrangement of data from the highest to least occurrence as you specified. But, there are many underlying details that you need to consider and understand to properly make sense of the output. It represents which function has the most overhead (in terms of the number of events that occurred in that function ). There is also a parent-child relationship, based on which function calls which function, between all the functions and their overheads. Please use the Perf Report link to understand more.
As you know already events are being sampled, not counted. So you cannot accurately get the number of events, but you will get the number of samples and based on the tuned frequency of collecting samples, you will also get the raw count of the number of events ( Everything should be available to you with the perf report output ).

Related

Benchmarking - How to count number of instructions sent to CPU to find consumed MIPS

Consider I have a software and want to study its behavior using a black-box approach. I have a 3.0GHz CPU with 2 sockets and 4 cores. As you know, in order to find out instructions per second (IPS) we have to use the following formula:
IPS = sockets*(cores/sockets)*clock*(instructions/cycle)
At first, I wanted to find number of instructions per cycle for my specific algorithm. Then I realised its almost impossible to count it using a block-box approach and I need to do in-depth analysis of the algorithm.
But now, I have two question: Regardless of what kind of software is running on my machine and its cpu usage, is there any way to count number of instructions per second sent to the CPU (Millions of instructions per second (MIPS))? And is it possible to find the type of instruction set (add, compare, in, jump, etc) ?
Any piece of script or tool recommendation would be appreciated (in any language).

perf stat --all-user ./my_program on Linux will use CPU performance counters to record how many user-space instructions it ran, and how many core clock cycles it took. And how much CPU time it used, and will calculate average instructions per core clock cycle for you, e.g.
3,496,129,612 instructions:u # 2.61 insn per cycle
It calculates IPC for you; this is usually more interesting than instructions per second. uops per clock is usually even more interesting in terms of how close you are to maxing out the front-end, though. You can manually calculate MIPS from instructions and task-clock. For most other events perf prints a comment with a per-second rate.
(If you don't use --all-user, you can use perf stat -e task-clock:u,instructions:u , ... to have those specific events count in user-space only, while other events can count always, including inside interrupt handlers and system calls.)
But see How to calculate MIPS using perf stat for more detail on instructions / task-clock vs. instructions / elapsed_time if you do actually want total or average MIPS across cores, and counting sleep or not.
For an example output from using it on a tiny microbenchmark loop in a static executable, see Can x86's MOV really be "free"? Why can't I reproduce this at all?
How can I get real-time information at run-time
Do you mean from within the program, to profile only part of it? There's a perf API where you can do perf_event_open or something. Or use a different library for direct access to the HW perf counters.
perf stat is great for microbenchmarking a loop that you've isolated into a stand-alone program that just runs the hot loop for a second or so.
Or maybe you mean something else. perf stat -I 1000 ... ./a.out will print counter values every 1000 ms (1 second), to see how program behaviour changes in real time with whatever time window you want (down to 10ms intervals).
sudo perf top is system-wide, slightly like Unix top
There's also perf record --timestamp to record a timestamp with each event sample. perf report -D might be useful along with this. See http://www.brendangregg.com/perf.html, he mentions something about -T (--timestamp). I haven't really used this; I mostly isolate single loops I'm tuning into a static executable I can run under perf stat.
And is it possible to find the type of instruction set (add, compare, in, jump, etc)?
Intel x86 CPUs at least have a counter for branch instructions, but other types aren't differentiated, other than FP instructions. This is probably common to most architectures that have perf counters at all.
For Intel CPUs, there's ocperf.py, a wrapper for perf with symbolic names for more microarchitectural events. (Update: plain perf now knows the names of most uarch-specific counters so you don't need ocperf.py anymore.)
perf stat -e task_clock,cycles,instructions,fp_arith_inst_retired.128b_packed_single,fp_arith_inst_retired.scalar_double,uops_executed.x87 ./my_program
It's not designed to tell you what instructions are running, you can already tell that from tracing execution. Most instructions are fully pipelined, so the interesting thing is which ports have the most pressure. The exception is the divide/sqrt unit: there's a counter for arith.divider_active: "Cycles when divide unit is busy executing divide or square root operations. Accounts for integer and floating-point operations". The divider isn't fully pipelined, so a new divps or sqrtps can't always start even if no older uops are ready to execute on port 0. (http://agner.org/optimize/)
Related: linux perf: how to interpret and find hotspots for using perf to identify hotspots. Especially using top-down profiling you have perf sample the call-stack to see which functions make a lot of expensive child calls. (I mention this in case that's what you really wanted to know, rather than instruction mix.)
Related:
How do I determine the number of x86 machine instructions executed in a C program?
How to characterize a workload by obtaining the instruction type breakdown?
How do I monitor the amount of SIMD instruction usage
For exact dynamic instruction counts, you might use an instrumentation tool like Intel PIN, if you're on x86. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool.
perf stat counts for the instructions:u hardware even should also be more or less exact, and is in practice very repeatable across runs of the same program doing the same work.
On recent Intel CPUs, there's HW support for recording which way conditional / indirect branches went, so you can reconstruct exactly which instructions ran in which order, assuming no self-modifying code and that you can still read any JIT buffers. Intel PT.
Sorry I don't know what the equivalents are on AMD CPUs.

how to monitor CPU frequency reliably ( in linux)?

I am tying to monitor the CPU operating frequencies for individual cores. I am not sure what's the correct way to monitor the CPU frequency both form the kernel level and hardware level reliably with less overhead.
I would highly appreciate if someone could answer couple of questions that I have.
Let's say I am running an application by pinning it on to a core. I would like to monitor whats the frequency it demands during its execution phase (start to end) and capture it. I would want the accurate frequency that it demands from the hardware level (from MSR's might be).
Not sure what's the accurate way to capture this? Is there a way? Are there any tools or command via which I can read the frequency value directly from the MSR's?
I have tried couple of options, not sure if this reflects the correct frequency:
NOTE: I am trying to sample the core's frequency every 10ms, 20ms, 30ms, ..... and so on.
from the kernel level:
I was reading a sysfs file:
cat /sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_cur_freq
Not sure if the above gives the correct frequency value every 10ms, 20ms, etc. Is there any overhead associated by reading this file every 10ms time interval?
Then I was using turbostat command, but this does not tell me what the correct frequency is for a particular core on a specified time interval but rather tells me the busy% etc, but I am looking for an accurate frequency for the sampling time interval that I specify
Questions:
Whats the best and reliable way to monitor CPU frequency from a systems perspective with very low overhead?
Whats the minimum sampling interval time that I can use to monitor CPU frequency (I know this depends on the CPU governors). I am currently assuming and interested for Ondemand power governor being set for the core which I am trying to monitor the CPU frequency.
It would be a great help if someone could guide me.

I assume that you want to track all changes between cpu frequencies on application event basis, not the summary which those sysfs provides:
cpufreq have trace points to track - trace_cpu_frequency(),
and you can add additional custom trace points to track your application's event - e.g. write messages to trace/trace_marker.
you can see the all events recorded including cpufreq TPs and your trace marks after execution.

How does perf work?

I'm using perf to get an idea of the overhead each function of my program imposes on the total execution time. For that, I use cpu-cycles event:
perf record -e cpu-cycles -c 10000 <binary-with-arguments>
When I look at the output, I see some percentages associated with each function. But what doesn't make sense to me is a case like this: function A is called within function B and nowhere else. But the overhead percentage I get for function A is higher than B. If B calls A, that means B should include A's overhead. Or am I missing something here?

The perf command you are using only sample your programs without recording any information of the call stack. Using perf report you get the number of samples falling into your functions independently of their calling relations.
You can use the --call-graph option to get a tree when using perf report:
perf record -e cpu-cycles --call-graph dwarf -c 10000 <binary-with-arguments>

Perf works on the Model Specific Registers of your CPU for measurements like cycles or branch-misses or so.
A special Part called PMU(Performance Measurement Unit) is counting all kinds of events.
So if you measure just a few features of your program, there is actually no overhead, because the CPU's PMU works independently from the actual computation.
If you exceed the Registercount of your PMU, the measurement cycles through the features to measure. Perf annotates this with [XX %].

How does a system wide profiler (e.g. perf) correlate counters with instructions?

I'm trying to understand how a system wide profiler works. Let's take linux perf as example. For a certain profiling time it can provide:
Various aggregated hadware performance counters
Time spent and hardware counters (e.g. #instructions) for each user space process and kernel space function
Information about context switches
etc.
The first thing I'm almost sure about is that the report is just an estimation of what's really happening. So I think there's some kernel module that launches software interrupts at a certain sampling rate. The lower the sampling rate, the lower the profiler overhead. The interrupt can read the model specific registers that store the performance counters.
The next part is to correlate the counters with the software that's running on the machine. That's the part I don't understand.
So where does the profiler gets its data from?
Can you interrogate for example the task scheduler to find out what was running when you interrupted him? Won't that affect the
execution of the scheduler (e.g. instead of continuing the
interrupted function it will just schedule another one, making the
profiler result not accurate). Is the list of task_struct objects available?
How can profilers even correlate HW
metrics even at instruction level?

So I think there's some kernel module that launches software interrupts at a certain sampling rate.
Perf is not module, it is part of the Linux kernel, implemented in
kernel/events/core.c and for every supported architecture and cpu model, for example arch/x86/kernel/cpu/perf_event*.c. But Oprofile was a module, with similar approach.
Perf generally works by asking PMU (Performance monitoring unit) of CPU to generate interrupt after N events of some hardware performance counter (Yokohama, slide 5 "• Interrupt when threshold reached: allows sampling"). Actually it may be implemented as:
select some PMU counter
initialize it to -N, where N is the sampling period (we want interrupt after N events, for example, after 2 millions of cycles perf record -c 2000000 -e cycles, or some N computed and tuned by perf when no extra option is set or -F is given)
set this counter to wanted event, and ask PMU to generate interrupt on overflow (ARCH_PERFMON_EVENTSEL_INT). It will happen after N increments of our counter.
All modern Intel chips supports this, for example, Nehalem: https://software.intel.com/sites/default/files/76/87/30320 - Nehalem Performance Monitoring Unit Programming Guide
EBS - Event Based Sampling. A technique in which counters are pre-loaded with a large negative count, and they are configured to interrupt the processor on overflow. When the counter overflows the interrupt service routine capture profiling data.
So, when you use hardware PMU, there is no additional work at timer interrupt with special reading of hardware PMU counters. There is some work to save/restore PMU state at task switch, but this (*_sched_in/*_sched_out of kernel/events/core.c) will not change PMU counter value for current thread nor will export it to user-space.
There is a handler: arch/x86/kernel/cpu/perf_event.c: x86_pmu_handle_irq which finds the overflowed counter and calls perf_sample_data_init(&data, 0, event->hw.last_period); to record the current time, IP of last executed command (it can be inexact because of out-of-order nature of most Intel microarchitetures, there is limited workaround for some events - PEBS, perf record -e cycles:pp), stacktrace data (if -g was used in record), etc. Then handler resets the counter value to the -N (x86_perf_event_set_period, wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask); - note the minus before left)
The lower the sampling rate, the lower the profiler overhead.
Perf allows you to set target sampling rate with -F option, -F 1000 means around 1000 irq/s. High rates are not recommended due to high overhead. Ten years ago Intel VTune recommended not more than 1000 irq/s (http://www.cs.utah.edu/~mhall/cs4961f09/VTune-1.pdf "Try to get about a 1000 samples per second per logical CPU."), perf usually don't allow too high rate for non-root (autotuned to lower rate when "perf interrupt took too long" - check in your dmesg; also check sysctl -a|grep perf, for example kernel.perf_cpu_time_max_percent=25 - which means that perf will try to use not more then 25 % of CPU)
Can you interrogate for example the task scheduler to find out what was running when you interrupted him?
No. But you can enable tracepoint at sched_switch or other sched event (list all available in sched: perf list 'sched:*'), and use it as profiling event for the perf. You can even ask perf to record stacktrace at this tracepoint:
perf record -a -g -e "sched:sched_switch" sleep 10
Won't that affect the execution of the scheduler
Enabled tracepoint will make add some perf event sampling work to the function with tracepoint
Is the list of task_struct objects available?
Only via ftrace...
Information about context switches
This is software perf event, just call to perf_sw_event with PERF_COUNT_SW_CONTEXT_SWITCHES event from sched/core.c (indirectly). Example of direct call - migration software event: kernel/sched/core.c set_task_cpu(): p->se.nr_migrations++; perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
PS: there are good slides on perf, ftrace and other profiling and tracing subsystems in Linux by Gregg: http://www.brendangregg.com/linuxperf.html

This is pretty much answers all three of your questions.
Profiling consits of two types: Counting and sampling. Counting measures the
overall
number
of events during the entire execution without offering any insight
regarding
the
instructions or functions that
generated
them
. On
the other hand,
sampling gives a correlation of
the events to the code
through captured samples of the Instruction Pointer
.
When sampling, the
kernel instructs the processor to issue an interrupt when
a chosen
event counter exceeds a
threshold. T
his interrupt is caught by the kernel and the sampled data
including the Instruction
Pointer
value are stored into a ring buffer. The buffer is polled periodically by the userspace
perf tool and its contents
written to disk.
In post processing, the Instruction Pointer is matched to
addresses in binary files, which can be translated into function names and such
Refer http://openlab.web.cern.ch/sites/openlab.web.cern.ch/files/technical_documents/TheOverheadOfProfilingUsingPMUhardwareCounters.pdf

dotTrace - what profiling settings should I use for my desktop app?

When using dotTrace, I have to pick a profiling mode and a time measurement method. Profiling modes are:
Tracing
Line-by-line
Sampling
And time measurement methods are:
Wall time (performance counter)
Thread time
Wall time (CPU instruction)
Tracing and line-by-line can't use thread time measurement. But that still leaves me with seven different combinations to try. I've now read the dotTrace help pages on these well over a dozen times, and I remain no more knowledgeable than I started out about which one to pick.
I'm working on a WPF app that reads Word docs, extracts all the paragraphs and styles, and then loops through that extracted content to pick out document sections. I'm trying to optimize this process. (Currently it takes well over an hour to complete, so I'm trying to profile it for a given length of time rather than until it finishes.)
Which profiling and time measurement types would give me the best results? Or if the answer is "It depends", then what does it depend on? What are the pros and cons of a given profiling mode or time measurement method?

Profiling types:
Sampling: fastest but least accurate profiling-type, minimum profiler overhead. Essentially equivalent to pausing the program many times a second and viewing the stacktrace; thus the number of calls per method is approximate. Still useful for identifying performance bottlenecks at the method-level.
Snapshots captured in sampling mode occupy a lot less space on disk (I'd say 5-6 less space.)
Use for initial assessment or when profiling a long-running application (which sounds like your case.)
Tracing: Records the duration taken for each method. App under profiling runs slower but in return, dotTrace shows exact number of calls of each function, and function timing info is more accurate. This is good for diving into details of a problem at the method-level.
Line-by-line: Profiles the program on a per-line basis. Largest resource hog but most fine-grained profiling results. Slows the program way down. The preferred tactic here is to initially profile using another type, and then hand-pick functions for line-by-line profiling.
As for meter kinds, I think they are described quite well in Getting started with dotTrace Performance by the great Hadi Hariri.
Wall time (CPU Instruction): This is the simplest and fastest way to measure wall time (that is, the
time we observe on a wall clock). However, on some older multi-core processors this may produce
incorrect results due to the cores timers being desynchronized. If this is the case, it is recommended
to use Performance Counter.
Wall time (Performance Counter): Performance counters is part of the Windows API and it allows
taking time samples in a hardware-independent way. However, being an API call, every measure takes
substantial time and therefore has an impact on the profiled application.
Thread time: In a multi-threaded application concurrent threads contribute to each other's wall time.
To avoid such interference we can use thread time meter which makes system API calls to get the
amount of time given by the OS scheduler to the thread. The downsides are that taking thread time
samples is much slower than using CPU counter and the precision is also limited by the size of
quantum used by thread scheduler (normally 10ms). This mode is only supported when the Profiling
Type is set to Sampling
However they don't differ too much.
I'm not a wizard in profiling myself but in your case I'd start with sampling to get a list of functions that take ridiculously long to execute, and then I'd mark them for line-by-line profiling.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio