how to monitor CPU frequency reliably ( in linux)? - linux-kernel

I am tying to monitor the CPU operating frequencies for individual cores. I am not sure what's the correct way to monitor the CPU frequency both form the kernel level and hardware level reliably with less overhead.
I would highly appreciate if someone could answer couple of questions that I have.
Let's say I am running an application by pinning it on to a core. I would like to monitor whats the frequency it demands during its execution phase (start to end) and capture it. I would want the accurate frequency that it demands from the hardware level (from MSR's might be).
Not sure what's the accurate way to capture this? Is there a way? Are there any tools or command via which I can read the frequency value directly from the MSR's?
I have tried couple of options, not sure if this reflects the correct frequency:
NOTE: I am trying to sample the core's frequency every 10ms, 20ms, 30ms, ..... and so on.
from the kernel level:
I was reading a sysfs file:
cat /sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_cur_freq
Not sure if the above gives the correct frequency value every 10ms, 20ms, etc. Is there any overhead associated by reading this file every 10ms time interval?
Then I was using turbostat command, but this does not tell me what the correct frequency is for a particular core on a specified time interval but rather tells me the busy% etc, but I am looking for an accurate frequency for the sampling time interval that I specify
Questions:
Whats the best and reliable way to monitor CPU frequency from a systems perspective with very low overhead?
Whats the minimum sampling interval time that I can use to monitor CPU frequency (I know this depends on the CPU governors). I am currently assuming and interested for Ondemand power governor being set for the core which I am trying to monitor the CPU frequency.
It would be a great help if someone could guide me.

I assume that you want to track all changes between cpu frequencies on application event basis, not the summary which those sysfs provides:
cpufreq have trace points to track - trace_cpu_frequency(),
and you can add additional custom trace points to track your application's event - e.g. write messages to trace/trace_marker.
you can see the all events recorded including cpufreq TPs and your trace marks after execution.

Related

Performance monitoring with perf

Disclaimer: I am new to perf and still trying to learn the ins/outs.
I have an executable that is running on my target system running Linux. I would like to use perf in order to profile/monitor its performance over time. For arguments sake I am trying to prove that my CPU utilization that is currently measured from top and collectD can be replaced by monitoring it via perf which will result in much more granular data.
Since we are trying to get a plot over time I have been using perf record -e cycles -p <pid>. Afterwards I can get the data to display via perf report.
Questions:
Just displaying the data with perf report shows me a summary of
the entire data set that I took correct?
If I were to run perf report -D I get a dump of all data. (Just as a side question the timestamp is uptime in ns correct?) Now I would assume that sample is based on the frequency that could be set in perf record correct? I have run into issues by taking the time delta of the timestamp and it appears to be recorded at a random interval.
Once I dump the data there is nothing in here that really shouts out "this is your count!!" So the assumption was that the "period" field from the dump is the raw count. Is this true? Meaning that if period = 100, I could assume that for that interval, my program used 100 cycles? Additionally, I am starting to get the feel that this is for not just the application but for each library or kernel call that the program makes. I.e. if a malloc is called a different event will be record outlining that calls cycles taken. So overall how can I derive duration or event + the number of cycles from the event + which event it actually was from this field to get a true measure of CPU utilization?
IF this application of perf is not what it was intended to do then I will also like to know why not? Additionally, I think this same type of analysis would be useful for all of the other types of statistics since you can then pinpoint in time when an anomaly occurred in your running code. Just for reference I am running perf against top collecting at 1s. I am doing this since I want to compare top output to perf output. Any insight would be helpful since as I said I am still learning and new to this powerful tool.
(Linux Kernel version: 3.10.82)
ANSWER #1
Yes mostly. perf report does show you a summary of the trace collected. Samples collected by perf record are saved into a binary file called, by default, perf.data. The perf report command reads this file and generates a concise execution profile. By default, samples are sorted by functions with the most samples first. However, you can do much more detailed profiling also using this report.
ANSWER #2
You should ideally use perf script -D to get a trace of all data. The timestamp is in microseconds. Although, in kernels newer than the one you specify, with the help of a command line switch (-ns) you can display the time in nanoseconds as well. Here is the source -
Timestamp
It is quite hard to tell this without looking at what kind of "deltas" are you getting. Remember the period of collecting samples is usually tuned. There are two ways of specifying the rate at which to collect samples --
You can use the perf record (--c for count) to specify the period at which to collect samples. This will mean that for every c occurrences of the event that you are measuring, you will have a sample for that. You can then modify the sampling period and test various values. This means that at every two occurences of the event for which you are measuring, the counter will overflow and you will record a sample.
The other way around to express the sampling period, is to specify the average rate of samples per second (frequency) - which you can do using perf record -F. So perf record -F 1000 will record around 1000 samples per second and these samples will be generated when the hardware/PMU counter corresponding to the event overflows. This means that the kernel will dynamically adjust the sampling period. And you will get sample times at different random moments.
You can see for yourself in code here:
How perf dynamically updates time
ANSWER #3
Why not ? Ideally you should get the number of event samples collected if you do a perf report and just do a deeper analysis. Also when you do a perf record and finish recording samples, you would get a notification on the command line about the number of samples collected corresponding to the event you measured. (This may not be available in the kernel module you use, I would suggest you switch to a newer linux version if possible!). The number of samples should be the raw count - not the period.
If your period is 100 - it means that for the whole duration of the trace, perf recorded every 100th event. That means, if a total of 1000 events happened for the trace duration, perf approximately collected event 1, 100, 200, 300...1000.
Yes the samples recorded are not only from the application. In fact, you can use switches like this : perf record -e <event-name:u> or <event-name:k> (u for userspace and k for kernel) to record events. Additionally perf records samples from shared libraries as well. (Please consult the perf man-page for more details).
As I said previously, perf report should be an ideal tool to calculate the number of samples of event cycles recorded by perf. The number of events collected/recorded is not exact because it is simply not possible for hardware to record all cycle events. This is because recording and preparing details of all the events require the kernel to maintain a ring buffer which gets written to periodically as and when the counter overflows. This writing to the buffer happens via interrupts. They take up a fraction of CPU time- this time is lost and could have been used to record events which will now be lost as the CPU was busy servicing interrupts. You can get a really great estimate by perf even then, though.
CONCLUSION
perf does especially what it intends to do given the limitations of hardware resources we have at hand currently. I would suggest going through the man-pages for each command to understand better.
QUESTIONS
I assume you are looking at perf report. I also assume you are talking about the overhead % in perf report. Theoretically, it can be considered to be an arrangement of data from the highest to least occurrence as you specified. But, there are many underlying details that you need to consider and understand to properly make sense of the output. It represents which function has the most overhead (in terms of the number of events that occurred in that function ). There is also a parent-child relationship, based on which function calls which function, between all the functions and their overheads. Please use the Perf Report link to understand more.
As you know already events are being sampled, not counted. So you cannot accurately get the number of events, but you will get the number of samples and based on the tuned frequency of collecting samples, you will also get the raw count of the number of events ( Everything should be available to you with the perf report output ).

How does a system wide profiler (e.g. perf) correlate counters with instructions?

I'm trying to understand how a system wide profiler works. Let's take linux perf as example. For a certain profiling time it can provide:
Various aggregated hadware performance counters
Time spent and hardware counters (e.g. #instructions) for each user space process and kernel space function
Information about context switches
etc.
The first thing I'm almost sure about is that the report is just an estimation of what's really happening. So I think there's some kernel module that launches software interrupts at a certain sampling rate. The lower the sampling rate, the lower the profiler overhead. The interrupt can read the model specific registers that store the performance counters.
The next part is to correlate the counters with the software that's running on the machine. That's the part I don't understand.
So where does the profiler gets its data from?
Can you interrogate for example the task scheduler to find out what was running when you interrupted him? Won't that affect the
execution of the scheduler (e.g. instead of continuing the
interrupted function it will just schedule another one, making the
profiler result not accurate). Is the list of task_struct objects available?
How can profilers even correlate HW
metrics even at instruction level?
So I think there's some kernel module that launches software interrupts at a certain sampling rate.
Perf is not module, it is part of the Linux kernel, implemented in
kernel/events/core.c and for every supported architecture and cpu model, for example arch/x86/kernel/cpu/perf_event*.c. But Oprofile was a module, with similar approach.
Perf generally works by asking PMU (Performance monitoring unit) of CPU to generate interrupt after N events of some hardware performance counter (Yokohama, slide 5 "• Interrupt when threshold reached: allows sampling"). Actually it may be implemented as:
select some PMU counter
initialize it to -N, where N is the sampling period (we want interrupt after N events, for example, after 2 millions of cycles perf record -c 2000000 -e cycles, or some N computed and tuned by perf when no extra option is set or -F is given)
set this counter to wanted event, and ask PMU to generate interrupt on overflow (ARCH_PERFMON_EVENTSEL_INT). It will happen after N increments of our counter.
All modern Intel chips supports this, for example, Nehalem: https://software.intel.com/sites/default/files/76/87/30320 - Nehalem Performance Monitoring Unit Programming Guide
EBS - Event Based Sampling. A technique in which counters are pre-loaded with a large negative count, and they are configured to interrupt the processor on overflow. When the counter overflows the interrupt service routine capture profiling data.
So, when you use hardware PMU, there is no additional work at timer interrupt with special reading of hardware PMU counters. There is some work to save/restore PMU state at task switch, but this (*_sched_in/*_sched_out of kernel/events/core.c) will not change PMU counter value for current thread nor will export it to user-space.
There is a handler: arch/x86/kernel/cpu/perf_event.c: x86_pmu_handle_irq which finds the overflowed counter and calls perf_sample_data_init(&data, 0, event->hw.last_period); to record the current time, IP of last executed command (it can be inexact because of out-of-order nature of most Intel microarchitetures, there is limited workaround for some events - PEBS, perf record -e cycles:pp), stacktrace data (if -g was used in record), etc. Then handler resets the counter value to the -N (x86_perf_event_set_period, wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask); - note the minus before left)
The lower the sampling rate, the lower the profiler overhead.
Perf allows you to set target sampling rate with -F option, -F 1000 means around 1000 irq/s. High rates are not recommended due to high overhead. Ten years ago Intel VTune recommended not more than 1000 irq/s (http://www.cs.utah.edu/~mhall/cs4961f09/VTune-1.pdf "Try to get about a 1000 samples per second per logical CPU."), perf usually don't allow too high rate for non-root (autotuned to lower rate when "perf interrupt took too long" - check in your dmesg; also check sysctl -a|grep perf, for example kernel.perf_cpu_time_max_percent=25 - which means that perf will try to use not more then 25 % of CPU)
Can you interrogate for example the task scheduler to find out what was running when you interrupted him?
No. But you can enable tracepoint at sched_switch or other sched event (list all available in sched: perf list 'sched:*'), and use it as profiling event for the perf. You can even ask perf to record stacktrace at this tracepoint:
perf record -a -g -e "sched:sched_switch" sleep 10
Won't that affect the execution of the scheduler
Enabled tracepoint will make add some perf event sampling work to the function with tracepoint
Is the list of task_struct objects available?
Only via ftrace...
Information about context switches
This is software perf event, just call to perf_sw_event with PERF_COUNT_SW_CONTEXT_SWITCHES event from sched/core.c (indirectly). Example of direct call - migration software event: kernel/sched/core.c set_task_cpu(): p->se.nr_migrations++; perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
PS: there are good slides on perf, ftrace and other profiling and tracing subsystems in Linux by Gregg: http://www.brendangregg.com/linuxperf.html
This is pretty much answers all three of your questions.
Profiling consits of two types: Counting and sampling. Counting measures the
overall
number
of events during the entire execution without offering any insight
regarding
the
instructions or functions that
generated
them
. On
the other hand,
sampling gives a correlation of
the events to the code
through captured samples of the Instruction Pointer
.
When sampling, the
kernel instructs the processor to issue an interrupt when
a chosen
event counter exceeds a
threshold. T
his interrupt is caught by the kernel and the sampled data
including the Instruction
Pointer
value are stored into a ring buffer. The buffer is polled periodically by the userspace
perf tool and its contents
written to disk.
In post processing, the Instruction Pointer is matched to
addresses in binary files, which can be translated into function names and such
Refer http://openlab.web.cern.ch/sites/openlab.web.cern.ch/files/technical_documents/TheOverheadOfProfilingUsingPMUhardwareCounters.pdf

dotTrace - what profiling settings should I use for my desktop app?

When using dotTrace, I have to pick a profiling mode and a time measurement method. Profiling modes are:
Tracing
Line-by-line
Sampling
And time measurement methods are:
Wall time (performance counter)
Thread time
Wall time (CPU instruction)
Tracing and line-by-line can't use thread time measurement. But that still leaves me with seven different combinations to try. I've now read the dotTrace help pages on these well over a dozen times, and I remain no more knowledgeable than I started out about which one to pick.
I'm working on a WPF app that reads Word docs, extracts all the paragraphs and styles, and then loops through that extracted content to pick out document sections. I'm trying to optimize this process. (Currently it takes well over an hour to complete, so I'm trying to profile it for a given length of time rather than until it finishes.)
Which profiling and time measurement types would give me the best results? Or if the answer is "It depends", then what does it depend on? What are the pros and cons of a given profiling mode or time measurement method?
Profiling types:
Sampling: fastest but least accurate profiling-type, minimum profiler overhead. Essentially equivalent to pausing the program many times a second and viewing the stacktrace; thus the number of calls per method is approximate. Still useful for identifying performance bottlenecks at the method-level.
Snapshots captured in sampling mode occupy a lot less space on disk (I'd say 5-6 less space.)
Use for initial assessment or when profiling a long-running application (which sounds like your case.)
Tracing: Records the duration taken for each method. App under profiling runs slower but in return, dotTrace shows exact number of calls of each function, and function timing info is more accurate. This is good for diving into details of a problem at the method-level.
Line-by-line: Profiles the program on a per-line basis. Largest resource hog but most fine-grained profiling results. Slows the program way down. The preferred tactic here is to initially profile using another type, and then hand-pick functions for line-by-line profiling.
As for meter kinds, I think they are described quite well in Getting started with dotTrace Performance by the great Hadi Hariri.
Wall time (CPU Instruction): This is the simplest and fastest way to measure wall time (that is, the
time we observe on a wall clock). However, on some older multi-core processors this may produce
incorrect results due to the cores timers being desynchronized. If this is the case, it is recommended
to use Performance Counter.
Wall time (Performance Counter): Performance counters is part of the Windows API and it allows
taking time samples in a hardware-independent way. However, being an API call, every measure takes
substantial time and therefore has an impact on the profiled application.
Thread time: In a multi-threaded application concurrent threads contribute to each other's wall time.
To avoid such interference we can use thread time meter which makes system API calls to get the
amount of time given by the OS scheduler to the thread. The downsides are that taking thread time
samples is much slower than using CPU counter and the precision is also limited by the size of
quantum used by thread scheduler (normally 10ms). This mode is only supported when the Profiling
Type is set to Sampling
However they don't differ too much.
I'm not a wizard in profiling myself but in your case I'd start with sampling to get a list of functions that take ridiculously long to execute, and then I'd mark them for line-by-line profiling.

CPU consumption equivalent for harddisk scanning

I would like my software that scans disk structure to work in background but lowing the priority for the thread that that scans disk structure doesn't work. I mean you still have the feeling of the computer hard working and even freezing even if your program consumes only 1 percent of the processor time. Is it possible to implement "hard disk time consumption" equivalent of CPU consumption in Win32
Since Vista you can lower your IO priority, which is separate from CPU priority.
http://msdn.microsoft.com/en-us/library/ms686219(VS.85).aspx
SetPriorityClass(GetCurrentProcess(), PROCESS_MODE_BACKGROUND_BEGIN)
For XP, 2003 and older, you'd have to find some other way to throttle your disk activity, like using Sleep() often.
Disk accesses are typically measured by a few different metrics transfers per second (which can be broken down into reads/writes), and data read or written per second. If you want to limit the impact of your disk scanning application, one way to do this would be to track one (or both) of these metrics, determine a reasonable cap, and periodically sleep your thread for some time period. Nothing you can do to CPU scheduling is going to be effective at accomplishing this task except in the most diaphanous, indirect way.

What is the best way to measure "spare" CPU time on a Linux based system

For some of the customers that we develop software for, we are required to "guarantee" a certain amount of spare resources (memory, disk space, CPU). Memory and disk space are simple, but CPU is a bit more difficult.
One technique that we have used is to create a process that consumes a guaranteed amount of CPU time (say 2.5 seconds every 5 seconds). We run this process at highest priority in order to guarantee that it runs and consumes all of its required CPU cycles.
If our normal applications are able to run at an acceptable level of performance and can pass all of their functionality tests while the spare time process is running as well, then we "assume" that we have met our commitment for spare CPU time.
I'm sure that there are other techniques for doing the same thing, and would like to learn about them.
So this may not be exactly the answer you're looking for, but if all you want to do is make sure your application doesn't exceed certain limits on resource consumption and you're running on linux you can customize /etc/security/limits.con (may be different file on your distro of choice) to force the limits on a particular user and only run the process under that user. This is of course assuming that you have that level of control on your client's production environment.
If I understand correctly, your concern is wether the application also runs while a given percentage of the processing power is not available.
The most incontrovertible approach is to use underpowered hardware for your testing. If the processor in your setup allows you to, you can downclock it online. The Linux kernel gives you an easy interface for doing this, see /sys/devices/system/cpu/cpu0/cpufreq/. There is also a bunch of GUI applications for this available.
If your processor isn't capable of changing clock speed online, you can do it the hard way and select a smaller multiplier in your BIOS.
I think you get the idea. If it runs on 1600 Mhz instead of 2400 Mhz, you can guarantee 33% of spare CPU time.
SAR is a standard *nix process that collects information about the operational use of system resources. It also has a command line tool that allows you to create various reports, and it's common for the data to be persisted in a database.
With a multi-core/processor system you could use Affinity to your advantage.

Resources