what's the difference between perf command and perfmon2 or libpfm4 - linux-kernel

recently I am digging things around performance counter. And I googled up the perfmon2 and libpfm4 http://perfmon2.sourceforge.net/ and I also found perf command https://perf.wiki.kernel.org/index.php/Main_Page and shipped out with linux's kernel source code.
the perf source code link I played both libpfm4 and perf command, and libpfm4 seems can only provide cpu's cycles count or instructions count.
I can't found any example code or runnable example of how to retrieve information like L1-dcache-loads which seems obtainable by using perf, I looked it up on stackoverflow and found articles discussing the relationship between perf command and libpfm4 :Using Hardware Performance Counters in Linux People said the author of libpfm4 was angry with one of the perf command's contributor Ingo but later on he actually helping review the perf's code.
So can some one explain what's relationship between perfmon2 or libpfm4 with perf command. And can I retrieve information like L1-dcache using libpfm4 just like use perf command. Thank you very much!

The perf command provides a subset of common performance counter events to measure such as processor clock cycles, instructions counts, and cache event metrics. However, most processors provide many other implementation specific hardware events such a floating point operations and microarchitecture events (such as stalls due to hardware resource limits). To access those implementation specific events one needs to use the raw event in perf (http://lxr.linux.no/#linux+v3.6/tools/perf/Documentation/perf-record.txt#L33), which can be tedious. libpfm4 provides a mapping mechanism to refer to those implementation specific hardware events by name. libpfm is used by papi. You might take a look to see how papi uses libpfm to access those implementation specific events (http://icl.cs.utk.edu/projects/papi/)

Related

How does perf record (or other profilers) pick which instruction to count as costing time?

Recently, I found out that actually perf (or pprof) may show in disassembly view instruction timing near the line that didn't actually take this time. The real instruction, which actually took this time, is before it. I know a vague explanation that this happens due to instruction pipelining in CPU. However, I would like to find out the following:
Is there a more detailed explanation of this effect?
Is it documented in perf or pprof? I haven't found any references.
Is there a way to obtain correctly placed timings?
(quick not super detailed answer; a more detailed one would be good if someone wants to write one).
perf just uses the CPU's own hardware performance counters, which can be put into a mode where they record an event when the counter counts down to zero or up to a threshold.
Either raising an interrupt or writing an event into a buffer in memory (with PEBS precise events). That event will include a code address that the CPU picked to associate with the event (i.e. the point at which the interrupt was raised), even for events like cycles which unlike instructions don't inherently have a specific instruction associated. The out-of-order exec back-end can have a couple hundred instructions in flight when counter wraps, but has to pick exactly one for any given sample.
Generally the CPU "blames" the instruction that was waiting for a slow-to-produce result, not the one producing it, especially cache-miss loads.
For an example with Intel x86 CPUs, see Why is this jump instruction so expensive when performing pointer chasing?
which also appears to depend on the effect of letting the last instruction in the ROB retire when an interrupt is raised. (Intel CPUs at least do seem to do that; makes sense for ensuring forward progress even with a potentially slow instruction.)
In general there can be "skew" when a later instruction is blamed than the one actually taking the time, possibly with different causes. (Perhaps especially for uncore events, since they happen asynchronously to the core clock.)
Other related Q&As with interesting examples or other things
Inconsistent `perf annotate` memory load/store time reporting
Linux perf reporting cache misses for unexpected instruction
https://travisdowns.github.io/blog/2019/08/20/interrupts.html - some experiments into which instructions tend to get counts on Skylake.

Benchmarking - How to count number of instructions sent to CPU to find consumed MIPS

Consider I have a software and want to study its behavior using a black-box approach. I have a 3.0GHz CPU with 2 sockets and 4 cores. As you know, in order to find out instructions per second (IPS) we have to use the following formula:
IPS = sockets*(cores/sockets)*clock*(instructions/cycle)
At first, I wanted to find number of instructions per cycle for my specific algorithm. Then I realised its almost impossible to count it using a block-box approach and I need to do in-depth analysis of the algorithm.
But now, I have two question: Regardless of what kind of software is running on my machine and its cpu usage, is there any way to count number of instructions per second sent to the CPU (Millions of instructions per second (MIPS))? And is it possible to find the type of instruction set (add, compare, in, jump, etc) ?
Any piece of script or tool recommendation would be appreciated (in any language).
perf stat --all-user ./my_program on Linux will use CPU performance counters to record how many user-space instructions it ran, and how many core clock cycles it took. And how much CPU time it used, and will calculate average instructions per core clock cycle for you, e.g.
3,496,129,612 instructions:u # 2.61 insn per cycle
It calculates IPC for you; this is usually more interesting than instructions per second. uops per clock is usually even more interesting in terms of how close you are to maxing out the front-end, though. You can manually calculate MIPS from instructions and task-clock. For most other events perf prints a comment with a per-second rate.
(If you don't use --all-user, you can use perf stat -e task-clock:u,instructions:u , ... to have those specific events count in user-space only, while other events can count always, including inside interrupt handlers and system calls.)
But see How to calculate MIPS using perf stat for more detail on instructions / task-clock vs. instructions / elapsed_time if you do actually want total or average MIPS across cores, and counting sleep or not.
For an example output from using it on a tiny microbenchmark loop in a static executable, see Can x86's MOV really be "free"? Why can't I reproduce this at all?
How can I get real-time information at run-time
Do you mean from within the program, to profile only part of it? There's a perf API where you can do perf_event_open or something. Or use a different library for direct access to the HW perf counters.
perf stat is great for microbenchmarking a loop that you've isolated into a stand-alone program that just runs the hot loop for a second or so.
Or maybe you mean something else. perf stat -I 1000 ... ./a.out will print counter values every 1000 ms (1 second), to see how program behaviour changes in real time with whatever time window you want (down to 10ms intervals).
sudo perf top is system-wide, slightly like Unix top
There's also perf record --timestamp to record a timestamp with each event sample. perf report -D might be useful along with this. See http://www.brendangregg.com/perf.html, he mentions something about -T (--timestamp). I haven't really used this; I mostly isolate single loops I'm tuning into a static executable I can run under perf stat.
And is it possible to find the type of instruction set (add, compare, in, jump, etc)?
Intel x86 CPUs at least have a counter for branch instructions, but other types aren't differentiated, other than FP instructions. This is probably common to most architectures that have perf counters at all.
For Intel CPUs, there's ocperf.py, a wrapper for perf with symbolic names for more microarchitectural events. (Update: plain perf now knows the names of most uarch-specific counters so you don't need ocperf.py anymore.)
perf stat -e task_clock,cycles,instructions,fp_arith_inst_retired.128b_packed_single,fp_arith_inst_retired.scalar_double,uops_executed.x87 ./my_program
It's not designed to tell you what instructions are running, you can already tell that from tracing execution. Most instructions are fully pipelined, so the interesting thing is which ports have the most pressure. The exception is the divide/sqrt unit: there's a counter for arith.divider_active: "Cycles when divide unit is busy executing divide or square root operations. Accounts for integer and floating-point operations". The divider isn't fully pipelined, so a new divps or sqrtps can't always start even if no older uops are ready to execute on port 0. (http://agner.org/optimize/)
Related: linux perf: how to interpret and find hotspots for using perf to identify hotspots. Especially using top-down profiling you have perf sample the call-stack to see which functions make a lot of expensive child calls. (I mention this in case that's what you really wanted to know, rather than instruction mix.)
Related:
How do I determine the number of x86 machine instructions executed in a C program?
How to characterize a workload by obtaining the instruction type breakdown?
How do I monitor the amount of SIMD instruction usage
For exact dynamic instruction counts, you might use an instrumentation tool like Intel PIN, if you're on x86. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool.
perf stat counts for the instructions:u hardware even should also be more or less exact, and is in practice very repeatable across runs of the same program doing the same work.
On recent Intel CPUs, there's HW support for recording which way conditional / indirect branches went, so you can reconstruct exactly which instructions ran in which order, assuming no self-modifying code and that you can still read any JIT buffers. Intel PT.
Sorry I don't know what the equivalents are on AMD CPUs.

Using perf to monitor memory access of every CPU

I'm trying to use the linux perf tool to sample the memory accesses in my program. Specifically, I'm using perf to monitor read/write access of every CPU in NUMA.
Now, I can monitor every single CPU's read and write memory access, but I also have to know whether the access is a local memory access or a remote memory access.
I have used perf list to go through the events list, but I just find out some events about socket's memory access.
Questions
Is there any way to get every single CPU's remote memory access, when using perf ?
Is there a better option than perf ?
Yes, the PMU unit in your CPU can probably do what you want through the various uncore counters - in particular they can count the various offcore responses for non-local memory access. This blog post is a reasonable starting point.
The main problem is that often the perf tool, which is tied to the specific kernel version, will lag behind in its support of modern processors1, especially when it comes to uncore and NUMA related events2.
To work around that, you can use Andi Kleen's pmu-tools, which provides an ocperf wrapper script that uses whatever underlying perf you have on your system but with up-to-date event ids downloaded directly from Intel. That will usually give you access to the uncore events you need.
Of course, even when you get that working, these events are often very tough to interpret, especially because the mental model you have of demand-memory requests is complicated by a ton of factors such as prefetch behavior, request-for-ownership, accesses that "hit" in a line-buffer in the process of being filled, etc, etc.
1 Both because adding new processors/events as some lag, but especially because the tool is tied to the kernel, and you likely aren't on a bleeding edge kernel, so even though mainline perf might have support, you are stuck with the perf version associated with your kernel.
2 Probably because most kernel developers, like developers in general, aren't working on NUMA systems.

How can I perform a low-level analysis of a performance degradation?

For example, I have a large linear function (1 basic block, ~1000 instructions)
which is called many times. After some fiddling with compiler options I've got
an unexpected 10% performance degradation on Cortex-A57. Presumably it is due to
a little different instruction scheduling. I'd like to investigate the problem
deeper and find out what instruction combination causes unnecessary pipeline
stalls. But I have no idea how I could do that. I guess, I need a very detailed
execution trace to understand what happens, though I'm not sure if it is
possible to get such a trace.
So, the question is: What tools can I use to investigate such low-level
performance problems? How can I determine what prevents the CPU from executing
maximum number of instructions every cycle?
PS I'm mostly interested in Cortex-A57 cores, but I'd appreciate useful
information on any other core or even a different architecture.
PPS The function accesses the memory, but it is expected that almost all memory
accesses hit the cache. The assumption is confirmed by perf stat -e r42,r43
(L1D_CACHE_REFILL_LD and L1D_CACHE_REFILL_ST events).
Tools: I'm most familiar with Intel compilers and tools but notice there are several similar tools out there for the ARM ecosystem. Here are some techniques I recommend.
USE YOUR COMPILER It has many options that can give you a very good idea of what is going on.
Disable any optimizations (compiler option) while compiling your original code. This will tell you if the issue is related to code generation optimizations.
Do a before and after ASM dump, and compare. You may find code differences that you already know are suspect.
Make sure you are not including any debugging information. Debugging inserts check points and other things that can potentially impact the performance of your code. These bits of code will also change the execution of the code through the pipeline
Change the compiler options one at a time to identify if the issue is related to data or code alignment enforcement, etc. I'm sure you've already done this but am mentioning it for completeness.
Enable any compiler performance monitoring options that can be dumped to a log file. A lot of useful information can be found in compiler log files. On the other hand, they also contain info that can only be interpreted by those that live on a higher plain of existence, i.e. compiler writers.
USE A TOOL THAT DUMPS PMU EVENTS
I saw quite a few out there. My apologies for not giving references but you can do a simple search "tool arm pmu events". These can be extremely sophisticated and powerful, e.g. Intel VTune, or very basic and still very powerful, e.g. the command line SEP for x86.
Take a look at the performance events (PMU events) available to you and figure out which events you want to monitor. You can get these events from the ARM Cortex-A57 processor tech reference (Chapter 11, Performance Monitoring Unit).
USE A PMU DUMPING SDK
Use an SDK what has functions for acquiring the ARM PMU events. These SDKs provide you with APIs for selecting and acquiring PMU events, giving you very precise control. Inserting this monitoring code may impact the execution of your code, so be careful of its placement. Again, you can find plenty of such SDKs out there with a simple search.
STUDY UP ON PIPELINE DEBUGGING (IF YOU ARE REALLY INTO THIS TYPE OF STUFF)
Find a good architectural description of the pipeline, including reservation stations, # of ALUs, etc.
Find a good reference on how to figure out what is going on in the pipeline. Here's an example for x86. ARM is a different beast but x86 articles will give you the basics (and more) of what you need to analyze and and what you can do with what you find.
Good luck. Pipeline debugging can be fun but time consuming.

Kernel module profilers

I want to profile some modules (for example network subsystem module).
Can we profile time / cpu utilization of a function in kernel module?
I heard about some profilers such as:
perf for system-wide profiling
valgrind -- application level
Is there any profiler to best suit for my use case above?
I really appreciate your time, thanks
You had it right! Perf is the tool for you. Since you're going to profile a kernel module, there's no point in using any userland tools such as valgrind
Usually when monitoring software you care about how much time your system spends in each system, this can be achieved by perf top that will give you a good estimate of much of the time you system is spending at each function.
Functions that you're spending a lot of time in can be very good pointers for optimization.
I'm not sure I understand the time / cpu model you require, but I think the above should meet your needs.
You can read more about how to use perf here.
[EDIT]
Like #myaut said, there are other kernel profiling tools. While I have very good experience with perf and I disagree with #myaut about the quality of the results, it is well worth mentioning some of the other tools. If you're just interested in getting the job done perf will do just fine, but if you want to learn about other profiling tools and their abilities, I found this nice reference here
(...Don't forget to kindly mark #myaut or my answer as accepted if we helped you...)
I doubt that profiling itself will reveal useful results -- you will need to make this function to be called very often or spend significant time in it. Otherwise you will get very small amount of data since perf profiles all modules.
If you want to measure real time spend while executing function, I suggest you to look at SystemTap:
stap -e 'global tms;
probe kernel.function("dev_queue_xmit") {
tms[cpu()] = local_clock_ns(); }
probe kernel.function("dev_queue_xmit").return {
println(local_clock_ns() - tms[cpu()]); }'
This script saves local CPU time in nanoseconds to tms associative array on entry to function dev_queue_xmit(). When CPU leaves dev_queue_xmit(), second probe calculates delta. Note that if CPU will be switched in dev_queue_xmit(), it can show messy results.
To measure times for module, replace kernel.function("dev_queue_xmit") with module("NAME").function("*"), but attaching to many functions may affect performance. You may also use get_cycles() instead of local_clock_ns() to get CPU cycles.

Resources