Event-based sampling with the perf userland tool and PEBS - linux-kernel

I'm doing event-based sampling with the perf userland tool: the objective being trying to find out where certain performance-impacting events like branch misses and cache misses are occurring on a larger system I'm working on.
Now, something like
perf record -a -e branch-misses:pp -- sleep 5
works perfectly: the PEBS counting mode trigerred by the 'pp' modifier is really accurate when collecting the IP in the samples.
Unfortunately, when I try to do the same for cache-misses, i.e.
perf record -a -e cache-misses:pp -- sleep 5 # [1]
I get
Error: sys_perf_event_open() syscall returned with 22 (Invalid argument). /bin/dmesg may provide additional information.
Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?
dmesg | grep "perf\|pmu" shows nothing useful AFAICT. I'm also pretty sure that the kernel was compiled with CONFIG_PERF_EVENTS=y because both [1] and
perf record -a -e cache-misses -- sleep 5 # [2]
work : the problem with [2] being that the collected samples are not very accurate, which hurts my profiles.
Any hints on what could be going on here?

It turns out the specific event that the generic cache-misses maps to does not support PEBS. An alternative is to use one of the events that are supported by PEBS (see the list for the Nehalem architecture here) with an appropriate mask to narrow it down. Specifically, one could use MEM_LOAD_RETIRED:LLC_MISS, even though the event doesn't seem to be accurate on all occasions.

Related

Which tool for finding the reason for latency peaks on embedded Linux?

Best case would be, if I had a (debug)-tool which runs in the background and tells me the name of the process or driver that breaks my latency requirement to my system. Which tool is suitable? Do you have a short example of its usage for the following case?
Test case:
The oscilloscope measures the time between the trigger of a GPIO input and the response on a GPIO output. Usually the response time is 150µs. I trigger every 25ms.
My linux user test program uses poll() and read()+write() to mirror the detected signal of the input as response back to an output.
The Linux kernel is patched with the Preempt_rt patch.
In the dimension of hours I can see response time peaks of up to 20ms.
The best real chance is to
switch on tracing in the kernel configuration and build such Linux kernel:
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_STACK_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_DEBUG_FS=y
then run your application until weird things happen by using a tool trace-cmd
trace-cmd start -b 10000 -e 'sched_wakeup*' -e sched_switch -e gpio_value -e irq_handler_entry -e irq_handler_exit /tmp/myUserApplication
and get a trace.dat file.
trace-cmd stop
trace-cmd extract
Load that trace.dat file in KernelShark and analyse the CPUs, threads, interrupts, kworker threads and user space threads. It's great to see which blocks the system.

failing to attach eBPF `kretprobes` to `napi_poll()` with bcc tools

Idea is to use argdist to measure latency duration of napi_poll() which returns number of packet processed (called work). Ratio of execution latency of napi_poll() to number of packets processed would give me average amount of time it took to process each packet in form of histogram.
I am using following command
argdist -H 'r:c:napi_poll():u64:$latency/$retval#avg time per packet (ns)'
which end up giving me error Failed to attach BPF to kprobe and in dmesg I get message like Could not insert probe at napi_poll+0: -2
I am just curios why I can not attach kretprobes to napi_poll() when similar trick works with net_rx_action() ?
Most of the time the Failed to attach BPF to kprobe error is caused by an inlined function. As explained in the Kprobes documentation (section Kprobes Features and Limitations), Kprobes will fail to attach if the target function was inlined. Since napi_poll is static, it might have been inlined at compile time.
You can check in kernel symbols if napi_poll was inlined or not:
$ cat /boot/System.map-`uname -r` | grep " napi_poll"
$
$ cat /boot/System.map-`uname -r` | grep " net_rx_action"
ffffffff817d8110 t net_rx_action
On my system, napi_poll is inlined while net_rx_action is not.
There are several workarounds for this problem, depending on your goal.
If you don't mind recompiling your kernel, you can use the Linux inline attribute to ensure napi_poll is not inlined.
If you can't change your kernel, the usual workaround is to find a calling function of napi_poll that provides the same information. A function called by napi_poll can also work if it provides enough information and is not inlined itself.

Recompile Linux Kernel not to use specific CPU register

I'm doing an experiment that write the index of loop into a CPU register R11, then building it with gcc -ffixed-r11 try to let compiler know do not use that reg, and finally using perf to measure it.
But when I check the report (using perf script), the R11 value of most record entry is not what I expected, it supposed to be the number sequence like 1..2..3 or 1..4..7, etc. But actually it just a few fixed value. (possibly affected by system call overwriting?)
How can I let perf records the value I set to the register in my program? Or I must to recompile the whole kernel with -ffixed-r11 to achieve?
Thanks everyone.
You should not try to recompile kernel when you just want to sample some register with perf. As I understand, kernel has its own set of registers and will not overwrite user R11. syscall interface uses some fixed registers which can't be changed (can you try different reg?) and there are often glibc gateways to syscall which may use some additional registers (they are not in kernel, they are user-space code; often generated or written in assembler). You may try using gdb to monitor the register to change to find who did it. It can do this (hmm, one more link to the same user on SO): gdb: breakpoint when register will have value 0xffaa like gdb ./program then gdb commands start; watch $r11; continue; where.
Two weeks age there was question perf-report show value of CPU register about register value sampling with perf:
I follow this document and using perf record with --intr-regs=ax,bx,r15, trying to log additional CPU register information with PEBS record.
While there was x86 & PEBS, ARM may have --intr-regs implemented too. Check output of perf record --intr-regs=\? (man perf-record: "To list the available registers use --intr-regs=\?") to find support status and register names.
To print registers, use perf script -F ip,sym,iregs command. There was example in some linux commits:
# perf record --intr-regs=AX,SP usleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.016 MB perf.data (8 samples) ]
# perf script -F ip,sym,iregs | tail -5
ffffffff8105f42a native_write_msr_safe AX:0xf SP:0xffff8802629c3c00
ffffffff8105f42a native_write_msr_safe AX:0xf SP:0xffff8802629c3c00
ffffffff81761ac0 _raw_spin_lock AX:0xffff8801bfcf8020 SP:0xffff8802629c3ce8
ffffffff81202bf8 __vma_adjust_trans_huge AX:0x7ffc75200000 SP:0xffff8802629c3b30
ffffffff8122b089 dput AX:0x101 SP:0xffff8802629c3c78
#
If you need cycle accurate profile of to the metal CPU activity then perf is not the right tool, as it is at best an approximation due to the fact it only samples the program at select points. See this video on perf by Clang developer Chandler Carruth.
Instead, you should single step through the program in order to monitor exactly what is happening to the registers. Or you could program your system bare metal without an OS, but that is probably outside the scope here.

QEMU-KVM and Perf Statistics

I got some VMs running on an IBM Power8 using QEMU-KVM and I want to get statistics about LLC misses.
How can I do that in order to get statistics for each VM separately?
You want to have these data from the whole VM or for one application running on a VM?
I tested it on a Ubuntu 15.04 image over QEMU-KVM, and I am able to get it using perf. In this case, I am getting the LLC stats regarding to a gzip operation. Take a look:
$ perf stat -e LLC-loads,LLC-load-misses gzip -9 /tmp/vmlinux
Performance counter stats for 'gzip -9 /tmp/vmlinux':
263,653 LLC-loads
10,753 LLC-load-misses # 4.08% of all LL-cache hits
4.006553608 seconds time elapsed
For more detailed/explanatory content about some POWER events, refer to these documents:
Comprehensive PMU Event Reference – POWER7
Commonly Used Metrics for Performance Analysis – POWER7
The former is a more of a reference, and the latter is more of a tutorial (including a section about cache/memory hierarchy w/ hits/misses).
Those should be listed in: https://www.power.org/events/Power7

perf record: can I specify multiple events and use different sample-after value for each of them

I'm trying use perf tool from the linux kernel package to measure several raw PMU events. In the manpage of perf-record there is an "-l" option (Scale counter values), which is useful for my case because I want to know the total counter value, not just sample count. However it seems the -l is not recognized, is this expected? How can I get a total count?
Another question is that how can I specify multiple events and use different sample-after value for each of them? like perf record -c 10000,2000000,2000000 -e r2d4,r03c,r0c0
thank you
in the example, I am showing I have already installed libpfm4 so that perf knows the user-friendly event names. There is a rather clumsy default syntax that can be used that allows the sampling period to be set per event
levinth#ubuntu18-2:~$ perf record -e cpu/inst_retired.any_p,period=2000000/,cpu/cpu_clk_unhalted.thread_any,period=3000000/ -- sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data ]

Resources