An isolation cpu cache-misses observation

An isolation cpu cache-misses observation - cpu

I have an isolation cpu in cpu core 12 , and I execute the following perf command :
sudo perf stat -e DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK,ITLB_MISSES.MISS_CAUSES_A_WALK,DTLB_STORE_MISSES.MISS_CAUSES_A_WALK,cache-references,cache-misses -C 12 sleep 10
Performance counter stats for 'CPU(s) 12':
0 DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK
4 ITLB_MISSES.MISS_CAUSES_A_WALK
0 DTLB_STORE_MISSES.MISS_CAUSES_A_WALK
210 cache-references
176 cache-misses # 83.810 % of all cache refs
Usually I use perf cache-misses for a process , how to interpret this : cache-misses for a cpu core ?!

Related

Query a `perf.data` file for the total raw execution time of a symbol

I used perf to generate a perf file with perf record ./application. perf report shows me various things about it. How can I show the total time it took to run the application, and the total time to run a specific "symbol"/function? perf seems to often show percentages, but I want raw time, and it want "inclusive" time, i.e. including children.
perf v4.15.18 on Ubuntu Linux 18.04

perf is statistical (sampling) profiler (in its default perf record mode), and it means it have no exact timestamps on function entry and exit (tracing is required for exact data). Perf asks OS kernel to generate interrupts thousands times per second (4 kHz for hardware PMU if -e cycles supported, less for software event -e cpu-clock). Every interrupt of program execution is recorded as sample which contains EIP (current instruction pointer), pid (process/thread id), timestamp of current time. When program runs for several seconds, there will be thousands of samples, and perf report can generate histograms from them: which parts of program code (which functions) were executed more often than other. You will get generic overview that some functions did take around 30% of program execution time while other - 5%.
perf report does not compute total program execution time (it may estimate it by comparing timestamps of first and last sample, but it is not exact if there were off-CPU periods). But it does estimate total event count (it is printed in first line in interactive TUI and is listed in text output):
$ perf report |grep approx
# Samples: 1K of event 'cycles'
# Event count (approx.): 844373507
There is perf report -n option which adds column "number of samples" next to percent column.
Samples: 1K of event 'cycles', Event count (approx.): 861416907
Overhead Samples Command Shared Object Symbol
42.36% 576 bc bc [.] _bc_rec_mul
37.49% 510 bc bc [.] _bc_shift_addsub.isra.3
14.90% 202 bc bc [.] _bc_do_sub
0.89% 12 bc bc [.] bc_free_num
But samples are taken not at same intervals and they are less exact than computed overhead (every sample may have different weight). I will recommend you to run perf stat ./application to have real total running time and total hardware counts for your application. It is better when your application has stable running time (do perf stat -r 5 ./application to have variation estimated by tool as "+- 0.28%" in last column)
To include children functions stack traces must be sampled at every interrupt. They are not sampled in default perf record mode. This sampling is turned on with -g or --call-graph dwarf options: perf record -g ./application or perf record --call-graph dwarf ./application. It is not simple to use it correctly for preinstalled libraries or applications in Linux (as most distributions strip debug information from packages), but can be used for your own applications compiled with debug information. The default -g which is same as --call-graph fp requires that all code is compiled with -fno-omit-frame-pointer gcc option, and non-default --call-graph dwarf is more reliable. With correctly prepared program and libraries, single-threaded application, and long enough stack size samples (8KB is default, change with --call-graph dwarf,65536), perf report should show around 99% for _start and main functions (including children).
bc calculator compiled with -fno-omit-frame-pointer:
bc-no-omit-frame$ echo '3^123456%3' | perf record -g bc/bc
bc-no-omit-frame$ perf report
Samples: 1K of event 'cycles:uppp', Event count (approx.): 811063902
Children Self Command Shared Object Symbol
+ 98.33% 0.00% bc [unknown] [.] 0x771e258d4c544155
+ 98.33% 0.00% bc libc-2.27.so [.] __libc_start_main
+ 98.33% 0.00% bc bc [.] main
bc calculator with dwarf call graph:
$ echo '3^123456%3' | perf record --call-graph dwarf bc/bc
$ perf report
Samples: 1K of event 'cycles:uppp', Event count (approx.): 898828479
Children Self Command Shared Object Symbol
+ 98.42% 0.00% bc bc [.] _start
+ 98.42% 0.00% bc libc-2.27.so [.] __libc_start_main
+ 98.42% 0.00% bc bc [.] main
bc without debug info has incorrect call graph handling by perf in -g (fp) mode (no 99% for main):
$ cp bc/bc bc.strip
$ strip -d bc.strip
$ echo '3^123456%3' | perf record --call-graph fp ./bc.strip
Samples: 1K of event 'cycles:uppp', Event count (approx.): 841993392
Children Self Command Shared Object Symbol
+ 43.94% 43.94% bc.strip bc.strip [.] _bc_rec_mul
+ 39.73% 39.73% bc.strip bc.strip [.] _bc_shift_addsub.isra.3
+ 11.27% 11.27% bc.strip bc.strip [.] _bc_do_sub
+ 0.92% 0.92% bc.strip libc-2.27.so [.] malloc
Sometimes perf report --no-children can be useful to disable sorting on self+children overhead (will sort by "self" overhead), for example when call graph was not fully captured.

Bad Disk performance after moving from Ubuntu to Centos 7

Relatively old Dell R620 server (32 cores / 128GB RAM) was working perfect for years with Ubuntu. Plain OS install, no Virtualization.
2 system disks in mirror (XFS)
6 RAID 5 disks for /var (XFS)
server is used for a nightly check of a MySQL Xtrabackup file.
Before the format and move to Centos 7 the process would finish by 08:00, Now running late at noon.
99% of the job is opening a large tar.gz file.
htop : there are only two processes doing something :
1. gzip -d : about 20% CPU
2. tar zxf Xtrabackup.tar.gz : about 4-7% CPU
iotop : it's steady at around 3M/s (Read) / 20-25 M/s (Write) which is about 25% of what i would expect at minimum.
Memory : Used : 1GB of 128GB
Server is fully updated both OS / HW / Firmware including the disks firmware.
IDRAC shows no problems.
Bottom line : Server is not working hard (to say the least) but performance is way off.
Any ideas would be appreciated.
vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 2 0 469072 0 130362040 0 0 57 341 0 0 0 0 98 2 0
0 2 0 456916 0 130374568 0 0 3328 24576 1176 3241 2 1 94 4 0

You have blocked processes and also io operations (around 20MB/s). And this mean for me you have few processes which concurrently access disc resources. What you can do to improve the performance is instead of
tar zxf Xtrabackup.tar.gz
use
gzip -d Xtrabackup.tar.gz|tar xvf -
The second add parallelism and can benefit from multy processor, You can also benefit from increase of the pipe (fifo) buffer. Check this answer for some ideas
Also consider to tune filesystem where are stored output files of tar

Disable timer interrupt on Linux kernel

I want to disable the timer interrupt on some of the cores (1-2) on my machine which is a x86 running centos 7 with rt patch, both cores are isolated cores with nohz_full, (you can see the cmdline) but timer interrupt continues to interrupt the real time process which are running on core1 and core2.
1. uname -r
3.10.0-693.11.1.rt56.632.el7.x86_64
2. cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-693.11.1.rt56.632.el7.x86_64 \
root=/dev/mapper/centos-root ro crashkernel=auto \
rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet \
default_hugepagesz=2M hugepagesz=2M hugepages=1024 \
intel_iommu=on isolcpus=1-2 irqaffinity=0 intel_idle.max_cstate=0 \
processor.max_cstate=0 idle=mwait tsc=perfect rcu_nocbs=1-2 rcu_nocb_poll \
nohz_full=1-2 nmi_watchdog=0
3. cat /proc/interrupts
CPU0 CPU1 CPU2
0: 29 0 0 IO-APIC-edge timer
.....
......
NMI: 0 0 0 Non-maskable interrupts
LOC: 835205157 308723100 308384525 Local timer interrupts
SPU: 0 0 0 Spurious interrupts
PMI: 0 0 0 Performance monitoring interrupts
IWI: 0 0 0 IRQ work interrupts
RTR: 0 0 0 APIC ICR read retries
RES: 347330843 309191325 308417790 Rescheduling interrupts
CAL: 0 935 935 Function call interrupts
TLB: 320 22 58 TLB shootdowns
TRM: 0 0 0 Thermal event interrupts
THR: 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 Machine check exceptions
MCP: 2 2 2 Machine check polls
CPUs/Clocksource:
4. lscpu | grep CPU.s
CPU(s): 3
On-line CPU(s) list: 0-2
NUMA node0 CPU(s): 0-2
5. cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
Thanks a lot for any help.
Moses

Even with nohz_full= you get some ticks on the isolated CPUs:
Some process-handling operations still require the occasional scheduling-clock tick. These operations include calculating CPU load, maintaining sched average, computing CFS entity vruntime, computing avenrun, and carrying out load balancing. They are currently accommodated by scheduling-clock tick every second or so. On-going work will eliminate the need even for these infrequent scheduling-clock ticks.
(Documentation/timers/NO_HZ.txt, cf. (Nearly) full tickless operation in 3.10 LWN, 2013)
Thus, you have to check the rate of the local timer, e.g.:
$ perf stat -a -A -e irq_vectors:local_timer_entry sleep 120
(while your isolated threads/processes are running)
Also, nohz_full= is only effective if there is just one runnable task on each isolated core. You can check that with e.g. ps -L -e -o pid,tid,user,state,psr,cmd and cat /proc/sched_debug.
Perhaps you need to move some (kernel) tasks to you house-keeping core, e.g.:
# tuna -U -t '*' -c 0-4 -m
You can get more insights into what timers are still active by looking at /proc/timer_list.
Another method to investigate causes for possible interruption is to use the functional tracer (ftrace). See also Reducing OS jitter due to per-cpu kthreads for some examples.
I see nmi_watchdog=0 in your kernel parameters, but you don't disable the soft watchdog. Perhaps this is another timer tick source that would show up with ftrace.
You can disable all watchdogs with nowatchdog.
Btw, some of your kernel parameters seem to be off:
tsc=perfect - do you mean tsc=reliable? The 'perfect' value isn't documented in the kernel docs
idle=mwait - do you mean idle=poll? Again, I can't find the 'mwait' value in the kernel docs
intel_iommu=on - what's the purpose of this?

Can Linux perf compare per - thread performance?

I know perf can profile single progress or single thread use perf stat -p tid/pid or perf top -p tid/pid.
But I want to profile per-thread in a progress, and compare event, get which thread is high consumption, then to optimize it. Can perf do this ? if not, which tools can do that ?
thanks.

There was proposed patch to add --per-thread option to perf stat (and with interval mode -I 1000 it is possible to see current counters every second for every thread ): https://lwn.net/Articles/649001/ "perf stat: Introduce --per-thread option" From: Jiri Olsa, Date: Tue, 23 Jun 2015
adding the possibility to display stat data per thread.
Allowing following commands and output:
$ perf stat -e cycles,instructions --per-thread -p 30190,30242 ^C
Performance counter stats for process id '30190,30242':
cat-30190 0 cycles
yes-30242 3,842,525,421 cycles
cat-30190 0 instructions
yes-30242 10,370,817,010 instructions
1.143155657 seconds time elapsed
Also works under interval mode:
$ perf stat -e cycles,instructions --per-thread -p 30190,30242 -I
1000
# time comm-pid counts unit events
1.000073435 cat-30190 89,058 cycles
1.000073435 yes-30242 3,360,786,902 cycles (100.00%)
1.000073435 cat-30190 14,066 instructions
1.000073435 yes-30242 9,069,937,462 instructions
2.000204830 cat-30190 0 cycles
2.000204830 yes-30242 3,351,667,626 cycles
2.000204830 cat-30190 0 instructions
2.000204830 yes-30242 9,045,796,885 instructions
^C
2.771286639 cat-30190 0 cycles
2.771286639 yes-30242 2,593,884,166 cycles
2.771286639 cat-30190 0 instructions
2.771286639 yes-30242 7,001,171,191 instructions
Available in here:
git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
perf/per_thread

Yes. Of course.
You could use perf_event_open() system call to open performance counters.
and then use proctl/ioctl to read counters.
You could check linux man pages for all the details.

Do you see this question?
How to profile multi-threaded C++ application on Linux?
I think you could start with valgrind:
http://valgrind.org/docs/manual/cl-manual.html

How do I see a Unix Job's performance (execution time and cpu resource) after execution?

I have an incoming task that tells me to do performance testing on a unix job. I would like to know if there's something in UNIX that would tell me of a job's performance (execution time and cpu resource)? I plan to do a before and after comparison.

You can use linux profile tool perf, eg:
perf stat ls
In my computer:
Performance counter stats for 'ls':
2.066571 task-clock # 0.804 CPUs utilized
1 context-switches # 0.000 M/sec
0 CPU-migrations # 0.000 M/sec
267 page-faults # 0.129 M/sec
2,434,744 cycles # 1.178 GHz [57.78%]
1,384,929 stalled-cycles-frontend # 56.88% frontend cycles idle [52.01%]
1,035,939 stalled-cycles-backend # 42.55% backend cycles idle [98.96%]
1,894,339 instructions # 0.78 insns per cycle
# 0.73 stalled cycles per insn
370,865 branches # 179.459 M/sec
14,280 branch-misses # 3.85% of all branches
0.002569026 seconds time elapsed

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio