Can Linux perf compare per - thread performance? - performance

I know perf can profile single progress or single thread use perf stat -p tid/pid or perf top -p tid/pid.
But I want to profile per-thread in a progress, and compare event, get which thread is high consumption, then to optimize it. Can perf do this ? if not, which tools can do that ?
thanks.

There was proposed patch to add --per-thread option to perf stat (and with interval mode -I 1000 it is possible to see current counters every second for every thread ): https://lwn.net/Articles/649001/ "perf stat: Introduce --per-thread option" From: Jiri Olsa, Date: Tue, 23 Jun 2015
adding the possibility to display stat data per thread.
Allowing following commands and output:
$ perf stat -e cycles,instructions --per-thread -p 30190,30242 ^C
Performance counter stats for process id '30190,30242':
cat-30190 0 cycles
yes-30242 3,842,525,421 cycles
cat-30190 0 instructions
yes-30242 10,370,817,010 instructions
1.143155657 seconds time elapsed
Also works under interval mode:
$ perf stat -e cycles,instructions --per-thread -p 30190,30242 -I
1000
# time comm-pid counts unit events
1.000073435 cat-30190 89,058 cycles
1.000073435 yes-30242 3,360,786,902 cycles (100.00%)
1.000073435 cat-30190 14,066 instructions
1.000073435 yes-30242 9,069,937,462 instructions
2.000204830 cat-30190 0 cycles
2.000204830 yes-30242 3,351,667,626 cycles
2.000204830 cat-30190 0 instructions
2.000204830 yes-30242 9,045,796,885 instructions
^C
2.771286639 cat-30190 0 cycles
2.771286639 yes-30242 2,593,884,166 cycles
2.771286639 cat-30190 0 instructions
2.771286639 yes-30242 7,001,171,191 instructions
Available in here:
git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
perf/per_thread

Yes. Of course.
You could use perf_event_open() system call to open performance counters.
and then use proctl/ioctl to read counters.
You could check linux man pages for all the details.

Do you see this question?
How to profile multi-threaded C++ application on Linux?
I think you could start with valgrind:
http://valgrind.org/docs/manual/cl-manual.html

Related

Query a `perf.data` file for the total raw execution time of a symbol

I used perf to generate a perf file with perf record ./application. perf report shows me various things about it. How can I show the total time it took to run the application, and the total time to run a specific "symbol"/function? perf seems to often show percentages, but I want raw time, and it want "inclusive" time, i.e. including children.
perf v4.15.18 on Ubuntu Linux 18.04
perf is statistical (sampling) profiler (in its default perf record mode), and it means it have no exact timestamps on function entry and exit (tracing is required for exact data). Perf asks OS kernel to generate interrupts thousands times per second (4 kHz for hardware PMU if -e cycles supported, less for software event -e cpu-clock). Every interrupt of program execution is recorded as sample which contains EIP (current instruction pointer), pid (process/thread id), timestamp of current time. When program runs for several seconds, there will be thousands of samples, and perf report can generate histograms from them: which parts of program code (which functions) were executed more often than other. You will get generic overview that some functions did take around 30% of program execution time while other - 5%.
perf report does not compute total program execution time (it may estimate it by comparing timestamps of first and last sample, but it is not exact if there were off-CPU periods). But it does estimate total event count (it is printed in first line in interactive TUI and is listed in text output):
$ perf report |grep approx
# Samples: 1K of event 'cycles'
# Event count (approx.): 844373507
There is perf report -n option which adds column "number of samples" next to percent column.
Samples: 1K of event 'cycles', Event count (approx.): 861416907
Overhead Samples Command Shared Object Symbol
42.36% 576 bc bc [.] _bc_rec_mul
37.49% 510 bc bc [.] _bc_shift_addsub.isra.3
14.90% 202 bc bc [.] _bc_do_sub
0.89% 12 bc bc [.] bc_free_num
But samples are taken not at same intervals and they are less exact than computed overhead (every sample may have different weight). I will recommend you to run perf stat ./application to have real total running time and total hardware counts for your application. It is better when your application has stable running time (do perf stat -r 5 ./application to have variation estimated by tool as "+- 0.28%" in last column)
To include children functions stack traces must be sampled at every interrupt. They are not sampled in default perf record mode. This sampling is turned on with -g or --call-graph dwarf options: perf record -g ./application or perf record --call-graph dwarf ./application. It is not simple to use it correctly for preinstalled libraries or applications in Linux (as most distributions strip debug information from packages), but can be used for your own applications compiled with debug information. The default -g which is same as --call-graph fp requires that all code is compiled with -fno-omit-frame-pointer gcc option, and non-default --call-graph dwarf is more reliable. With correctly prepared program and libraries, single-threaded application, and long enough stack size samples (8KB is default, change with --call-graph dwarf,65536), perf report should show around 99% for _start and main functions (including children).
bc calculator compiled with -fno-omit-frame-pointer:
bc-no-omit-frame$ echo '3^123456%3' | perf record -g bc/bc
bc-no-omit-frame$ perf report
Samples: 1K of event 'cycles:uppp', Event count (approx.): 811063902
Children Self Command Shared Object Symbol
+ 98.33% 0.00% bc [unknown] [.] 0x771e258d4c544155
+ 98.33% 0.00% bc libc-2.27.so [.] __libc_start_main
+ 98.33% 0.00% bc bc [.] main
bc calculator with dwarf call graph:
$ echo '3^123456%3' | perf record --call-graph dwarf bc/bc
$ perf report
Samples: 1K of event 'cycles:uppp', Event count (approx.): 898828479
Children Self Command Shared Object Symbol
+ 98.42% 0.00% bc bc [.] _start
+ 98.42% 0.00% bc libc-2.27.so [.] __libc_start_main
+ 98.42% 0.00% bc bc [.] main
bc without debug info has incorrect call graph handling by perf in -g (fp) mode (no 99% for main):
$ cp bc/bc bc.strip
$ strip -d bc.strip
$ echo '3^123456%3' | perf record --call-graph fp ./bc.strip
Samples: 1K of event 'cycles:uppp', Event count (approx.): 841993392
Children Self Command Shared Object Symbol
+ 43.94% 43.94% bc.strip bc.strip [.] _bc_rec_mul
+ 39.73% 39.73% bc.strip bc.strip [.] _bc_shift_addsub.isra.3
+ 11.27% 11.27% bc.strip bc.strip [.] _bc_do_sub
+ 0.92% 0.92% bc.strip libc-2.27.so [.] malloc
Sometimes perf report --no-children can be useful to disable sorting on self+children overhead (will sort by "self" overhead), for example when call graph was not fully captured.

An isolation cpu cache-misses observation

I have an isolation cpu in cpu core 12 , and I execute the following perf command :
sudo perf stat -e DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK,ITLB_MISSES.MISS_CAUSES_A_WALK,DTLB_STORE_MISSES.MISS_CAUSES_A_WALK,cache-references,cache-misses -C 12 sleep 10
Performance counter stats for 'CPU(s) 12':
0 DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK
4 ITLB_MISSES.MISS_CAUSES_A_WALK
0 DTLB_STORE_MISSES.MISS_CAUSES_A_WALK
210 cache-references
176 cache-misses # 83.810 % of all cache refs
Usually I use perf cache-misses for a process , how to interpret this : cache-misses for a cpu core ?!

Disable timer interrupt on Linux kernel

I want to disable the timer interrupt on some of the cores (1-2) on my machine which is a x86 running centos 7 with rt patch, both cores are isolated cores with nohz_full, (you can see the cmdline) but timer interrupt continues to interrupt the real time process which are running on core1 and core2.
1. uname -r
3.10.0-693.11.1.rt56.632.el7.x86_64
2. cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-693.11.1.rt56.632.el7.x86_64 \
root=/dev/mapper/centos-root ro crashkernel=auto \
rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet \
default_hugepagesz=2M hugepagesz=2M hugepages=1024 \
intel_iommu=on isolcpus=1-2 irqaffinity=0 intel_idle.max_cstate=0 \
processor.max_cstate=0 idle=mwait tsc=perfect rcu_nocbs=1-2 rcu_nocb_poll \
nohz_full=1-2 nmi_watchdog=0
3. cat /proc/interrupts
CPU0 CPU1 CPU2
0: 29 0 0 IO-APIC-edge timer
.....
......
NMI: 0 0 0 Non-maskable interrupts
LOC: 835205157 308723100 308384525 Local timer interrupts
SPU: 0 0 0 Spurious interrupts
PMI: 0 0 0 Performance monitoring interrupts
IWI: 0 0 0 IRQ work interrupts
RTR: 0 0 0 APIC ICR read retries
RES: 347330843 309191325 308417790 Rescheduling interrupts
CAL: 0 935 935 Function call interrupts
TLB: 320 22 58 TLB shootdowns
TRM: 0 0 0 Thermal event interrupts
THR: 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 Machine check exceptions
MCP: 2 2 2 Machine check polls
CPUs/Clocksource:
4. lscpu | grep CPU.s
CPU(s): 3
On-line CPU(s) list: 0-2
NUMA node0 CPU(s): 0-2
5. cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
Thanks a lot for any help.
Moses
Even with nohz_full= you get some ticks on the isolated CPUs:
Some process-handling operations still require the occasional scheduling-clock tick. These operations include calculating CPU load, maintaining sched average, computing CFS entity vruntime, computing avenrun, and carrying out load balancing. They are currently accommodated by scheduling-clock tick every second or so. On-going work will eliminate the need even for these infrequent scheduling-clock ticks.
(Documentation/timers/NO_HZ.txt, cf. (Nearly) full tickless operation in 3.10 LWN, 2013)
Thus, you have to check the rate of the local timer, e.g.:
$ perf stat -a -A -e irq_vectors:local_timer_entry sleep 120
(while your isolated threads/processes are running)
Also, nohz_full= is only effective if there is just one runnable task on each isolated core. You can check that with e.g. ps -L -e -o pid,tid,user,state,psr,cmd and cat /proc/sched_debug.
Perhaps you need to move some (kernel) tasks to you house-keeping core, e.g.:
# tuna -U -t '*' -c 0-4 -m
You can get more insights into what timers are still active by looking at /proc/timer_list.
Another method to investigate causes for possible interruption is to use the functional tracer (ftrace). See also Reducing OS jitter due to per-cpu kthreads for some examples.
I see nmi_watchdog=0 in your kernel parameters, but you don't disable the soft watchdog. Perhaps this is another timer tick source that would show up with ftrace.
You can disable all watchdogs with nowatchdog.
Btw, some of your kernel parameters seem to be off:
tsc=perfect - do you mean tsc=reliable? The 'perfect' value isn't documented in the kernel docs
idle=mwait - do you mean idle=poll? Again, I can't find the 'mwait' value in the kernel docs
intel_iommu=on - what's the purpose of this?

slurm jobs are pending but resources are available

I'm having some trouble with resource allocation in the sense that according to how I understood
the documentation and applied that to the config file I am expecting some behavior that does not happen.
Here is the relevant excerpt from the config file:
60 SchedulerType=sched/backfill
61 SchedulerParameters=bf_continue,bf_interval=45,bf_resolution=90,max_array_tasks=1000
62 #SchedulerAuth=
63 #SchedulerPort=
64 #SchedulerRootFilter=
65 SelectType=select/cons_res
66 SelectTypeParameters=CR_CPU_Memory
67 FastSchedule=1
...
102 NodeName=cn_burebista Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=256000 State=UNKNOWN
103 PartitionName=main_compute Nodes=cn_burebista Shared=YES Default=YES MaxTime=76:00:00 State=UP
According to the above I have the backfill scheduler enabled with CPUs and Memory configured as
resources. I have 56 CPUs and 256GB of RAM in my resource pool. I would expect that he backfill
scheduler attempts to allocate the resources in order to fill as much of the cores as possible if there
are multiple processes asking for more resources than available. In my case I have the following queue:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2361 main_comp training mc PD 0:00 1 (Resources)
2356 main_comp skrf_ori jh R 58:41 1 cn_burebista
2357 main_comp skrf_ori jh R 44:13 1 cn_burebista
Jobs 2356 and 2357 are asking for 16 CPUs each, job 2361 is asking for 20 CPUs, meaning in total 52 CPUs
As seen from above job 2361(which is started by a different user) is marked as pending due to lack of resources although there are plenty of CPUs and memory available. "scontrol show nodes cn_burebista" gives me the following:
NodeName=cn_burebista Arch=x86_64 CoresPerSocket=14
CPUAlloc=32 CPUErr=0 CPUTot=56 CPULoad=21.65
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=cn_burebista NodeHostName=cn_burebista Version=16.05
OS=Linux RealMemory=256000 AllocMem=64000 FreeMem=178166 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
BootTime=2018-03-09T12:04:52 SlurmdStartTime=2018-03-20T10:35:50
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
I'm going through the documentation again and again but I cannot figure out what am I doing wrong ...
Why do I have the above situation? What should I change to my config to make this work?
Similar(not the same situation) question asked here but no answer
EDIT:
This is part of my script for the task:
3 # job parameters
4 #SBATCH --job-name=training_carlib
5 #SBATCH --output=training_job_%j.out
6
7 # needed resources
8 #SBATCH --ntasks=1
9 #SBATCH --cpus-per-task=20
10 #SBATCH --export=ALL
17 export OMP_NUM_THREADS=20
18 srun ./super_awesome_app
As it can be seen the request is made for 1 task per node and 20 CPUs per task. As the scheduler is configured to consider CPUs as resources and not cores and I ask explicitly for CPUs in the script why would the job ask for cores? This is my reference document.
EDIT 2:
Here's the output from the suggested command:
JobId=2383 JobName=training_carlib
UserId=mcetateanu(1000) GroupId=mcetateanu(1001) MCS_label=N/A
Priority=4294901726 Nice=0 Account=(null) QOS=(null)
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=3-04:00:00 TimeMin=N/A
SubmitTime=2018-03-27T10:30:38 EligibleTime=2018-03-27T10:30:38
StartTime=2018-03-28T10:27:36 EndTime=2018-03-31T14:27:36 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=main_compute AllocNode:Sid=zalmoxis:23690
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null) SchedNodeList=cn_burebista
NumNodes=1 NumCPUs=20 NumTasks=1 CPUs/Task=20 ReqB:S:C:T=0:0:*:*
TRES=cpu=20,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=20 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/train_classifier.sh
WorkDir=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier
StdErr=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/training_job_2383.out
StdIn=/dev/null
StdOut=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/training_job_2383.out
Power=
In your configuration, Slurm cannot allocate two jobs on two hardware threads of the same core. In your example, Slurm would thus need at least 10 cores completely free to start your job.
Also, if the default block:cyclic task affinity configuration is used, Slurm cycles over sockets to distribute tasks in a node.
So what is happening is the following I believe:
Job 2356 submitted, being allocated 16 physical cores because of the default task distribution
Job 2357 submitted, being allocated 2 hardware threads on 8 physical cores, overriding default task distribution to get the job to run
Job 2361 submitted, waiting for at least 10 physical cores to become available.
You can get the exact CPU numbers allocated to a job using
scontrol show -dd job <jobid>
To configure Slurm in a way that it considers hardware threads exactly as if they were core, you need indeed to define
SelectTypeParameters=CR_CPU_Memory
but you also need to specify CPUs directly in the node definition
NodeName=cn_burebista CPUs=56 RealMemory=256000 State=UNKNOWN
and not let Slurm compute CPUs from Sockets, CoresPerSocket, and ThreadsPerCore.
See the section about ThreadsPerCore in the slurm.conf manpage section about node definition.

Oracle Database CPU Usage on AIX

I want to find the CPU process usage for all Oracle processes on an AIX box.
On Solaris I can do the following:
prstat -n 400 -c -s cpu -p 9013 1 1
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
9013 oracle 3463M 2928M sleep 53 0 0:00:35 0.9% oracle/2
Total: 1 processes, 2 lwps, load averages: 2.25, 2.32, 2.40
This basically reports the CPU usage for a given process ID (in this case 9013). Given a list of all Oracle PID’s I can use this command to get the CPU usage for each one, sum them up and hey presto I have my Oracle database CPU usage.
How can I get the same with AIX?
Thanks
You can try nmon or topas, which will show the current %CPU. You might also want to look into using WLM to create a class for all the Oracle processes, then use wlmstat to see the CPU usage for that class. That would save you the trouble of adding them up manually.

Resources