Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Also, how much faster would you estimate that Rust compilation might be, if its compiler was written from scratch (like Go) instead of using LLVM?
The most time-consuming parts of Rust compilation are generally the optimisation passes & final codegen, though there are probably degenerate situations for which other parts are problematic e.g. building rust-csv with -Z time-passes yields this for the final csv crate:
time: 0.000; rss: 174MB monomorphization_collector_root_collections
time: 0.002; rss: 53MB parse_crate
time: 0.000; rss: 53MB attributes_injection
time: 0.000; rss: 53MB recursion_limit
time: 0.000; rss: 53MB plugin_loading
time: 0.000; rss: 53MB plugin_registration
time: 0.000; rss: 53MB pre_AST_expansion_lint_checks
time: 0.000; rss: 56MB crate_injection
time: 0.000; rss: 57MB pre_AST_expansion_lint_checks
time: 0.000; rss: 57MB pre_AST_expansion_lint_checks
time: 0.000; rss: 58MB pre_AST_expansion_lint_checks
time: 0.000; rss: 58MB pre_AST_expansion_lint_checks
time: 0.072; rss: 178MB monomorphization_collector_graph_walk
time: 0.000; rss: 60MB pre_AST_expansion_lint_checks
time: 0.009; rss: 178MB partition_and_assert_distinct_symbols
time: 0.000; rss: 178MB find_cgu_reuse
time: 0.000; rss: 60MB pre_AST_expansion_lint_checks
time: 0.001; rss: 60MB pre_AST_expansion_lint_checks
time: 0.000; rss: 61MB pre_AST_expansion_lint_checks
time: 0.000; rss: 61MB pre_AST_expansion_lint_checks
time: 0.014; rss: 184MB LLVM_module_optimize_function_passes(bstr.9t899r4h-cgu.1)
time: 0.010; rss: 187MB LLVM_module_optimize_function_passes(bstr.9t899r4h-cgu.2)
time: 0.096; rss: 90MB expand_crate
time: 0.000; rss: 90MB check_unused_macros
time: 0.096; rss: 90MB macro_expand_crate
time: 0.000; rss: 90MB maybe_building_test_harness
time: 0.001; rss: 90MB AST_validation
time: 0.000; rss: 90MB maybe_create_a_macro_crate
time: 0.015; rss: 190MB LLVM_module_optimize_function_passes(bstr.9t899r4h-cgu.11)
time: 0.001; rss: 94MB complete_gated_feature_checking
time: 0.132; rss: 94MB configure_and_expand
time: 0.000; rss: 94MB prepare_outputs
time: 0.008; rss: 192MB LLVM_module_optimize_function_passes(bstr.9t899r4h-cgu.3)
time: 0.028; rss: 99MB hir_lowering
time: 0.003; rss: 99MB early_lint_checks
time: 0.001; rss: 101MB setup_global_ctxt
time: 0.000; rss: 101MB dep_graph_tcx_init
time: 0.001; rss: 101MB create_global_ctxt
time: 0.022; rss: 195MB LLVM_module_optimize_function_passes(bstr.9t899r4h-cgu.0)
time: 0.000; rss: 104MB looking_for_entry_point
time: 0.000; rss: 104MB looking_for_plugin_registrar
time: 0.000; rss: 104MB looking_for_derive_registrar
time: 0.031; rss: 107MB misc_checking_1
time: 0.023; rss: 113MB type_collecting
time: 0.001; rss: 113MB impl_wf_inference
time: 0.018; rss: 199MB LLVM_module_optimize_function_passes(bstr.9t899r4h-cgu.14)
time: 0.213; rss: 201MB LLVM_module_optimize_module_passes(bstr.9t899r4h-cgu.1)
time: 0.000; rss: 126MB unsafety_checking
time: 0.000; rss: 126MB orphan_checking
time: 0.052; rss: 126MB coherence_checking
time: 0.177; rss: 203MB LLVM_module_optimize_module_passes(bstr.9t899r4h-cgu.3)
time: 0.014; rss: 203MB LLVM_module_optimize_function_passes(bstr.9t899r4h-cgu.15)
time: 0.036; rss: 204MB LLVM_module_optimize_function_passes(bstr.9t899r4h-cgu.10)
time: 0.202; rss: 205MB LLVM_module_optimize_module_passes(bstr.9t899r4h-cgu.0)
time: 0.172; rss: 205MB LLVM_module_optimize_module_passes(bstr.9t899r4h-cgu.14)
time: 0.335; rss: 205MB LLVM_module_optimize_module_passes(bstr.9t899r4h-cgu.2)
time: 0.006; rss: 205MB LLVM_module_optimize_function_passes(bstr.9t899r4h-cgu.8)
time: 0.326; rss: 206MB LLVM_module_optimize_module_passes(bstr.9t899r4h-cgu.11)
time: 0.157; rss: 131MB wf_checking
time: 0.012; rss: 206MB LLVM_module_optimize_function_passes(bstr.9t899r4h-cgu.6)
time: 0.007; rss: 206MB LLVM_module_optimize_function_passes(bstr.9t899r4h-cgu.5)
time: 0.004; rss: 206MB LLVM_module_optimize_function_passes(bstr.9t899r4h-cgu.7)
time: 0.036; rss: 131MB item_types_checking
time: 0.463; rss: 206MB codegen_to_LLVM_IR
time: 0.000; rss: 206MB assert_dep_graph
time: 0.000; rss: 206MB serialize_dep_graph
time: 0.547; rss: 206MB codegen_crate
time: 0.010; rss: 166MB free_global_ctxt
time: 0.003; rss: 166MB LLVM_module_optimize_function_passes(bstr.9t899r4h-cgu.12)
time: 0.052; rss: 166MB LLVM_module_optimize_module_passes(bstr.9t899r4h-cgu.7)
time: 0.075; rss: 167MB LLVM_module_optimize_module_passes(bstr.9t899r4h-cgu.5)
time: 0.005; rss: 167MB LLVM_module_optimize_function_passes(bstr.9t899r4h-cgu.4)
time: 0.033; rss: 167MB LLVM_module_optimize_module_passes(bstr.9t899r4h-cgu.12)
time: 0.216; rss: 167MB LLVM_module_optimize_module_passes(bstr.9t899r4h-cgu.15)
time: 0.004; rss: 167MB LLVM_module_optimize_function_passes(bstr.9t899r4h-cgu.13)
time: 0.005; rss: 168MB LLVM_module_optimize_function_passes(bstr.9t899r4h-cgu.9)
time: 0.197; rss: 170MB LLVM_module_optimize_module_passes(bstr.9t899r4h-cgu.10)
time: 0.040; rss: 171MB LLVM_module_optimize_module_passes(bstr.9t899r4h-cgu.13)
time: 0.140; rss: 171MB LLVM_module_optimize_module_passes(bstr.9t899r4h-cgu.6)
time: 0.055; rss: 171MB LLVM_module_optimize_module_passes(bstr.9t899r4h-cgu.9)
time: 0.088; rss: 171MB LLVM_module_optimize_module_passes(bstr.9t899r4h-cgu.4)
time: 0.222; rss: 171MB LLVM_module_optimize_module_passes(bstr.9t899r4h-cgu.8)
time: 0.049; rss: 192MB LLVM_lto_optimize(bstr.9t899r4h-cgu.8)
time: 0.047; rss: 193MB LLVM_lto_optimize(bstr.9t899r4h-cgu.0)
time: 0.117; rss: 197MB LLVM_lto_optimize(bstr.9t899r4h-cgu.14)
time: 0.108; rss: 198MB LLVM_lto_optimize(bstr.9t899r4h-cgu.11)
time: 0.131; rss: 199MB LLVM_lto_optimize(bstr.9t899r4h-cgu.15)
time: 0.161; rss: 199MB LLVM_lto_optimize(bstr.9t899r4h-cgu.1)
time: 0.115; rss: 203MB LLVM_lto_optimize(bstr.9t899r4h-cgu.10)
time: 0.315; rss: 207MB LLVM_lto_optimize(bstr.9t899r4h-cgu.2)
time: 0.574; rss: 148MB item_bodies_checking
time: 0.844; rss: 148MB type_check_crate
time: 0.015; rss: 208MB LLVM_lto_optimize(bstr.9t899r4h-cgu.6)
time: 0.118; rss: 209MB LLVM_lto_optimize(bstr.9t899r4h-cgu.4)
time: 0.043; rss: 210MB LLVM_lto_optimize(bstr.9t899r4h-cgu.5)
time: 0.030; rss: 148MB match_checking
time: 0.014; rss: 210MB LLVM_lto_optimize(bstr.9t899r4h-cgu.12)
time: 0.019; rss: 150MB liveness_and_intrinsic_checking
time: 0.049; rss: 150MB misc_checking_2
time: 0.087; rss: 210MB LLVM_lto_optimize(bstr.9t899r4h-cgu.3)
time: 0.050; rss: 210MB LLVM_lto_optimize(bstr.9t899r4h-cgu.13)
time: 0.023; rss: 210MB LLVM_lto_optimize(bstr.9t899r4h-cgu.7)
time: 0.031; rss: 210MB LLVM_lto_optimize(bstr.9t899r4h-cgu.9)
time: 1.256; rss: 213MB LLVM_passes(crate)
time: 0.000; rss: 213MB join_worker_thread
time: 0.812; rss: 213MB finish_ongoing_codegen
time: 0.000; rss: 213MB serialize_work_products
time: 0.000; rss: 213MB link_binary_check_files_are_writeable
time: 0.003; rss: 213MB link_rlib
time: 0.000; rss: 213MB link_binary_remove_temps
time: 0.004; rss: 213MB link_binary
time: 0.004; rss: 213MB link_crate
time: 0.000; rss: 213MB llvm_dump_timing_file
time: 0.817; rss: 213MB link
time: 2.690; rss: 213MB total
As you can see the items with large time counts are pretty much all LLVM optimisation passes, that's why cargo check is so useful (it just typechecks and stops there, so it's generally much, much faster than a full codegen, even a debug one).
AFAIK this is a mix of advanced optimisations being plain expensive, and rustc historically generating large & complex IR and leaving a bit of a mess for llvm to untangle. And LLVM itself is heavy and not exactly lightning-fast.
I believe this slowly improves via a mix of improving the codegen and adding or moving optimisation passes to MIR. And of course ongoing extensive effort at chipping at the issue from various angles.
Also, how much faster would you estimate that Rust compilation might be, if its compiler was written from scratch (like Go) instead of using LLVM?
Well if you want fast-and-inefficient that's pretty much the value proposition of using cranelift as the debug backend, so you can get numbers there.
But to get the optimisations in is much more expensive in both person-hours and actual CPU time.
And "debug rust" is extremely slow, to the extent that it's one of the first thing people check when somebody asks why their rust is slower than their python, 95% of the time it's because they were compiling in debug.
As Calvin Weng's recent series at Pingcap notes, Rust's model and purpose very much relies on heavy optimisations in order to fulfill its goals and promises.
Related
I've been working with my setup lately and have been trying to determine where my 2.3s terminal load times have been coming from. I'm fairly new to linux performance testing in general but I have determined a few things.
The first thing I should mention is that terminal is a shell program with the following:
#!/bin/sh
WINIT_X11_SCALE_FACTOR=1.5 alacritty "$#"
The stats on launching the terminal program (alacritty) and its shell (zsh -l):
> perf stat -r 10 -d terminal -e $SHELL -slc exit
Performance counter stats for 'terminal -e /usr/bin/zsh -slc exit' (10 runs):
602.55 msec task-clock # 0.261 CPUs utilized ( +- 1.33% )
957 context-switches # 1.532 K/sec ( +- 0.42% )
92 cpu-migrations # 147.298 /sec ( +- 1.89% )
68,150 page-faults # 109.113 K/sec ( +- 0.13% )
2,188,445,151 cycles # 3.504 GHz ( +- 0.17% )
3,695,337,515 instructions # 1.70 insn per cycle ( +- 0.08% )
791,333,786 branches # 1.267 G/sec ( +- 0.06% )
14,007,258 branch-misses # 1.78% of all branches ( +- 0.09% )
10,893,173,535 slots # 17.441 G/sec ( +- 0.13% )
3,574,546,556 topdown-retiring # 30.5% Retiring ( +- 0.11% )
2,888,937,632 topdown-bad-spec # 24.0% Bad Speculation ( +- 0.41% )
3,125,577,758 topdown-fe-bound # 27.1% Frontend Bound ( +- 0.16% )
2,189,183,796 topdown-be-bound # 18.4% Backend Bound ( +- 0.47% )
924,852,782 L1-dcache-loads # 1.481 G/sec ( +- 0.07% )
38,308,478 L1-dcache-load-misses # 4.16% of all L1-dcache accesses ( +- 0.09% )
3,445,566 LLC-loads # 5.517 M/sec ( +- 0.20% )
725,990 LLC-load-misses # 20.97% of all LL-cache accesses ( +- 0.36% )
2.30683 +- 0.00331 seconds time elapsed ( +- 0.14% )
The stats on launching just the shell (zsh):
Performance counter stats for '/usr/bin/zsh -i -c exit' (10 runs):
1,548.56 msec task-clock # 0.987 CPUs utilized ( +- 3.28% )
525 context-switches # 323.233 /sec ( +- 21.17% )
16 cpu-migrations # 9.851 /sec ( +- 11.33% )
90,616 page-faults # 55.791 K/sec ( +- 2.63% )
6,559,830,564 cycles # 4.039 GHz ( +- 3.18% )
11,317,955,247 instructions # 1.68 insn per cycle ( +- 3.69% )
2,351,473,571 branches # 1.448 G/sec ( +- 3.46% )
46,539,165 branch-misses # 1.91% of all branches ( +- 1.31% )
32,783,001,655 slots # 20.184 G/sec ( +- 3.18% )
10,776,867,769 topdown-retiring # 32.5% Retiring ( +- 3.28% )
5,729,353,491 topdown-bad-spec # 18.2% Bad Speculation ( +- 6.90% )
11,083,567,578 topdown-fe-bound # 33.3% Frontend Bound ( +- 2.34% )
5,458,201,823 topdown-be-bound # 15.9% Backend Bound ( +- 4.51% )
3,180,211,376 L1-dcache-loads # 1.958 G/sec ( +- 3.10% )
126,282,947 L1-dcache-load-misses # 3.85% of all L1-dcache accesses ( +- 2.37% )
14,347,257 LLC-loads # 8.833 M/sec ( +- 1.48% )
2,386,047 LLC-load-misses # 16.33% of all LL-cache accesses ( +- 0.77% )
1.5682 +- 0.0550 seconds time elapsed ( +- 3.51% )
The stats on launching the shell (zsh) with zmodload zsh/zprof:
num calls time self name
-----------------------------------------------------------------------------------
1) 31 78.54 2.53 77.09% 50.07 1.62 49.14% antigen
2) 2 23.24 11.62 22.81% 15.93 7.96 15.63% compinit
3) 2 7.31 3.66 7.18% 7.31 3.66 7.18% compaudit
4) 1 8.27 8.27 8.12% 7.29 7.29 7.16% _autoenv_source
5) 1 6.93 6.93 6.80% 6.93 6.93 6.80% detect-clipboard
6) 1 5.18 5.18 5.08% 5.18 5.18 5.08% _autoenv_hash_pair
7) 1 2.49 2.49 2.45% 2.45 2.45 2.41% _zsh_highlight_load_highlighters
8) 2 1.01 0.51 0.99% 1.01 0.51 0.99% _autoenv_stack_entered_contains
9) 10 0.91 0.09 0.89% 0.91 0.09 0.89% add-zsh-hook
10) 1 0.94 0.94 0.92% 0.87 0.87 0.85% _autoenv_stack_entered_add
11) 1 0.85 0.85 0.84% 0.85 0.85 0.84% async_init
12) 1 0.49 0.49 0.49% 0.49 0.49 0.48% _zsh_highlight__function_callable_p
13) 1 0.45 0.45 0.44% 0.45 0.45 0.44% colors
14) 3 0.38 0.13 0.37% 0.35 0.12 0.35% add-zle-hook-widget
15) 6 0.34 0.06 0.34% 0.34 0.06 0.34% is-at-least
16) 2 15.14 7.57 14.86% 0.27 0.13 0.26% _autoenv_chpwd_handler
17) 1 5.46 5.46 5.36% 0.26 0.26 0.26% _autoenv_authorized_env_file
18) 1 0.23 0.23 0.22% 0.23 0.23 0.22% regexp-replace
19) 11 0.19 0.02 0.19% 0.19 0.02 0.19% _autoenv_debug
20) 2 0.10 0.05 0.10% 0.10 0.05 0.10% wrap_clipboard_widgets
21) 16 0.09 0.01 0.09% 0.09 0.01 0.09% compdef
22) 1 0.08 0.08 0.08% 0.08 0.08 0.08% (anon) [/home/nate-wilkins/.antigen/bundles/zsh-users/zsh-autosuggestions/zsh-autosuggestions.zsh:458]
23) 2 0.05 0.02 0.05% 0.05 0.02 0.05% bashcompinit
24) 1 0.06 0.06 0.06% 0.04 0.04 0.04% _autoenv_stack_entered_remove
25) 1 5.50 5.50 5.40% 0.03 0.03 0.03% _autoenv_check_authorized_env_file
26) 1 0.04 0.04 0.04% 0.03 0.03 0.03% complete
27) 1 0.88 0.88 0.87% 0.03 0.03 0.03% async
28) 1 0.03 0.03 0.02% 0.03 0.03 0.02% (anon) [/usr/share/zsh/functions/Misc/add-zle-hook-widget:28]
29) 2 0.01 0.00 0.01% 0.01 0.00 0.01% env_default
30) 1 0.01 0.01 0.01% 0.01 0.01 0.01% _zsh_highlight__is_function_p
31) 1 0.00 0.00 0.00% 0.00 0.00 0.00% _zsh_highlight_bind_widgets
Lastly I have a perf run with a corresponding flamegraph:
perf-run --out alacritty --command "terminal -e $SHELL -slc exit"
But I'm not sure how to interpret the flamegraph since it seems to have everything in it and not just the command that was run.
So my question is:
What is taking up the most time in my terminal setup and is there another approach I could use to better determine where the problem is coming from?
I have some fortran code, that when compiled with gfortran is faster than when compiled with ifort. I usually find on the internet about the opposite case...
I tried to run intel vtune to identify different hotspots between the executables, but I couldn't manage to solve these.
I'm not sure what can cause this difference. Here is the perf output:
gfortran:
Performance counter stats for 'build/gnuRelease/NBODY inputFile temp' (10 runs):
2,489.36 msec task-clock:u # 0.986 CPUs utilized ( +- 0.21% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
589 page-faults:u # 0.237 K/sec ( +- 0.05% )
10,678,130,527 cycles:u # 4.290 GHz ( +- 0.20% )
31,102,858,644 instructions:u # 2.91 insn per cycle ( +- 0.00% )
3,537,572,458 branches:u # 1421.078 M/sec ( +- 0.00% )
566,054 branch-misses:u # 0.02% of all branches ( +- 5.14% )
2.5235 +- 0.0150 seconds time elapsed ( +- 0.59% )
ifort:
Performance counter stats for 'build/ifortRelease/NBODY inputFile temp' (10 runs):
2,834.44 msec task-clock:u # 0.978 CPUs utilized ( +- 0.14% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
2,600 page-faults:u # 0.917 K/sec ( +- 0.01% )
12,146,500,211 cycles:u # 4.285 GHz ( +- 0.14% )
36,441,911,065 instructions:u # 3.00 insn per cycle ( +- 0.00% )
2,936,917,079 branches:u # 1036.154 M/sec ( +- 0.00% )
339,226 branch-misses:u # 0.01% of all branches ( +- 3.74% )
2.8991 +- 0.0165 seconds time elapsed ( +- 0.57% )
The page-faults metric caught my eye but I'm not sure what does it mean...
UPDATE:
gfortran version: 10.2.0
ifort version: 19.1.3.304
intel Xeon(R)
UPDATE:
similar example: Puzzling performance difference between ifort and gfortran
from this example:
When the complex IF statement is removed, gfortran takes about 4 times as much time (10-11 seconds). This is to be expected since the statement approximately throws out about 75% of the numbers, avoiding to do the SQRT on them. On the other hand, ifort only uses slightly more time. My guess is that something goes wrong when ifort tries to optimize the IF statement.
seem to be relevant to this case too
I want to use ftrace to trace a fix process’s wakeup latency.
But, ftrace will only record the max latency.
And, set_ftrace_pid is useless.
Does anybody know how to do that?
Thank you very much.
You can use the tool I wrote to trace a fix process’s wakeup latency.
https://gitee.com/openeuler-competition/summer2021-42
This tool supports analyzing the overall scheduling latency of the system through ftrace raw data.
Adding to the suggestion added by #Qeole, you can also use the perf sched utility to obtain a much detailed trace of a process' wakeup latency. While ebpf tools like runqlat will give you a much higher level overview, using perf sched will help you capture all scheduler events and thereby, observe and inspect the wakeup latencies of a process in much more detail. Note that running perf sched to monitor a long-running computationally intensive process, will come with its own issues of overhead.
You first need to run perf sched record -
From the man-page,
'perf sched record <command>' records the scheduling events of an arbitrary workload.
For eg. say you want to trace the wakeup latencies of the command ls.
sudo perf sched record ls
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.453 MB perf.data (562 samples) ]
You will see that in the same directory where the command was run, a perf.data file will be generated. This file will contain all of the raw scheduler events, and the commands below will help to make sense of all these scheduler events.
You can run perf sched latency to obtain per-task latency summaries, including details of the number of context switches per task, average and maximum delay.
sudo perf sched latency
-----------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at |
-----------------------------------------------------------------------------------------------------------------
migration/4:35 | 0.000 ms | 1 | avg: 0.003 ms | max: 0.003 ms | max at: 231259.727951 s
kworker/u16:0-p:6962 | 0.103 ms | 20 | avg: 0.003 ms | max: 0.035 ms | max at: 231259.729314 s
ls:7118 | 1.752 ms | 1 | avg: 0.003 ms | max: 0.003 ms | max at: 231259.727898 s
alsa-sink-Gener:3133 | 0.000 ms | 1 | avg: 0.003 ms | max: 0.003 ms | max at: 231259.729321 s
Timer:5229 | 0.035 ms | 1 | avg: 0.002 ms | max: 0.002 ms | max at: 231259.729625 s
AudioIP~ent RPC:7597 | 0.040 ms | 1 | avg: 0.002 ms | max: 0.002 ms | max at: 231259.729698 s
MediaTimer #1:7075 | 0.025 ms | 1 | avg: 0.002 ms | max: 0.002 ms | max at: 231259.729651 s
gnome-terminal-:4989 | 0.254 ms | 24 | avg: 0.001 ms | max: 0.003 ms | max at: 231259.729358 s
MediaPl~back #3:7098 | 0.034 ms | 1 | avg: 0.001 ms | max: 0.001 ms | max at: 231259.729670 s
kworker/u16:2-p:5987 | 0.144 ms | 32 | avg: 0.001 ms | max: 0.002 ms | max at: 231259.729193 s
perf:7114 | 3.503 ms | 1 | avg: 0.001 ms | max: 0.001 ms | max at: 231259.729656 s
kworker/u16:1-p:7112 | 0.184 ms | 52 | avg: 0.001 ms | max: 0.001 ms | max at: 231259.729201 s
chrome:5713 | 0.067 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s
-----------------------------------------------------------------------------------------------------------------
TOTAL: | 6.141 ms | 137 |
---------------------------------------------------
You can see the process ls, as well as the process perf being present among all the other processes that co-existed at the same time while the perf sched record command was being run.
You can run perf sched timehist to obtain a detailed summary of the individual scheduler events.
sudo perf sched timehist
time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
--------------- ------ ------------------------------ --------- --------- ---------
231259.726350 [0005] <idle> 0.000 0.000 0.000
231259.726465 [0005] chrome[5713] 0.000 0.000 0.114
231259.727447 [0005] <idle> 0.114 0.000 0.981
231259.727513 [0005] chrome[5713] 0.981 0.000 0.066
231259.727898 [0004] <idle> 0.000 0.000 0.000
231259.727951 [0004] perf[7118] 0.000 0.002 0.052
231259.727958 [0002] perf[7114] 0.000 0.000 0.000
231259.727960 [0000] <idle> 0.000 0.000 0.000
231259.727964 [0004] migration/4[35] 0.000 0.002 0.013
231259.729193 [0006] <idle> 0.000 0.000 0.000
231259.729201 [0002] <idle> 0.000 0.000 1.242
231259.729201 [0003] <idle> 0.000 0.000 0.000 231259.729216 [0002] kworker/u16:1-p[7112] 0.006 0.001 0.005
231259.729219 [0002] <idle> 0.005 0.000 0.002
231259.729222 [0002] kworker/u16:1-p[7112] 0.002 0.000 0.002
231259.729222 [0006] <idle> 0.001 0.000 0.007
The wait time refers to the time the task was waiting to be woken up, and the sch delay is the time it took for the scheduler to actually schedule it into the run queue after the task was woken up.
You can filter the timehist command by pid and since the pid of the ls command was 7118 (you can observe this in the perf sched latency output).
sudo perf sched timehist -p 7118
Samples do not have callchains.
time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
--------------- ------ ------------------------------ --------- --------- ---------
231259.727951 [0004] perf[7118] 0.000 0.002 0.052
231259.729657 [0000] ls[7118] 0.009 0.000 1.697
Now, in order to observe the wakeup events for this process, you can run add a command line switch -w to the previous command -
sudo perf sched timehist -p 7118 -w
Samples do not have callchains.
time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
--------------- ------ ------------------------------ --------- --------- ---------
231259.727895 [0002] perf[7114] awakened: perf[7118]
231259.727948 [0004] perf[7118] awakened: migration/4[35]
231259.727951 [0004] perf[7118] 0.000 0.002 0.052
231259.729190 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
231259.729199 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729207 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
231259.729209 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729212 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
231259.729218 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729221 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
231259.729223 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729226 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729231 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729233 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
231259.729237 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
231259.729240 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729242 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
-------------------------------------- # some other events here
231259.729548 [0000] ls[7118] awakened: kworker/u16:0-p[6962]
231259.729553 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729555 [0000] ls[7118] awakened: kworker/u16:0-p[6962]
231259.729557 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729562 [0000] ls[7118] awakened: kworker/u16:0-p[6962]
231259.729564 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729655 [0000] ls[7118] awakened: perf[7114]
231259.729657 [0000] ls[7118] 0.009 0.000 1.697
The kworker threads interrupt the initial execution of perf and its child process ls at 231259.729190 ms. You can see that the perf process gets woken up eventually, to be actually executed at 231259.729655 ms after all of the kernel worker threads have done some work. You can get a more detailed CPU visualization of the above timehist details using the below command -
sudo perf sched timehist -p 7118 -wV
Samples do not have callchains.
time cpu 012345678 task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
--------------- ------ --------- ------------------------------ --------- --------- ---------
231259.727895 [0002] perf[7114] awakened: perf[7118]
231259.727948 [0004] perf[7118] awakened: migration/4[35]
231259.727951 [0004] s perf[7118] 0.000 0.002 0.052
231259.729190 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
231259.729199 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729207 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
231259.729209 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729212 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
-------------------------------------------------- # some other events here
231259.729562 [0000] ls[7118] awakened: kworker/u16:0-p[6962]
231259.729564 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729655 [0000] ls[7118] awakened: perf[7114]
231259.729657 [0000] s ls[7118] 0.009 0.000 1.697
The CPU visualization column ("012345678") has "s" for context-switch events, which indicates that first, CPU 4 and then, CPU 0 was context switching to the ls process.
Note : You can supplement the above information with outputs from the remaining commands of perf sched, namely perf sched script and perf sched map.
I have a file like below:
0.000 -0.001 0.017 (F) -0.001 af mclk rdctrlp1/timer/ircb%clk {ec0crb0o2ab1n03x5}
0.027 0.026 0.012 0.002 (F) 0.026 af mclk rdctrlp1/timer/ircb%clkout {ec0crb0o2ab1n03x5} ORGATE
0.001 0.027 0.013 (F) 0.027 af mclk rdctrlp1/timer/iclkout_inv%clk {ec0cinv00ab1n12x5}
0.011 0.037 0.010 0.007 (R) 0.037 af mclk rdctrlp1/timer/iclkout_inv%clkout {ec0cinv00ab1n12x5} NOTGATE
0.001 0.038 0.010 (R) 0.038 af mclk rdctrlp1/clksdlgen/i01%clk {ec0ceb000ab2n02x4}
0.026 0.064 0.005 0.001 (R) 0.064 af mclk rdctrlp1/clksdlgen/i01%clkout {ec0ceb000ab2n02x4} BUFFER
0.000 0.064 0.006 (R) 0.064 af mclk rdctrlp1/clksdlgen/i0invd%clk {ec0cinv00ab2n02x5}
0.006 0.070 0.005 0.001 (F) 0.070 af mclk rdctrlp1/clksdlgen/i0invd%clkout {ec0cinv00ab2n02x5} NOTGATE
0.000 0.070 0.005 (F) 0.070 af mclk rdctrlp1/clksdlgen/inand0dft%clk {ec0cnan02ab3n02x5}
0.011 0.081 0.012 0.002 (R) 0.081 af mclk rdctrlp1/clksdlgen/inand0dft%clkout {ec0cnan02ab3n02x5} NANDGATE
I am using this below code for matching these kind of lines and processing further:
pattern="^\s+(-?\d(\.\d+)?)\s+(-?\d(\.\d+)?).+?\((R|F)\).+?(a|b)(.)\s"
if [[ $line =~ $pattern ]]
then
arc_type="${BASH_REMATCH[7]}"_"${BASH_REMATCH[5]}"
delay="${BASH_REMATCH[1]}"
It doesn't work, not sure why. Below is a regex that works fine in the same script:
if [[ $line =~ "#(.+?)\s,\s.+?ip%(.+?)\s->>\s.+?ip%(.+?)\s,\s" ]]
I'm using ruby-prof to figure out where my CPU time is going for a small 2D game engine I'm building in Ruby. Everything looks normal here aside from the main Kernel#` entry. The Ruby docs here would suggest that this is a function for getting the STDOUT of a command running in a subshell:
Measure Mode: wall_time
Thread ID: 7966920
Fiber ID: 16567620
Total: 7.415271
Sort by: self_time
%self total self wait child calls name
28.88 2.141 2.141 0.000 0.000 476 Kernel#`
10.72 1.488 0.795 0.000 0.693 1963500 Tile#draw
9.35 0.693 0.693 0.000 0.000 1963976 Gosu::Image#draw
6.67 7.323 0.495 0.000 6.828 476 Gosu::Window#_tick
1.38 0.102 0.102 0.000 0.000 2380 Gosu::Font#draw
0.26 4.579 0.019 0.000 4.560 62832 *Array#each
0.15 0.011 0.011 0.000 0.000 476 Gosu::Window#caption=
0.09 6.873 0.007 0.000 6.867 476 PlayState#draw
0.07 0.005 0.005 0.000 0.000 476 String#gsub
0.06 2.155 0.004 0.000 2.151 476 GameWindow#memory_usage
0.06 4.580 0.004 0.000 4.576 1904 Hash#each
0.04 0.003 0.003 0.000 0.000 476 String#chomp
0.04 0.038 0.003 0.000 0.035 476 Gosu::Window#protected_update
0.04 0.004 0.003 0.000 0.001 3167 Gosu::Window#button_down?
0.04 0.005 0.003 0.000 0.002 952 Enumerable#map
0.03 0.015 0.003 0.000 0.012 476 Player#update
0.03 4.596 0.002 0.000 4.593 476 <Module::Gosu>#scale
0.03 0.002 0.002 0.000 0.000 5236 Fixnum#to_s
0.03 7.326 0.002 0.000 7.324 476 Gosu::Window#tick
0.03 0.003 0.002 0.000 0.001 952 Player#coord_facing
0.03 4.598 0.002 0.000 4.597 476 <Module::Gosu>#translate
0.02 0.002 0.002 0.000 0.000 952 Array#reject
Any suggestions as to why this might be happening? I'm fairly confident that I'm not using it in my code - unless it's being called indirectly somehow. Not sure where to start looking for that sort of thing.
I've solved my problem. Though it wasn't exactly clear to me given the ruby documentation I linked in the question, the source of the problem is how ruby-prof categorizes the usage of the #{} shortcut, also known as 'string interpolation'. I had semi-intensive debugging logic being executed within these shortcuts.
Turning off my debugging text solves my problem.