Why does ruby-prof list "Kernel#`" as a resource hog? - ruby

I'm using ruby-prof to figure out where my CPU time is going for a small 2D game engine I'm building in Ruby. Everything looks normal here aside from the main Kernel#` entry. The Ruby docs here would suggest that this is a function for getting the STDOUT of a command running in a subshell:
Measure Mode: wall_time
Thread ID: 7966920
Fiber ID: 16567620
Total: 7.415271
Sort by: self_time
%self total self wait child calls name
28.88 2.141 2.141 0.000 0.000 476 Kernel#`
10.72 1.488 0.795 0.000 0.693 1963500 Tile#draw
9.35 0.693 0.693 0.000 0.000 1963976 Gosu::Image#draw
6.67 7.323 0.495 0.000 6.828 476 Gosu::Window#_tick
1.38 0.102 0.102 0.000 0.000 2380 Gosu::Font#draw
0.26 4.579 0.019 0.000 4.560 62832 *Array#each
0.15 0.011 0.011 0.000 0.000 476 Gosu::Window#caption=
0.09 6.873 0.007 0.000 6.867 476 PlayState#draw
0.07 0.005 0.005 0.000 0.000 476 String#gsub
0.06 2.155 0.004 0.000 2.151 476 GameWindow#memory_usage
0.06 4.580 0.004 0.000 4.576 1904 Hash#each
0.04 0.003 0.003 0.000 0.000 476 String#chomp
0.04 0.038 0.003 0.000 0.035 476 Gosu::Window#protected_update
0.04 0.004 0.003 0.000 0.001 3167 Gosu::Window#button_down?
0.04 0.005 0.003 0.000 0.002 952 Enumerable#map
0.03 0.015 0.003 0.000 0.012 476 Player#update
0.03 4.596 0.002 0.000 4.593 476 <Module::Gosu>#scale
0.03 0.002 0.002 0.000 0.000 5236 Fixnum#to_s
0.03 7.326 0.002 0.000 7.324 476 Gosu::Window#tick
0.03 0.003 0.002 0.000 0.001 952 Player#coord_facing
0.03 4.598 0.002 0.000 4.597 476 <Module::Gosu>#translate
0.02 0.002 0.002 0.000 0.000 952 Array#reject
Any suggestions as to why this might be happening? I'm fairly confident that I'm not using it in my code - unless it's being called indirectly somehow. Not sure where to start looking for that sort of thing.

I've solved my problem. Though it wasn't exactly clear to me given the ruby documentation I linked in the question, the source of the problem is how ruby-prof categorizes the usage of the #{} shortcut, also known as 'string interpolation'. I had semi-intensive debugging logic being executed within these shortcuts.
Turning off my debugging text solves my problem.

Related

AWK comparing two columns in each of two files that have headers

I have two files:
temp_bandstructure.dat has the following format
# spin band kx ky kz E(MF) E(QP) Delta E kn E(MF)5dp
# (Cartesian coordinates) (eV) (eV) (eV) (eV)
1 22 0.00000 0.00000 0.00000 -3.021665798 -4.022414204 -1.000748406 1 -3.02167
1 22 0.00850 0.00000 0.00000 -3.026245712 -4.027334803 -1.001089091 2 -3.02625
1 22 0.01699 0.00000 0.00000 -3.039924052 -4.061680485 -1.021756433 3 -3.03992
1 22 0.00000 0.00000 0.00000 -3.021665798 -4.022414204 -1.000748406 4 -3.02167
1 29 0.00000 0.00000 0.00000 -1.344238286 -2.629257334 -1.285019048 1 -1.34424
mf_pband.dat has 46 header rows and more data rows than temp_bandstructure.dat. The extra data are not useful and should not make its way into the final output.
#header row
#header row
3 0.02000 -3.03993 0.984 0.000 0.010 0.011 0.000 0.000 0.010 0.011 0.000 0.000 0.005 0.014 0.000 0.000 0.005 0.014 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.426 0.000 0.001 0.000 0.426 0.000
2 0.01000 -3.02624 0.982 0.000 0.009 0.011 0.000 0.000 0.009 0.011 0.000 0.000 0.005 0.014 0.000 0.000 0.005 0.014 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.427 0.000 0.000 0.000 0.427 0.000
4 0.00000 -3.02167 0.982 0.000 0.009 0.011 0.000 0.000 0.009 0.011 0.000 0.000 0.005 0.014 0.000 0.000 0.005 0.014 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.427 0.000 0.000 0.000 0.427 0.000
1 0.00000 -3.02167 0.982 0.000 0.009 0.011 0.000 0.000 0.009 0.011 0.000 0.000 0.005 0.014 0.000 0.000 0.005 0.014 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.427 0.000 0.000 0.000 0.427 0.000
1 0.00000 -1.34424 0.994 0.000 0.000 0.046 0.000 0.000 0.000 0.046 0.000 0.000 0.004 0.263 0.000 0.000 0.004 0.263 0.000 0.000 0.000 0.000 0.000 0.018 0.000 0.000 0.002 0.149 0.000 0.000 0.000 0.000 0.000 0.018 0.000 0.000 0.002 0.149 0.000 0.000 0.000 0.002 0.013 0.000 0.000 0.002 0.013
1 0.00000 -55.55593 0.998 0.000 0.001 0.000 0.000 0.000 0.001 0.000 0.000 0.000 0.003 0.000 0.000 0.000 0.003 0.000 0.000 0.490 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.492 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.002 0.000 0.000 0.000 0.002 0.000 0.000 0.000
I have a nested for loop that compares column 1 and 3 of every row in mf_pband.dat against column 9 and 10 of every row in temp_bandstructure.dat. If the numbers in match within a value of 0.00001, then the script will print out the entire row of mf_pband.dat to a cache file.
For example, the script should be able to match row 4, 2, 1, 3, 5 of mf_pband.dat with row 1, 2, 3, 4, 5 of temp_bandstructure.dat, giving the output
1 22 0.00000 0.00000 0.00000 -3.021665798 -4.022414204 -1.000748406 1 -3.02167 1 0.00000 -3.02167 0.982 0.000 0.009 0.011 0.000 0.000 0.009 0.011 0.000 0.000 0.005 0.014 0.000 0.000 0.005 0.014 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.427 0.000 0.000 0.000 0.427 0.000
1 22 0.00850 0.00000 0.00000 -3.026245712 -4.027334803 -1.001089091 2 -3.02625 2 0.01000 -3.02624 0.982 0.000 0.009 0.011 0.000 0.000 0.009 0.011 0.000 0.000 0.005 0.014 0.000 0.000 0.005 0.014 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.427 0.000 0.000 0.000 0.427 0.000
1 22 0.01699 0.00000 0.00000 -3.039924052 -4.061680485 -1.021756433 3 -3.03992 3 0.02000 -3.03993 0.984 0.000 0.010 0.011 0.000 0.000 0.010 0.011 0.000 0.000 0.005 0.014 0.000 0.000 0.005 0.014 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.426 0.000 0.001 0.000 0.426 0.000
1 22 0.00000 0.00000 0.00000 -3.021665798 -4.022414204 -1.000748406 4 -3.02167 4 0.00000 -3.02167 0.982 0.000 0.009 0.011 0.000 0.000 0.009 0.011 0.000 0.000 0.005 0.014 0.000 0.000 0.005 0.014 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.021 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.427 0.000 0.000 0.000 0.427 0.000
1 29 0.00000 0.00000 0.00000 -1.344238286 -2.629257334 -1.285019048 1 -1.34424 1 0.00000 -1.34424 0.994 0.000 0.000 0.046 0.000 0.000 0.000 0.046 0.000 0.000 0.004 0.263 0.000 0.000 0.004 0.263 0.000 0.000 0.000 0.000 0.000 0.018 0.000 0.000 0.002 0.149 0.000 0.000 0.000 0.000 0.000 0.018 0.000 0.000 0.002 0.149 0.000 0.000 0.000 0.002 0.013 0.000 0.000 0.002 0.013
The extra row 6 of mf_pband.dat does not make into the final output as it does not have a match.
I wrote a working for loop that gets the job done, but at a very slow pace:
kmax=207
bandmin=$(cat bandstructure.dat | awk 'NR==3''{ print$2 }')
bandmax=$(tac bandstructure.dat | awk 'NR==1''{ print$2 }')
nband=$(($bandmax-$bandmin+1))
nheader=46
for ((i=3;i<=$(($kmax*$nband+2)); i++)); do
kn=$(awk -v i=$i 'NR==i''{ print$9 }' temp_bandstructure.dat)
emf=$(awk -v i=$i 'NR==i''{ print$10 }' temp_bandstructure.dat)
for ((j=$(($nheader+1));j<=$(($kmax*$nband+$nheader)); j++)); do
kn_mf_pband=$(awk -v j=$j 'NR==j''{ print$1 }' mf_pband.dat)
emf_mf_pband=$(awk -v j=$j 'NR==j''{ print$3 }' mf_pband.dat)
if [ "$kn" = "$kn_mf_pband" ] && (( $(echo "$emf - $emf_mf_pband <= 0.00001" |bc -l) )) && (( $(echo "$emf_mf_pband - $emf <= 0.00001" |bc -l) ))
then
awk -v j=$j 'NR==j' mf_pband.dat >> temp_copying_cache.dat
echo $i $j $kn $kn_mf_pband $emf $emf_mf_pband
break
fi
done
done
Now I'm trying to use AWK arrays to speed up the process. Drawing my inspiration from Socowi and here, I managed to write the following to replace the for loops. However, I am unfamiliar with how to reference the arrays with the correct syntax.
awk -v nheader=$nheader 'NR==FNR && NR>nheader { a[NR-nheader]=$1; b[NR-nheader]=$3; c[NR-nheader]=$0 next }
FNR>2 { d[NR-2]=$9; e[Nr-2]=$10 }(a == d) && (abs(b - e) <= 0.00001){ print $0, c[$1] }' mf_pband.dat temp_bandstructure.dat > temp_copying_cache.dat
Can anyone tell me how the correct syntax should be?
Update:
Developing on #EdMorton's solution, I have managed the following code, which uses NR as the array indices to overcome the issue of repeated values in $9. However, something is not right and the code currently is not producing any output.
awk -v nheader=$nheader '
/^#/ { next }
NR==FNR { rec[NR]=$0; k[NR]=$9; val[NR]=$10; next }
($1 == k[NR]) && (abs(val[NR] - $3) <= 0.0001) { print rec[NR], $0 }
function abs(x) { return (x<0 ? -x : x) }
' temp_bandstructure.dat mf_pband.dat > temp_copying_cache.dat
Your code ...
awk -v nheader=$nheader '
/^#/ { next }
NR==FNR { rec[NR]=$0; k[NR]=$9; val[NR]=$10; next }
($1 == k[NR]) && (abs(val[NR] - $3) <= 0.0001) { print rec[NR], $0 }
function abs(x) { return (x<0 ? -x : x) }
' temp_bandstructure.dat mf_pband.dat > temp_copying_cache.dat
... does not print anything because you assumed NR to be the line number (or something like that) in the current file, when it actually is the the total number of lines processed so far. After the first file is processed, NR keeps on incrementing.
Assume your first file has 99 rows, then you initialize k[] and val[] for indices from 1 to 99. But then in the second file, you access k[] and val[] at the uninitialized indices 100, 101, … . The default value for an uninitialized variable is 0, so your checks $1 == k[NR] and abs(val[NR] - $3) <= 0.0001 fail, because $1 and $3 are never 0 in your file mf_pband.da.
You could use FNR to access the line number in the current file, but then your script still wouldn't do what you want. You would only compare line 1 from the first file to line 1 from the second file and so on, when you actually wanted …
compares column 1 and 3 of every row in mf_pband.dat against column 9 and 10 of every row in temp_bandstructure.dat
Maybe the following works for you. This exploits the fact that the numbers in mf_pband.dat $3 and temp_bandstructure.dat $10 have a precision of at most 0.0001 which is also the allowed delta.
awk -v d=0.00001 -v CONVFMT=%.5f '
/^#/ { next }
NR==FNR { a[$1,$3]=a[$1,$3+d]=a[$1,$3-d]=$0; next }
($9 SUBSEP $10 in a) { print $0, a[$9,$10] }
' mf_pband.dat temp_bandstructure.dat
The CONVFMT=%.5f ensures that upon calculating $3+d/$3-d the results are always printed with 5 decimal places, the precision of d and the numbers in your file. Without that, the calculation -55.55593+0.00001 would have resulted in the rounded string representation -55.55593 again.

Assistance with bash script to correctly pull the required column and set proper header using filename regex

I have a folder with multiple files. Each file has a naming convention of 1000T.quant.sf, 1000G.quant.sf, 1001T.quant.sf, 1001G.quant.sf, and so on. The script I wrote needs modification with the header generation line. Basically, the script pulls the first column once and loops inside of all the files to pull column 5 for each file inside a directory to generate an overall matrix with those columns. The problem I ran into is substituting the column header with the properly. I want to substitute the header with string before *.quant.sf in each column but currently I have doubleheader. How can I resolve this?
Snippet:
cut -f 1 `ls *quant.sf | head -1` > tmp
for x in *quant.sf; do
printf "\t" >> tsamples
printf `echo $x | cut -d. -f 1` >> tsamples
cut -f 5 $x | paste tmp - > tmp2
mv tmp2 tmp
done
echo "" >> tsamples
cat tsamples tmp > transcipts.numreads
rm tsamples tmp
Current output
1001G 1001T 1005G 1005T 1006G
Name NumReads NumReads NumReads NumReads NumReads
ENST00000456328.2 12.090 0.000 0.000 0.000 1.545
ENST00000450305.2 0.000 0.000 0.000 0.000 0.000
ENST00000488147.1 620.145 204.533 451.949 250.643 437.618
ENST00000619216.1 0.000 0.000 0.000 0.000 0.000
ENST00000473358.1 0.000 3.680 0.000 1.000 0.000
ENST00000469289.1 4.990 0.000 0.000 0.000 0.000
ENST00000607096.1 0.000 0.000 0.000 0.000 0.000
ENST00000417324.1 0.000 0.000 0.000 0.000 0.000
Desired output:
Name 1001G 1001T 1005G 1005T 1006G
ENST00000456328.2 12.090 0.000 0.000 0.000 1.545
ENST00000450305.2 0.000 0.000 0.000 0.000 0.000
ENST00000488147.1 620.145 204.533 451.949 250.643 437.618
ENST00000619216.1 0.000 0.000 0.000 0.000 0.000
ENST00000473358.1 0.000 3.680 0.000 1.000 0.000
ENST00000469289.1 4.990 0.000 0.000 0.000 0.000
ENST00000607096.1 0.000 0.000 0.000 0.000 0.000
ENST00000417324.1 0.000 0.000 0.000 0.000 0.000
One input file contents:
$ head 1005T.salmon_quant.sf
Name Length EffectiveLength TPM NumReads
ENST00000456328.2 1657 1441.000 0.000000 0.000
ENST00000450305.2 632 417.000 0.000000 0.000
ENST00000488147.1 1351 1170.738 4.987413 250.643
ENST00000619216.1 68 69.000 0.000000 0.000
ENST00000473358.1 712 512.539 0.045452 1.000
ENST00000469289.1 535 323.000 0.000000 0.000
ENST00000607096.1 138 18.000 0.000000 0.000
ENST00000417324.1 1187 971.000 0.000000 0.000
ENST00000461467.1 590 376.000 0.000000 0.000
Initialize tsamples with the Name heading. Then when you're processing the file contents, skip the first line with tail -n +2.
printf "Name" >tsamples
tail -n +2 "$(ls *quant.sf | head -1)" | cut -f 1 > tmp
for file in *quant.sf; do
printf '\t%s' "${file%%.*}" >> tsamples
tail -n +2 "$file" | cut -f 5 | paste tmp - > tmp2
mv tmp2 tmp
done
echo "" >> tsamples
cat tsamples tmp > transcipts.numreads
rm tsamples tmp
You can also use bash's %% parameter expansion operator to remove everything from the ., rather that piping to cut.
Another wariant using associative array
#!/bin/bash
header='Name\t\t' # initiate header
declare -A row # create associative array 'row' to store rows of data
for file in *.sf; {
header+="\t${file%%.*}" # creating header by appending first part of filename
while read line; do # read file line by line
[[ $line =~ ame ]] && continue # skip header
raw=( $line ) # split line by converting to 'raw' array
name=${raw[0]} # get name
data=${raw[4]} # get data
row[$name]+="\t$data" # append data to named row
done < $file
}; header+='\n' # apend header with new line in the end
printf $header # print header
for key in "${!row[#]}"; { # print
printf "$key${row[$key]}\n" # all
} # rows
Testing:
$ ./ex
Name 1005G 1005T
ENST00000607096.1 0.000 4.000
ENST00000450305.2 0.000 3.000
ENST00000473358.1 1.000 1.000
ENST00000417324.1 0.000 0.000
ENST00000619216.1 0.000 3.000
ENST00000456328.2 0.000 1.000
ENST00000461467.1 0.000 0.000
ENST00000488147.1 250.643 230.43
ENST00000469289.1 0.000 0.000

Docker 50% performance hit on cpu intensive code

I'm pretty new at using docker or any containers, so please be gentle if I've missed something obvious that everyone else be already knows.
I've searched everywhere I can think of, but haven't seen this issue addressed.
I'm trying to evaluate the performance cost of running a benchmark in docker, and I discovered surprising large differences that don't make sense to me. I created a simple Docker image with this Dockerfile:
FROM ubuntu:18.04
RUN apt -y -q update && apt -y -q install python3 vim strace linux-tools-common \
linux-tools-4.15.0-74-generic linux-cloud-tools-4.15.0-74-generic
ADD . /workspace
WORKDIR /workspace
And I've got a simple python script for testing:
$ cat cpu-test.py
#!/usr/bin/env python3
import math
from time import time
N = range(10)
N_i = range(1_000)
N_j = range(1_000)
x = 1
start = time()
for _ in N:
for i in N_i:
for j in N_j:
x += -1**j * math.sqrt(i)/max(j,2)
stop = time()
print(stop-start)
and then I compare running it normally to running in a container:
$ ./cpu-test.py
4.077672481536865
$ docker run -it --rm cpu:test ./cpu-test.py
6.113868236541748
$
I was investigating it using perf, which led me to the discovery that I needed --privileged to run perf inside a docker, but then the performance gap disappeared:
$ docker run -it --rm --privileged cpu:test ./cpu-test.py
4.1469762325286865
$
Searching for anything to do with docker and --privileged mostly results in litanies of reasons that I shouldn't use privileged because of security considerations, haven't found anything about severe performance effects on mundane code.
Using perf to compare the with/without privilege runs, they look quite different:
With privilege, the top 5 in the perf report are:
7.26% docker docker [.] runtime.mapassign_faststr
6.21% docker docker [.] runtime.mapaccess2
6.12% docker [kernel] [k] 0xffffffff880015e0
5.37% docker [kernel] [k] 0xffffffff87faac87
4.92% docker docker [.] runtime.retake
while running without privilege results in:
11.11% docker docker [.] runtime.evacuate_faststr
8.14% docker docker [.] runtime.scanobject
7.18% docker docker [.] runtime.mallocgc
5.10% docker docker [.] runtime.mapassign
4.44% docker docker [.] runtime.growslice
I don't know if that is meaningful though, as I'm not at all familiar with the code of the docker runtime.
Am I doing something wrong? Or is there some special knob I need to turn?
Thanks
The flag seccomp:unconfined when added to the docker run command improves the performance of the python program. seccomp is a linux kernel feature, which can be used to restrict the actions available inside the container, by way of allowing and disallowing certain system calls to be made to the host. This reduces the container's access to the host, and in security terminology, helps in reducing the attack surface of the container. The default seccomp profile disables 44 system calls for running containers including perf_event_open and when you add the flag --security-opt seccomp:unconfined all system calls are enabled for the running container.
Since adding seccomp:unconfined helps the Python program in running at almost 1.5x-2x speed, the first point of analysis would be to look at strace output and see if any system calls are slowing things down, when that flag is not added.
Output with --security-opt seccomp:unconfined flag
strace -c -f -S name docker run -it --rm --security-opt seccomp:unconfined cpu:test ./cpu-test.py
5.4090752601623535
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
2.00 0.000194 32 6 6 access
0.11 0.000011 11 1 arch_prctl
0.33 0.000032 11 3 brk
0.00 0.000000 0 1 capget
0.10 0.000010 1 16 clone
0.64 0.000062 4 17 close
0.00 0.000000 0 5 2 connect
0.00 0.000000 0 1 epoll_create1
0.00 0.000000 0 14 2 epoll_ctl
0.22 0.000021 0 62 epoll_pwait
0.29 0.000028 28 1 execve
0.00 0.000000 0 8 fcntl
0.67 0.000065 8 8 fstat
68.87 0.006687 22 310 24 futex
0.02 0.000002 2 1 getgid
0.00 0.000000 0 3 getpeername
0.00 0.000000 0 2 getpid
0.00 0.000000 0 1 getrandom
0.00 0.000000 0 3 getsockname
0.10 0.000010 1 17 gettid
0.02 0.000002 1 2 getuid
0.00 0.000000 0 5 1 ioctl
0.00 0.000000 0 1 lseek
5.83 0.000566 7 84 mmap
2.12 0.000206 5 39 mprotect
0.35 0.000034 2 14 munmap
0.00 0.000000 0 12 9 newfstatat
1.43 0.000139 10 14 openat
0.13 0.000013 13 1 prlimit64
10.21 0.000991 10 102 pselect6
0.55 0.000053 2 34 10 read
0.00 0.000000 0 1 readlinkat
3.14 0.000305 3 120 rt_sigaction
0.36 0.000035 1 53 rt_sigprocmask
0.04 0.000004 4 1 sched_getaffinity
2.04 0.000198 5 42 sched_yield
0.18 0.000017 1 17 set_robust_list
0.03 0.000003 3 1 set_tid_address
0.00 0.000000 0 3 setsockopt
0.22 0.000021 1 34 sigaltstack
0.00 0.000000 0 5 socket
0.00 0.000000 0 7 write
------ ----------- ----------- --------- --------- ----------------
100.00 0.009709 1072 54 total
Output without --security-opt seccomp:unconfined flag
strace -c -f -S name docker run -it --rm cpu:test ./cpu-test.py
8.161764860153198
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
0.08 0.000033 6 6 6 access
0.04 0.000015 15 1 arch_prctl
0.02 0.000007 2 3 brk
0.00 0.000000 0 1 capget
0.22 0.000087 6 15 clone
0.26 0.000102 6 17 close
0.04 0.000015 3 5 2 connect
0.00 0.000000 0 1 epoll_create1
0.14 0.000054 4 14 2 epoll_ctl
2.31 0.000916 23 40 epoll_pwait
0.00 0.000000 0 1 execve
0.00 0.000000 0 8 fcntl
0.07 0.000027 3 8 fstat
72.00 0.028580 99 290 21 futex
0.01 0.000002 2 1 getgid
0.01 0.000002 1 3 getpeername
0.00 0.000000 0 2 getpid
0.00 0.000000 0 1 getrandom
0.01 0.000002 1 3 getsockname
0.10 0.000039 2 16 gettid
0.01 0.000002 1 2 getuid
0.01 0.000005 1 5 1 ioctl
0.00 0.000000 0 1 lseek
1.33 0.000529 7 80 mmap
0.72 0.000284 8 37 mprotect
0.31 0.000125 8 15 munmap
0.07 0.000026 2 12 9 newfstatat
0.20 0.000080 6 14 openat
0.01 0.000003 3 1 prlimit64
20.04 0.007954 42 189 pselect6
0.21 0.000085 3 34 10 read
0.00 0.000000 0 1 readlinkat
0.46 0.000182 2 120 rt_sigaction
0.52 0.000207 4 50 rt_sigprocmask
0.01 0.000004 4 1 sched_getaffinity
0.27 0.000108 5 20 sched_yield
0.11 0.000045 3 16 set_robust_list
0.01 0.000003 3 1 set_tid_address
0.01 0.000002 1 3 setsockopt
0.32 0.000127 4 32 sigaltstack
0.02 0.000008 2 5 socket
0.09 0.000035 5 7 write
------ ----------- ----------- --------- --------- ----------------
100.00 0.039695 1082 51 total
Nothing significant yet.
So the next thing that was to be analyzed was the Python program itself.
All of the below commands to obtain execution time profiles were run 5 times, and one from that sample space was chosen. There was very minimal variation in the timings.
Running the containers in the background, and then exec-ing into the container,
Output of doing a profile on the Python program running inside container with --security-opt seccomp:unconfined flag
docker exec -it cpu-test-seccomp bash
root#133453c7ccc6:/workspace# python3 -m cProfile ./cpu-test.py
7.339433908462524
20000069 function calls in 7.340 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:103(release)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:143(__init__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:147(__enter__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:151(__exit__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:157(_get_module_lock)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:176(cb)
2 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:222(_verbose_message)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:232(_requires_builtin_wrapper)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:307(__init__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:311(__enter__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:318(__exit__)
4 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:321(<genexpr>)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:369(__init__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:416(parent)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:424(has_location)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:433(spec_from_loader)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:504(_init_module_attrs)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:564(module_from_spec)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:58(__init__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:651(_load_unlocked)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:707(find_spec)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:728(create_module)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:736(exec_module)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:753(is_package)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:78(acquire)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:843(__enter__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:847(__exit__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:870(_find_spec)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:936(_find_and_load_unlocked)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:966(_find_and_load)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:997(_handle_fromlist)
1 5.540 5.540 7.340 7.340 cpu-test.py:3(<module>)
3 0.000 0.000 0.000 0.000 {built-in method _imp.acquire_lock}
1 0.000 0.000 0.000 0.000 {built-in method _imp.create_builtin}
1 0.000 0.000 0.000 0.000 {built-in method _imp.exec_builtin}
1 0.000 0.000 0.000 0.000 {built-in method _imp.is_builtin}
3 0.000 0.000 0.000 0.000 {built-in method _imp.release_lock}
2 0.000 0.000 0.000 0.000 {built-in method _thread.allocate_lock}
2 0.000 0.000 0.000 0.000 {built-in method _thread.get_ident}
1 0.000 0.000 0.000 0.000 {built-in method builtins.any}
1 0.000 0.000 7.340 7.340 {built-in method builtins.exec}
4 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
5 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}
10000000 1.228 0.000 1.228 0.000 {built-in method builtins.max}
1 0.000 0.000 0.000 0.000 {built-in method builtins.print}
10000000 0.571 0.000 0.571 0.000 {built-in method math.sqrt}
2 0.000 0.000 0.000 0.000 {built-in method time.time}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
2 0.000 0.000 0.000 0.000 {method 'rpartition' of 'str' objects}
Output of doing a profile on the Python program running inside container with no --security-opt flag
docker exec -it cpu-test-no-seccomp bash
root#500724539bd0:/workspace# python3 -m cProfile ./cpu-test.py
11.848757982254028
20000069 function calls in 11.849 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:103(release)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:143(__init__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:147(__enter__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:151(__exit__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:157(_get_module_lock)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:176(cb)
2 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:222(_verbose_message)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:232(_requires_builtin_wrapper)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:307(__init__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:311(__enter__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:318(__exit__)
4 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:321(<genexpr>)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:369(__init__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:416(parent)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:424(has_location)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:433(spec_from_loader)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:504(_init_module_attrs)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:564(module_from_spec)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:58(__init__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:651(_load_unlocked)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:707(find_spec)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:728(create_module)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:736(exec_module)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:753(is_package)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:78(acquire)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:843(__enter__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:847(__exit__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:870(_find_spec)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:936(_find_and_load_unlocked)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:966(_find_and_load)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:997(_handle_fromlist)
1 8.654 8.654 11.849 11.849 cpu-test.py:3(<module>)
3 0.000 0.000 0.000 0.000 {built-in method _imp.acquire_lock}
1 0.000 0.000 0.000 0.000 {built-in method _imp.create_builtin}
1 0.000 0.000 0.000 0.000 {built-in method _imp.exec_builtin}
1 0.000 0.000 0.000 0.000 {built-in method _imp.is_builtin}
3 0.000 0.000 0.000 0.000 {built-in method _imp.release_lock}
2 0.000 0.000 0.000 0.000 {built-in method _thread.allocate_lock}
2 0.000 0.000 0.000 0.000 {built-in method _thread.get_ident}
1 0.000 0.000 0.000 0.000 {built-in method builtins.any}
1 0.000 0.000 11.849 11.849 {built-in method builtins.exec}
4 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
5 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}
10000000 2.155 0.000 2.155 0.000 {built-in method builtins.max}
1 0.000 0.000 0.000 0.000 {built-in method builtins.print}
10000000 1.039 0.000 1.039 0.000 {built-in method math.sqrt}
2 0.000 0.000 0.000 0.000 {built-in method time.time}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
2 0.000 0.000 0.000 0.000 {method 'rpartition' of 'str' objects}
The timings are slightly high in both the cases, because of the profiling overhead here. But two things are noticeable here -
The builtin math.sqrt and builtins.max functions show almost 1.5-2x differences in their execution time, this difference gets pronounced since these functions get called 10000000 times.
The resultant overall execution time is slower without the flag, as can be seen in the builtins.exec function and their execution times.
To understand this phenomenon more, the math.sqrt as well as the max functions were removed. The below line in cpu-test.py-
x += -1**j * math.sqrt(i)/max(j,2)
was changed to -
x += 1
and the import math line was removed too, thereby reducing a lot of overhead from the import statement.
With --security-opt seccomp:unconfined
root#133453c7ccc6:/workspace# python3 -m cProfile ./cpu-test.py
0.7199039459228516
8 function calls in 0.720 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:997(_handle_fromlist)
1 0.720 0.720 0.720 0.720 cpu-test.py:4(<module>)
1 0.000 0.000 0.720 0.720 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}
1 0.000 0.000 0.000 0.000 {built-in method builtins.print}
2 0.000 0.000 0.000 0.000 {built-in method time.time}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
With no --security-opt seccomp:unconfined
root#500724539bd0:/workspace# python3 -m cProfile ./cpu-test.py
1.0778992176055908
8 function calls in 1.078 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:997(_handle_fromlist)
1 1.078 1.078 1.078 1.078 cpu-test.py:4(<module>)
1 0.000 0.000 1.078 1.078 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}
1 0.000 0.000 0.000 0.000 {built-in method builtins.print}
2 0.000 0.000 0.000 0.000 {built-in method time.time}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Also doing a perf record -e ./cpu-test.py after starting the container with --privileged flags, and then doing a perf report, we can see -
Samples: 20K of event 'cycles:ppp', Event count (approx.): 17551108136
Overhead Command Shared Object Symbol
14.56% python3 python3.6 [.] 0x0000000000181c0b
11.65% python3 python3.6 [.] _PyEval_EvalFrameDefault
5.75% python3 python3.6 [.] PyDict_GetItem
3.43% python3 python3.6 [.] PyDict_SetItem
1.69% python3 python3.6 [.] 0x0000000000181e45
1.68% python3 python3.6 [.] 0x0000000000181c23
1.59% python3 python3.6 [.] 0x00000000001705c9
1.54% python3 python3.6 [.] 0x0000000000181a88
1.54% python3 python3.6 [.] 0x0000000000181bfa
1.48% python3 python3.6 [.] 0x0000000000181c56
1.48% python3 python3.6 [.] 0x0000000000181c71
1.42% python3 python3.6 [.] 0x0000000000181c42
1.37% python3 python3.6 [.] 0x0000000000181c8a
1.28% python3 python3.6 [.] 0x0000000000181c01
1.09% python3 python3.6 [.] _PyObject_GC_New
0.96% python3 python3.6 [.] PyNumber_Multiply
0.63% python3 python3.6 [.] PyLong_AsDouble
0.59% python3 python3.6 [.] PyObject_GetAttr
0.57% python3 python3.6 [.] 0x00000000000c4df9
0.57% python3 python3.6 [.] 0x0000000000165808
0.56% python3 python3.6 [.] PyObject_RichCompare
0.53% python3 python3.6 [.] PyNumber_Negative
Most of the time is spent in _PyEval_EvalFrameDefault, which is a fair indication that most of the time is spent by the interpreter executing the byte code.
It would be fair to assume that addition of the --security-opt seccomp:unconfined speeds up the interpreter in executing the byte code. This would require a bit of digging around the Python internals.
Note that, the disassembled output is the same in both the cases, running with --seccomp:unconfined as well as using the default seccomp profile.
From this link:
When the operator executes docker run --privileged, Docker will enable
access to all devices on the host as well as set some configuration in
AppArmor or SELinux to allow the container nearly all the same access
to the host as processes running outside containers on the host.
Additional information about running with --privileged is available on
the Docker Blog.
I feel that there are security restrictions which when running in privileged mode are practically disabled. I believe that the nature of these security restrictions tend to have performance cost when enabled however this performance cost for the sake of maintaining reasonable security. This would be very visible when running CPU intensive tasks like in your example.
As described by Yusuke Endoh in his blog, the heavy slowdown in scripting languages such as python, perl and ruby running under docker (and containerd) with the default seccomp profile appears to come from a kernel vulnerability mitigation called Speculative Store Bypass.
The mitigation suppresses indirect branch prediction (called STIBP). This makes most code much slower as measured by phoronix. The mitigation added in kernel 4.20, but was quickly disabled by default, due to is impact on performance.
To see if the mitigation is on for your kernel, run:
$ cat /sys/devices/system/cpu/vulnerabilities/spec_store_bypass
Mitigation: Speculative Store Bypass disabled via prctl and seccomp
As it turns out, running a program with seccomp enables the infamous STIBP mitigation, making some workloads, such as scripting languages two times slower.
Solutions to the problem (at the cost of security), include running with --security-opt seccomp:unconfined or adding kernel parameters to disable the relevant mitigation:
docker run --security-opt seccomp:unconfined
Booting the kernel with: mitigations=off
Booting the kernel with: spectre_v2_user=off spec_store_bypass_disable=off
Which solution is best depends entirely on the king of things running on the system. If you trust the code in the docker container completely but have untrusted VMs running on the system, you may opt for --security-opt seccomp:unconfined. Otherwise I would back Linus Torvalds and use the kernel parameters to disable the mitigation.
Other possible slowdowns (though not as significant) related to seccomp which one may still experience if running an old kernel/seccomp:
seccomp_rule_add was slow between 2.4 and 2.5
seccomp filters ran in O(n) time instead of O(1) prior to kernel 5.11
With seccomp 2.5, linux kernel 5.12 and mitigiations=off I am seeing above 99% native performance with docker, compared to less than 80% native performance with the default mitigation settings on Centos/Fedora.
even if this question is now kind of old, I though it would still help some people to share our solution.
It seems that the real cause of the problem here is the libseccomp2 version installed on your machine. By upgrading it, apt-get install --only-upgrade libseccomp2, you will be able to enhance your application performances, and avoid setting --security-opt seccomp:unconfined when running your container.

How to trace a fix process’s wakeup latency?

I want to use ftrace to trace a fix process’s wakeup latency.
But, ftrace will only record the max latency.
And, set_ftrace_pid is useless.
Does anybody know how to do that?
Thank you very much.
You can use the tool I wrote to trace a fix process’s wakeup latency. 
https://gitee.com/openeuler-competition/summer2021-42
This tool supports analyzing the overall scheduling latency of the system through ftrace raw data.
Adding to the suggestion added by #Qeole, you can also use the perf sched utility to obtain a much detailed trace of a process' wakeup latency. While ebpf tools like runqlat will give you a much higher level overview, using perf sched will help you capture all scheduler events and thereby, observe and inspect the wakeup latencies of a process in much more detail. Note that running perf sched to monitor a long-running computationally intensive process, will come with its own issues of overhead.
You first need to run perf sched record -
From the man-page,
'perf sched record <command>' records the scheduling events of an arbitrary workload.
For eg. say you want to trace the wakeup latencies of the command ls.
sudo perf sched record ls
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.453 MB perf.data (562 samples) ]
You will see that in the same directory where the command was run, a perf.data file will be generated. This file will contain all of the raw scheduler events, and the commands below will help to make sense of all these scheduler events.
You can run perf sched latency to obtain per-task latency summaries, including details of the number of context switches per task, average and maximum delay.
sudo perf sched latency
-----------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at |
-----------------------------------------------------------------------------------------------------------------
migration/4:35 | 0.000 ms | 1 | avg: 0.003 ms | max: 0.003 ms | max at: 231259.727951 s
kworker/u16:0-p:6962 | 0.103 ms | 20 | avg: 0.003 ms | max: 0.035 ms | max at: 231259.729314 s
ls:7118 | 1.752 ms | 1 | avg: 0.003 ms | max: 0.003 ms | max at: 231259.727898 s
alsa-sink-Gener:3133 | 0.000 ms | 1 | avg: 0.003 ms | max: 0.003 ms | max at: 231259.729321 s
Timer:5229 | 0.035 ms | 1 | avg: 0.002 ms | max: 0.002 ms | max at: 231259.729625 s
AudioIP~ent RPC:7597 | 0.040 ms | 1 | avg: 0.002 ms | max: 0.002 ms | max at: 231259.729698 s
MediaTimer #1:7075 | 0.025 ms | 1 | avg: 0.002 ms | max: 0.002 ms | max at: 231259.729651 s
gnome-terminal-:4989 | 0.254 ms | 24 | avg: 0.001 ms | max: 0.003 ms | max at: 231259.729358 s
MediaPl~back #3:7098 | 0.034 ms | 1 | avg: 0.001 ms | max: 0.001 ms | max at: 231259.729670 s
kworker/u16:2-p:5987 | 0.144 ms | 32 | avg: 0.001 ms | max: 0.002 ms | max at: 231259.729193 s
perf:7114 | 3.503 ms | 1 | avg: 0.001 ms | max: 0.001 ms | max at: 231259.729656 s
kworker/u16:1-p:7112 | 0.184 ms | 52 | avg: 0.001 ms | max: 0.001 ms | max at: 231259.729201 s
chrome:5713 | 0.067 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s
-----------------------------------------------------------------------------------------------------------------
TOTAL: | 6.141 ms | 137 |
---------------------------------------------------
You can see the process ls, as well as the process perf being present among all the other processes that co-existed at the same time while the perf sched record command was being run.
You can run perf sched timehist to obtain a detailed summary of the individual scheduler events.
sudo perf sched timehist
time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
--------------- ------ ------------------------------ --------- --------- ---------
231259.726350 [0005] <idle> 0.000 0.000 0.000
231259.726465 [0005] chrome[5713] 0.000 0.000 0.114
231259.727447 [0005] <idle> 0.114 0.000 0.981
231259.727513 [0005] chrome[5713] 0.981 0.000 0.066
231259.727898 [0004] <idle> 0.000 0.000 0.000
231259.727951 [0004] perf[7118] 0.000 0.002 0.052
231259.727958 [0002] perf[7114] 0.000 0.000 0.000
231259.727960 [0000] <idle> 0.000 0.000 0.000
231259.727964 [0004] migration/4[35] 0.000 0.002 0.013
231259.729193 [0006] <idle> 0.000 0.000 0.000
231259.729201 [0002] <idle> 0.000 0.000 1.242
231259.729201 [0003] <idle> 0.000 0.000 0.000 231259.729216 [0002] kworker/u16:1-p[7112] 0.006 0.001 0.005
231259.729219 [0002] <idle> 0.005 0.000 0.002
231259.729222 [0002] kworker/u16:1-p[7112] 0.002 0.000 0.002
231259.729222 [0006] <idle> 0.001 0.000 0.007
The wait time refers to the time the task was waiting to be woken up, and the sch delay is the time it took for the scheduler to actually schedule it into the run queue after the task was woken up.
You can filter the timehist command by pid and since the pid of the ls command was 7118 (you can observe this in the perf sched latency output).
sudo perf sched timehist -p 7118
Samples do not have callchains.
time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
--------------- ------ ------------------------------ --------- --------- ---------
231259.727951 [0004] perf[7118] 0.000 0.002 0.052
231259.729657 [0000] ls[7118] 0.009 0.000 1.697
Now, in order to observe the wakeup events for this process, you can run add a command line switch -w to the previous command -
sudo perf sched timehist -p 7118 -w
Samples do not have callchains.
time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
--------------- ------ ------------------------------ --------- --------- ---------
231259.727895 [0002] perf[7114] awakened: perf[7118]
231259.727948 [0004] perf[7118] awakened: migration/4[35]
231259.727951 [0004] perf[7118] 0.000 0.002 0.052
231259.729190 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
231259.729199 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729207 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
231259.729209 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729212 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
231259.729218 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729221 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
231259.729223 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729226 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729231 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729233 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
231259.729237 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
231259.729240 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729242 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
-------------------------------------- # some other events here
231259.729548 [0000] ls[7118] awakened: kworker/u16:0-p[6962]
231259.729553 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729555 [0000] ls[7118] awakened: kworker/u16:0-p[6962]
231259.729557 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729562 [0000] ls[7118] awakened: kworker/u16:0-p[6962]
231259.729564 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729655 [0000] ls[7118] awakened: perf[7114]
231259.729657 [0000] ls[7118] 0.009 0.000 1.697
The kworker threads interrupt the initial execution of perf and its child process ls at 231259.729190 ms. You can see that the perf process gets woken up eventually, to be actually executed at 231259.729655 ms after all of the kernel worker threads have done some work. You can get a more detailed CPU visualization of the above timehist details using the below command -
sudo perf sched timehist -p 7118 -wV
Samples do not have callchains.
time cpu 012345678 task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
--------------- ------ --------- ------------------------------ --------- --------- ---------
231259.727895 [0002] perf[7114] awakened: perf[7118]
231259.727948 [0004] perf[7118] awakened: migration/4[35]
231259.727951 [0004] s perf[7118] 0.000 0.002 0.052
231259.729190 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
231259.729199 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729207 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
231259.729209 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729212 [0000] ls[7118] awakened: kworker/u16:2-p[5987]
-------------------------------------------------- # some other events here
231259.729562 [0000] ls[7118] awakened: kworker/u16:0-p[6962]
231259.729564 [0000] ls[7118] awakened: kworker/u16:1-p[7112]
231259.729655 [0000] ls[7118] awakened: perf[7114]
231259.729657 [0000] s ls[7118] 0.009 0.000 1.697
The CPU visualization column ("012345678") has "s" for context-switch events, which indicates that first, CPU 4 and then, CPU 0 was context switching to the ls process.
Note : You can supplement the above information with outputs from the remaining commands of perf sched, namely perf sched script and perf sched map.

Issue with rematch in bash regex

I have a file like below:
0.000 -0.001 0.017 (F) -0.001 af mclk rdctrlp1/timer/ircb%clk {ec0crb0o2ab1n03x5}
0.027 0.026 0.012 0.002 (F) 0.026 af mclk rdctrlp1/timer/ircb%clkout {ec0crb0o2ab1n03x5} ORGATE
0.001 0.027 0.013 (F) 0.027 af mclk rdctrlp1/timer/iclkout_inv%clk {ec0cinv00ab1n12x5}
0.011 0.037 0.010 0.007 (R) 0.037 af mclk rdctrlp1/timer/iclkout_inv%clkout {ec0cinv00ab1n12x5} NOTGATE
0.001 0.038 0.010 (R) 0.038 af mclk rdctrlp1/clksdlgen/i01%clk {ec0ceb000ab2n02x4}
0.026 0.064 0.005 0.001 (R) 0.064 af mclk rdctrlp1/clksdlgen/i01%clkout {ec0ceb000ab2n02x4} BUFFER
0.000 0.064 0.006 (R) 0.064 af mclk rdctrlp1/clksdlgen/i0invd%clk {ec0cinv00ab2n02x5}
0.006 0.070 0.005 0.001 (F) 0.070 af mclk rdctrlp1/clksdlgen/i0invd%clkout {ec0cinv00ab2n02x5} NOTGATE
0.000 0.070 0.005 (F) 0.070 af mclk rdctrlp1/clksdlgen/inand0dft%clk {ec0cnan02ab3n02x5}
0.011 0.081 0.012 0.002 (R) 0.081 af mclk rdctrlp1/clksdlgen/inand0dft%clkout {ec0cnan02ab3n02x5} NANDGATE
I am using this below code for matching these kind of lines and processing further:
pattern="^\s+(-?\d(\.\d+)?)\s+(-?\d(\.\d+)?).+?\((R|F)\).+?(a|b)(.)\s"
if [[ $line =~ $pattern ]]
then
arc_type="${BASH_REMATCH[7]}"_"${BASH_REMATCH[5]}"
delay="${BASH_REMATCH[1]}"
It doesn't work, not sure why. Below is a regex that works fine in the same script:
if [[ $line =~ "#(.+?)\s,\s.+?ip%(.+?)\s->>\s.+?ip%(.+?)\s,\s" ]]

Resources