how to debug an increase in context switches - performance

I am profiling two versions of a large application using Linux perf. There is a reproducible degradation of performance in one of the versions. The problem is affecting a run which takes about 10 minutes to complete, and running
perf stat
one can see that there's a large difference in the number of context switches:
2759681,344820 task-clock (msec) # 4,089 CPUs utilized
1.976.068 context-switches # 0,716 K/sec
288.370 cpu-migrations # 0,104 K/sec
1.065.076 page-faults # 0,386 K/sec
9.600.316.147.196 cycles # 3,479 GHz
9.608.308.311.681 instructions # 1,00 insn per cycle
1.847.613.212.847 branches # 669,502 M/sec
29.342.163.081 branch-misses # 1,59% of all branches
674,891697479 seconds time elapsed
versus
3045676,296012 task-clock (msec) # 4,093 CPUs utilized
22.156.426 context-switches # 0,007 M/sec
385.364 cpu-migrations # 0,127 K/sec
1.066.383 page-faults # 0,350 K/sec
10.505.321.454.387 cycles # 3,449 GHz
9.723.994.869.100 instructions # 0,93 insn per cycle
1.869.145.049.594 branches # 613,704 M/sec
30.241.815.060 branch-misses # 1,62% of all branches
744,170941002 seconds time elapsed
Running
perf record -e context-switches -ag -T
gives the following system calls
Children Self Samples Command Shared Object Symbol
+ 44,06% 44,06% 170846 swapper [kernel.kallsyms] [k] schedule_idle
+ 33,07% 33,07% 127004 Thread (pooled) [kernel.kallsyms] [k] schedule
versus
Children Self Samples Command Shared Object Symbol
+ 49,02% 49,02% 958827 swapper [kernel.kallsyms] [k] schedule_idle
+ 43,96% 43,96% 855603 Thread (pooled) [kernel.kallsyms] [k] schedule
So the difference in the number of samples is almost one order of magnitude. My question is how can I investigate this further, given that I can access the source code of both versions, but it's large and I don't know much about it?
Update:
The problem was a lock, I can spot it by running gdb, breaking in the middle of the execution, catching the system calls and printing the backtrace
(gdb) catch syscall
(gdb) bt
The information reported by perf was this
Children Self Command Shared Object Symbol ◆
- 49,02% 49,02% swapper [kernel.kallsyms] [k] schedule_idle ▒
- secondary_startup_64 ▒
- 42,82% start_secondary ▒
cpu_startup_entry ▒
do_idle ▒
schedule_idle ▒
+ 6,20% x86_64_start_kernel
- 43,96% 43,96% Thread (pooled) [kernel.kallsyms] [k] schedule ▒
- 43,32% syscall ▒
- 43,32% entry_SYSCALL_64_after_hwframe ▒
do_syscall_64 ▒
sys_futex ▒
do_futex ▒
futex_wait ▒
futex_wait_queue_me ▒
schedule
which is not very informative since it does not print who made the system call. Stepping with GDB worked in my case but can be tedious. Do you know any tracing toool, or options to perf which would help in such cases? There's a list of tools on Brendan Gregg's blog http://www.brendangregg.com/blog/2015-07-08/choosing-a-linux-tracer.html , but I haven't much experience with them.

Related

jupyter notebook %%time doesn't measure cpu time of %%sh commands?

When I run python code in a jupyter-lab (v3.4.3) ipython notebook (v8.4.0) and use the %%time cell magic, both cpu time and wall time are reported.
%%time
for i in range(10000000):
a = i*i
CPU times: user 758 ms, sys: 121 µs, total: 758 ms
Wall time: 757 ms
But when the same computation is performed using the %%sh magic to run a shell script, the cpu time results are nonsense.
%%time
%%sh
python -c "for i in range(10000000): a = i*i"
CPU times: user 6.14 ms, sys: 12.5 ms, total: 18.6 ms
Wall time: 920 ms
The docs for %time do say "Time execution of a Python statement or expression.", but this still surprised me because I had assumed that the shell script will run in a python subprocess and thus can also be measured. So, what's going on here? Is this a bug, or just a known caveat of using %%sh?
I know I can use the shell builtin time or /usr/bin/time to get similar output, but this is a bit cumbersome for multiple lines of shell---is there a better workaround?

Query a `perf.data` file for the total raw execution time of a symbol

I used perf to generate a perf file with perf record ./application. perf report shows me various things about it. How can I show the total time it took to run the application, and the total time to run a specific "symbol"/function? perf seems to often show percentages, but I want raw time, and it want "inclusive" time, i.e. including children.
perf v4.15.18 on Ubuntu Linux 18.04
perf is statistical (sampling) profiler (in its default perf record mode), and it means it have no exact timestamps on function entry and exit (tracing is required for exact data). Perf asks OS kernel to generate interrupts thousands times per second (4 kHz for hardware PMU if -e cycles supported, less for software event -e cpu-clock). Every interrupt of program execution is recorded as sample which contains EIP (current instruction pointer), pid (process/thread id), timestamp of current time. When program runs for several seconds, there will be thousands of samples, and perf report can generate histograms from them: which parts of program code (which functions) were executed more often than other. You will get generic overview that some functions did take around 30% of program execution time while other - 5%.
perf report does not compute total program execution time (it may estimate it by comparing timestamps of first and last sample, but it is not exact if there were off-CPU periods). But it does estimate total event count (it is printed in first line in interactive TUI and is listed in text output):
$ perf report |grep approx
# Samples: 1K of event 'cycles'
# Event count (approx.): 844373507
There is perf report -n option which adds column "number of samples" next to percent column.
Samples: 1K of event 'cycles', Event count (approx.): 861416907
Overhead Samples Command Shared Object Symbol
42.36% 576 bc bc [.] _bc_rec_mul
37.49% 510 bc bc [.] _bc_shift_addsub.isra.3
14.90% 202 bc bc [.] _bc_do_sub
0.89% 12 bc bc [.] bc_free_num
But samples are taken not at same intervals and they are less exact than computed overhead (every sample may have different weight). I will recommend you to run perf stat ./application to have real total running time and total hardware counts for your application. It is better when your application has stable running time (do perf stat -r 5 ./application to have variation estimated by tool as "+- 0.28%" in last column)
To include children functions stack traces must be sampled at every interrupt. They are not sampled in default perf record mode. This sampling is turned on with -g or --call-graph dwarf options: perf record -g ./application or perf record --call-graph dwarf ./application. It is not simple to use it correctly for preinstalled libraries or applications in Linux (as most distributions strip debug information from packages), but can be used for your own applications compiled with debug information. The default -g which is same as --call-graph fp requires that all code is compiled with -fno-omit-frame-pointer gcc option, and non-default --call-graph dwarf is more reliable. With correctly prepared program and libraries, single-threaded application, and long enough stack size samples (8KB is default, change with --call-graph dwarf,65536), perf report should show around 99% for _start and main functions (including children).
bc calculator compiled with -fno-omit-frame-pointer:
bc-no-omit-frame$ echo '3^123456%3' | perf record -g bc/bc
bc-no-omit-frame$ perf report
Samples: 1K of event 'cycles:uppp', Event count (approx.): 811063902
Children Self Command Shared Object Symbol
+ 98.33% 0.00% bc [unknown] [.] 0x771e258d4c544155
+ 98.33% 0.00% bc libc-2.27.so [.] __libc_start_main
+ 98.33% 0.00% bc bc [.] main
bc calculator with dwarf call graph:
$ echo '3^123456%3' | perf record --call-graph dwarf bc/bc
$ perf report
Samples: 1K of event 'cycles:uppp', Event count (approx.): 898828479
Children Self Command Shared Object Symbol
+ 98.42% 0.00% bc bc [.] _start
+ 98.42% 0.00% bc libc-2.27.so [.] __libc_start_main
+ 98.42% 0.00% bc bc [.] main
bc without debug info has incorrect call graph handling by perf in -g (fp) mode (no 99% for main):
$ cp bc/bc bc.strip
$ strip -d bc.strip
$ echo '3^123456%3' | perf record --call-graph fp ./bc.strip
Samples: 1K of event 'cycles:uppp', Event count (approx.): 841993392
Children Self Command Shared Object Symbol
+ 43.94% 43.94% bc.strip bc.strip [.] _bc_rec_mul
+ 39.73% 39.73% bc.strip bc.strip [.] _bc_shift_addsub.isra.3
+ 11.27% 11.27% bc.strip bc.strip [.] _bc_do_sub
+ 0.92% 0.92% bc.strip libc-2.27.so [.] malloc
Sometimes perf report --no-children can be useful to disable sorting on self+children overhead (will sort by "self" overhead), for example when call graph was not fully captured.

An isolation cpu cache-misses observation

I have an isolation cpu in cpu core 12 , and I execute the following perf command :
sudo perf stat -e DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK,ITLB_MISSES.MISS_CAUSES_A_WALK,DTLB_STORE_MISSES.MISS_CAUSES_A_WALK,cache-references,cache-misses -C 12 sleep 10
Performance counter stats for 'CPU(s) 12':
0 DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK
4 ITLB_MISSES.MISS_CAUSES_A_WALK
0 DTLB_STORE_MISSES.MISS_CAUSES_A_WALK
210 cache-references
176 cache-misses # 83.810 % of all cache refs
Usually I use perf cache-misses for a process , how to interpret this : cache-misses for a cpu core ?!

How do I see a Unix Job's performance (execution time and cpu resource) after execution?

I have an incoming task that tells me to do performance testing on a unix job. I would like to know if there's something in UNIX that would tell me of a job's performance (execution time and cpu resource)? I plan to do a before and after comparison.
You can use linux profile tool perf, eg:
perf stat ls
In my computer:
Performance counter stats for 'ls':
2.066571 task-clock # 0.804 CPUs utilized
1 context-switches # 0.000 M/sec
0 CPU-migrations # 0.000 M/sec
267 page-faults # 0.129 M/sec
2,434,744 cycles # 1.178 GHz [57.78%]
1,384,929 stalled-cycles-frontend # 56.88% frontend cycles idle [52.01%]
1,035,939 stalled-cycles-backend # 42.55% backend cycles idle [98.96%]
1,894,339 instructions # 0.78 insns per cycle
# 0.73 stalled cycles per insn
370,865 branches # 179.459 M/sec
14,280 branch-misses # 3.85% of all branches
0.002569026 seconds time elapsed

Making use of all available RAM in a Haskell program?

I have 8 GB of RAM, but Haskell programs seemingly can only use 1.3 GB.
I'm using this simple program to determine how much memory a GHC program can allocate:
import System.Environment
import Data.Set as Set
main = do
args <- getArgs
let n = (read $ args !! 0) :: Int
s = Set.fromList [0..n]
do
putStrLn $ "min: " ++ (show $ findMin s)
putStrLn $ "max: " ++ (show $ findMax s)
Here's what I'm finding:
running ./mem.exe 40000000 +RTS -s succeeds and reports 1113 MB total memory in use
running ./mem.exe 42000000 +RTS -s fails with out of memory error
running ./mem.exe 42000000 +RTS -s -M4G errors out with -M4G: size outside allowed range
running ./mem.exe 42000000 +RTS -s -M3.9G fails with out of memory error
Monitoring the process via the Windows Task Manager shows that the max memory usage is about 1.2 GB.
My system: Win7, 8 GB RAM, Haskell Platform 2011.04.0.0, ghc 7.0.4.
I'm compiling with: ghc -O2 mem.hs -rtsopts
How can I make use of all of my available RAM? Am I missing something obvious?
Currently, on Windows, GHC is a 32-bit GHC - I think a 64-bit GHC for windows is supposed to be available when 7.6 comes.
One consequence of that is that on Windows, you can't use more than 4G - 1BLOCK of memory, since the maximum allowed as a size-parameter is HS_WORD_MAX:
decodeSize(rts_argv[arg], 2, BLOCK_SIZE, HS_WORD_MAX) / BLOCK_SIZE;
With 32-bit Words, HS_WORD_MAX = 2^32-1.
That explains
running ./mem.exe 42000000 +RTS -s -M4G errors out with -M4G: size outside allowed range
since decodeSize() decodes 4G as 2^32.
This limitation will remain also after upgrading your GHC, until finally a 64-bit GHC for Windows is released.
As a 32-bit process, the user-mode virtual address space is limited to 2 or 4 GB (depending on the status of the IMAGE_FILE_LARGE_ADDRESS_AWARE flag), cf Memory limits for Windows Releases.
Now, you are trying to construct a Set containing 42 million 4-byte Ints. A Data.Set.Set has five words of overhead per element (constructor, size, left and right subtree pointer, pointer to element), so the Set will take up about 0.94 GiB of memory (1.008 'metric' GB). But the process uses about twice that or more (it needs space for the garbage collection, at least the size of the live heap).
Running the programme on my 64-bit linux, with input 21000000 (to make up for the twice as large Ints and pointers), I get
$ ./mem +RTS -s -RTS 21000000
min: 0
max: 21000000
31,330,814,200 bytes allocated in the heap
4,708,535,032 bytes copied during GC
1,157,426,280 bytes maximum residency (12 sample(s))
13,669,312 bytes maximum slop
2261 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 59971 colls, 0 par 2.73s 2.73s 0.0000s 0.0003s
Gen 1 12 colls, 0 par 3.31s 10.38s 0.8654s 8.8131s
INIT time 0.00s ( 0.00s elapsed)
MUT time 12.12s ( 13.33s elapsed)
GC time 6.03s ( 13.12s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 18.15s ( 26.45s elapsed)
%GC time 33.2% (49.6% elapsed)
Alloc rate 2,584,429,494 bytes per MUT second
Productivity 66.8% of total user, 45.8% of total elapsed
but top reports only 1.1g of memory use - top, and presumably the Task Manager, reports only live heap.
So it seems IMAGE_FILE_LARGE_ADDRESS_AWARE is not set, your process is limited to an address space of 2GB, and the 42 million Set needs more than that - unless you specify a maximum or suggested heap size that is smaller:
$ ./mem +RTS -s -M1800M -RTS 21000000
min: 0
max: 21000000
31,330,814,200 bytes allocated in the heap
3,551,201,872 bytes copied during GC
1,157,426,280 bytes maximum residency (12 sample(s))
13,669,312 bytes maximum slop
1154 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 59971 colls, 0 par 2.70s 2.70s 0.0000s 0.0002s
Gen 1 12 colls, 0 par 4.23s 4.85s 0.4043s 3.3144s
INIT time 0.00s ( 0.00s elapsed)
MUT time 11.99s ( 12.00s elapsed)
GC time 6.93s ( 7.55s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 18.93s ( 19.56s elapsed)
%GC time 36.6% (38.6% elapsed)
Alloc rate 2,611,793,025 bytes per MUT second
Productivity 63.4% of total user, 61.3% of total elapsed
Setting the maximal heap size below what it would use naturally, actually lets it fit in hardly more than the space needed for the Set, at the price of a slightly longer GC time, and suggesting a heap size of -H1800M lets it finish using only
1831 MB total memory in use (0 MB lost due to fragmentation)
So if you specify a maximal heap size below 2GB (but large enough for the Set to fit), it should work.
The default heap size is unlimited.
Using GHC 7.2 on a 64 bit Windows XP machine, I can allocate higher values, by setting the heap size larger, explicitly:
$ ./A 42000000 +RTS -s -H1.6G
min: 0
max: 42000000
32,590,763,756 bytes allocated in the heap
3,347,044,008 bytes copied during GC
714,186,476 bytes maximum residency (4 sample(s))
3,285,676 bytes maximum slop
1651 MB total memory in use (0 MB lost due to fragmentation)
and
$ ./A 42000000 +RTS -s -H1.7G
min: 0
max: 42000000
32,590,763,756 bytes allocated in the heap
3,399,477,240 bytes copied during GC
757,603,572 bytes maximum residency (4 sample(s))
3,281,580 bytes maximum slop
1754 MB total memory in use (0 MB lost due to fragmentation)
even:
$ ./A 42000000 +RTS -s -H1.85G
min: 0
max: 42000000
32,590,763,784 bytes allocated in the heap
3,492,115,128 bytes copied during GC
821,240,344 bytes maximum residency (4 sample(s))
3,285,676 bytes maximum slop
1909 MB total memory in use (0 MB lost due to fragmentation)
That is, I can allocate up to the Windows XP 2G process limit. I imagine on Win 7 you won't have such a low limit -- this table suggests either 4G or 192G -- just ask for as much as you need (and use a more recent GHC).

Resources