Simpleperf doesn't unwind stack - performance

I'm attempting profile an Android NDK 14b clang based application with Google's simpleperf sampling profiler. The recorded callstack samples aren't actually unwound -- just the top frame of the callstack seems to be recorded, so the profiling reports aren't very useful. I've specified -fno-omit-frame-pointer in most of the code, but this seems to make no difference.
What am I missing? Is there a more current profiler for Android NDK projects I should be using?

If you are doing frame pointer based unwinding (using --call-graph fp option), please use aarch64 architecture, because arm has combined arm/thumb code, and can't unwind well even if you use -fno-omit-frame-pointer everywhere.
If you are doing dwarf based unwinding (using -g or --call-graph dwarf option), -fno-omit-frame-pointer doesn't work, and you'd better use shared libraries containing debug info in the apk.
It is also possible that the unwinding stops at java code. To unwind java code, you need to fully compiled it into native code and use dwarf based unwinding.
After all, you can use app_profiler.py contained in the ndk r14b. It tries to handle details for you, fully compiling the java code, and downloading libraries with debug info to device. It is also easy to check and change if it doesn't work well in your environment.

There are some simpleperf options I've found I need to specify (or not specify) which seem to make it more likely that I get the expected call-graph.
If I specify '-a --cpu 1' for instance, then the binary I'm profiling won't even appear in the call graph.
For instance if I do (where perf_text.x mostly spins for 1 second on cpu 1):
simpleperf record -g -a -e cpu-cycles --cpu 1 ./perf_test.x -C 1 -w bw -t 1
simpleperf report -g caller
then perf_test.x won't appear at all (for me) in the output.
So drop the --cpu x option if you are using it.
Also, high sampling rate increases the overhead. Below runs with the (current) default sampling rate of 4000 sample/sec.
simpleperf record -g -a -e cpu-cycles -F 4000 ./perf_test.x -C 1 -w bw -t 1
simpleperf report -g caller
Above shows simpleperf as the top process using 40-70% of the samples.
Reducing the sampling rate:
simpleperf record -g -a -e cpu-cycles -F 1000 ./perf_test.x -C 1 -w bw -t 1
simpleperf report -g caller
brought perf_test.x up to the top % of total samples and the 1st simpleperf entry comes in at 24% of total samples.
Hope this is helpful.

Related

What is the difference between "-c opt" and "--copt=-O3" in Bazel build (or GCC)

I'm learning GCC and Bazel. I want to enable all the optimization for Bazel to build a project which requires the best performance.
Then I found -c opt which means to set the compilation mode to optimized without debug information.
And --copt=-O3 means set the optimization level to the third one. There are -O2, -Os, etc.
I'm confused with these two options.
What is the difference between -c opt and --copt=-O3?
Will they trigger each other. So I only need to write one of them with bazel build?
--copt is for passing args to to the compiler.
-c is a short form of --compilation-mode.
Its effect is described in the user-manual:
It sets compiler options (e.g. -c opt implies -O2 -DNDEBUG)
There are different output directories per compilation mode, so you can switch between debug and optimized builds without full recompilation.
So usually, -c optis enough. If you want the behaviour of -c opt but with a different optimization level, you combine the two options like in -c opt --copt=-O3 and the compiler will get both options -O2 and -O3, but the last one will win.
And watch out, there is a third similar option:
--config=configname is for selecting a configuration. You can have a .bazelrc which defines default options. Some of them are not always active, but some only if you activate them by the --config=configname command line option. Now opt is a popular configname, so if you have a .bazelrc that contains
build:opt --copt=-O3
then bazel build --config=opt has the same effect as bazel build --copt=-O3

GCC, compare the effect of using -O2 and using all the optimization flags it turns on

From gcc5.4 documentation, it says
-O2 turns on all optimization flags specified by -O. It also turns on the following optimization flags:
-fthread-jumps
-falign-functions -falign-jumps
-falign-loops -falign-labels
-fcaller-saves
-fcrossjumping
-fcse-follow-jumps, etc
It seems that using -O2 has the same effect of using all the 83 optimization flags turned on by -O2 in gcc 5.4.0 on the performance of the test programs.
However, I compare the running time of the executable files test1 and test2 obtained by
gcc-5.4 -O2 test.c -o test1
and
gcc-5.4 -fauto-inc-dec
-fbranch-count-reg
-fcombine-stack-adjustments
-fcompare-elim ... -fthread-jumps -falign-functions ...(all the 83 flags) test.c -o test2
I tested on 20 random generated c programs and running each test case 100000 times to make sure the measurement of running time is accurate enough. But the result is that using -O2 is averagely about 60% faster than using all the 83 flags.
I am really confused why the effect of using -O2 is not equivalent to using all the optimization flags it turns on.
I must misunderstood something, but I couldn't find any explanation yet. I'd appreciate any help. Thanks a lot.
It is a common gotcha. In order to enable (or disable) specific optimizations, you must first enable the optimizer in general, i.e. use one of -O... flags, except -O0 (or just -O, which is equivalent to -O1).
The optimisation level affects decisions in other parts of the compiler besides determining which passes get run. These can be during mandatory processes like transforming between internal representations of the code, register allocation etc, so the optimisation level is not exactly equivalent to a set of switches enabling every compiler pass.
Look at this thread for some discussion on this topic.

Avrdude .hex with Fuses

I have used a makefile to build my code and I have produced an ELF file.
To make it understandable for my attiny85, I usually use avr-objcopy -O ihex -R .eeprom -R .fuse main.elf main_all.hex. I get a hex file containing fuse settings. I flash the hex file with avrdude -p t85 -c avrispmkII -P usb -U flash:w:main_all.hex.
I am using an avrispmkII connected via a working and tested SPI.
This time I got an error.
ERROR: address 0x820003 out of range
I guess because I've played in the code with fuses that this is the problem. According to Contiki compile error, " ERROR: address 0x820003 out of range at line 1740 of...",
I've noticed that you can make avrdude create a hex without fuses.
avr-objcopy -O ihex -R .eeprom -R. Fuse main.elf main_ohne.hex
This has also worked and now lets the attiny85 flash completely normally.
Now the real question.
How do I still get the fuses on the attiny85?
Is there any way to see which fuse I am setting how, before I set the fuses? I ask explicitly before, because I have no experience in flashing with 12V (HV) and this arvmkII synonymous not true (Yes, I should look in the data sheet whether he can).
My main concern is to get the fuses on the attiny. I am a graduate electrical engineer who is programming in the spare time. So I'm fine with overprivileged links and the magic command.
(Rough translation from the German original)
You can set the fuse bytes on the command-line of avrdude. example
There are only 3 fuse bytes on the attiny: low, high, and extended. They can be found on p. 148 of the datasheet.
Just compute the fuse setting as a hex number and include -U switches like
-U efuse:w:0xff:m -U hfuse:w:0x89:m -U lfuse:w:0x2e:m
for the extended, high, and low fuses.

Dump IR after each LLVM optimization (each pass), both LLVM IR passes and backend debugging

I want to find some debugging options for Clang/LLVM which work like GCC's -fdump-tree-all-all, -fdump-rtl-all, and -fdump-ipa-all-all.
Basically, I want to have an LLVM IR dump before and after each optimization pass. Also, it can be useful to have all dumps of the AST from Clang and all phases of code generation (backend phases, Selection DAG, ISEL-SDNode, register allocation, and MCInsts).
I was able to find only the Clang's -ccc-print-phases, but it will only print high-level phases names, e.g., preprocess-compile-assemble-link; but no any dump of IR.
Also there is Life of an instruction in LLVM paper with -cc1-ast-dump option to dump Clang ASTs, but I want more, especially for code generation.
It seems that you've already discovered how to do dumps on the Clang AST level and LLVM IR level. For code generation, the following are useful:
-debug for a detailed textual dump of instruction selection and later stages. Also, the -view*-dags show (pop-up) DAGs:
llc -help-hidden | grep dags
Output:
-view-dag-combine-lt-dags - Pop up a window to show dags before the
post legalize types dag combine pass
-view-dag-combine1-dags - Pop up a window to show dags before
the first dag combine pass
-view-dag-combine2-dags - Pop up a window to show dags before the
second dag combine pass
-view-isel-dags - Pop up a window to show isel dags
as they are selected
-view-legalize-dags - Pop up a window to show dags before legalize
-view-legalize-types-dags - Pop up a window to show dags
before legalize types
-view-misched-dags - Pop up a window to show MISched
dags after they are processed
-view-sched-dags - Pop up a window to show sched
dags as they are processed
-view-sunit-dags - Pop up a window to show SUnit dags
after they are processed
These may not show up if you haven't configured and compiled LLVM with Graphviz support.
It is not fully about your question, but to see the passes applied, you can do:
clang test.c -Ofast -march=core-avx2 -mllvm -debug-pass=Arguments
You will see something like:
Pass Arguments: -datalayout -notti -basictti -x86tti -targetlibinfo -jump-instr-table-info -targetpassconfig -no-aa -tbaa -scoped-noalias -basicaa -collector-metadata -machinemoduleinfo -machine-branch-prob -jump-instr-tables -verify -verify-di -domtree -loops -loop-simplify -scalar-evolution -iv-users -loop-reduce -gc-lowering -unreachableblockelim -consthoist -partially-inline-libcalls -codegenprepare -verify-di -stack-protector -verify -domtree -loops -branch-prob -machinedomtree -expand-isel-pseudos -tailduplication -opt-phis -machinedomtree -slotindexes -stack-coloring -localstackalloc -dead-mi-elimination -machinedomtree -machine-loops -machine-trace-metrics -early-ifcvt -machinelicm -machine-cse -machine-sink -peephole-opts -dead-mi-elimination -processimpdefs -unreachable-mbb-elimination -livevars -machinedomtree -machine-loops -phi-node-elimination -twoaddressinstruction -slotindexes -liveintervals -simple-register-coalescing -misched -machine-block-freq -livedebugvars -livestacks -virtregmap -liveregmatrix -edge-bundles -spill-code-placement -virtregrewriter -stack-slot-coloring -machinelicm -edge-bundles -prologepilog -machine-block-freq -branch-folder -tailduplication -machine-cp -postrapseudos -machinedomtree -machine-loops -post-RA-sched -gc-analysis -machine-block-freq -block-placement2 -stackmap-liveness -machinedomtree -machine-loops
I am using llvm-gcc-4.2 on Mac OS X v10.8 (Mountain Lion) and -fdump-tree-all works.
gcc -fdump-tree-all -o test file1.c file2.c file1.h -I .

Is there a method/function to get the code size of a C program compiled using GCC compiler? (may vary when some optimization technique is applied)

Can I measure the code size with the help of an fseek() function and store it to a shell variable?
Is it possible to extract the code size, compilation time and execution time using milepost gcc or a GNU Profiler tool? If yes, how to store them into shell variables?
Since my aim is to find the best set of optimization technique upon the basis of the compilation time, execution time and code size, I will be expecting some function that can return these parameters.
MyPgm=/root/Project/Programs/test.c
gcc -Wall -o1 -fauto-inc-dec $MyPgm -o output
time -f "%e" -o Output.log ./output
while read line;
do
echo -e "$line";
Val=$line
done<Output.log
This will store the execution time to the variable Val. Similarly, I want to get the values of code size as well as compilation time.
I will prefer something that I can do to accomplish this, without using an external program!
for code size on linux, you can use size command on terminal.
$size file-name.out
it will give size of different sections. use text section for code size. you can use data and bss if you want to consider global data size as well.
You can use the size(1) command http://www.linuxmanpages.com/man1/size.1.php
Or open the ELF file, walk over section headers and sum the sizes of all the section with type SHT_PROGBITS and the SHF_EXECINSTR flag set.
On non-Linux / non-GNU-utils systems (where you may have neither GNU size nor readelf), the nm program can be used to dump symbol information (including sizes) from object files (libraries / executables). The syntax is slightly system-dependent:
OpenGroup manpage for nm (the "portable subset")
Linux/BSD manpage for nm (GNU version)
Solaris manpage for nm
AIX manpage for nm
nm usage on HP/UX (this says "PA-RISC" but the utility is present / usable on Itanium)
Windows: Doesn't have nm as such, but see: Microsoft equivalent of the nm command
Unfortunately, while the utility is available almost everywhere, its output format is not as portable as could be, so some system-specific scripting is necessary.

Resources