How does linux's perf utility understand stack traces? - linux-kernel

Linux's perf utility is famously used by Brendan Gregg to generate flamegraphs for c/c++, jvm code, nodejs code, etc.
Does the Linux kernel natively understand stack traces? Where can I read more about how a tool is able to introspect into stack traces of processes, even if processes are written in completely different languages?

There is short introduction about stack traces in perf by Gregg:
http://www.brendangregg.com/perf.html
4.4 Stack Traces
Always compile with frame pointers. Omitting frame pointers is an evil compiler optimization that breaks debuggers, and sadly, is often the default. Without them, you may see incomplete stacks from perf_events ... There are two ways to fix this: either using dwarf data to unwind the stack, or returning the frame pointers.
Dwarf
Since about the 3.9 kernel, perf_events has supported a workaround for missing frame pointers in user-level stacks: libunwind, which uses dwarf. This can be enabled using "-g dwarf".
... compiler optimizations (-O2), which in this case has omitted the frame pointer. ... recompiling .. with -fno-omit-frame-pointer:
Non C-style languages may have different frame format, or may omit frame pointers too:
4.3. JIT Symbols (Java, Node.js)
Programs that have virtual machines (VMs), like Java's JVM and node's v8, execute their own virtual processor, which has its own way of executing functions and managing stacks. If you profile these using perf_events, you'll see symbols for the VM engine .. perf_events has JIT support to solve this, which requires the VM to maintain a /tmp/perf-PID.map file for symbol translation.
Note that Java may not show full stacks to begin with, due to hotspot on x86 omitting the frame pointer (just like gcc). On newer versions (JDK 8u60+), you can use the -XX:+PreserveFramePointer option to fix this behavior, ...
The Gregg's blog post about Java and stack traces:
http://techblog.netflix.com/2015/07/java-in-flames.html ("Fixing Frame Pointers" - fixed in some JDK8 versions and in JDK9 by adding option on program start)
Now, your questions:
How does linux's perf utility understand stack traces?
perf utility basically (in early versions) just parses data returned from linux kernel's subsystem "perf_events" (or sometimes "events"), accessed with syscall perf_event_open. For call stack trace there are options PERF_SAMPLE_CALLCHAIN / PERF_SAMPLE_STACK_USER:
sample_type
PERF_SAMPLE_CALLCHAIN
Records the callchain (stack backtrace).
PERF_SAMPLE_STACK_USER (since Linux 3.7)
Records the user level stack, allowing stack unwinding.
Does the Linux kernel natively understand stack traces?
It may understand (if implemented) and may not, depending on your cpu architecture. The function of sampling (getting/reading call stack from live process) callchain is defined in architecture-independent part of kernel as __weak with empty body:
http://lxr.free-electrons.com/source/kernel/events/callchain.c?v=4.4#L26
27 __weak void perf_callchain_kernel(struct perf_callchain_entry *entry,
28 struct pt_regs *regs)
29 {
30 }
31
32 __weak void perf_callchain_user(struct perf_callchain_entry *entry,
33 struct pt_regs *regs)
34 {
35 }
In 4.4 kernel user-space callchain sampler is redefined in architecture-dependent part of kernel for x86/x86_64, ARC, SPARC, ARM/ARM64, Xtensa, Tilera TILE, PowerPC, Imagination Meta:
http://lxr.free-electrons.com/ident?v=4.4;i=perf_callchain_user
arch/x86/kernel/cpu/perf_event.c, line 2279
arch/arc/kernel/perf_event.c, line 72
arch/sparc/kernel/perf_event.c, line 1829
arch/arm/kernel/perf_callchain.c, line 62
arch/xtensa/kernel/perf_event.c, line 339
arch/tile/kernel/perf_event.c, line 995
arch/arm64/kernel/perf_callchain.c, line 109
arch/powerpc/perf/callchain.c, line 490
arch/metag/kernel/perf_callchain.c, line 59
Reading of call chain from user stack may be not trivial for some architectures and/or for some modes.
What CPU architecture you use? What languages and VM are used?
Where can I read more about how a tool is able to introspect into stack traces of processes, even if processes are written in completely different languages?
You may try gdb and/or debuggers for the language or backtrace function of libc or support of read-only unwinding in libunwind (there is local backtrace example in libunwind, show_backtrace()).
They may have better support of frame parsing / better integration with virtual machine of the language or with unwind info. If gdb (with backtrace command) or other debuggers can't get stack traces from running program, there may be no way of getting stack trace at all.
If they can get call trace, but perf can't (even after recompiling with -fno-omit-frame-pointer for C/C++), it may be possible to add support of such combination of architecture + frame format into perf_events and perf.
There are several blogs with some info about generic backtracing problems and solutions:
http://eli.thegreenplace.net/2015/programmatic-access-to-the-call-stack-in-c/ - local backtrace with libunwind
http://codingrelic.geekhold.com/2009/05/pre-mortem-backtracing.html gcc's __builtin_return_address(N) vs glibc's backtrace() vs libunwind's local backtrace
http://lucumr.pocoo.org/2014/10/30/dont-panic/ backtrace and unwinding in rust
https://github.com/gperftools/gperftools/wiki/gperftools'-stacktrace-capturing-methods-and-their-issues same problem of backtracing in gperftools software-timer based profiler library
Dwarf support for perf_events/perf:
https://lwn.net/Articles/499116/ [RFCv4 00/16] perf: Add backtrace post dwarf unwind, may 2012
https://lwn.net/Articles/507753/ [PATCHv7 00/17] perf: Add backtrace post dwarf unwind, Jul 2012
https://wiki.linaro.org/LEG/Engineering/TOOLS/perf-callstack-unwinding - Dwarf unwinding on ARM 7/8 for perf
https://wiki.linaro.org/KenWerner/Sandbox/libunwind#libunwind_ARM_unwind_methods - non-dwarf methods too

Related

Difference of "Use of Stack Memory After Return" between native arm64 and native Intel/rosetta2 x86_64

I have an odd message using my code base (C/C++ & Swift).
The code itself is way too big to post, but I wanted to hear what people think could be the reason.
I run the same code natively on my M1 Apple Silicon chip without any issues. I have all diagnostics turned on:
The fun begins when I use it on an Intel based Mac and/or under Rosetta2. (All systems are Big Sur).
Vithanco(83162,0x20400de00) malloc: enabling scribbling to detect mods to free blocks
Vithanco(83162,0x20400de00) malloc: nano zone abandoned due to inability to preallocate reserved vm space.
applicationDidFinishLaunching
objc[83162]: Class _NSZombie_NSSimpleRegularExpressionCheckingResult is implemented in both ?? (0x60400017ab90) and ?? (0x60400016ffd0). One of the two will be used. Which one is undefined.
=================================================================
==83162==ERROR: AddressSanitizer: stack-use-after-return on address 0x0001105fee00 at pc 0x000101fbd30f bp 0x000308d4eb70 sp 0x000308d4eb68
WRITE of size 8 at 0x0001105fee00 thread T0
==83162==WARNING: invalid path to external symbolizer!
==83162==WARNING: Failed to use and restart external symbolizer!
#0 0x101fbd30e in textfont_dict_open+0x44e (/Users/(deleted)/Library/Developer/Xcode/DerivedData/...-gwcenzuufsseezetprookmoioioy/Build/Products/Debug/.../Contents/MacOS/Vithanco:x86_64+0x1012b630e)
#1 0x1026f3036 in loadGraphvizLibraries+0x156 (/Users/(deleted)/Library/Developer/Xcode/DerivedData/Vithanco-gwcenzuufsseezetprookmoioioy/Build/Products/Debug/Vithanco.app/Contents/MacOS/Vithanco:x86_64+0x1019ec036)
#2 0x1026f618c in globalinit_33_2FCABEB9B9698DE37811B48DE0525A0F_func0+0xc (/Users/(deleted)/Library/Developer/Xcode/DerivedData/Vithanco-gwcenzuufsseezetprookmoioioy/Build/Products/Debug/Vithanco.app/Contents/MacOS/Vithanco:x86_64+0x1019ef18c)
#3 0x1102400af in _dispatch_client_callout+0x7 (/usr/lib/system/introspection/libdispatch.dylib:x86_64+0x40af)
There is a lot more to come on the error stack, but not much of use.
I was just wondering: what could be the case? Why would the same code run into a Use of Stack Memory After Return only on one architecture? Same code was running previously on Intel. So, would this be a macOS, compiler issue or something else?
I used this declaration:
extern struct _dt_s textfont_dict_open(GVC_t * gvc);
instead of
extern struct _dt_s * textfont_dict_open(GVC_t * gvc);
Interesting, how the two architectures led to a very different outcome although I never used the outcome of the method.

SEH on Windows, call stack traceback is gone

I am reading this article about the SEH on Windows.
and here is the source code of myseh.cpp
I debugged myseh.cpp. I set 2 breakpoints at printf("Hello from an exception handler\n"); at line:24 and DWORD handler = (DWORD)_except_handler; at line: 36 respectively.
Then I ran it and it broke at line:36. I saw the stack trace as follows.
As going, AccessViolationException occurred because of mov [eax], 1
Then it broke at line:24. I saw the stack trace as follows.
The same thread but the frame of main was gone! Instead of _except_handle. And ESP jumped from 0018f6c8 to 0018ef34;it's a big gap between 0018f6c8 and 0018ef34
After Exception handled.
I know that _except_handle must be run at user mode rather than kernel mode.
After _except_handle returned, the thread turned to ring0 and then windows kernel modified CONTEXT EAX to &scratch & and then returned to ring3 . Thus thread ran continually.
I am curious about the mechanism of windows dealing with exception:
WHY the frame calling main was gone?
WHY the ESP jumped from 0018f6c8 to 0018ef34?(I mean a big pitch), Do those ESP address belong to same thread's stack??? Did the kernel play some tricks on ESP in ring3??? If so, WHY did it choose the address of 0018ef34 as handler callback's frame? Many thanks!
You are using the default debugger settings, not good enough to see all the details. They were chosen to help you focus on your own code and get the debug session started as quickly as possible.
The [External Code] block tells you that there are parts of the stack frame that do not belong to code that you have written. They don't, they belong to the operating system. Use Tools > Options > Debugging > General and untick the "Enable Just My Code" option.
The [Frames below might be incorrect...] warning tells you that the debugger doesn't have accurate PDBs to correctly walk the stack. Use Tools > Options > Debugging > Symbols and tick the "Microsoft Symbol Servers" option and choose a cache location. The debugger will now download the PDBs you need to debug through the operating system DLLs. Might take a while, it is only done once.
You can reason out the big ESP change, the CONTEXT structure is quite large and takes up space on the stack.
After these changes you ought to now see something resembling:
ConsoleApplication1942.exe!_except_handler(_EXCEPTION_RECORD * ExceptionRecord, void * EstablisherFrame, _CONTEXT * ContextRecord, void * DispatcherContext) Line 22 C++
ntdll.dll!ExecuteHandler2#20() Unknown
ntdll.dll!ExecuteHandler#20() Unknown
ntdll.dll!_KiUserExceptionDispatcher#8() Unknown
ConsoleApplication1942.exe!main() Line 46 C++
ConsoleApplication1942.exe!invoke_main() Line 64 C++
ConsoleApplication1942.exe!__scrt_common_main_seh() Line 255 C++
ConsoleApplication1942.exe!__scrt_common_main() Line 300 C++
ConsoleApplication1942.exe!mainCRTStartup() Line 17 C++
kernel32.dll!#BaseThreadInitThunk#12() Unknown
ntdll.dll!__RtlUserThreadStart() Unknown
ntdll.dll!__RtlUserThreadStart#8() Unknown
Recorded on Win10 version 1607 and VS2015 Update 2. This isn't the correct way to write SEH handlers, find a better example in this post.

Why gfortran does not give symbolic backtrace?

After I ran my fortran code with gfortran compiler using with –g otion I get the following error:
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7F2EE30E57D7
#1 0x7F2EE30E5DDE
#2 0x7F2EE2820D3F
#3 0x7F2EE2DEC913
#4 0x408A97 in __aerosols_MOD_moment_logn at aerosols.f90:45
#5 0x408A02 in __aerosols_MOD_set_aerosol at aerosols.f90:78 (discriminator 20)
#6 0x6D357B in __test_cases_2d_MOD_standard_2d_cases at test_cases_2d.f90:210
#7 0x67E9FC in __set_profiles_MOD_read_profiles_standard at set_profiles.f90:118
#8 0x463BF8 in __main_MOD_main_loop at main.f90:48
#9 0x401F05 in kid at KiD.f90:17
Floating point exception (core dumped)
I do not understand why the first four backtraces does not inform about the error trace. I tried addr2line to find the address but it also does not give information. How can I get to know the error traces?
The symbolic backtraces printed by gfortran are not done by gdb, but rather by addr2line. The problem is that addr2line inspects the binary on disk and not the program image in memory. Thus for shared libraries, which are loaded into memory at some random offset (for security reasons), addr2line cannot translate the addresses into symbol names and thus the gfortran backtrace mechanism falls back to printing the addresses.
You can work around this by compiling statically, allowing addr2line to translate addresses in libgfortran, the gfortran runtime library. Usually the first few stack frames are from the libgfortran backtrace printing functionality, in any case.
I do not understand why the first four backtraces does not inform about the error trace.
The stack trace you got is from some kind of internal Fortran error reporting mechanism, and not from GDB as your question implies. That mechanism is likely not handling shared libraries (note that all the "missing" frames are very far from application frames -- the missing frames are likely in a shared library).
Solution: run the program under GDB, and use where command. GDB knows how to read symbol info for shared libraries, and is likely to give you the missing info.
There are a few ways you can wind up with some stack frames that don't have useful information.
One way is if your program has a bug and trashes the stack. In this case I would suggest turning to valgrind to find the problem.
Another way is if the code in question was compiled without debuginfo. Sometimes you may still get some information here, but not always. In this case the solution is to recompile the code with -g.
A third way is if your program contains a just-in-time compiler and the execution stops in JITted code. I suspect this isn't your issue, given that you're working in FORTRAN.
One way to tell where the code may have come from is to use info shared or info proc mappings, and search though the list of addresses to see where the PC values from the offending frames fit it. (Yes, it's unfortunate to do this by hand.) If the PC fits into one of the maps listed, then you know where to look to fix the -g problem. If it doesn't fit anywhere, then most likely the stack is trashed.

How Kernel stack is used in case of different processor mode in ARM architecture?

As I understand every process have a user stack and kernel stack. Apart from that there is a stack for every mode in ARM achitecture. So I want to know How different stack and stack pointer works in ARM modes? Also when this kernel stack associated with the process will be used ?
... when this kernel stack associated with the process will be used ?
When you make a system call. Like you want to get IP address of an interface, kernel just like any other application needs some stack to prepare what you want. So it has a corresponding stack when you switch to kernel side of a system call.
How different stack and stack pointer works in ARM modes?
ARM defines a few hardware modes to handle different inputs to the system. For example out of nowhere you can execute an illegal instruction (or undefined). In this case execution in CPU goes into a different mode and needs to be told how to proceed. Since most of the time you require some stack space to be able to handle this gracefully you need a separate stack for this mode. ARM provides you different stack register so when you switch to a different HW mode you don't overwrite previous modes stack pointer.
The kernel stack is not associated with any particular process it is used by kernel to keep track of its own functions and the system calls which are invoked by processes.since system call handles kernel data structures its stack can not be maintained on process stack since then process can access private data strucutres of kernel which is harmful to kernel.

STM32F103 TRACESWO debug configuration

I'm working with the following MCU STM32F103RBT6 ARM 32 bit CORTEX M3™. The developemment board is STM32-H103.
My project aims to give an approximate current consumption per executed instruction. For that i need to configure the SWO to generate packages containing the PC (program counter), all the generated interrupts and the timestamp.
Can anyone help me ?
Thanks.
TRACE SWO is available to you on your board if using SWI rather than JTAG. Section 31 of the STM32F103 reference manual describes the debug support for the part.
Trace information available via TRACESWO is very limited and not instruction level. You require more expensive debug hardware to access the full instruction level trace capability.
Full trace capability is a 6 wire interface in addition to the standard JTAG or SWI debug interface. The STM32-H103 board's debug connector does not provide these pins, though they may be available on the extension headers since they are multiplexed with other functions.
TRACE SWO example,
int SwdWrite(char * pcBuff,unsigned long length)
{
int xBytesSent=0;
while (length)
{
ITM_SendChar((uint32_t)(*pcBuff));
length--;
pcBuff++;
xBytesSent++;
}
return xBytesSent++;
}

Resources