Why gfortran does not give symbolic backtrace? - debugging

After I ran my fortran code with gfortran compiler using with –g otion I get the following error:
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7F2EE30E57D7
#1 0x7F2EE30E5DDE
#2 0x7F2EE2820D3F
#3 0x7F2EE2DEC913
#4 0x408A97 in __aerosols_MOD_moment_logn at aerosols.f90:45
#5 0x408A02 in __aerosols_MOD_set_aerosol at aerosols.f90:78 (discriminator 20)
#6 0x6D357B in __test_cases_2d_MOD_standard_2d_cases at test_cases_2d.f90:210
#7 0x67E9FC in __set_profiles_MOD_read_profiles_standard at set_profiles.f90:118
#8 0x463BF8 in __main_MOD_main_loop at main.f90:48
#9 0x401F05 in kid at KiD.f90:17
Floating point exception (core dumped)
I do not understand why the first four backtraces does not inform about the error trace. I tried addr2line to find the address but it also does not give information. How can I get to know the error traces?

The symbolic backtraces printed by gfortran are not done by gdb, but rather by addr2line. The problem is that addr2line inspects the binary on disk and not the program image in memory. Thus for shared libraries, which are loaded into memory at some random offset (for security reasons), addr2line cannot translate the addresses into symbol names and thus the gfortran backtrace mechanism falls back to printing the addresses.
You can work around this by compiling statically, allowing addr2line to translate addresses in libgfortran, the gfortran runtime library. Usually the first few stack frames are from the libgfortran backtrace printing functionality, in any case.

I do not understand why the first four backtraces does not inform about the error trace.
The stack trace you got is from some kind of internal Fortran error reporting mechanism, and not from GDB as your question implies. That mechanism is likely not handling shared libraries (note that all the "missing" frames are very far from application frames -- the missing frames are likely in a shared library).
Solution: run the program under GDB, and use where command. GDB knows how to read symbol info for shared libraries, and is likely to give you the missing info.

There are a few ways you can wind up with some stack frames that don't have useful information.
One way is if your program has a bug and trashes the stack. In this case I would suggest turning to valgrind to find the problem.
Another way is if the code in question was compiled without debuginfo. Sometimes you may still get some information here, but not always. In this case the solution is to recompile the code with -g.
A third way is if your program contains a just-in-time compiler and the execution stops in JITted code. I suspect this isn't your issue, given that you're working in FORTRAN.
One way to tell where the code may have come from is to use info shared or info proc mappings, and search though the list of addresses to see where the PC values from the offending frames fit it. (Yes, it's unfortunate to do this by hand.) If the PC fits into one of the maps listed, then you know where to look to fix the -g problem. If it doesn't fit anywhere, then most likely the stack is trashed.

Related

what are the two numbers for an instruction location in objdump of a kernel module? [duplicate]

Consider the following Linux kernel dump stack trace; e.g., you can trigger a panic from the kernel source code by calling panic("debugging a Linux kernel panic");:
[<001360ac>] (unwind_backtrace+0x0/0xf8) from [<00147b7c>] (warn_slowpath_common+0x50/0x60)
[<00147b7c>] (warn_slowpath_common+0x50/0x60) from [<00147c40>] (warn_slowpath_null+0x1c/0x24)
[<00147c40>] (warn_slowpath_null+0x1c/0x24) from [<0014de44>] (local_bh_enable_ip+0xa0/0xac)
[<0014de44>] (local_bh_enable_ip+0xa0/0xac) from [<0019594c>] (bdi_register+0xec/0x150)
In unwind_backtrace+0x0/0xf8 what does +0x0/0xf8 stand for?
How can I see the C code of unwind_backtrace+0x0/0xf8?
How to interpret the panic's content?
It's just an ordinary backtrace, those functions are called in reverse order (first one called was called by the previous one and so on):
unwind_backtrace+0x0/0xf8
warn_slowpath_common+0x50/0x60
warn_slowpath_null+0x1c/0x24
ocal_bh_enable_ip+0xa0/0xac
bdi_register+0xec/0x150
The bdi_register+0xec/0x150 is the symbol + the offset/length there's more information about that in Understanding a Kernel Oops and how you can debug a kernel oops. Also there's this excellent tutorial on Debugging the Kernel
Note: as suggested below by Eugene, you may want to try addr2line first, it still needs an image with debugging symbols though, for example
addr2line -e vmlinux_with_debug_info 0019594c(+offset)
Here are two alternatives for addr2line. Assuming you have the proper target's toolchain, you can do one of the following:
Use objdump:
locate your vmlinux or the .ko file under the kernel root directory, then disassemble the object file :
objdump -dS vmlinux > /tmp/kernel.s
Open the generated assembly file, /tmp/kernel.s. with a text editor such as vim. Go to
unwind_backtrace+0x0/0xf8, i.e. search for the address of unwind_backtrace + the offset. Finally, you have located the problematic part in your source code.
Use gdb:
IMO, an even more elegant option is to use the one and only gdb. Assuming you have the suitable toolchain on your host machine:
Run gdb <path-to-vmlinux>.
Execute in gdb's prompt: list *(unwind_backtrace+0x10).
For additional information, you may checkout the following resources:
Kernel Debugging Tricks.
Debugging The Linux Kernel Using Gdb
In unwind_backtrace+0x0/0xf8 what the +0x0/0xf8 stands for?
The first number (+0x0) is the offset from the beginning of the function (unwind_backtrace in this case). The second number (0xf8) is the total length of the function. Given these two pieces of information, if you already have a hunch about where the fault occurred this might be enough to confirm your suspicion (you can tell (roughly) how far along in the function you were).
To get the exact source line of the corresponding instruction (generally better than hunches), use addr2line or the other methods in other answers.

What is RUST_BACKTRACE supposed to tell me?

My program is panicking so I followed its advice to run RUST_BACKTRACE=1 and I get this (just a little snippet).
1: 0x800c05b5 - std::sys::imp::backtrace::tracing::imp::write::hf33ae72d0baa11ed
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:42
2: 0x800c22ed - std::panicking::default_hook::{{closure}}::h59672b733cc6a455
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/panicking.rs:351
If the program panics it stops the whole program, so where can I figure out at which line it's panicking on?
Is this line telling me there is a problem at line 42 and line 351?
The whole backtrace is on this image, I felt it would be to messy to copy and paste it here.
I've never heard of a stack trace or a back trace. I'm compiling with warnings, but I don't know what debugging symbols are.
What is a stack trace?
If your program panics, you encountered a bug and would like to fix it; a stack trace wants to help you here. When the panic happens, you would like to know the cause of the panic (the function in which the panic was triggered). But the function directly triggering the panic is usually not enough to really see what's going on. Therefore we also print the function that called the previous function... and so on. We trace back all function calls leading to the panic up to main() which is (pretty much) the first function being called.
What are debug symbols?
When the compiler generates the machine code, it pretty much only needs to emit instructions for the CPU. The problem is that it's virtually impossible to quickly see from which Rust-function a set of instructions came. Therefore the compiler can insert additional information into the executable that is ignored by the CPU, but is used by debugging tools.
One important part are file locations: the compiler annotates which instruction came from which file at which line. This also means that we can later see where a specific function is defined. If we don't have debug symbols, we can't.
In your stack trace you can see a few file locations:
1: 0x800c05b5 - std::sys::imp::backtrace::tracing::imp::write::hf33ae72d0baa11ed
at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:42
The Rust standard library is shipped with debug symbols. As such, we can see where the function is defined (gcc_s.rs line 42).
If you compile in debug mode (rustc or cargo build), debug symbols are activated by default. If you, however, compile in release mode (rustc -O or cargo build --release), debug symbols are disabled by default as they increase the executable size and... usually aren't important for the end user. You can tweak whether or not you want debug symbols in your Cargo.toml in a specific profile section with the debug key.
What are all these strange functions?!
When you first look at a stack trace you might be confused by all the strange function names you're seeing. Don't worry, this is normal! You are interested in what part of your code triggered the panic, but the stack trace shows all functions somehow involved. In your example, you can ignore the first 9 entries: those are just functions handling the panic and generating the exact message you are seeing.
Entry 10 is still not your code, but might be interesting as well: the panic was triggered in the index() function of Vec<T> which is called when you use the [] operator. And finally, entry 11 shows a function you defined. But you might have noticed that this entry is missing a file location... the above section describes how to fix that.
What do to with a stack trace? (tl;dr)
Activate debug symbols if you haven't already (e.g. just compile in debug mode).
Ignore any functions from std and core at the top of the stack trace.
Look at the first function you defined, find the corresponding location in your file and fix the bug.
If you haven't already, change all camelCase function and method names to snake_case to stick to the community wide style guide.

Are hashtagged error message useful in debugging?

I sometimes encounter error messages while executing a fortran/C program. For example, after running my present fortran program I have got the following message in my screen output.
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x101f584f2
#1 0x101f58cae
#2 0x7fff88661f19
#3 0x101e7984c
#4 0x101e7a8dd
#5 0x101e7b16f
#6 0x101e7cab3
Segmentation fault: 11
I am worried if the hashtagged symbols mean anything to a debugger? Can one exploit those symbols using gdb or valgrind? If yes, how to backtrace?
PS. There is a similar post where # tmyklebu says, You may (or may not) be able to feed them through addr2line to get function names and line numbers out of them. But he/she doesn't tell how to do.
I am worried if the hashtagged symbols mean anything to a debugger?
These messages are not hastagged. Here the # symbol simply stands for (frame) number.
Besides, there is no reason to feed this output to the debugger. If you ran the program under debugger, and then used (gdb) where command, you would get similar output, with additional info (symbol names, possibly file/line info). But since you didn't, you now need to use tools other than debugger (e.g. addr2line).

How to debug stack-overwriting errors with Valgrind?

I just spent some time chasing down a bug that boiled down to the following. Code was erroneously overwriting the stack, and I think it wrote over the return address of the function call. Following the return, the program would crash and stack would be corrupted. Running the program in valgrind would return an error such as:
vex x86->IR: unhandled instruction bytes: 0xEA 0x3 0x0 0x0
==9222== valgrind: Unrecognised instruction at address 0x4e925a8.
I figure this is because the return jumped to a random location, containing stuff that were not valid x86 opcodes. (Though I am somehow suspicious that this address 0x4e925a8 happened to be in an executable page. I imagine valgrind would throw a different error if this wasn't the case.)
I am certain that the problem was of the stack-overwriting type, and I've since fixed it. Now I am trying to think how I could catch errors like this more effectively. Obviously, valgrind can't warn me if I rewrite data on the stack, but maybe it can catch when someone writes over a return address on the stack. In principle, it can detect when something like 'push EIP' happens (so it can flag where the return addresses are on the stack).
I was wondering if anyone knows if Valgrind, or anything else can do that? If not, can you comment on other suggestions regarding debugging errors of this type efficiently.
If the problem happens deterministically enough that you can point out particular function that has it's stack smashed (in one repeatable test case), you could, in gdb:
Break at entry to that function
Find where the return address is stored (it's relative to %ebp (on x86) (which keeps the value of %esp at the function entry), I am not sure whether there is any offset).
Add watchpoint to that address. You have to issue the watch command with calculated number, not an expression, because with an expression gdb would try to re-evaluate it after each instruction instead of setting up a trap and that would be extremely slow.
Let the function run to completion.
I have not yet worked with the python support available in gdb7, but it should allow automating this.
In general, Valgrind detection of overflows in stack and global variables is weak to non-existant. Arguably, Valgrind is the wrong tool for that job.
If you are on one of supported platforms, building with -fmudflap and linking with -lmudflap will give you much better results for these kinds of errors. Additional docs here.
Udpdate:
Much has changed in the 6 years since this answer. On Linux, the tool to find stack (and heap) overflows is AddressSanitizer, supported by recent versions of GCC and Clang.

How does the gcc option -fstack-check exactly work?

My program crashed when I added the option -fstack-check and -fstack-protector. __stack_chk_fail is called in the back trace.
So how could I know where the problem is ? What does -fstack-check really check ?
The information about gcc seems too huge to find out the answer.
After checked the assembly program.
I think -fstack-check, will add code write 0 to an offset of the stack pointer, so to test if the program visit a violation address, the program went crash if it does.
e.g. mov $0x0,-0x928(%esp)
-fstack-check: If two feature macros STACK_CHECK_BUILTIN and STACK_CHECK_STATIC_BUILTIN are left at the default 0, it just inserts a NULL byte every 4kb (page) when the stack grows.
By default only one, but when the stack can grow more than one page, which is the most dangerous case, every 4KB. linux >2.6 only has only one small page gap between the stack and the heap, which can lead to stack-gap attacks, known since 2005.
See What exception is raised in C by GCC -fstack-check option for assembly.
It is enabled in gcc at least since 2.95.3, in clang since 3.6.
__stack_chk_fail is the inserted -fstack-protector code which verifies an inserted stack canary value which might be overwritten by a simple stack overflow, e.g. by recursion.
"`-fstack-protector' emits extra code to check for buffer overflows, such as stack
smashing attacks. This is done by adding a guard variable to
functions with vulnerable objects. This includes functions that
call alloca, and functions with buffers larger than 8 bytes. The
guards are initialized when a function is entered and then checked
when the function exits. If a guard check fails, an error message
is printed and the program exits"
GCC Options That Control Optimization
GCC extension for protecting applications from stack-smashing attacks
Smashing The Stack For Fun And Profit
I Hope this will give some clue..

Resources