I was immediately suspicious of the crash. A Floating Point Exception in a method whose only arithmetic was a "divide by sizeof(short)".
I looked at the stack crawl & saw that the offset into the method was "+91". Then I examined a disassembly of that method & confirmed that the Program Counter was in fact foobar at the time of the crash. The disassembly showed instructions at +90 and +93 but not +91.
This is a method, 32-bit x86 instructions, that gets called very frequently in the life of the application. This crash has been reported 3 times.
How does this happen? How do I set a debugging trap for the situation?
Generally when you fault in the middle of an instruction, its due to bad flow control(ie: a broken jump, call, retn), an overflow, bad dereferencing or your debug symbols being out-of-sync making the stack trace show incorrect info. Your first step is to reliably reproduce the error everytime, else you'll have trouble trapping it, from there I'd just run it in a debugger, force the conditions to make it explode, then examine the (call) stack and registers to see if they are valid values etc.
Related
I'm doing a debug on a program which report:
Thread 1 "test.out" received signal SIGSEGV, Segmentation fault.
I then gdbed the program and found out that the program jump into a SignalHandler function right after a call instruction call 0x401950.
I tested the call destination rax, rdi and rsi (Input of the call). However nothing strange found.
I haven't meet this situation before, I guess there is an soft interrupt created due to some of the exemption, which then scheduled after the instruction. So the actual problem may occur earlier.
Now I need to find out where rises the exemption, so that I could fix it. But I do not have any clue on how.
Therefore I came to ask if anyone could help me on that.
One big sorry for not showing the code, since it is company asset.....
Thanks for anyone who helps!!!
I then gdbed the program and found out that the program jump into a SignalHandler function right after a call instruction call 0x401950.
You didn't say which processor and OS you are using. Guessing Linux and x86_64, you have a stack overflow.
The CALL instruction pushes return address onto the stack, and this operation will generate SIGSEGV if your stack is exhausted.
You can confirm this guess by using (gdb) where (which is likely to show very deep recursion, though other reasons for stack exhaustion are also possible), and by looking at the value of RSP (which should be just below page boundary).
Depending on how the stack size is set up, using ulimit -s unlimited before invoking the program may work around this crash (though you really ought to fix the root cause by some other mechanism).
I'm working on Ethernet code on an STM32F429 ARM Cortex M4 device and running into a situation where I'm getting an MemManage exception where the cause is proving very difficult to track down. From what I understand, the MemManage exception is caused by some violation of the MPU such as trying to execute code in the protected register space at 0xE0000000 and above. Cortex M4 documentation I've read indicates the reason for the exception should be captured in the MMSFR register bits and that the address of the error may be captured in the MMFAR register in certain circumstances.
What frustrating me is that MemManage exception is being generated with all bits in the MMSFR register zero. I'm executing a breakpoint instruction just as the exception handler is entered so I'm pretty sure the MMSFR is not being accidentally cleared. Furthermore, no where in my code am I even using the MPU and it should be in its default state on power-up. Finally, I can purposely create a MemManage exception elsewhere in my code and the MMSFR bits correctly identify the issue I triggered. Unwinding the stack from the exception, the only thing unusual thing about the PC is that it's in the middle of code that is called early on to initialize the RTOS, but should not be executing later when the exception occurs. I'm trying to determine how the PC got to the value it did, but it's proving difficult to isolate.
Does someone have some ideas as to why the the MemManage exception might occur without the MMSFR bits being set? Or, suggestions for techniques to better understand the circumstances that occur in my code just before the exception occurs.
My instinct (not necessarily accurate!) is that something's not right here. There's no reason that the MemManage exception should not accurately log the reason for its invocation, and your mention of the PC having been somewhere it shouldn't have been suggests that whatever's wrong went wrong well before the exception entry. On that basis I think you'll learn more by identifying where the exception takes place than by trying to deduce the cause from the exception type.
I'd start by checking the value in LR at the point you've identified that the exception takes place. This won't necessarily tell you where the PC corruption took place, but it'll tell you where the last BL was issued prior to the problem, so it might help put bounds on where the problem might be. You might also find it helpful to check the exception state bits in the PSR ([8-0]) to confirm the type of the fault. (MemManage is 0x004.)
I finally tracked down the issue. It was code executing a callback function within a structure, but the structure pointer was a null pointer. The offset of the callback function within the structure corresponded to the offset of the MemManager exception handler in the vector table from address zero. Thus, the MemManager handler was not being called via an exception, but rather a simple function call. This was why the stack looked confusing to me -- I was expecting to see a an exception stack frame rather than a simple function call stack frame.
The clue to me was the exception state bits in the PSR ([8-0]) being all zeros (thanks to the suggestion from cooperised) which indicates my MemManager exception was not actually being called as an exception. I then backtracked from there to understand what code was responsible for calling the handler as a function call. My flawed assumption was that the only way the MemManager handler could be reached was via an exception -- with the PSR value and non-exception stack frame being the major clues that I was ignoring.
Double-check that the exception you're getting is actually MemManage and not something else (e.g. if you're using a shared handler for several exception types). Another possibility is that you're getting an imprecise fault and the information about the original fault has been discarded. From FreeRTOS debugging guide:
ARM Cortex-M faults can be precise or imprecise. If the IMPRECISERR
bit (bit 2) is set in the BusFault Status Register (or BFSR, which
is byte accessible at address 0xE000ED29) is set then the fault is
imprecise.
...
In the above example, turning off write buffering by setting the
DISDEFWBUF bit (bit 1) in the Auxiliary Control Register (or
ACTLR) will result in the imprecise fault becoming a precise fault,
which makes the fault easier to debug, albeit at the cost of slower
program execution.
I am currently debugging some assembly code using GDB and am stuck on the following problem. Somehow or other, I've ended up at a bogus instruction address, probably because either I called a bogus function pointer, or I mangled the return address on the parent stack frame.
GDB is fantastic and stops the program exactly when it detects this has happened. However, what it doesn't tell me is the instruction address that sent me to this bogus address. So now I am stuck. I know that I am now at a bogus address, but I have no way of knowing how I got here. What I think I need is a list of the last n values that $rip has taken on. But I cannot find any way of doing that in GDB's documentation and am pretty sure it is not possible.
So I would appreciate it if anyone else had any great tips on low-level debugging they could share. Thanks!
-Patrick
I think GDB's trace might helps
https://sourceware.org/gdb/onlinedocs/gdb/Tracepoints.html
When your code crash, the trap is raised and the instruction pointer jump to the trap table, the jump depends on the trap raised.
You want to determine which instruction causes this trap, so you can execute the command backtrace (bt) to show the latest exectuted functions before the jump to the trap table. When you identify the function, execute step by step to idenfy the instruction which causes the error.
If you are using a target with gdb in remote mode, you need to give gdb a strong symbol table to allow the awareness of the whole code symbols.
I just spent some time chasing down a bug that boiled down to the following. Code was erroneously overwriting the stack, and I think it wrote over the return address of the function call. Following the return, the program would crash and stack would be corrupted. Running the program in valgrind would return an error such as:
vex x86->IR: unhandled instruction bytes: 0xEA 0x3 0x0 0x0
==9222== valgrind: Unrecognised instruction at address 0x4e925a8.
I figure this is because the return jumped to a random location, containing stuff that were not valid x86 opcodes. (Though I am somehow suspicious that this address 0x4e925a8 happened to be in an executable page. I imagine valgrind would throw a different error if this wasn't the case.)
I am certain that the problem was of the stack-overwriting type, and I've since fixed it. Now I am trying to think how I could catch errors like this more effectively. Obviously, valgrind can't warn me if I rewrite data on the stack, but maybe it can catch when someone writes over a return address on the stack. In principle, it can detect when something like 'push EIP' happens (so it can flag where the return addresses are on the stack).
I was wondering if anyone knows if Valgrind, or anything else can do that? If not, can you comment on other suggestions regarding debugging errors of this type efficiently.
If the problem happens deterministically enough that you can point out particular function that has it's stack smashed (in one repeatable test case), you could, in gdb:
Break at entry to that function
Find where the return address is stored (it's relative to %ebp (on x86) (which keeps the value of %esp at the function entry), I am not sure whether there is any offset).
Add watchpoint to that address. You have to issue the watch command with calculated number, not an expression, because with an expression gdb would try to re-evaluate it after each instruction instead of setting up a trap and that would be extremely slow.
Let the function run to completion.
I have not yet worked with the python support available in gdb7, but it should allow automating this.
In general, Valgrind detection of overflows in stack and global variables is weak to non-existant. Arguably, Valgrind is the wrong tool for that job.
If you are on one of supported platforms, building with -fmudflap and linking with -lmudflap will give you much better results for these kinds of errors. Additional docs here.
Udpdate:
Much has changed in the 6 years since this answer. On Linux, the tool to find stack (and heap) overflows is AddressSanitizer, supported by recent versions of GCC and Clang.
How can I recognize that the callstack that is shown by the debugger when my program crashes may be wrong and misleading. For example when the callstack says the following frames may be missing or incorrect, what that actually means? Also what the + number after the function call in the callstack means :
kernel32!LoadLibrary + 0x100 bytes
Should this number be important to me, and is it true that if this number is big the callstack may be incorrect ?
Sorry if I am asking something trivial and obvious
Thank you all
Generally, you can trust your callstack to be correct.
However, if you re-throw exceptions explicitly instead of allowing them to bubble up the callstack naturally, the actual error can be hidden from the stack trace.
To start with the 2nd one: kernel32!LoadLibrary + 0x100 bytes means that the call was from the function LoadLibrary (offset: +100 bytes); appearantly there was no symbolic information exactly identifying the caller. This in itself is no reason for the callstack to be corrupted.
A call stack may be corrupted if functions overwrite values on the stack (i.e. by buffer overflow. This would likely show as '0x41445249' (if it were my name to overwrite it) as a call function. That is something outside your program memory ranges.
A way to diagnose the cause of your crash would be to set breakpoints on functions identified by the call stack. Or use your debugger to backtrace (depending on debugger & system). It is interesting to find out what arguments were included in the calls. Pointers are generally a good start (NULL pointers, uninitialized pointers). Good luck.