How to debug stack-overwriting errors with Valgrind? - stack-overflow

I just spent some time chasing down a bug that boiled down to the following. Code was erroneously overwriting the stack, and I think it wrote over the return address of the function call. Following the return, the program would crash and stack would be corrupted. Running the program in valgrind would return an error such as:
vex x86->IR: unhandled instruction bytes: 0xEA 0x3 0x0 0x0
==9222== valgrind: Unrecognised instruction at address 0x4e925a8.
I figure this is because the return jumped to a random location, containing stuff that were not valid x86 opcodes. (Though I am somehow suspicious that this address 0x4e925a8 happened to be in an executable page. I imagine valgrind would throw a different error if this wasn't the case.)
I am certain that the problem was of the stack-overwriting type, and I've since fixed it. Now I am trying to think how I could catch errors like this more effectively. Obviously, valgrind can't warn me if I rewrite data on the stack, but maybe it can catch when someone writes over a return address on the stack. In principle, it can detect when something like 'push EIP' happens (so it can flag where the return addresses are on the stack).
I was wondering if anyone knows if Valgrind, or anything else can do that? If not, can you comment on other suggestions regarding debugging errors of this type efficiently.

If the problem happens deterministically enough that you can point out particular function that has it's stack smashed (in one repeatable test case), you could, in gdb:
Break at entry to that function
Find where the return address is stored (it's relative to %ebp (on x86) (which keeps the value of %esp at the function entry), I am not sure whether there is any offset).
Add watchpoint to that address. You have to issue the watch command with calculated number, not an expression, because with an expression gdb would try to re-evaluate it after each instruction instead of setting up a trap and that would be extremely slow.
Let the function run to completion.
I have not yet worked with the python support available in gdb7, but it should allow automating this.

In general, Valgrind detection of overflows in stack and global variables is weak to non-existant. Arguably, Valgrind is the wrong tool for that job.
If you are on one of supported platforms, building with -fmudflap and linking with -lmudflap will give you much better results for these kinds of errors. Additional docs here.
Udpdate:
Much has changed in the 6 years since this answer. On Linux, the tool to find stack (and heap) overflows is AddressSanitizer, supported by recent versions of GCC and Clang.

Related

Jump into `SignalHandler` right after a call instruction

I'm doing a debug on a program which report:
Thread 1 "test.out" received signal SIGSEGV, Segmentation fault.
I then gdbed the program and found out that the program jump into a SignalHandler function right after a call instruction call 0x401950.
I tested the call destination rax, rdi and rsi (Input of the call). However nothing strange found.
I haven't meet this situation before, I guess there is an soft interrupt created due to some of the exemption, which then scheduled after the instruction. So the actual problem may occur earlier.
Now I need to find out where rises the exemption, so that I could fix it. But I do not have any clue on how.
Therefore I came to ask if anyone could help me on that.
One big sorry for not showing the code, since it is company asset.....
Thanks for anyone who helps!!!
I then gdbed the program and found out that the program jump into a SignalHandler function right after a call instruction call 0x401950.
You didn't say which processor and OS you are using. Guessing Linux and x86_64, you have a stack overflow.
The CALL instruction pushes return address onto the stack, and this operation will generate SIGSEGV if your stack is exhausted.
You can confirm this guess by using (gdb) where (which is likely to show very deep recursion, though other reasons for stack exhaustion are also possible), and by looking at the value of RSP (which should be just below page boundary).
Depending on how the stack size is set up, using ulimit -s unlimited before invoking the program may work around this crash (though you really ought to fix the root cause by some other mechanism).

When we get runtime error in swift project, Why does Xcode send us to Thread output in assembly language? What's the point ?

As you know when there is somethings wrong when we are running a Swift project in Xcode we will direct to tread debug navigator's thread section and we will be face with some assembly code like this :
I am wondering is there any reference, tutorial or tools for understanding these codes , there should be reasone that we direct to these code
let me clear; I know how to fix the errors but this suffering me when I do not understand some thing like this. I want to know what are these codes and how we can use them or at least understand them.
Thanks :)
Original question: what language is that? That's AT&T syntax assembly language for x86-64. https://stackoverflow.com/tags/x86/info for manuals from Intel and other resources, and https://stackoverflow.com/tags/att/info for how AT&T syntax differs from Intel syntax used in most manuals. (I think the x86 tag wiki has a few AT&T syntax tutorials.) Most AT&T-syntax disassemblers have an intel-syntax mode, too, so you can use that if you want asm that matches Intel's manuals.
What's the point?
The point is so you can debug your program if you know asm. Or you can show the asm to someone who does understand it, or include it in a bug report.
Did you compile without debug symbols? Or did it crash in library code without symbols? It's normal for debuggers to show you asm if it can't show you source, or if you ask for asm.
If you have debug symbols for your own code, you can at least backtrace into parent functions for which you do have source. (Unless the stack is corrupted.)
Did your program fault on that instruction highlighted in pink? That's a bit odd, since it's loading from static data (a RIP-relative load means the address is a link-time constant).
Did you maybe munmap or mprotect that page of your program's data or text segment so a load would fault? Normally you only get faults when an addressing mode involves a pointer.
(The call *0x1234(%rip) right before it is calling through a function pointer, though. The function-pointer is stored in memory, but code-fetch after the call executes would fault if it was pointing to an unmapped or non-executable page). But your first image shows you got a SIGABRT, not SIGSEGV, so that's more like the program on purpose aborted after failing an assertion.
I believe majority of swift coders don't know asm
There's nothing more useful a debugger can do without debug symbols and source files.
Also keep in mind that the majority of debugger authors do know asm, so for them it is an obviously-useful feature / behaviour. They know that many people won't be able to benefit from it, but that some will.
Asm is what's really running on the machine. Without asm, you couldn't find wrong-code compiler bugs, etc. etc. As far as software bugs, there is no lower level than asm, so it's not some arbitrary choice of some lower-level layer to stop at.
(Unless there's also a bug in your disassembler or debugger, in which case you need to check the hex machine code.)

How do symbols solve walking the stack with FPO in x86 debugging?

In this answer: https://stackoverflow.com/a/8646611/192359 , it is explained that when debugging x86 code, symbols allow the debugger to display the callstack even when FPO (Frame Pointer Omission) is used.
The given explanation is:
On the x86 PDBs contain FPO information, which allows the debugger to reliably unwind a call stack.
My question is what's this information? As far as I understand, just knowing whether a function has FPO or not does not help you finding the original value of the stack pointer, since that depends on runtime information.
What am I missing here?
Fundamentally, it is always possible to walk the stack with enough information1, except in cases where the stack or execution context has been irrecoverably corrupted.
For example, even if rbp isn't used as the frame pointer, the return address is still on the stack somewhere, and you just need to know where. For a function that doesn't modify rsp (indirectly or directly) in the body of the function it would be at a simple fixed offset from rsp. For functions that modify rsp in the body of the function (i.e., that have a variable stack size), the offset from rsp might depend on the exact location in the function.
The PDB file simply contains this "side band" information which allows someone to determine the return address for any instruction in the function. Hans linked a relevant in-memory structure above - you can see that since it knows the size of the local variables and so on it can calculate the offset between rsp and the base of the frame, and hence get at the return address. It also knows how many instruction bytes are part of the "prolog" which is important because if the IP is still in that region, different rules apply (i.e., the stack hasn't been adjusted to reflect the locals in this function yet).
In 64-bit Windows, the exact function call ABI has been made a bit more concrete, and all functions generally have to provide unwind information: not in a .pdb but directly in a section included in the binary. So even without .pdb files you should be able to unwind a properly structured 64-bit Windows program. It allows any register to be used as the frame pointer, and still allows frame-pointer omission (with some restrictions). For details, start here.
1 If this weren't true, ask yourself how the currently running function could ever return? Now, technically you could design a program which clobbers or forgets the stack in a way that it cannot return, and either never exits or uses a method like exit() or abort() to terminate. This is highly unusual and not possibly outside of assembly.

How do I debug a jump to a bad address?

I am currently debugging some assembly code using GDB and am stuck on the following problem. Somehow or other, I've ended up at a bogus instruction address, probably because either I called a bogus function pointer, or I mangled the return address on the parent stack frame.
GDB is fantastic and stops the program exactly when it detects this has happened. However, what it doesn't tell me is the instruction address that sent me to this bogus address. So now I am stuck. I know that I am now at a bogus address, but I have no way of knowing how I got here. What I think I need is a list of the last n values that $rip has taken on. But I cannot find any way of doing that in GDB's documentation and am pretty sure it is not possible.
So I would appreciate it if anyone else had any great tips on low-level debugging they could share. Thanks!
-Patrick
I think GDB's trace might helps
https://sourceware.org/gdb/onlinedocs/gdb/Tracepoints.html
When your code crash, the trap is raised and the instruction pointer jump to the trap table, the jump depends on the trap raised.
You want to determine which instruction causes this trap, so you can execute the command backtrace (bt) to show the latest exectuted functions before the jump to the trap table. When you identify the function, execute step by step to idenfy the instruction which causes the error.
If you are using a target with gdb in remote mode, you need to give gdb a strong symbol table to allow the awareness of the whole code symbols.

How to debug a foobar Program Counter

I was immediately suspicious of the crash. A Floating Point Exception in a method whose only arithmetic was a "divide by sizeof(short)".
I looked at the stack crawl & saw that the offset into the method was "+91". Then I examined a disassembly of that method & confirmed that the Program Counter was in fact foobar at the time of the crash. The disassembly showed instructions at +90 and +93 but not +91.
This is a method, 32-bit x86 instructions, that gets called very frequently in the life of the application. This crash has been reported 3 times.
How does this happen? How do I set a debugging trap for the situation?
Generally when you fault in the middle of an instruction, its due to bad flow control(ie: a broken jump, call, retn), an overflow, bad dereferencing or your debug symbols being out-of-sync making the stack trace show incorrect info. Your first step is to reliably reproduce the error everytime, else you'll have trouble trapping it, from there I'd just run it in a debugger, force the conditions to make it explode, then examine the (call) stack and registers to see if they are valid values etc.

Resources