I am currently debugging some assembly code using GDB and am stuck on the following problem. Somehow or other, I've ended up at a bogus instruction address, probably because either I called a bogus function pointer, or I mangled the return address on the parent stack frame.
GDB is fantastic and stops the program exactly when it detects this has happened. However, what it doesn't tell me is the instruction address that sent me to this bogus address. So now I am stuck. I know that I am now at a bogus address, but I have no way of knowing how I got here. What I think I need is a list of the last n values that $rip has taken on. But I cannot find any way of doing that in GDB's documentation and am pretty sure it is not possible.
So I would appreciate it if anyone else had any great tips on low-level debugging they could share. Thanks!
-Patrick
I think GDB's trace might helps
https://sourceware.org/gdb/onlinedocs/gdb/Tracepoints.html
When your code crash, the trap is raised and the instruction pointer jump to the trap table, the jump depends on the trap raised.
You want to determine which instruction causes this trap, so you can execute the command backtrace (bt) to show the latest exectuted functions before the jump to the trap table. When you identify the function, execute step by step to idenfy the instruction which causes the error.
If you are using a target with gdb in remote mode, you need to give gdb a strong symbol table to allow the awareness of the whole code symbols.
Related
I'm doing a debug on a program which report:
Thread 1 "test.out" received signal SIGSEGV, Segmentation fault.
I then gdbed the program and found out that the program jump into a SignalHandler function right after a call instruction call 0x401950.
I tested the call destination rax, rdi and rsi (Input of the call). However nothing strange found.
I haven't meet this situation before, I guess there is an soft interrupt created due to some of the exemption, which then scheduled after the instruction. So the actual problem may occur earlier.
Now I need to find out where rises the exemption, so that I could fix it. But I do not have any clue on how.
Therefore I came to ask if anyone could help me on that.
One big sorry for not showing the code, since it is company asset.....
Thanks for anyone who helps!!!
I then gdbed the program and found out that the program jump into a SignalHandler function right after a call instruction call 0x401950.
You didn't say which processor and OS you are using. Guessing Linux and x86_64, you have a stack overflow.
The CALL instruction pushes return address onto the stack, and this operation will generate SIGSEGV if your stack is exhausted.
You can confirm this guess by using (gdb) where (which is likely to show very deep recursion, though other reasons for stack exhaustion are also possible), and by looking at the value of RSP (which should be just below page boundary).
Depending on how the stack size is set up, using ulimit -s unlimited before invoking the program may work around this crash (though you really ought to fix the root cause by some other mechanism).
Is there a SPARC equivalent to x86's single step mode? What I want is to stop execution after every instruction and move control flow to a trap handler or something similar.
I thought of using the ta instruction in the delayed execution slot but this would not work when the previous instruction is a branching instruction with the annul bit set.
Sparc lacks a single step bit in PSR, so it's harder to single step. But I've used a trick to help get closer. Set TPC to the address of the instruction you want to single step, and set TNPC to an address someplace else where you've placed a trap instruction. When you execute the retry instruction to get back to the process context, it will single step the one instruction you want, then it will next execute the trap instruction which will bring you right back to the kernel, where you can do whatever you want. (n.b this is for sparc64, not sure about sparc32). This is a nice trick because you don't have modify existing instructions in the user's address space. This was important to me since I was single stepping instructions in the kernel.
Another idea I had, but never tried, was to simply set TNPC to an illegal address. Then after the instruction at TPC was executed, you'd get an automatic trap back into the kernel. And since the trap handling code knows that the process is being single stepped, there would be no confusion over a "real" illegal address trap.
How does a debugger set breakpoints if the image is in read-only memory? I know there are hardware breakpoints, but in the debugger I use (OllyDbg) those have to be set specially using a different dialog than normal breakpoints.
Explanation:
Here is a routine in a debugger that is comparing itself to a copy of itself. EDX points to the running image, EBX points to the known good copy of the image. The breakpoint on 4010CE only is reached if there is a mismatch. The character being compared is in the AL register. As you can see the debugger shows EB F6 at 10CE, but this is false. 10CE actually has CC in it, as you can see by looking at the AL register. This is because the debugger has secretely inserted the CC to perform the breakpoint.
The debugger first has to change the memory protection of the page it wants to write to. This can be done with VirtualProtectEx. After that it is able to write with WriteProcessMemory and then set the protection back to the original value.
Let me preface this with a disclaimer that I'm not familiar with your particular toolset.
If you haven't enabled hardware breakpoints, the only remaining breakpoint type is a software breakpoint. These are only hit (on x86 because that's what I'm most familiar with) when you replace the first byte of an instruction with a trap instruction, and will only be routed through the breakpoint mechanism of your OS to your debugger if the correct trap instruction for your OS is used and the debugger has already registered itself with the OS as a debugger for this process. In order to cause the software breakpoint to happen at the correct moment, the trap instruction must be written into your code segment over the first byte of your correct instruction.
The two answers that got here first explain the two scenarios which could get you here (at least, the only two I can think of):
The kernel always has write access everywhere, except for hardware-protected pages (ie on some sort of ROM), which your process' memory is almost certainly not. It has the ability to write the breakpoint instruction regardless of the permissions exposed to the user process being debugged.
The debugger must use some syscall to change the access rights on the memory of the target process before inserting the breakpoint.
Personally, I'm guessing the first thing is happening. The segment permissions are only in place to protect your target process from itself, not from a debugger process or from the kernel. Debugging mechanisms in operating systems pretty regularly violate "normal" permissions to allow the debugger to do whatever it wants to the target process. This, of course, is why some operating systems require you to enter a password before you're allowed to use the debugger in certain scenarios.
However, you can test if it's the second one by attempting to write to the code segment from inside the target process after a breakpoint has been set. If the write succeeds, you know the permissions have been lowered by the OS (to allow the process to be debugged). It would be pretty awkward for the OS to require a debugger to jump through this hoop since it can already insert arbitrary code into the writeable parts of memory and then force a jump to it by generating a stack frame overflow.
The debugger takes advantage of the WriteProcessMemory() function to alter the instruction in place. It'll keep a copy of the instruction. When the bp is hit it will reset the old byte value and set EIP back to the previous instruction so the real instruction can execute.
I was immediately suspicious of the crash. A Floating Point Exception in a method whose only arithmetic was a "divide by sizeof(short)".
I looked at the stack crawl & saw that the offset into the method was "+91". Then I examined a disassembly of that method & confirmed that the Program Counter was in fact foobar at the time of the crash. The disassembly showed instructions at +90 and +93 but not +91.
This is a method, 32-bit x86 instructions, that gets called very frequently in the life of the application. This crash has been reported 3 times.
How does this happen? How do I set a debugging trap for the situation?
Generally when you fault in the middle of an instruction, its due to bad flow control(ie: a broken jump, call, retn), an overflow, bad dereferencing or your debug symbols being out-of-sync making the stack trace show incorrect info. Your first step is to reliably reproduce the error everytime, else you'll have trouble trapping it, from there I'd just run it in a debugger, force the conditions to make it explode, then examine the (call) stack and registers to see if they are valid values etc.
I just spent some time chasing down a bug that boiled down to the following. Code was erroneously overwriting the stack, and I think it wrote over the return address of the function call. Following the return, the program would crash and stack would be corrupted. Running the program in valgrind would return an error such as:
vex x86->IR: unhandled instruction bytes: 0xEA 0x3 0x0 0x0
==9222== valgrind: Unrecognised instruction at address 0x4e925a8.
I figure this is because the return jumped to a random location, containing stuff that were not valid x86 opcodes. (Though I am somehow suspicious that this address 0x4e925a8 happened to be in an executable page. I imagine valgrind would throw a different error if this wasn't the case.)
I am certain that the problem was of the stack-overwriting type, and I've since fixed it. Now I am trying to think how I could catch errors like this more effectively. Obviously, valgrind can't warn me if I rewrite data on the stack, but maybe it can catch when someone writes over a return address on the stack. In principle, it can detect when something like 'push EIP' happens (so it can flag where the return addresses are on the stack).
I was wondering if anyone knows if Valgrind, or anything else can do that? If not, can you comment on other suggestions regarding debugging errors of this type efficiently.
If the problem happens deterministically enough that you can point out particular function that has it's stack smashed (in one repeatable test case), you could, in gdb:
Break at entry to that function
Find where the return address is stored (it's relative to %ebp (on x86) (which keeps the value of %esp at the function entry), I am not sure whether there is any offset).
Add watchpoint to that address. You have to issue the watch command with calculated number, not an expression, because with an expression gdb would try to re-evaluate it after each instruction instead of setting up a trap and that would be extremely slow.
Let the function run to completion.
I have not yet worked with the python support available in gdb7, but it should allow automating this.
In general, Valgrind detection of overflows in stack and global variables is weak to non-existant. Arguably, Valgrind is the wrong tool for that job.
If you are on one of supported platforms, building with -fmudflap and linking with -lmudflap will give you much better results for these kinds of errors. Additional docs here.
Udpdate:
Much has changed in the 6 years since this answer. On Linux, the tool to find stack (and heap) overflows is AddressSanitizer, supported by recent versions of GCC and Clang.