My Go code uses several hundreds of goroutines. A runtime error could occur from time to time. But when error occurs, it will simply print out the stack traces of all goroutines, making it impossible to debug?
How to locate where the program breaks?
Sry I didn't post stack traces earlier, I didn't know how to print stderr to stack and the output is too long so I can't view all of it.
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x141edce pc=0x141edce]
runtime stack:
runtime: unexpected return pc for runtime.sigpanic called from 0x141edce
stack: frame={sp:0x7ffbffffa9f0, fp:0x7ffbffffaa40} stack=[0x7ffbff7fbb80,0x7ffbffffabb0)
00007ffbffffa8f0: 00007ffbffffa960 000000000042b58c <runtime.dopanic_m+540>
00007ffbffffa900: 000000000042b031 <runtime.throw+129> 00007ffbffffa9d0
00007ffbffffa910: 0000000000000000 000000000097f880
00007ffbffffa920: 010000000042bae8 0000000000000004
00007ffbffffa930: 000000000000001f 000000000141edce
00007ffbffffa940: 000000000141edce 0000000000000001
00007ffbffffa950: 00000000007996e6 000000c420302180
00007ffbffffa960: 00007ffbffffa988 00000000004530ac <runtime.dopanic.func1+60>
00007ffbffffa970: 000000000097f880 000000000042b031 <runtime.throw+129>
00007ffbffffa980: 00007ffbffffa9d0 00007ffbffffa9c0
00007ffbffffa990: 000000000042af5a <runtime.dopanic+74> 00007ffbffffa9a0
00007ffbffffa9a0: 0000000000453070 <runtime.dopanic.func1+0> 000000000097f880
00007ffbffffa9b0: 000000000042b031 <runtime.throw+129> 00007ffbffffa9d0
00007ffbffffa9c0: 00007ffbffffa9e0 000000000042b031 <runtime.throw+129>
00007ffbffffa9d0: 0000000000000000 000000000000002a
00007ffbffffa9e0: 00007ffbffffaa30 000000000043fb1e <runtime.sigpanic+654>
00007ffbffffa9f0: <000000000079dce7 000000000000002a
00007ffbffffaa00: 00007ffbffffaa30 000000000041f08e <runtime.greyobject+302>
00007ffbffffaa10: 000000c420029c70 000000000097f880
00007ffbffffaa20: 000000000045247d <runtime.markroot.func1+109> 000000c420a69b00
00007ffbffffaa30: 00007ffbffffaad8 !000000000141edce
00007ffbffffaa40: >000000c42160ca40 000000c4206d8000
00007ffbffffaa50: 0000000000000c00 000000c41ff4f9ad
00007ffbffffaa60: 000000c400000000 00007efbff5188f8
00007ffbffffaa70: 000000c420029c70 0000000000000052
00007ffbffffaa80: 0000000021e84000 00007ffbffffaab0
00007ffbffffaa90: 0000000000002000 0000000000000c00
00007ffbffffaaa0: 000000c422b00000 000000c420000000
00007ffbffffaab0: 00007ffbffffaad8 0000000000421564 <runtime.(*gcWork).tryGet+164>
00007ffbffffaac0: 000000c41ffc939f 000000c4226eb000
00007ffbffffaad0: 000000c4226e9000 00007ffbffffab30
00007ffbffffaae0: 000000000041e527 <runtime.gcDrain+567> 000000c4206d8000
00007ffbffffaaf0: 000000c420029c70 0000000000000000
00007ffbffffab00: 7ffffffffff8df47 00007ffc0001fc30
00007ffbffffab10: 00007ffbffffab70 0000000000000000
00007ffbffffab20: 000000c420302180 0000000000000000
00007ffbffffab30: 00007ffbffffab70 00000000004522c0 <runtime.gcBgMarkWorker.func2+128>
runtime.throw(0x79dce7, 0x2a)
/usr/lib/go-1.10/src/runtime/panic.go:616 +0x81
runtime: unexpected return pc for runtime.sigpanic called from 0x141edce
stack: frame={sp:0x7ffbffffa9f0, fp:0x7ffbffffaa40} stack=[0x7ffbff7fbb80,0x7ffbffffabb0)
00007ffbffffa8f0: 00007ffbffffa960 000000000042b58c <runtime.dopanic_m+540>
00007ffbffffa900: 000000000042b031 <runtime.throw+129> 00007ffbffffa9d0
00007ffbffffa910: 0000000000000000 000000000097f880
00007ffbffffa920: 010000000042bae8 0000000000000004
00007ffbffffa930: 000000000000001f 000000000141edce
00007ffbffffa940: 000000000141edce 0000000000000001
00007ffbffffa950: 00000000007996e6 000000c420302180
00007ffbffffa960: 00007ffbffffa988 00000000004530ac <runtime.dopanic.func1+60>
00007ffbffffa970: 000000000097f880 000000000042b031 <runtime.throw+129>
00007ffbffffa980: 00007ffbffffa9d0 00007ffbffffa9c0
00007ffbffffa990: 000000000042af5a <runtime.dopanic+74> 00007ffbffffa9a0
00007ffbffffa9a0: 0000000000453070 <runtime.dopanic.func1+0> 000000000097f880
00007ffbffffa9b0: 000000000042b031 <runtime.throw+129> 00007ffbffffa9d0
00007ffbffffa9c0: 00007ffbffffa9e0 000000000042b031 <runtime.throw+129>
00007ffbffffa9d0: 0000000000000000 000000000000002a
00007ffbffffa9e0: 00007ffbffffaa30 000000000043fb1e <runtime.sigpanic+654>
00007ffbffffa9f0: <000000000079dce7 000000000000002a
00007ffbffffaa00: 00007ffbffffaa30 000000000041f08e <runtime.greyobject+302>
00007ffbffffaa10: 000000c420029c70 000000000097f880
00007ffbffffaa20: 000000000045247d <runtime.markroot.func1+109> 000000c420a69b00
00007ffbffffaa30: 00007ffbffffaad8 !000000000141edce
00007ffbffffaa40: >000000c42160ca40 000000c4206d8000
00007ffbffffaa50: 0000000000000c00 000000c41ff4f9ad
00007ffbffffaa60: 000000c400000000 00007efbff5188f8
00007ffbffffaa70: 000000c420029c70 0000000000000052
00007ffbffffaa80: 0000000021e84000 00007ffbffffaab0
00007ffbffffaa90: 0000000000002000 0000000000000c00
00007ffbffffaaa0: 000000c422b00000 000000c420000000
00007ffbffffaab0: 00007ffbffffaad8 0000000000421564 <runtime.(*gcWork).tryGet+164>
00007ffbffffaac0: 000000c41ffc939f 000000c4226eb000
00007ffbffffaad0: 000000c4226e9000 00007ffbffffab30
00007ffbffffaae0: 000000000041e527 <runtime.gcDrain+567> 000000c4206d8000
00007ffbffffaaf0: 000000c420029c70 0000000000000000
00007ffbffffab00: 7ffffffffff8df47 00007ffc0001fc30
00007ffbffffab10: 00007ffbffffab70 0000000000000000
00007ffbffffab20: 000000c420302180 0000000000000000
00007ffbffffab30: 00007ffbffffab70 00000000004522c0 <runtime.gcBgMarkWorker.func2+128>
runtime.sigpanic()
/usr/lib/go-1.10/src/runtime/signal_unix.go:372 +0x28e
It actually makes it easy to debug by dumping those stacks.
You might not be familiar with this approach to post-mortem analysis but this can be fixed ;-)
The first thing to note is that in normal Go code the panic/recover mechanism is not used for control flow, and so when some goroutine panics, it usually has a pretty grave reason to do that. In turn, this means, such reason is usually restricted to a not too broad set of possible reasons, and in 100% of such cases it signalizes a logical error in the program: an attempt to dereference an unitialized (nil) pointer, an attempt to send to a closed channel and so on.
(Of course, the problem may be with the 3rd-party code, or with the way you're using it.)
OK, so to start analysing what has happened, the first thing is to stop considering this as "something awry happened": instead, some particular error has happened, and the Go runtime displayed you the state of all of your goroutines at that instant in time.
So, the first thing to do is to actually read and understand the very error displayed. It contains the immediate reason which caused the Go runtime to crash your program — it may be nil pointer derefernce, memory exhaustion, an attempt to close a closed channel and so on.
The second thing to do — once the essense of the error was clearly understood — is to analyse whether the stack trace dump will be of use. It's simple: all runtime errors can be classified into the two broad groups: "low-level" or "high-level". The former are those happening deep in the Go runtime itself. A failure to allocate memory is the best example. Such errors might even indicate bugs in the runtime (though this is very unlikely to see in practice unless you're using a bleeding edge build of the Go toolset to build your program). The chief property of such errors is that they may have little to do with the exact place the error occured at. Say, a failure to allocate memory may be triggered by some innocent allocation while some real memory hog leaking memory successfully managed to get a big chunk of memory just before.
But such errors are rare, and the high-level errors occur far, far more often. And with them, inspecting stack traces helps a lot.
In these cases, you roll like this.
A stack trace dump consists of the descriptions of the stack frames of the callchain leading to the error: the stack frame of the function in which the error occured is at the top, its caller is just below, the caller of the caller is the next down the line, and so on and so on — right until the entry point of the executing goroutine.
Each stack frame's description includes the name of the function and the name of the file that function is defined in and the line number of the statement the error occured in.
That is super useful in itself: you find that statement in the source code of your program, squint at it while keeping in mind that the indicated error happened there and then start to analyse "backwards" how it may so happened such that it occured in there. If it's not clear with the code of the function preceding that statement, it may help to analyze the caller's stack frame (whcih also includes the file name and the line number) and so on.
In most cases the above suffices.
In rare cases when it does not, analysing the arguments to the functions — also captured by their dumped stack frames — may help.
The values of the arguments are listed in their source code order — from left to right; the only problem with interpreiting them is "decoding" those of the arguments which are of "compound" types — such as strings, slices, used-defined struct types etc.
Say, a string is a struct of two fields, and in the list of arguments these fields will come one after the other, "unwrapped".
But let's not dig too deeper for now. There are other things to explore here (say, I've touched on the memory exhaustion errors but did not explain how to approach them) but you're better off actually trying to learn your way here by practice.
If you have any concrete questions while dealing with such problems, — ask away, but be sure to include the stack trace of the crashed goroutine and describe what your own attempt at analysis yielded, and what exactly you have problem with.
There exists another approach you can use.
The GOTRACEBACK environment variable may be assigned a special value to tell the Go runtime to crash your program in a way friendly to a "regular" interactive debugger capable of working with core dumps — such as gdb.
For instance, you may enable dumping of core files and then allow the Go runtime to crash your process in such a way so that the OS dumps its core:
$ ulimit -c unlimited
$ export GOTRACEBACK=crash
$ ./your_program
...
... your_program crashes
...
$ ls *core*
core
$ gdb -e ./your_program core
(gdb) thread apply all bt
* tracebacks follow *
(Actual debugging of the state captured by a core file is what your IDE or whatever should take care of, I suppose; I demonstrated how to run the gdb debugger.
Run help ulimit in bash to see what that ulimit encantation above was about.)
Related
I've got a rather complicated program that does a lot of memory allocation, and today by surprise it started segfaulting in a weird way that gdb couldn't pin-point the location of. Suspecting memory corruption somewhere, I linked it against Electric Fence, but I am baffled as to what it is telling me:
ElectricFence Exiting: mprotect() failed:
Program received signal SIGSEGV, Segmentation fault.
__strlen_sse2 () at ../sysdeps/i386/i686/multiarch/strlen.S:99
99 ../sysdeps/i386/i686/multiarch/strlen.S: No such file or directory.
in ../sysdeps/i386/i686/multiarch/strlen.S
#0 __strlen_sse2 () at ../sysdeps/i386/i686/multiarch/strlen.S:99
#1 0xb7fd6f2d in ?? () from /usr/lib/libefence.so.0
#2 0xb7fd6fc2 in EF_Exit () from /usr/lib/libefence.so.0
#3 0xb7fd6b48 in ?? () from /usr/lib/libefence.so.0
#4 0xb7fd66c9 in memalign () from /usr/lib/libefence.so.0
#5 0xb7fd68ed in malloc () from /usr/lib/libefence.so.0
#6 <and above are frames in my program>
I'm calling malloc with a value of 36, so I'm pretty sure that shouldn't be a problem.
What I don't understand is how it is even possible that I could be trashing the heap in malloc. In reading the manual page a bit more, it appears that maybe I am writing to a free page, or maybe I'm underwriting a buffer. So, I have tried the following environment variables, together and by themselves:
EF_PROTECT_FREE=1
EF_PROTECT_BELOW=1
EF_ALIGNMENT=64
EF_ALIGNMENT=4096
The last two had absolutely no effect.
The first one changed the portions of the stack frame which are in my program (where in my program was executing when malloc was called fatally), but with identical frames once malloc was entered.
The second one changed a bit more; in addition to the crash occurring at a different place in my program, it also occurred in a call to realloc instead of malloc, although realloc is directly calling malloc and otherwise the back trace is identical to above.
I'm not explicitly linking against any other libraries besides fence.
Update: I found several places where it suggests that the message: " mprotect() failed: Cannot allocate memory" means that there is not enough memory on the machine. But I am not seeing the "Cannot allocate memory" part, and ps says I am only using 15% of memory. With such a small allocation (4k+32) could this really be the problem?
I just wasted several hours on the same problem.
It turns out that it is to do with the setting in
/proc/sys/vm/max_map_count
From the kernel documentation:
"This file contains the maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling malloc, directly by mmap and mprotect, and also when loading shared libraries.
While most applications need less than a thousand maps, certain programs, particularly malloc debuggers, may consume lots of them, e.g., up to one or two maps per allocation."
So you can 'cat' that file to see what it is set to, and then you can 'echo' a bigger number into it. Like this: echo 165535 > /proc/sys/vm/max_map_count
For me, this allowed electric fence to get past where it was before, and start to find real bugs.
We have a slow memory leak in our application and I've already gone through the following steps in trying to analyize the cause for the leak:
Enabling user mode stack trace database in GFlags
In Windbg, typing the following command: !heap -stat -h 1250000 (where 1250000 is the address of the heap that has the leak)
After comparing multiple dumps, I see that a memory blocks of size 0xC are increasing over time and are probably the memory that is leaked.
typing the following command: !heap -flt s c
gives the UserPtr of those allocations and finally:
typing !heap -p -a address on some of those addresses always shows the following allocation call stack:
0:000> !heap -p -a 10576ef8
address 10576ef8 found in
_HEAP # 1250000
HEAP_ENTRY Size Prev Flags UserPtr UserSize - state
10576ed0 000a 0000 [03] 10576ef8 0000c - (busy)
mscoreei!CLRRuntimeInfoImpl::`vftable'
7c94b244 ntdll!RtlAllocateHeapSlowly+0x00000044
7c919c0c ntdll!RtlAllocateHeap+0x00000e64
603b14a4 mscoreei!UtilExecutionEngine::ClrHeapAlloc+0x00000014
603b14cb mscoreei!ClrHeapAlloc+0x00000023
603b14f7 mscoreei!ClrAllocInProcessHeapBootstrap+0x0000002e
603b1614 mscoreei!operator new[]+0x0000002b
603d402b +0x0000005f
603d5142 mscoreei!GetThunkUseState+0x00000025
603d6fe8 mscoreei!_CorDllMain+0x00000056
79015012 mscoree!ShellShim__CorDllMain+0x000000ad
7c90118a ntdll!LdrpCallInitRoutine+0x00000014
7c919a6d ntdll!LdrpInitializeThread+0x000000c0
7c9198e6 ntdll!_LdrpInitialize+0x00000219
7c90e457 ntdll!KiUserApcDispatcher+0x00000007
This looks like thread initialization call stack but I need to know more than this.
What is the next step you would recommend to do in order to put the finger at the exact cause for the leak.
The stack recorded when using GFlags is done without utilizing .pdb and often not correct.
Since you have traced the leak down to a specific size on a given heap, you can try
to set a live break in RtlAllocateHeap and inspect the stack in windbg with proper symbols. I have used the following with some success. You must edit it to suit your heap and size.
$$ Display stack if heap handle eq 0x00310000 and size is 0x1303
$$ ====================================================================
bp ntdll!RtlAllocateHeap "j ((poi(#esp+4) = 0x00310000) & (poi(#esp+c) = 0x1303) )'k';'gc'"
Maybe you then get another stack and other ideas for the offender.
The first thing is that the new operator is the new [] operator so is there a corresponding delete[] call and not a plain old delete call?
If you suspect this code I would put a test harness around it, for instance put it in a loop and execute it 100 or 1000 times, does it still leak and proportionally.
You can also measure the memory increase using process explorer or programmatically using GetProcessInformation.
The other obvious thing is to see what happens when you comment out this function call, does the memory leak go away? You may need to do a binary chop if possible of the code to reduce the likely suspect code by half (roughly) each time by commenting out code, however, changing the behaviour of the code may cause more problems or dependant code path issues which can cause memory leaks or strange behaviour.
EDIT
Ignore the following seeing as you are working in a managed environment.
You may also consider using the STL or better yet boost reference counted pointers like shared_ptr or scoped_array for array structures to manage the lifetime of the objects.
I was immediately suspicious of the crash. A Floating Point Exception in a method whose only arithmetic was a "divide by sizeof(short)".
I looked at the stack crawl & saw that the offset into the method was "+91". Then I examined a disassembly of that method & confirmed that the Program Counter was in fact foobar at the time of the crash. The disassembly showed instructions at +90 and +93 but not +91.
This is a method, 32-bit x86 instructions, that gets called very frequently in the life of the application. This crash has been reported 3 times.
How does this happen? How do I set a debugging trap for the situation?
Generally when you fault in the middle of an instruction, its due to bad flow control(ie: a broken jump, call, retn), an overflow, bad dereferencing or your debug symbols being out-of-sync making the stack trace show incorrect info. Your first step is to reliably reproduce the error everytime, else you'll have trouble trapping it, from there I'd just run it in a debugger, force the conditions to make it explode, then examine the (call) stack and registers to see if they are valid values etc.
I just spent some time chasing down a bug that boiled down to the following. Code was erroneously overwriting the stack, and I think it wrote over the return address of the function call. Following the return, the program would crash and stack would be corrupted. Running the program in valgrind would return an error such as:
vex x86->IR: unhandled instruction bytes: 0xEA 0x3 0x0 0x0
==9222== valgrind: Unrecognised instruction at address 0x4e925a8.
I figure this is because the return jumped to a random location, containing stuff that were not valid x86 opcodes. (Though I am somehow suspicious that this address 0x4e925a8 happened to be in an executable page. I imagine valgrind would throw a different error if this wasn't the case.)
I am certain that the problem was of the stack-overwriting type, and I've since fixed it. Now I am trying to think how I could catch errors like this more effectively. Obviously, valgrind can't warn me if I rewrite data on the stack, but maybe it can catch when someone writes over a return address on the stack. In principle, it can detect when something like 'push EIP' happens (so it can flag where the return addresses are on the stack).
I was wondering if anyone knows if Valgrind, or anything else can do that? If not, can you comment on other suggestions regarding debugging errors of this type efficiently.
If the problem happens deterministically enough that you can point out particular function that has it's stack smashed (in one repeatable test case), you could, in gdb:
Break at entry to that function
Find where the return address is stored (it's relative to %ebp (on x86) (which keeps the value of %esp at the function entry), I am not sure whether there is any offset).
Add watchpoint to that address. You have to issue the watch command with calculated number, not an expression, because with an expression gdb would try to re-evaluate it after each instruction instead of setting up a trap and that would be extremely slow.
Let the function run to completion.
I have not yet worked with the python support available in gdb7, but it should allow automating this.
In general, Valgrind detection of overflows in stack and global variables is weak to non-existant. Arguably, Valgrind is the wrong tool for that job.
If you are on one of supported platforms, building with -fmudflap and linking with -lmudflap will give you much better results for these kinds of errors. Additional docs here.
Udpdate:
Much has changed in the 6 years since this answer. On Linux, the tool to find stack (and heap) overflows is AddressSanitizer, supported by recent versions of GCC and Clang.
I am trying to compile Ruby 1.9.1-p0 on HP-UX. After a small change to ext/pty.c it compiles successfully, albeit with a lot of warning messages (about 5K). When I run the self-tests using "make test" it crashes and core-dumps with the following error:
sendsig: useracc failed. 0x9fffffffbf7dae00 0x00000000005000
Pid 3044 was killed due to failure in writing the signal context - possible stack overflow.
Illegal instruction
From googling this problem the Illegal instruction is just a signal that the system uses to kill the process, and not related to the problem. It would seem that there is a problem with the re-establishing the context when calling the signal handler. Bringing the core up in gdb doesn't show a particularly deep stack, so I don't think the "possible stack overflow" is right either.
The gdb stack backtrace output looks like this:
#0 0xc00000000033a990:0 in __ksleep+0x30 () from /usr/lib/hpux64/libc.so.1
#1 0xc0000000001280a0:0 in __mxn_sleep+0xae0 ()
from /usr/lib/hpux64/libpthread.so.1
#2 0xc0000000000c0f90:0 in <unknown_procedure> + 0xc50 ()
from /usr/lib/hpux64/libpthread.so.1
#3 0xc0000000000c1e30:0 in pthread_cond_timedwait+0x1d0 ()
from /usr/lib/hpux64/libpthread.so.1
Answering my own question:
The problem was that the stack being allocated was too small. So it really was a stack overflow. The sendsig() function was preparing a context structure to be copied from kernel space to user space. The useracc() function checks that there's enough space at the address specified to do so.
The Ruby 1.9.1-p0 code was using PTHREAD_STACK_MIN to allocate the stack for any threads created. According to HP-UX documentation, on Itanium this is 256KB, but when I checked the header files, it was only 4KB. The error message from useracc() indicated that it was trying to copy 20KB.
So if a thread received a signal, it wouldn't have enough space to receive the signal context on its stack.