My Go code uses several hundreds of goroutines. A runtime error could occur from time to time. But when error occurs, it will simply print out the stack traces of all goroutines, making it impossible to debug?
How to locate where the program breaks?
Sry I didn't post stack traces earlier, I didn't know how to print stderr to stack and the output is too long so I can't view all of it.
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x141edce pc=0x141edce]
runtime stack:
runtime: unexpected return pc for runtime.sigpanic called from 0x141edce
stack: frame={sp:0x7ffbffffa9f0, fp:0x7ffbffffaa40} stack=[0x7ffbff7fbb80,0x7ffbffffabb0)
00007ffbffffa8f0: 00007ffbffffa960 000000000042b58c <runtime.dopanic_m+540>
00007ffbffffa900: 000000000042b031 <runtime.throw+129> 00007ffbffffa9d0
00007ffbffffa910: 0000000000000000 000000000097f880
00007ffbffffa920: 010000000042bae8 0000000000000004
00007ffbffffa930: 000000000000001f 000000000141edce
00007ffbffffa940: 000000000141edce 0000000000000001
00007ffbffffa950: 00000000007996e6 000000c420302180
00007ffbffffa960: 00007ffbffffa988 00000000004530ac <runtime.dopanic.func1+60>
00007ffbffffa970: 000000000097f880 000000000042b031 <runtime.throw+129>
00007ffbffffa980: 00007ffbffffa9d0 00007ffbffffa9c0
00007ffbffffa990: 000000000042af5a <runtime.dopanic+74> 00007ffbffffa9a0
00007ffbffffa9a0: 0000000000453070 <runtime.dopanic.func1+0> 000000000097f880
00007ffbffffa9b0: 000000000042b031 <runtime.throw+129> 00007ffbffffa9d0
00007ffbffffa9c0: 00007ffbffffa9e0 000000000042b031 <runtime.throw+129>
00007ffbffffa9d0: 0000000000000000 000000000000002a
00007ffbffffa9e0: 00007ffbffffaa30 000000000043fb1e <runtime.sigpanic+654>
00007ffbffffa9f0: <000000000079dce7 000000000000002a
00007ffbffffaa00: 00007ffbffffaa30 000000000041f08e <runtime.greyobject+302>
00007ffbffffaa10: 000000c420029c70 000000000097f880
00007ffbffffaa20: 000000000045247d <runtime.markroot.func1+109> 000000c420a69b00
00007ffbffffaa30: 00007ffbffffaad8 !000000000141edce
00007ffbffffaa40: >000000c42160ca40 000000c4206d8000
00007ffbffffaa50: 0000000000000c00 000000c41ff4f9ad
00007ffbffffaa60: 000000c400000000 00007efbff5188f8
00007ffbffffaa70: 000000c420029c70 0000000000000052
00007ffbffffaa80: 0000000021e84000 00007ffbffffaab0
00007ffbffffaa90: 0000000000002000 0000000000000c00
00007ffbffffaaa0: 000000c422b00000 000000c420000000
00007ffbffffaab0: 00007ffbffffaad8 0000000000421564 <runtime.(*gcWork).tryGet+164>
00007ffbffffaac0: 000000c41ffc939f 000000c4226eb000
00007ffbffffaad0: 000000c4226e9000 00007ffbffffab30
00007ffbffffaae0: 000000000041e527 <runtime.gcDrain+567> 000000c4206d8000
00007ffbffffaaf0: 000000c420029c70 0000000000000000
00007ffbffffab00: 7ffffffffff8df47 00007ffc0001fc30
00007ffbffffab10: 00007ffbffffab70 0000000000000000
00007ffbffffab20: 000000c420302180 0000000000000000
00007ffbffffab30: 00007ffbffffab70 00000000004522c0 <runtime.gcBgMarkWorker.func2+128>
runtime.throw(0x79dce7, 0x2a)
/usr/lib/go-1.10/src/runtime/panic.go:616 +0x81
runtime: unexpected return pc for runtime.sigpanic called from 0x141edce
stack: frame={sp:0x7ffbffffa9f0, fp:0x7ffbffffaa40} stack=[0x7ffbff7fbb80,0x7ffbffffabb0)
00007ffbffffa8f0: 00007ffbffffa960 000000000042b58c <runtime.dopanic_m+540>
00007ffbffffa900: 000000000042b031 <runtime.throw+129> 00007ffbffffa9d0
00007ffbffffa910: 0000000000000000 000000000097f880
00007ffbffffa920: 010000000042bae8 0000000000000004
00007ffbffffa930: 000000000000001f 000000000141edce
00007ffbffffa940: 000000000141edce 0000000000000001
00007ffbffffa950: 00000000007996e6 000000c420302180
00007ffbffffa960: 00007ffbffffa988 00000000004530ac <runtime.dopanic.func1+60>
00007ffbffffa970: 000000000097f880 000000000042b031 <runtime.throw+129>
00007ffbffffa980: 00007ffbffffa9d0 00007ffbffffa9c0
00007ffbffffa990: 000000000042af5a <runtime.dopanic+74> 00007ffbffffa9a0
00007ffbffffa9a0: 0000000000453070 <runtime.dopanic.func1+0> 000000000097f880
00007ffbffffa9b0: 000000000042b031 <runtime.throw+129> 00007ffbffffa9d0
00007ffbffffa9c0: 00007ffbffffa9e0 000000000042b031 <runtime.throw+129>
00007ffbffffa9d0: 0000000000000000 000000000000002a
00007ffbffffa9e0: 00007ffbffffaa30 000000000043fb1e <runtime.sigpanic+654>
00007ffbffffa9f0: <000000000079dce7 000000000000002a
00007ffbffffaa00: 00007ffbffffaa30 000000000041f08e <runtime.greyobject+302>
00007ffbffffaa10: 000000c420029c70 000000000097f880
00007ffbffffaa20: 000000000045247d <runtime.markroot.func1+109> 000000c420a69b00
00007ffbffffaa30: 00007ffbffffaad8 !000000000141edce
00007ffbffffaa40: >000000c42160ca40 000000c4206d8000
00007ffbffffaa50: 0000000000000c00 000000c41ff4f9ad
00007ffbffffaa60: 000000c400000000 00007efbff5188f8
00007ffbffffaa70: 000000c420029c70 0000000000000052
00007ffbffffaa80: 0000000021e84000 00007ffbffffaab0
00007ffbffffaa90: 0000000000002000 0000000000000c00
00007ffbffffaaa0: 000000c422b00000 000000c420000000
00007ffbffffaab0: 00007ffbffffaad8 0000000000421564 <runtime.(*gcWork).tryGet+164>
00007ffbffffaac0: 000000c41ffc939f 000000c4226eb000
00007ffbffffaad0: 000000c4226e9000 00007ffbffffab30
00007ffbffffaae0: 000000000041e527 <runtime.gcDrain+567> 000000c4206d8000
00007ffbffffaaf0: 000000c420029c70 0000000000000000
00007ffbffffab00: 7ffffffffff8df47 00007ffc0001fc30
00007ffbffffab10: 00007ffbffffab70 0000000000000000
00007ffbffffab20: 000000c420302180 0000000000000000
00007ffbffffab30: 00007ffbffffab70 00000000004522c0 <runtime.gcBgMarkWorker.func2+128>
runtime.sigpanic()
/usr/lib/go-1.10/src/runtime/signal_unix.go:372 +0x28e
It actually makes it easy to debug by dumping those stacks.
You might not be familiar with this approach to post-mortem analysis but this can be fixed ;-)
The first thing to note is that in normal Go code the panic/recover mechanism is not used for control flow, and so when some goroutine panics, it usually has a pretty grave reason to do that. In turn, this means, such reason is usually restricted to a not too broad set of possible reasons, and in 100% of such cases it signalizes a logical error in the program: an attempt to dereference an unitialized (nil) pointer, an attempt to send to a closed channel and so on.
(Of course, the problem may be with the 3rd-party code, or with the way you're using it.)
OK, so to start analysing what has happened, the first thing is to stop considering this as "something awry happened": instead, some particular error has happened, and the Go runtime displayed you the state of all of your goroutines at that instant in time.
So, the first thing to do is to actually read and understand the very error displayed. It contains the immediate reason which caused the Go runtime to crash your program — it may be nil pointer derefernce, memory exhaustion, an attempt to close a closed channel and so on.
The second thing to do — once the essense of the error was clearly understood — is to analyse whether the stack trace dump will be of use. It's simple: all runtime errors can be classified into the two broad groups: "low-level" or "high-level". The former are those happening deep in the Go runtime itself. A failure to allocate memory is the best example. Such errors might even indicate bugs in the runtime (though this is very unlikely to see in practice unless you're using a bleeding edge build of the Go toolset to build your program). The chief property of such errors is that they may have little to do with the exact place the error occured at. Say, a failure to allocate memory may be triggered by some innocent allocation while some real memory hog leaking memory successfully managed to get a big chunk of memory just before.
But such errors are rare, and the high-level errors occur far, far more often. And with them, inspecting stack traces helps a lot.
In these cases, you roll like this.
A stack trace dump consists of the descriptions of the stack frames of the callchain leading to the error: the stack frame of the function in which the error occured is at the top, its caller is just below, the caller of the caller is the next down the line, and so on and so on — right until the entry point of the executing goroutine.
Each stack frame's description includes the name of the function and the name of the file that function is defined in and the line number of the statement the error occured in.
That is super useful in itself: you find that statement in the source code of your program, squint at it while keeping in mind that the indicated error happened there and then start to analyse "backwards" how it may so happened such that it occured in there. If it's not clear with the code of the function preceding that statement, it may help to analyze the caller's stack frame (whcih also includes the file name and the line number) and so on.
In most cases the above suffices.
In rare cases when it does not, analysing the arguments to the functions — also captured by their dumped stack frames — may help.
The values of the arguments are listed in their source code order — from left to right; the only problem with interpreiting them is "decoding" those of the arguments which are of "compound" types — such as strings, slices, used-defined struct types etc.
Say, a string is a struct of two fields, and in the list of arguments these fields will come one after the other, "unwrapped".
But let's not dig too deeper for now. There are other things to explore here (say, I've touched on the memory exhaustion errors but did not explain how to approach them) but you're better off actually trying to learn your way here by practice.
If you have any concrete questions while dealing with such problems, — ask away, but be sure to include the stack trace of the crashed goroutine and describe what your own attempt at analysis yielded, and what exactly you have problem with.
There exists another approach you can use.
The GOTRACEBACK environment variable may be assigned a special value to tell the Go runtime to crash your program in a way friendly to a "regular" interactive debugger capable of working with core dumps — such as gdb.
For instance, you may enable dumping of core files and then allow the Go runtime to crash your process in such a way so that the OS dumps its core:
$ ulimit -c unlimited
$ export GOTRACEBACK=crash
$ ./your_program
...
... your_program crashes
...
$ ls *core*
core
$ gdb -e ./your_program core
(gdb) thread apply all bt
* tracebacks follow *
(Actual debugging of the state captured by a core file is what your IDE or whatever should take care of, I suppose; I demonstrated how to run the gdb debugger.
Run help ulimit in bash to see what that ulimit encantation above was about.)
I am at my wit's end trying to debug a hard fault on an EFR32BG12 processor. I've been following the instructions in the Silicon Labs knowledge base here:
https://www.silabs.com/community/mcu/32-bit/knowledge-base.entry.html/2014/05/26/debug_a_hardfault-78gc
I've also been using the Keil app note here to fill in some details:
http://www.keil.com/appnotes/files/apnt209.pdf
I've managed to get the hard fault to occur quite consistently in one place. When the hard fault occurs, the code from the knowledge base article gives me the following values (pushed onto the stack by the processor before calling the hard fault handler):
Name Type Value Location
~~~~ ~~~~ ~~~~~ ~~~~~~~~
cfsr uint32_t 0x20000 (Hex) 0x2000078c
hfsr uint32_t 0x40000000 (Hex) 0x20000788
mmfar uint32_t 0xe000ed34 (Hex) 0x20000784
bfar uint32_t 0xe000ed38 (Hex) 0x20000780
r0 uint32_t 0x0 (Hex) 0x2000077c
r1 uint32_t 0x8 (Hex) 0x20000778
r2 uint32_t 0x0 (Hex) 0x20000774
r3 uint32_t 0x0 (Hex) 0x20000770
r12 uint32_t 0x1 (Hex) 0x2000076c
lr uint32_t 0xab61 (Hex) 0x20000768
pc uint32_t 0x38dc8 (Hex) 0x20000764
psr uint32_t 0x0 (Hex) 0x20000760
Looking at the Keil app note, I believe a CFSR value of 0x20000 indicates a Usage Fault with the INVSTATE bit set, i.e.:
INVSTATE: Invalid state: 0 = no invalid state 1 = the processor has
attempted to execute an instruction that makes illegal use of the
Execution Program Status Register (EPSR). When this bit is set, the PC
value stacked for the exception return points to the instruction that
attempted the illegal use of the EPSR. Potential reasons: a) Loading a
branch target address to PC with LSB=0. b) Stacked PSR corrupted
during exception or interrupt handling. c) Vector table contains a
vector address with LSB=0.
The PC value pushed onto the stack by the exception (provided by the code from the knowledge base article) seems to be 0x38dc8. If I go to this address in the Simplicity Studio "Disassembly" window, I see the following:
00038db8: str r5,[r5,#0x14]
00038dba: str r0,[r7,r1]
00038dbc: str r4,[r5,#0x14]
00038dbe: ldr r4,[pc,#0x1e4] ; 0x38fa0
00038dc0: strb r1,[r4,#0x11]
00038dc2: ldr r5,[r4,#0x64]
00038dc4: ldrb r3,[r4,#0x5]
00038dc6: movs r3,r6
00038dc8: strb r1,[r4,#0x15]
00038dca: ldr r4,[r4,#0x14]
00038dcc: cmp r7,#0x6f
00038dce: cmp r6,#0x30
00038dd0: str r7,[r6,#0x14]
00038dd2: lsls r6,r6,#1
00038dd4: movs r5,r0
00038dd6: movs r0,r0
The address appears to be well past the end of my code. If I look at the same address in the "Memory" window, this is what I see:
0x00038DC8 69647561 2E302F6F 00766177 00000005 audio/0.wav.....
0x00038DD8 00000000 000F4240 00000105 00000000 ....#B..........
0x00038DE8 00000000 00000000 00000005 00000000 ................
0x00038DF8 0001C200 00000500 00001000 00000000 .Â..............
0x00038E08 00000000 F00000F0 02F00001 0003F000 ....ð..ð..ð..ð..
0x00038E18 F00004F0 06010005 01020101 01011201 ð..ð............
0x00038E28 35010121 01010D01 6C363025 2E6E6775 !..5....%06lugn.
0x00038E38 00746164 00000001 000008D0 00038400 dat.....Ð.......
Curiously, "audio/0.wav" is a static string which is part of the firmware. If I understand correctly, what I've learned here is that PC somehow gets set to this point in memory, which of course is not a valid instruction and causes the hard fault.
To debug the issue, I need to know how PC came to be set to this incorrect value. I believe the LR register should give me an idea. The LR register pushed onto the stack by the exception seems to be 0xab61. If I look at this location, I see the following in the Disassembly window:
1270 dp->sect = clst2sect(fs, clst);
0000ab58: ldr r0,[r7,#0x10]
0000ab5a: ldr r1,[r7,#0x14]
0000ab5c: bl 0x00009904
0000ab60: mov r2,r0
0000ab62: ldr r3,[r7,#0x4]
0000ab64: str r2,[r3,#0x18]
It looks to me like the problem occurs during this call specifically:
0000ab5c: bl 0x00009904
This makes me think that the problem occurs as a result of a corrupt stack, which causes clst2sect to return to an invalid part of memory rather than to 0xab60. The code for clst2sect is pretty innocuous:
/*-----------------------------------------------------------------------*/
/* Get physical sector number from cluster number */
/*-----------------------------------------------------------------------*/
DWORD clst2sect ( /* !=0:Sector number, 0:Failed (invalid cluster#) */
FATFS* fs, /* Filesystem object */
DWORD clst /* Cluster# to be converted */
)
{
clst -= 2; /* Cluster number is origin from 2 */
if (clst >= fs->n_fatent - 2) return 0; /* Is it invalid cluster number? */
return fs->database + fs->csize * clst; /* Start sector number of the cluster */
}
Does this analysis sound about right?
I suppose the problem I've run into is that I have no idea what might cause this kind of behaviour... I've tried putting breakpoints in all of my interrupt handlers, to see if one of them might be corrupting the stack, but there doesn't seem to be any pattern--sometimes, no interrupt handler is called but the problem still occurs.
In that case, though, it's hard for me to see how a program might try to execute code at a location well past the actual end of the code... I feel like a function pointer might be a likely candidate, but in that case I would expect to see the problem show up, e.g., where a function pointer is used. However, I don't see any function pointers used near where the error is occurring.
Perhaps there is more information I can extract from the debug information I've given above? The problem is quite reproducible, so if there's something I have not tried, but which you think might give some insight, I would love to hear it.
Thanks for any help you can offer!
After about a month of chasing this one, I managed to identify the cause of the problem. I hope I can give enough information here that this will be useful to someone else.
In the end, the problem was caused by passing a pointer to a non-static local variable to a state machine which changed the value at that memory location later on. Because the local variable was no longer in scope, that memory location was a random point in the stack, and changing the value there corrupted the stack.
The problem was difficult to track down for two reasons:
Depending on how the code compiled, the changed memory location could be something non-critical like another local variable, which would cause a much more subtle error. Only when I got lucky would the change affect the PC register and cause a hard fault.
Even when I found a version of the code that consistently generated a hard fault, the actual hard fault typically occurred somewhere up the call stack, when a function returned and popped the stack value into PC. This made it difficult to identify the cause of the problem--all I knew was that something was corrupting the stack before that function return.
A few tools were really helpful in identifying the cause of the problem:
Early on, I had identified a block of code where the hard fault usually occurred using GPIO pins. I would toggle a pin high before entering the block and low when exiting the block. Then I performed many tests, checking if the pin was high or low when the hard fault occurred, and used a sort of binary search to determine the smallest block of code which consistently contained all the hard faults.
The hard fault pushes a number of important registers onto the stack. These helped me confirm where the PC register was becoming corrupt, and also helped me understand that it was becoming corrupt as a result of a stack corruption.
Starting somewhere before that block of code and stepping forward while keeping an eye on local variables, I was able to identify a function call that was corrupting the stack. I could confirm this using Simplicity Studio's memory view.
Finally, stepping through the offending function in detail, I realized that the problem was occurring when I dereferenced a stored pointer and wrote to that memory location. Looking back at where that pointer value was set, I realized it had been set to point to a non-static local variable that was now out of scope.
Thanks to #SeanHoulihane and #cooperised, who helped me eliminate a few possible causes and gave me a little more confidence with the debugging tools.
I have set a breakpoint in nt!ntWriteFile from Windbg. I'm using kernel debugging and I want to get the user stack + kernel stack trace when certain program (for example, notepad.exe) ends up calling this API. When the breakpoint kicks in I do the following:
.reload /user
K
but the result is similar to this (in this case notepad.exe is the current process):
# ChildEBP RetAddr
00 8f5a8c34 76e96c73 nt!NtWriteFile
01 8f5a8c38 badb0d00 ntdll!KiFastSystemCall+0x3
02 8f5a8c3c 0320ef04 0xbadb0d00
03 8f5a8c40 00000000 0x320ef04
My question are:
What is 0xbadb0d00? I always see this address.
Is the address 0x320ef04 the function on user land (inside notepad.exe in this case) from which the call begins? In this case, would that be the full stack trace (user stack + kernel stack).
Is there another easier way to get this?
Thank you.
Updated:
As I read in this link (thanks to Thomas Weller) 0xbadb0d00 is used to initialize uninitialized memory in some circumstances. Now I have even more doubts. Why does the stack trace show uninitialized memory? Why notepad.exe stack-trace does not appear in the output if I'm in its context?
The Windows host I'm debugging is a Windows 7 32 bits.
I know I am dealing with a managed thread but I have never managed to get !clrstack to work. I always get:
0:000> !clrstack
OS Thread Id: 0xaabb (0)
Child SP IP Call Site
GetFrameContext failed: 1
00000000 00000000
Admittedly I could use !dumpstack but I can't figure out how to make it show the arguments. It only shows ChildEBP, Return Address and the function name. Besides it mixes managed and unmanaged calls and I'd like to focus only on the managed portions.
UPDATE
As requested by Thomas, !clrstack -i returns:
0:000> !clrstack -i
Loaded c:\cache\mscordbi.dll\53489464110000\mscordbi.dll
Loaded c:\cache\mscordacwks_x86_x86_4.0.30319.34209.dll\5348961E69d000\mscordacwks_x86_x86_4.0.30319.34209.dll
Dumping managed stack and managed variables using ICorDebug.
=================================================================
Child SP IP Call Site
003ad0bc 77d1f8e1 [NativeStackFrame]
Stack walk complete.
Its progress :-)
Please post the output from !dumpstack or k to double check the callstack, you know the !clrstack only display the managed code call stack, however sometimes , if the managed thread finished this work, it would be waited in the CLR code(semaphore) if you use the thread pool, and the remain call stack become totally unmanaged call stack.so !clrstack display nothing for it.
I am debugging a program using WinDbg.
At the crash site, the last two frames of call stack are:
ChildEBP RetAddr
WARNING: Stack unwind information not available. Following frames may be wrong.
0251bfe8 6031f8da npdf!ProvideCoreHFT2+0x24db0
0251c000 011eb7a5 npdf!ProvideCoreHFT2+0x5ac1a
...
I want to find out how frame 1 calls frame 0. Since the return address of frame 0 is 6031f8da, I opened the disassembly window and jump to that location, the code are:
...
6031f8d5 e8a6d0ffff call npdf!ProvideCoreHFT2+0x57cc0 (6031c980)
6031f8da 5f pop edi
...
My question is that the call instruction right before the return address calls npdf!ProvideCoreHFT2+0x57cc0, while the function in frame 0 is actually npdf!ProvideCoreHFT2+0x24db0. Why such inconsistency exists? How should I proceed?
Thank you very much!