How to prevent Go program from crashing after accidental panic? - go

Think of a large project which deals with tons of concurrent requests handled by its own goroutine. It happens that there is a bug in the code and one of these requests will cause panic due to a nil reference.
In Java, C# and many other languages, this would end up in a exception which would stop the request without any harm to other healthy requests. In go, that would crash the entire program.
AFAIK, I'd have to have recover() for every single new go routine creation. Is that the only way to prevent entire program from crashing?
UPDATE: adding recover() call for every gorouting creation seems OK. What about third-party libraries? If third party creates goroutines without recover() safe net, it seems there is NOTHING to be done.

If you go the defer-recover-all-the-things, I suggest investing some time to make sure that a clear error message is collected with enough information to promptly act on it.
Writing the panic message to stderr/stdout is not great as it will be very hard to find where the problem is. In my experience the best approach is to invest a bit of time to get your Go programs to handle errors in a reasonable way. errors.Wrap from "github.com/pkg/errors" for instance allows you to wrap all errors and get a stack-trace.
Recovering panic is often a necessary evil. Like you say, it's not ideal to crash the entire program just because one requested caused a panic. In most cases recovering panics will not back-fire, but it is possible for a program to end up in a undefined not-recoverable state that only a manual restart can fix. That being said, my suggestion in this case is to make sure your Go program exposes a way to create a core dump.
Here's how to write a core dump to stderr when SIGQUIT is sent to the Go program (eg. kill pid -QUIT)
go func() {
// Based on answers to this stackoverflow question:
// https://stackoverflow.com/questions/19094099/how-to-dump-goroutine-stacktraces
sigs := make(chan os.Signal, 1)
signal.Notify(sigs, syscall.SIGQUIT)
for {
<-sigs
fmt.Fprintln(os.Stderr, "=== received SIGQUIT ===")
fmt.Fprintln(os.Stderr, "*** goroutine dump...")
var buf []byte
var bufsize int
var stacklen int
// Create a stack buffer of 1MB and grow it to at most 100MB if
// necessary
for bufsize = 1e6; bufsize < 100e6; bufsize *= 2 {
buf = make([]byte, bufsize)
stacklen = runtime.Stack(buf, true)
if stacklen < bufsize {
break
}
}
fmt.Fprintln(os.Stderr, string(buf[:stacklen]))
fmt.Fprintln(os.Stderr, "*** end of dump")
}
}()

there is no way you can handle panic without recover function, a good practice would be using a middleware like function for your safe function, checkout this snippet
https://play.golang.org/p/d_fQWzXnlAm

Related

Unlocking mutex without defer [duplicate]

Going through the standard library, I see a lot functions similar to the following:
// src/database/sql/sql.go
func (dc *driverConn) removeOpenStmt(ds *driverStmt) {
dc.Lock()
defer dc.Unlock()
delete(dc.openStmt, ds)
}
...
func (db *DB) addDep(x finalCloser, dep interface{}) {
//println(fmt.Sprintf("addDep(%T %p, %T %p)", x, x, dep, dep))
db.mu.Lock()
defer db.mu.Unlock()
db.addDepLocked(x, dep)
}
// src/expvar/expvar.go
func (v *Map) addKey(key string) {
v.keysMu.Lock()
defer v.keysMu.Unlock()
v.keys = append(v.keys, key)
sort.Strings(v.keys)
}
// etc...
I.e.: simple functions with no returns and presumably no way to panic that are still deferring the unlock of their mutex. As I understand it, the overhead of a defer has been improved (and perhaps is still in the process of being improved), but that being said: Is there any reason to include a defer in functions like these? Couldn't these types of defers end up slowing down a high traffic function?
Always deferring things like Mutex.Unlock() and WaitGroup.Done() at the top of the function makes debugging and maintenance easier, since you see immediately that those pieces are handled correctly so you know that those important pieces are taken care of, and can quickly move on to other issues.
It's not a big deal in 3 line functions, but consistent-looking code is also just easier to read in general. Then as the code grows, you don't have to worry about adding an expression that may panic, or complicated early return logic, because the pre-existing defers will always work correctly.
Panic is a sudden (so it could be unpredicted or unprepared) violation of normal control flow. Potentially it can emerge from anything - quite often from external things - for example memory failure. defer mechanism gives an easy and quite cheap tool to perform exit operation. Thus not leave a system in broken state. This is important for locks in high load applications because it help not to lose locked resources and freeze the whole system in lock once.
And if for some moment code has no places for panic (hard to guess such system ;) but things evolve. Later this function would be more complex and able to throw panic.
Conclusion: Defer helps you to ensure your function will exit correctly if something “goes wring”. Also quite important it is future-proof - same reply for different failures.
So it’s a food style to put them even in easy functions. As a programmer you can see nothing is lost. And be more sure in a code.

Are there any advantages to having a defer in a simple, no return, non-panicking function?

Going through the standard library, I see a lot functions similar to the following:
// src/database/sql/sql.go
func (dc *driverConn) removeOpenStmt(ds *driverStmt) {
dc.Lock()
defer dc.Unlock()
delete(dc.openStmt, ds)
}
...
func (db *DB) addDep(x finalCloser, dep interface{}) {
//println(fmt.Sprintf("addDep(%T %p, %T %p)", x, x, dep, dep))
db.mu.Lock()
defer db.mu.Unlock()
db.addDepLocked(x, dep)
}
// src/expvar/expvar.go
func (v *Map) addKey(key string) {
v.keysMu.Lock()
defer v.keysMu.Unlock()
v.keys = append(v.keys, key)
sort.Strings(v.keys)
}
// etc...
I.e.: simple functions with no returns and presumably no way to panic that are still deferring the unlock of their mutex. As I understand it, the overhead of a defer has been improved (and perhaps is still in the process of being improved), but that being said: Is there any reason to include a defer in functions like these? Couldn't these types of defers end up slowing down a high traffic function?
Always deferring things like Mutex.Unlock() and WaitGroup.Done() at the top of the function makes debugging and maintenance easier, since you see immediately that those pieces are handled correctly so you know that those important pieces are taken care of, and can quickly move on to other issues.
It's not a big deal in 3 line functions, but consistent-looking code is also just easier to read in general. Then as the code grows, you don't have to worry about adding an expression that may panic, or complicated early return logic, because the pre-existing defers will always work correctly.
Panic is a sudden (so it could be unpredicted or unprepared) violation of normal control flow. Potentially it can emerge from anything - quite often from external things - for example memory failure. defer mechanism gives an easy and quite cheap tool to perform exit operation. Thus not leave a system in broken state. This is important for locks in high load applications because it help not to lose locked resources and freeze the whole system in lock once.
And if for some moment code has no places for panic (hard to guess such system ;) but things evolve. Later this function would be more complex and able to throw panic.
Conclusion: Defer helps you to ensure your function will exit correctly if something “goes wring”. Also quite important it is future-proof - same reply for different failures.
So it’s a food style to put them even in easy functions. As a programmer you can see nothing is lost. And be more sure in a code.

Handling SIGSEGV with recover?

The signal package states:
Synchronous signals are signals triggered by errors in program
execution: SIGBUS, SIGFPE, and SIGSEGV. These are only considered
synchronous when caused by program execution, not when sent using
os.Process.Kill or the kill program or some similar mechanism. In
general, except as discussed below, Go programs will convert a
synchronous signal into a run-time panic.
However, it seems recover() is not catching this.
Program:
package main
import (
"fmt"
"unsafe"
"log"
)
func seeAnotherDay() {
defer func() {
if p := recover(); p != nil {
err := fmt.Errorf("recover panic: panic call")
log.Println(err)
return
}
}()
panic("oops")
}
func notSoMuch() {
defer func() {
if p := recover(); p != nil {
err := fmt.Errorf("recover panic: sigseg")
log.Println(err)
return
}
}()
b := make([]byte, 1)
log.Println("access some memory")
foo := (*int)(unsafe.Pointer(uintptr(unsafe.Pointer(&b[0])) + uintptr(9999999999999999)))
fmt.Print(*foo + 1)
}
func main() {
seeAnotherDay()
notSoMuch()
}
Output:
2017/04/04 12:13:16 recover panic: panic call
2017/04/04 12:13:16 access some memory
unexpected fault address 0xb01dfacedebac1e
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0xb01dfacedebac1e pc=0x108aa8a]
goroutine 1 [running]:
runtime.throw(0x10b5807, 0x5)
/usr/local/go/src/runtime/panic.go:596 +0x95 fp=0xc420043ea8 sp=0xc420043e88
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:297 +0x28c fp=0xc420043ef8 sp=0xc420043ea8
main.notSoMuch()
/Users/kbrandt/src/sigseg/main.go:32 +0xca fp=0xc420043f78 sp=0xc420043ef8
main.main()
/Users/kbrandt/src/sigseg/main.go:37 +0x25 fp=0xc420043f88 sp=0xc420043f78
runtime.main()
/usr/local/go/src/runtime/proc.go:185 +0x20a fp=0xc420043fe0 sp=0xc420043f88
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc420043fe8 sp=0xc420043fe0
exit status 2
Is there any way I could handle SIGSEGV in a way localized to certain parts of the code?
Yes, you will want to use debug.SetPanicOnFault to convert faults at an unexpected (non-nil) address into panics from which you can recover. From the docs:
SetPanicOnFault controls the runtime's behavior when a program faults at an unexpected (non-nil) address. Such faults are typically caused by bugs such as runtime memory corruption, so the default response is to crash the program. Programs working with memory-mapped files or unsafe manipulation of memory may cause faults at non-nil addresses in less dramatic situations; SetPanicOnFault allows such programs to request that the runtime trigger only a panic, not a crash. SetPanicOnFault applies only to the current goroutine. It returns the previous setting.
For the localization of the impact, note that SetPanicOnFault is set at the goroutine level, so a single goroutine can deal with known unsafe access.
When you encounter a sigsegv, you're really in an all-bets-are-off situation with regards to the program state. The only generally safe thing to do is to stop everything, and possibly have the system dump your memory to file for debugging, which is what Go does. There isn't really any way to "protect the main runtime" in this situation.
If you have a runtime that is running code that is untrusted or unsafe, you really should isolate it into a separate process instead. And if you are the one running the code received from the users (rather than the users themselves), this process should most definitely be sandboxed.
So my advice is, do either of the following:
Let it crash and let the user handle it from there. The user writing code causing a sigsegv in Go normally requires more or less active attempts of shooting in the direction of one's foot, so it should be rare and arguably filed under things they are doing at their own risk anyway.
Separate it into a supervisor process and an "untrusted/unsafe" child process, where the supervisor picks up improper exit conditions from the child process and reports them appropriately.

Go "panic: sync: unlock of unlocked mutex" without a known reason

I have a cli application in Go (still in development) and no changes were made in source code neither on dependencies but all of a sudden it started to panic panic: sync: unlock of unlocked mutex.
The only place I'm running concurrent code is to handle when program is requested to close:
func handleProcTermination() {
c := make(chan os.Signal, 1)
signal.Notify(c, os.Interrupt)
go func() {
<-c
curses.Endwin()
os.Exit(0)
}()
defer curses.Endwin()
}
Only thing I did was to rename my $GOPATH and work space folder. Can this operation cause such error?
Have some you experienced any related problem without having any explanation? Is there a rational check list that would help to find the cause of the problem?
Ok, after some unfruitful debugging sessions, as a last resort, I simply wiped all third party code (dependencies) from the workspace:
cd $GOPATH
rm -rf pkg/ bin/ src/github.com src/golang.org # the idea is to remove all except your own source
Used go get to get all used dependencies again:
go get github.com/yadayada/yada
go get # etc
And the problem is gone! Application is starting normally and tests are passing. No startup panics anymore. It looks like this problem happens when you mv your work space folder but I'm not 100% sure yet. Hope it helps someone else.
From now on, re install dependencies will be my first step when weird panic conditions like that suddenly appear.
You're not giving much information to go on, so my answer is generic.
In theory, bugs in concurrent code can remain unnoticed for a long time and then suddenly show up. In practice, if the bug is easily repeatable (happens nearly every run) this usually indicates that something did change in the code or environment.
The solution: debug.
Knowing what has changed can be helpful to identify the bug. In this case, it appears that lock/unlock pairs or not matching up. If you are not passing locks between threads, you should be able to find a code path within the thread that has not acquired the lock, or has released it early. It may be helpful to put assertions at certain points to validate that you are holding the lock when you think you are.
Make sure you don't copy the lock somewhere.
What can happen with seemingly bulletproof code in concurrent environments is that the struct including the code gets copied elsewhere, which results in the underlying lock being different.
Consider this code snippet:
type someStruct struct {
lock sync.Mutex
}
func (s *someStruct) DoSomethingUnderLock() {
s.lock.Lock()
defer s.lock.Unlock() // This will panic
time.Sleep(200 * time.Millisecond)
}
func main() {
s1 := &someStruct{}
go func() {
time.Sleep(100 * time.Millisecond) // Wait until DoSomethingUnderLock takes the lock
s2 := &someStruct{}
*s1 = *s2
}()
s1.DoSomethingUnderLock()
}
*s1 = *s2 is the key here - it results in a different lock being used by the same receiver function and if the struct is replaced while the lock is taken, we'll get sync: unlock of unlocked mutex.
What makes it harder to debug is that someStruct might be nested in another struct (and another, and another), and if the outer struct gets replaced (as long as someStruct is not a reference there) in a similar manner, the result will be the same.
If this is the case, you can use a reference to the lock (or the whole struct) instead. Now you need to initialize it, but it's a small price that might save you some obscure bugs. See the modified code that doesn't panic here.
For those who come here and didn't solve your problem. Check If the application is compiled in one version of Linux but running in another version. At least in my case it happened.

Go bug in ioutil.ReadFile()

I am running a program in Go which sends data continuously after reading a file /proc/stat.
Using ioutil.ReadFile("/proc/stat")
After running for about 14 hrs i got err: too many files open /proc/stat
Click here for snippet of code.
I doubt that defer f.Close is ignored by Go sometimes or it is skipping it.
The snippet of code (in case play.golang.org dies sooner than stackoverflow.com):
package main
import ("fmt";"io/ioutil")
func main() {
for {
fmt.Println("Hello, playground")
fData,err := ioutil.ReadFile("/proc/stat")
if err != nil {
fmt.Println("Err is ",err)
}
fmt.Println("FileData",string(fData))
}
}
The reason probably is that somewhere in your program:
you are forgetting to close files, or
you are leaning on the garbage collector to automatically close files on object finalization, but Go's conservative garbage collector fails to do so. In this case you should check your program's memory consumption (whether it is steadily increasing while the program is running).
In either case, try to check the contents of /proc/PID/fd to see whether the number of open files is increasing while the program is running.
If you are sure you Do the f.Close(),it still has the proble,Maybe it is because your other connection,for example the connection to MYSQL,also will be cause the problem,especially,in a loop,and you forget to close the connection.
Always do :
db.connection....
**defer db.Close()**
If it is in loop
loop
db.connection....
**defer db.Close()**
end
Do not put the db.connection before the loop

Resources