Ensuring goroutine cleanup, bestpractice - go

I have a fundamental understanding problem about how to make sure that spawned goroutines are "closed" properly in the context of long-running processes. I watched talks regarding that topic and read about best practices. In order to understand my question please refer to the video "Advanced Go Concurrency Patterns" here
For the following, if you run code on your machine please export the environment variable GOTRACEBACK=all so you are able to see routine states after panic.
I put the code for the original example here: naive (it does not execute on go playground, I guess bacause a time statement is used. Please copy the code and execute it locally)
The result of the panic of the naive implementation after execution is
panic: show me the stacks
goroutine 1 [running]:
panic(0x48a680, 0xc4201d8480)
/usr/lib/go/src/runtime/panic.go:500 +0x1a1
main.main()
/home/flx/workspace/go/go-rps/playground/ball-naive.go:18 +0x16b
goroutine 5 [chan receive]:
main.player(0x4a4ec4, 0x2, 0xc42006a060)
/home/flx/workspace/go/go-rps/playground/ball-naive.go:23 +0x61
created by main.main
/home/flx/workspace/go/go-rps/playground/ball-naive.go:13 +0x76
goroutine 6 [chan receive]:
main.player(0x4a4ec6, 0x2, 0xc42006a060)
/home/flx/workspace/go/go-rps/playground/ball-naive.go:23 +0x61
created by main.main
/home/flx/workspace/go/go-rps/playground/ball-naive.go:14 +0xad
exit status 2
That demonstrates the underlying problem of leaving dangling goroutines on the system, which is especially bad for long running processes.
So for my personal understanding I tried two slightly more sophisticated variants to be found here:
for-select with default
generator pattern with quit channel
(again, not executable on the playground, cause "process takes too long")
The first solution is not fitting for various reasons, even leading to non-determinism in executed steps, depending on goroutine execution speed.
Now I thought -- and here finally comes the question! -- that the second solution with the quit channel would be appropriate to eliminate all executional traces from the system before exiting. Anyhow, "sometimes" the program exits too fast and the panic reports an additional goroutine runnable still residing on the system. The panic output:
panic: show me the stacks
goroutine 1 [running]:
panic(0x48d8e0, 0xc4201e27c0)
/usr/lib/go/src/runtime/panic.go:500 +0x1a1
main.main()
/home/flx/workspace/go/go-rps/playground/ball-perfect.go:20 +0x1a9
goroutine 20 [runnable]:
main.player.func1(0xc420070060, 0x4a8986, 0x2, 0xc420070120)
/home/flx/workspace/go/go-rps/playground/ball-perfect.go:27 +0x211
created by main.player
/home/flx/workspace/go/go-rps/playground/ball-perfect.go:36 +0x7f
exit status 2
My question is: that should not happen, right? I do use a quit channel to cleanup state before stepping forward to panicking.
I did a final try of implementing safe cleanup behavior here:
artificial wait time for runnables to close
Anyhow, that solution does not feel right and may as well not be applicable to large amounts of runnables?
What would be the recommended and most idiomatic pattern to ensure correct cleanup?
Thanks for your time

Your are fooled by the output: Your "generator pattern with quit channel" works perfectly fine, the two goroutines actually are terminated properly.
You see them in the trace because you panic too early. Remember: You have to goroutines running concurrently with main. main "stops" these goroutines by signaling on the quit channel. After these two sends on line 18 and 19 the two receives on line 32 have happened. And nothing more! You still have three goroutines running: Main is between lines 19 and 20 and the player goroutines are between lines 32 and 33. If now the panic in main happens before the return in player then the player goroutines are still there and are show in the panic stacktrace. These goroutines would have ended several milliseconds later if only the scheduler would have had time to execute the return on line 33 (which it hadn't as you killed it by panicking).
This is an instance of the "main ends to early to see concurrent goroutines do work" problem asked once a month here. You do see the concorrent goroutines doing work, but not all work. You might try sleeping 2 milliseconds before the panic and your player goroutines will have time to execute the return and everything is fine.

Related

Why doesn't panic show all running goroutines?

Page 253 of The Go Programming Language states:
... if instead of returning from main in the event of cancellation, we execute a call to panic, then the runtime will dump the stack of every goroutine in the program.
This code deliberately leaks a goroutine by waiting on a channel that never has anything to receive:
package main
import (
"fmt"
"time"
)
func main() {
never := make(chan struct{})
go func() {
defer fmt.Println("End of child")
<-never
}()
time.Sleep(10 * time.Second)
panic("End of main")
}
However, the runtime only lists the main goroutine when panic is called:
panic: End of main
goroutine 1 [running]:
main.main()
/home/simon/panic/main.go:15 +0x7f
exit status 2
If I press Ctrl-\ to send SIGQUIT during the ten seconds before main panics, I do see the child goroutine listed in the output:
goroutine 1 [sleep]:
time.Sleep(0x2540be400)
/usr/lib/go-1.17/src/runtime/time.go:193 +0x12e
main.main()
/home/simon/panic/main.go:14 +0x6c
goroutine 18 [chan receive]:
main.main.func1()
/home/simon/panic/main.go:12 +0x76
created by main.main
/home/simon/panic/main.go:10 +0x5d
I thought maybe the channel was getting closed as panic runs (which still wouldn't guarantee the deferred fmt.Println had time to execute), but I get the same behaviour if the child goroutine does a time.Sleep instead of waiting on a channel.
I know there are ways to dump goroutine stacktraces myself, but my question is why doesn't panic behave as described in the book? The language spec only says that a panic will terminate the program, so is the book simply describing implementation-dependent behaviour?
Thanks to kostix for pointing me to the GOTRACEBACK runtime environment variable. Setting this to all instead of leaving it at the default of single restores the behaviour described in TGPL. Note that this variable is significant to the runtime, but you can't manipulate it with go env.
The default to only list the panicking goroutine is a change in go 1.6 - my edition of the book is copyrighted 2016 and gives go 1.5 as the prequisite for its example code, so it must predate the change. It's interesting reading the change discussion that there was concern about hiding useful information (as the recipient of many an incomplete error report, I can sympathise with this), but nobody called out the issue of scaling to large production systems that kostix mentioned.

Explaining deadlocks with a single lock from The Little Go Book

I'm reading The Little Go Book.
Page 76 demonstrates how you can deadlock with a single lock:
var (
lock sync.Mutex
)
func main() {
go func() { lock.Lock() }()
time.Sleep(time.Millisecond * 10)
lock.Lock()
}
Running this results in a deadlock as explained by the author. However, what I don't understand is why.
I changed the program to this:
var (
lock sync.Mutex
)
func main() {
go func() { lock.Lock() }()
lock.Lock()
}
My expectation was that a deadlock would still be thrown. But it wasn't.
Could someone explain to me what's happening here please?
The only scenario I can think of that explains this is the following (but this is guesswork):
First example
The lock is acquired in the first goroutine
The call to time.Sleep() ensures the lock is acquired
The main function attempts to acquire the lock resulting in a deadlock
Program exits
Second example
The lock is acquired in the first goroutine but this takes some time to happen (??)
Since there is no delay the main function acquires the lock before the goroutine can
Program exits
In the first example, main sleeps long enough to give the child goroutine the opportunity to start and acquire the lock. That goroutine then will merrily exit without releasing the lock.
By the time main resumes its control flow, the shared mutex is locked so the next attempt to acquire it will block forever. Since at this point main is the only routine left alive, blocking forever results in a deadlock.
In the second example, without the call to time.Sleep, main proceeds straight away to acquire the lock. This succeeds, so main goes ahead and exits. The child goroutine would then block forever, but since main has exited, the program just terminates, without deadlock.
By the way, even if main didn't exit, as long as there is at least one goroutine which is not blocking, there's no deadlock. For this purpose, time.Sleep is not a blocking operation, it simply pauses execution for the specified time duration.
go shows the deadlock error when all goroutines (including the main) are asleep.
in your first example, the inside goroutine is executed and terminated after he call mutex.Lock(). then the main goroutine tries to lock again but it goes asleep waiting for the opportunity to occupy the lock. so now we have all the goroutines(the main one) in the program in asleep mode which will cause a deadlock error !
it is important to understand this because a deadlock may happen but it will not always show an error if there still a running goroutine. which mostly what will happen in production. error will be reported only when the whole program get in a deadlock.

When my simple Go program run ,Why the result is deadlock?

This is my entire Go code! What confused me is that case balances <- balance: did't occurs.I dont know why?
package main
import (
"fmt"
)
func main() {
done := make(chan int)
var balance int
balances := make(chan int)
balance = 1
go func() {
fmt.Println(<-balances)
done <- 1
}()
select {
case balances <- balance:
fmt.Println("done case")
default:
fmt.Println("default case")
}
<-done
}
default case
fatal error: all goroutines are asleep - deadlock!
goroutine 1 [chan receive]:
main.main()
/tmp/sandbox575832950/prog.go:29 +0x13d
goroutine 18 [chan receive]:
main.main.func1()
/tmp/sandbox575832950/prog.go:17 +0x38
created by main.main
/tmp/sandbox575832950/prog.go:16 +0x97
The main goroutine executes the select before the anonymous goroutine function executes the receive from balances. The main goroutine executes the default clause in the select because there is no ready receiver on balances. The main goroutine continues on to receive on done.
The goroutine blocks on receive from balances because there is no sender. Main continued past the send by taking the default clause.
The main goroutine blocks on receive from done because there is no sender. The goroutine is blocked on receive from balances.
Fix by replacing the select statement with balances <- balance. The default clause causes the problem. When the the default class is removed, all that remains in the select is send to balances.
Because of concurrency, there's no guarantee that the goroutine will execute before the select. We can see this by adding a print to the goroutine.
go func() {
fmt.Println("Here")
fmt.Println(<-balances)
done <- 1
}()
$ go run test.go
default case
Here
fatal error: all goroutines are asleep - deadlock!
...
If the select runs first, balances <- balance would block; balances has no buffer and nothing is trying to read from it. case balances <- balance would block so select skips it and executes its default.
Then the goroutine runs and blocks reading balances. Meanwhile the main code blocks reading done. Deadlock.
You can solve this by either removing the default case from the select and allowing it to block until balances is ready to be written to.
select {
case balances <- balance:
fmt.Println("done case")
}
Or you can add a buffer to balances so it can be written to before it is read from. Then case balances <- balance does not block.
balances := make(chan int, 1)
What confused me is that case balances <- balance: did't occurs
To be specific: it's because of select with a default case.
Whenever you create a new goroutine with go ...(), there is no guarantee about whether the invoking goroutine, or the invoked goroutine, will run next.
In practice it's likely that the next statements in the invoking goroutine will execute next (there being no particularly good reason to stop it). Of course, we should write programs that function correctly all the time, not just some, most, or even almost all the time! Concurrent programming with go ...() is all about synchronizing the goroutines so that the intended behavior must occur. Channels can do that, if used properly.
I think the balances channel can receive data
It's an unbuffered channel, so it can receive data if someone is reading from it. Otherwise, that write to the channel will block. Which brings us back to select.
Since you provided a default case, it's quite likely that the goroutine that invoked go ...() will continue to execute, and select that can't immediately choose a different case, will choose default. So it would be very unlikely for the invoked goroutine to be ready to read from balances before the main goroutine had already proceeded to try to write to it, failed, and gone on to the default case.
You can solve this by either removing the default case from the select and allowing it to block until balances is ready to be written to.
You sure can, as #Schwern points out. But it's important that you understand you don't necessarily need to use select to use channels. Instead of a select with just one case, you could instead just write
balances <- balance
fmt.Println("done")
select is not required in this case, default is working against you, and there's just one case otherwise, so there's no need for select. You want the main function to block on that channel.
you can add a buffer to balances so it can be written to before it is read from.
Sure. But again, important to understand that the fact that a channel might block both sender and receiver until they're both ready to communicate , is a valid, useful, and common use of channels. Unbuffered channels are not the cause of your problem - providing a default case for your select, and thus a path for unintended behavior, is the cause.

Cannot understand go test -race : RACE: DATA WARNING stack trace

I ran into a DATA RACE warning while testing my project, and was wondering if anyone would be kind enough to help me decipher the problem. I have never attempted testing go routines in the past and am finding it hard to wrap my head around data races.
I have provided a link in the description to the open issue, with the trace in the issue description.
I would really appreciate some help, just from the aspect of learning to debug similar issues and writing better tests for go routines in the future.
https://github.com/nitishm/vegeta-server/issues/52
A snippet of the trace is provided below as well
=== RUN Test_dispatcher_Cancel_Error_completed
INFO[0000] creating new dispatcher component=dispatcher
INFO[0000] starting dispatcher component=dispatcher
INFO[0000] dispatching new attack ID=d63a79ac-6f51-486e-845d-077c8c76168a Status=scheduled component=dispatcher
==================
WARNING: DATA RACE
Read at 0x00c0000f8d68 by goroutine 8:
vegeta-server/internal/dispatcher.(*task).Complete()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:116 +0x61
vegeta-server/internal/dispatcher.run()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:213 +0x17a
Previous write at 0x00c0000f8d68 by goroutine 7:
vegeta-server/internal/dispatcher.(*task).Run()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:107 +0x12a
vegeta-server/internal/dispatcher.(*dispatcher).Run()
/Users/nitishm/vegeta-server/internal/dispatcher/dispatcher.go:109 +0xb5f
Goroutine 8 (running) created at:
vegeta-server/internal/dispatcher.(*task).Run()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:105 +0x11c
vegeta-server/internal/dispatcher.(*dispatcher).Run()
/Users/nitishm/vegeta-server/internal/dispatcher/dispatcher.go:109 +0xb5f
Goroutine 7 (running) created at:
vegeta-server/internal/dispatcher.Test_dispatcher_Cancel_Error_completed()
/Users/nitishm/vegeta-server/internal/dispatcher/dispatcher_test.go:249 +0x545
testing.tRunner()
/usr/local/go/src/testing/testing.go:827 +0x162
==================
==================
WARNING: DATA RACE
Write at 0x00c0000f8d98 by goroutine 8:
vegeta-server/internal/dispatcher.(*task).SendUpdate()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:164 +0x70
vegeta-server/internal/dispatcher.(*task).Complete()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:128 +0x20e
vegeta-server/internal/dispatcher.run()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:213 +0x17a
Previous write at 0x00c0000f8d98 by goroutine 7:
vegeta-server/internal/dispatcher.(*task).SendUpdate()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:164 +0x70
vegeta-server/internal/dispatcher.(*task).Run()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:109 +0x15d
vegeta-server/internal/dispatcher.(*dispatcher).Run()
/Users/nitishm/vegeta-server/internal/dispatcher/dispatcher.go:109 +0xb5f
Goroutine 8 (running) created at:
vegeta-server/internal/dispatcher.(*task).Run()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:105 +0x11c
vegeta-server/internal/dispatcher.(*dispatcher).Run()
/Users/nitishm/vegeta-server/internal/dispatcher/dispatcher.go:109 +0xb5f
Goroutine 7 (running) created at:
vegeta-server/internal/dispatcher.Test_dispatcher_Cancel_Error_completed()
/Users/nitishm/vegeta-server/internal/dispatcher/dispatcher_test.go:249 +0x545
testing.tRunner()
/usr/local/go/src/testing/testing.go:827 +0x162
==================
INFO[0002] canceling attack ID=d63a79ac-6f51-486e-845d-077c8c76168a ToCancel=true component=dispatcher
ERRO[0002] failed to cancel task ID=d63a79ac-6f51-486e-845d-077c8c76168a ToCancel=true component=dispatcher error="cannot cancel task d63a79ac-6f51-486e-845d-077c8c76168a with status completed"
WARN[0002] gracefully shutting down the dispatcher component=dispatcher
--- FAIL: Test_dispatcher_Cancel_Error_completed (2.01s)
testing.go:771: race detected during execution of test
As far as I can understand it:
Read at 0x00c0000f8d68 by goroutine 8: and Previous write at 0x00c0000f8d68 by goroutine 7
means that both goroutines 8 and 7 are reading from and writing to the same location. If you look at the lines pointed to by the error:
goroutine 8 on 116:
if t.status != models.AttackResponseStatusRunning {
goroutine 7 on 107:
t.status = models.AttackResponseStatusRunning
You can see that the goroutines are accessing the task's state without any synchronization and that, as you already know, can cause a race condition.
So if your program allows access to a single task by multiple goroutines you need to ensure that no data race occurs by using a mutex lock for example.

Discrepancies between Go Playground and Go on my machine?

To settle some misunderstandings I have about goroutines, I went to the Go playground and ran this code:
package main
import (
"fmt"
)
func other(done chan bool) {
done <- true
go func() {
for {
fmt.Println("Here")
}
}()
}
func main() {
fmt.Println("Hello, playground")
done := make(chan bool)
go other(done)
<-done
fmt.Println("Finished.")
}
As I expected, Go playground came back with an error: Process took too long.
This seems to imply that the goroutine created within other runs forever.
But when I run the same code on my own machine, I get this output almost instantaneously:
Hello, playground.
Finished.
This seems to imply that the goroutine within other exits when the main goroutine finishes. Is this true? Or does the main goroutine finish, while the other goroutine continues to run in the background?
Edit: Default GOMAXPROCS has changed on the Go Playground, it now defaults to 8. In the "old" days it defaulted to 1. To get the behavior described in the question, set it to 1 explicitly with runtime.GOMAXPROCS(1).
Explanation of what you see:
On the Go Playground, GOMAXPROCS is 1 (proof).
This means one goroutine is executed at a time, and if that goroutine does not block, the scheduler is not forced to switch to other goroutines.
Your code (like every Go app) starts with a goroutine executing the main() function (the main goroutine). It starts another goroutine that executes the other() function, then it receives from the done channel - which blocks. So the scheduler must switch to the other goroutine (executing other() function).
In your other() function when you send a value on the done channel, that makes both the current (other()) and the main goroutine runnable. The scheduler chooses to continue to run other(), and since GOMAXPROCS=1, main() is not continued. Now other() launches another goroutine executing an endless loop. The scheduler chooses to execute this goroutine which takes forever to get to a blocked state, so main() is not continued.
And then the timeout of the Go Playground's sandbox comes as an absolution:
process took too long
Note that the Go Memory Model only guarantees that certain events happen before other events, you have no guarantee how 2 concurrent goroutines are executed. Which makes the output non-deterministic.
You are not to question any execution order that does not violate the Go Memory Model. If you want the execution to reach certain points in your code (to execute certain statements), you need explicit synchronization (you need to synchronize your goroutines).
Also note that the output on the Go Playground is cached, so if you run the app again, it won't be run again, but instead the cached output will be presented immediately. If you change anything in the code (e.g. insert a space or a comment) and then you run it again, it then will be compiled and run again. You will notice it by the increased response time. Using the current version (Go 1.6) you will see the same output every time though.
Running locally (on your machine):
When you run it locally, most likely GOMAXPROCS will be greater than 1 as it defaults to the number of CPU cores available (since Go 1.5). So it doesn't matter if you have a goroutine executing an endless loop, another goroutine will be executed simultaneously, which will be the main(), and when main() returns, your program terminates; it does not wait for other non-main goroutines to complete (see Spec: Program execution).
Also note that even if you set GOMAXPROCS to 1, your app will most likely exit in a "short" time as the scheduler imlementation will switch to other goroutines and not just execute the endless loop forever (however, as stated above, this is non-deterministic). And when it does, it will be the main() goroutine, and so when main() finishes and returns, your app terminates.
Playing with your app on the Go Playground:
As mentioned, by default GOMAXPROCS is 1 on the Go Playground. However it is allowed to set it to a higher value, e.g.:
runtime.GOMAXPROCS(2)
Without explicit synchronization, execution still remains non-deterministic, however you will observe a different execution order and a termination without running into a timeout:
Hello, playground
Here
Here
Here
...
<Here is printed 996 times, then:>
Finished.
Try this variant on the Go Playground.
What you will see on screen is nondeterministic. Or more precisely if by any chance the true value you pass to channel is delayed you would see some "Here".
But usually the Stdout is buffered, it means it's not printed instantaneously but the data gets accumulated and after it gets to maximum buffer size it's printed. In your case before the "here" is printed the main function is already finished thus the process finishes.
The rule of thumb is: main function must be alive otherwise all other goroutines gets killed.

Resources