I ran into a DATA RACE warning while testing my project, and was wondering if anyone would be kind enough to help me decipher the problem. I have never attempted testing go routines in the past and am finding it hard to wrap my head around data races.
I have provided a link in the description to the open issue, with the trace in the issue description.
I would really appreciate some help, just from the aspect of learning to debug similar issues and writing better tests for go routines in the future.
https://github.com/nitishm/vegeta-server/issues/52
A snippet of the trace is provided below as well
=== RUN Test_dispatcher_Cancel_Error_completed
INFO[0000] creating new dispatcher component=dispatcher
INFO[0000] starting dispatcher component=dispatcher
INFO[0000] dispatching new attack ID=d63a79ac-6f51-486e-845d-077c8c76168a Status=scheduled component=dispatcher
==================
WARNING: DATA RACE
Read at 0x00c0000f8d68 by goroutine 8:
vegeta-server/internal/dispatcher.(*task).Complete()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:116 +0x61
vegeta-server/internal/dispatcher.run()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:213 +0x17a
Previous write at 0x00c0000f8d68 by goroutine 7:
vegeta-server/internal/dispatcher.(*task).Run()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:107 +0x12a
vegeta-server/internal/dispatcher.(*dispatcher).Run()
/Users/nitishm/vegeta-server/internal/dispatcher/dispatcher.go:109 +0xb5f
Goroutine 8 (running) created at:
vegeta-server/internal/dispatcher.(*task).Run()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:105 +0x11c
vegeta-server/internal/dispatcher.(*dispatcher).Run()
/Users/nitishm/vegeta-server/internal/dispatcher/dispatcher.go:109 +0xb5f
Goroutine 7 (running) created at:
vegeta-server/internal/dispatcher.Test_dispatcher_Cancel_Error_completed()
/Users/nitishm/vegeta-server/internal/dispatcher/dispatcher_test.go:249 +0x545
testing.tRunner()
/usr/local/go/src/testing/testing.go:827 +0x162
==================
==================
WARNING: DATA RACE
Write at 0x00c0000f8d98 by goroutine 8:
vegeta-server/internal/dispatcher.(*task).SendUpdate()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:164 +0x70
vegeta-server/internal/dispatcher.(*task).Complete()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:128 +0x20e
vegeta-server/internal/dispatcher.run()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:213 +0x17a
Previous write at 0x00c0000f8d98 by goroutine 7:
vegeta-server/internal/dispatcher.(*task).SendUpdate()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:164 +0x70
vegeta-server/internal/dispatcher.(*task).Run()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:109 +0x15d
vegeta-server/internal/dispatcher.(*dispatcher).Run()
/Users/nitishm/vegeta-server/internal/dispatcher/dispatcher.go:109 +0xb5f
Goroutine 8 (running) created at:
vegeta-server/internal/dispatcher.(*task).Run()
/Users/nitishm/vegeta-server/internal/dispatcher/task.go:105 +0x11c
vegeta-server/internal/dispatcher.(*dispatcher).Run()
/Users/nitishm/vegeta-server/internal/dispatcher/dispatcher.go:109 +0xb5f
Goroutine 7 (running) created at:
vegeta-server/internal/dispatcher.Test_dispatcher_Cancel_Error_completed()
/Users/nitishm/vegeta-server/internal/dispatcher/dispatcher_test.go:249 +0x545
testing.tRunner()
/usr/local/go/src/testing/testing.go:827 +0x162
==================
INFO[0002] canceling attack ID=d63a79ac-6f51-486e-845d-077c8c76168a ToCancel=true component=dispatcher
ERRO[0002] failed to cancel task ID=d63a79ac-6f51-486e-845d-077c8c76168a ToCancel=true component=dispatcher error="cannot cancel task d63a79ac-6f51-486e-845d-077c8c76168a with status completed"
WARN[0002] gracefully shutting down the dispatcher component=dispatcher
--- FAIL: Test_dispatcher_Cancel_Error_completed (2.01s)
testing.go:771: race detected during execution of test
As far as I can understand it:
Read at 0x00c0000f8d68 by goroutine 8: and Previous write at 0x00c0000f8d68 by goroutine 7
means that both goroutines 8 and 7 are reading from and writing to the same location. If you look at the lines pointed to by the error:
goroutine 8 on 116:
if t.status != models.AttackResponseStatusRunning {
goroutine 7 on 107:
t.status = models.AttackResponseStatusRunning
You can see that the goroutines are accessing the task's state without any synchronization and that, as you already know, can cause a race condition.
So if your program allows access to a single task by multiple goroutines you need to ensure that no data race occurs by using a mutex lock for example.
Related
Page 253 of The Go Programming Language states:
... if instead of returning from main in the event of cancellation, we execute a call to panic, then the runtime will dump the stack of every goroutine in the program.
This code deliberately leaks a goroutine by waiting on a channel that never has anything to receive:
package main
import (
"fmt"
"time"
)
func main() {
never := make(chan struct{})
go func() {
defer fmt.Println("End of child")
<-never
}()
time.Sleep(10 * time.Second)
panic("End of main")
}
However, the runtime only lists the main goroutine when panic is called:
panic: End of main
goroutine 1 [running]:
main.main()
/home/simon/panic/main.go:15 +0x7f
exit status 2
If I press Ctrl-\ to send SIGQUIT during the ten seconds before main panics, I do see the child goroutine listed in the output:
goroutine 1 [sleep]:
time.Sleep(0x2540be400)
/usr/lib/go-1.17/src/runtime/time.go:193 +0x12e
main.main()
/home/simon/panic/main.go:14 +0x6c
goroutine 18 [chan receive]:
main.main.func1()
/home/simon/panic/main.go:12 +0x76
created by main.main
/home/simon/panic/main.go:10 +0x5d
I thought maybe the channel was getting closed as panic runs (which still wouldn't guarantee the deferred fmt.Println had time to execute), but I get the same behaviour if the child goroutine does a time.Sleep instead of waiting on a channel.
I know there are ways to dump goroutine stacktraces myself, but my question is why doesn't panic behave as described in the book? The language spec only says that a panic will terminate the program, so is the book simply describing implementation-dependent behaviour?
Thanks to kostix for pointing me to the GOTRACEBACK runtime environment variable. Setting this to all instead of leaving it at the default of single restores the behaviour described in TGPL. Note that this variable is significant to the runtime, but you can't manipulate it with go env.
The default to only list the panicking goroutine is a change in go 1.6 - my edition of the book is copyrighted 2016 and gives go 1.5 as the prequisite for its example code, so it must predate the change. It's interesting reading the change discussion that there was concern about hiding useful information (as the recipient of many an incomplete error report, I can sympathise with this), but nobody called out the issue of scaling to large production systems that kostix mentioned.
I'm facing a problem in golang
var a = 0
func main() {
go func() {
for {
a = a + 1
}
}()
time.Sleep(time.Second)
fmt.Printf("result=%d\n", a)
}
expected: result=(a big int number)
result: result=0
You have a race condition,
run your program with -race flag
go run -race main.go
==================
WARNING: DATA RACE
Read at 0x0000005e9600 by main goroutine:
main.main()
/home/jack/Project/GoProject/src/gitlab.com/hooshyar/GoNetworkLab/StackOVerflow/race/main.go:17 +0x6c
Previous write at 0x0000005e9600 by goroutine 6:
main.main.func1()
/home/jack/Project/GoProject/src/gitlab.com/hooshyar/GoNetworkLab/StackOVerflow/race/main.go:13 +0x56
Goroutine 6 (running) created at:
main.main()
/home/jack/Project/GoProject/src/gitlab.com/hooshyar/GoNetworkLab/StackOVerflow/race/main.go:11 +0x46
==================
result=119657339
Found 1 data race(s)
exit status 66
what is solution?
There is some solution, A solution is using a mutex:
var a = 0
func main() {
var mu sync.Mutex
go func() {
for {
mu.Lock()
a = a + 1
mu.Unlock()
}
}()
time.Sleep(3*time.Second)
mu.Lock()
fmt.Printf("result=%d\n", a)
mu.Unlock()
}
before any read and write lock the mutex and then unlock it, now you don not have any race and resault will bi big int at the end.
For more information read this topic.
Data races in Go(Golang) and how to fix them
and this
Golang concurrency - data races
As other writers have mentioned, you have a data race, but if you are comparing this behavior to, say, a program written in C using pthreads, you are missing some important data. Your problem is not just about timing, it's about the very language definition. Because concurrency primitives are baked into the language itself, the Go language memory model (https://golang.org/ref/mem) describes exactly when and how changes in one goroutine -- think of goroutines as "super-lightweight user-space threads" and you won't be too far off -- are guaranteed to be visible to code running in another goroutine.
Without any synchronizing actions, like channel sends/receives or sync.Mutex locks/unlocks, the Go memory model says that any changes you make to 'a' inside that goroutine don't ever have to be visible to the main goroutine. And, since the compiler knows that, it is free to optimize away pretty much everything in your for loop. Or not.
It's a similar situation to when you have, say, a local int variable in C set to 1, and maybe you have a while loop reading that variable in a loop waiting for it to be set to 0 by an ISR, but then your compiler gets too clever and decides to optimize away the test for zero because it thinks your variable can't ever change within the loop and you really just wanted an infinite loop, and so you have to declare the variable as volatile to fix the 'bug'.
If you are going to be working in Go, (my current favorite language, FWIW,) take time to read and thoroughly grok the Go memory model linked above, and it will really pay off in the future.
Your program is running into race condition. go can detect such scenarios.
Try running your program using go run -race main.go assuming your file name is main.go. It will show how race occured ,
attempted write inside the goroutine ,
simultaneous read by the main goroutine.
It will also print a random int number as you expected.
I have a fundamental understanding problem about how to make sure that spawned goroutines are "closed" properly in the context of long-running processes. I watched talks regarding that topic and read about best practices. In order to understand my question please refer to the video "Advanced Go Concurrency Patterns" here
For the following, if you run code on your machine please export the environment variable GOTRACEBACK=all so you are able to see routine states after panic.
I put the code for the original example here: naive (it does not execute on go playground, I guess bacause a time statement is used. Please copy the code and execute it locally)
The result of the panic of the naive implementation after execution is
panic: show me the stacks
goroutine 1 [running]:
panic(0x48a680, 0xc4201d8480)
/usr/lib/go/src/runtime/panic.go:500 +0x1a1
main.main()
/home/flx/workspace/go/go-rps/playground/ball-naive.go:18 +0x16b
goroutine 5 [chan receive]:
main.player(0x4a4ec4, 0x2, 0xc42006a060)
/home/flx/workspace/go/go-rps/playground/ball-naive.go:23 +0x61
created by main.main
/home/flx/workspace/go/go-rps/playground/ball-naive.go:13 +0x76
goroutine 6 [chan receive]:
main.player(0x4a4ec6, 0x2, 0xc42006a060)
/home/flx/workspace/go/go-rps/playground/ball-naive.go:23 +0x61
created by main.main
/home/flx/workspace/go/go-rps/playground/ball-naive.go:14 +0xad
exit status 2
That demonstrates the underlying problem of leaving dangling goroutines on the system, which is especially bad for long running processes.
So for my personal understanding I tried two slightly more sophisticated variants to be found here:
for-select with default
generator pattern with quit channel
(again, not executable on the playground, cause "process takes too long")
The first solution is not fitting for various reasons, even leading to non-determinism in executed steps, depending on goroutine execution speed.
Now I thought -- and here finally comes the question! -- that the second solution with the quit channel would be appropriate to eliminate all executional traces from the system before exiting. Anyhow, "sometimes" the program exits too fast and the panic reports an additional goroutine runnable still residing on the system. The panic output:
panic: show me the stacks
goroutine 1 [running]:
panic(0x48d8e0, 0xc4201e27c0)
/usr/lib/go/src/runtime/panic.go:500 +0x1a1
main.main()
/home/flx/workspace/go/go-rps/playground/ball-perfect.go:20 +0x1a9
goroutine 20 [runnable]:
main.player.func1(0xc420070060, 0x4a8986, 0x2, 0xc420070120)
/home/flx/workspace/go/go-rps/playground/ball-perfect.go:27 +0x211
created by main.player
/home/flx/workspace/go/go-rps/playground/ball-perfect.go:36 +0x7f
exit status 2
My question is: that should not happen, right? I do use a quit channel to cleanup state before stepping forward to panicking.
I did a final try of implementing safe cleanup behavior here:
artificial wait time for runnables to close
Anyhow, that solution does not feel right and may as well not be applicable to large amounts of runnables?
What would be the recommended and most idiomatic pattern to ensure correct cleanup?
Thanks for your time
Your are fooled by the output: Your "generator pattern with quit channel" works perfectly fine, the two goroutines actually are terminated properly.
You see them in the trace because you panic too early. Remember: You have to goroutines running concurrently with main. main "stops" these goroutines by signaling on the quit channel. After these two sends on line 18 and 19 the two receives on line 32 have happened. And nothing more! You still have three goroutines running: Main is between lines 19 and 20 and the player goroutines are between lines 32 and 33. If now the panic in main happens before the return in player then the player goroutines are still there and are show in the panic stacktrace. These goroutines would have ended several milliseconds later if only the scheduler would have had time to execute the return on line 33 (which it hadn't as you killed it by panicking).
This is an instance of the "main ends to early to see concurrent goroutines do work" problem asked once a month here. You do see the concorrent goroutines doing work, but not all work. You might try sleeping 2 milliseconds before the panic and your player goroutines will have time to execute the return and everything is fine.
I have a different output for println and fmt.Println in race detector which I couldn't explain. I expected both to be race, or at least both to be no race.
package main
var a int
func f() {
a = 1
}
func main() {
go f()
println(a)
}
And, it finds race condition as expected.
0
==================
WARNING: DATA RACE
Write by goroutine 5:
main.f()
/home/felmas/test.go:6 +0x30
Previous read by main goroutine:
main.main()
/home/felmas/test.go:11 +0x4d
Goroutine 5 (running) created at:
main.main()
/home/felmas/test.go:10 +0x38
==================
Found 1 data race(s)
However, this one runs without any detected race.
package main
import "fmt"
var a int
func f() {
a = 1
}
func main() {
go f()
fmt.Println(a)
}
To my knowledge, no race is detected doesn't mean there is no race so is this one of these deficiencies or is there a deeper explanation since println is builtin and quite special?
The race detector is a dynamic testing tool and no static analysis. In order to get reliable results from the race detector, you should strife for a high test coverage of your program, preferable by writing lots of benchmarks using multiple processes (by setting GOMAXPROCS > 1, GOMAXPROCS=NumCPU is the default for Go 1.5) and use a continuous integration tool that executes those tests regularly.
The race detector does not report any false positives so you should take every output serious. On the other hand it might not detect every race on every run, depending on the order goroutines and processes are scheduled.
In your example, wrapping everything in a tight loop and re-executing the tests reports the race correctly in both cases.
i am getting data race condition when i try to marshal Struct to XML in a GoRoutine of 2 or more.
Sample main program : http://play.golang.org/p/YhkWXWL8C0
i believe xml:"members>member" causing this . if i change it to normal then all works fine. any thoughts why go-1.4.x version doing that.
Family struct {
XMLName xml.Name `xml:"family"`
Name string `xml:"famil_name"`
Members []Person `xml:"members>member"`
//Members []Person `xml:"members"`
}
go run -race data_race.go giving me
2015/02/06 13:53:43 Total GoRoutine Channels Created 2
2015/02/06 13:53:43 <family><famil_name></famil_name><members><person><name>ABCD</name><age>0</age></person><person><name>dummy</name><age>0</age></person></members></family>
==================
WARNING: DATA RACE
Write by goroutine 6:
runtime.slicecopy()
/usr/local/go/src/runtime/slice.go:94 +0x0
encoding/xml.(*parentStack).push()
/usr/local/go/src/encoding/xml/marshal.go:908 +0x2fb
encoding/xml.(*printer).marshalStruct()
/usr/local/go/src/encoding/xml/marshal.go:826 +0x628
encoding/xml.(*printer).marshalValue()
/usr/local/go/src/encoding/xml/marshal.go:531 +0x1499
encoding/xml.(*Encoder).Encode()
/usr/local/go/src/encoding/xml/marshal.go:153 +0xb8
encoding/xml.Marshal()
/usr/local/go/src/encoding/xml/marshal.go:72 +0xfb
main.ToXml()
/Users/kadalamittai/selfie/go/src/github.com/ivam/goal/command/data_race.go:51 +0x227
main.funcĀ·001()
/Users/kadalamittai/selfie/go/src/github.com/ivam/goal/command/data_race.go:61 +0x74
Previous read by goroutine 5:
encoding/xml.(*parentStack).trim()
/usr/local/go/src/encoding/xml/marshal.go:893 +0x2ae
encoding/xml.(*printer).marshalStruct()
/usr/local/go/src/encoding/xml/marshal.go:836 +0x203
encoding/xml.(*printer).marshalValue()
/usr/local/go/src/encoding/xml/marshal.go:531 +0x1499
encoding/xml.(*Encoder).Encode()
/usr/local/go/src/encoding/xml/marshal.go:153 +0xb8
encoding/xml.Marshal()
/usr/local/go/src/encoding/xml/marshal.go:72 +0xfb
main.ToXml()
/Users/kadalamittai/selfie/go/src/github.com/ivam/goal/command/data_race.go:51 +0x227
main.funcĀ·001()
/Users/kadalamittai/selfie/go/src/github.com/ivam/goal/command/data_race.go:61 +0x74
Goroutine 6 (running) created at:
main.AsyncExecute()
/Users/kadalamittai/selfie/go/src/github.com/ivam/goal/command/data_race.go:67 +0x15d
main.main()
/Users/kadalamittai/selfie/go/src/github.com/ivam/goal/command/data_race.go:80 +0x2bf
Goroutine 5 (finished) created at:
main.AsyncExecute()
/Users/kadalamittai/selfie/go/src/github.com/ivam/goal/command/data_race.go:67 +0x15d
main.main()
/Users/kadalamittai/selfie/go/src/github.com/ivam/goal/command/data_race.go:80 +0x2bf
==================
This looks like a bug in the Go 1.41 library. I've reported it as a bug. Hopefully it should get fixed. I'll leave the analysis below for reference.
What's happening is that there's an implicit shared value due to the use of getTypeInfo() which returns a type description of the struct. For efficiency, it appears to be globally cached state. Other parts of the XML encoder take components of this state and pass it around. It appears that there's an inadvertent mutation happening due to a slice append on a component of the shared value.
The p.stack attribute that's reporting as the source of the data race originates from a part of the typeInfo shared value, where a slice of tinfo.parents gets injected on line 821. That's ultimately where the sharing is happening with the potential for read and writing, because later on there are appends happening on the slice, and that can do mutation on the underlying array.
What should probably happen instead is that the slice should be capacity-restricted so that any potential append won't do a write on the shared array value.
That is, line 897 of the encoder library could probably be be changed from:
897 s.stack = parents[:split]
to:
897 s.stack = parents[:split:split]
to correct the issue.