Benchmark with Goroutines - go

Pretty new to the Golang here and bumped into a problem when benchmarking with goroutines.
The code I have is here:
type store struct{}
func (n *store) WriteSpan(span interface{}) error {
return nil
}
func smallTest(times int, b *testing.B) {
writer := store{}
var wg sync.WaitGroup
numGoroutines := times
wg.Add(numGoroutines)
b.ResetTimer()
b.ReportAllocs()
for n := 0; n < numGoroutines; n++ {
go func() {
writer.WriteSpan(nil)
wg.Done()
}()
}
wg.Wait()
}
func BenchmarkTest1(b *testing.B) {
smallTest(1000000, b)
}
func BenchmarkTest2(b *testing.B) {
smallTest(10000000, b)
}
It looks to me the runtime and allocation for both scenario should be similar, but running them gives me the following results which are vastly different. Wonder why this happens? Where do those extra allocations come from?
BenchmarkTest1-12 1000000000 0.26 ns/op 0 B/op 0 allocs/op
BenchmarkTest2-12 1 2868129398 ns/op 31872 B/op 83 allocs/op
PASS
I also notice If I add a inner loop to writeSpan multiple times, the runtime and allocation kind of relates to the numGoroutines * multiple times. If this is not the way how people benchmark with goroutines, are there any other standard ways to test? Thanks in advance.

Meaningless microbenchmarks produce meaningless results.
If this is not the way how people benchmark with goroutines, are there
any other standard ways to test?
It's not the way to benchmark anything. Benchmark real problems.
You run a very large number of goroutines, which do nothing, until you saturate the scheduler, the machine, and other resources. That merely proves that if you run anything enough times you can bring a machine to its knees.

Related

Confusing results from golang benchmarking of function and go routine call overhead

Out of curiosity, I am trying to understand what the function and go routine call overhead is for golang. I therefore wrote the benchmarks below giving the results below that. The result for BenchmarkNestedFunctions confuses me as it seems far too high so I naturally assume I have done something wrong. I was expecting the BenchmarkNestedFunctions to be slightly higher than the BenchmarkNopFunc and very close to the BenchmarkSplitNestedFunctions. Please can anyone suggest what I may be either not understanding or doing wrong.
package main
import (
"testing"
)
// Intended to allow me to see the iteration overhead being used in the benchmarking
func BenchmarkTestLoop(b *testing.B) {
for i := 0; i < b.N; i++ {
}
}
//go:noinline
func nop() {
}
// Intended to allow me to see the overhead from making a do nothing function call which I hope is not being optimised out
func BenchmarkNopFunc(b *testing.B) {
for i := 0; i < b.N; i++ {
nop()
}
}
// Intended to allow me to see the added cost from creating a channel, closing it and then reading from it
func BenchmarkChannelMakeCloseRead(b *testing.B) {
for i := 0; i < b.N; i++ {
done := make(chan struct{})
close(done)
_, _ = <-done
}
}
//go:noinline
func nestedfunction(n int, done chan<- struct{}) {
n--
if n > 0 {
nestedfunction(n, done)
} else {
close(done)
}
}
// Intended to allow me to see the added cost of making 1 function call doing a set of channel operations for each call
func BenchmarkUnnestedFunctions(b *testing.B) {
for i := 0; i < b.N; i++ {
done := make(chan struct{})
nestedfunction(1, done)
_, _ = <-done
}
}
// Intended to allow me to see the added cost of repeated nested calls and stack growth with an upper limit on the call depth to allow examination of a particular stack size
func BenchmarkNestedFunctions(b *testing.B) {
// Max number of nested function calls to prevent excessive stack growth
const max int = 200000
if b.N > max {
b.N = max
}
done := make(chan struct{})
nestedfunction(b.N, done)
_, _ = <-done
}
// Intended to allow me to see the added cost of repeated nested call with any stack reuse the runtime supports (presuming it doesn't free and the realloc the stack as it grows)
func BenchmarkSplitNestedFunctions(b *testing.B) {
// Max number of nested function calls to prevent excessive stack growth
const max int = 200000
for i := 0; i < b.N; i += max {
done := make(chan struct{})
if (b.N - i) > max {
nestedfunction(max, done)
} else {
nestedfunction(b.N-i, done)
}
_, _ = <-done
}
}
// Intended to allow me to see the added cost of spinning up a go routine to perform comparable useful work as the nested function calls
func BenchmarkNestedGoRoutines(b *testing.B) {
done := make(chan struct{})
go nestedgoroutines(b.N, done)
_, _ = <-done
}
The benchmarks are invoked as follows:
$ go test -bench=. -benchmem -benchtime=200ms
goos: windows
goarch: amd64
pkg: golangbenchmarks
cpu: AMD Ryzen 9 3900X 12-Core Processor
BenchmarkTestLoop-24 1000000000 0.2247 ns/op 0 B/op 0 allocs/op
BenchmarkNopFunc-24 170787386 1.402 ns/op 0 B/op 0 allocs/op
BenchmarkChannelMakeCloseRead-24 3990243 52.72 ns/op 96 B/op 1 allocs/op
BenchmarkUnnestedFunctions-24 4791862 58.63 ns/op 96 B/op 1 allocs/op
BenchmarkNestedFunctions-24 200000 50.11 ns/op 0 B/op 0 allocs/op
BenchmarkSplitNestedFunctions-24 155160835 1.528 ns/op 0 B/op 0 allocs/op
BenchmarkNestedGoRoutines-24 636734 412.2 ns/op 24 B/op 1 allocs/op
PASS
ok golangbenchmarks 1.700s
The BenchmarkTestLoop, BenchmarkNopFunc and BenchmarkSplitNestedFunctions results seem reasonably consistent with each other and make sense, the BenchmarkSplitNestedFunctions is doing more work than the BenchmarkNopFunc on average per benchmark operation but not by much because the expensive BenchmarkChannelMakeCloseRead operation is only done about once every 200,000 benchmarking operations.
Similarly the BenchmarkChannelMakeCloseRead and BenchmarkUnnestedFunctions results seem consistent since each BenchmarkUnnestedFunctions is doing slightly more than each BenchmarkChannelMakeCloseRead if only by a decrement and if test which is potentially causing a branch prediction failure (although I would have hoped the branch predicter would have been able to use the last branch result, but I don't know how complex the close function implementation is which may be overwhelming the branch history)
However BenchmarkNestedFunctions and BenchmarkSplitNestedFunctions are radically different and I don't understand why. There should be similar with the only intentional difference being any grown stack re-use and I did not expect the stack growth cost to be nearly so high (or is that the explanation and it is just co-incidence that result is so similar to the BenchmarkChannelMakeCloseRead result making me think it is not actually doing what I thought it was?)
It should also be noted that the BenchmarkSplitNestedFunctions result can occasionally take significantly different values; I have seen a few values in the range of 10 to 200 ns/op when running it repeatedly. It can also fail to report any result ns/op time while still passing when I run it; I have no idea what is going on there:
BenchmarkChannelMakeCloseRead-24 5724488 54.26 ns/op 96 B/op 1 allocs/op
BenchmarkUnnestedFunctions-24 3992061 57.49 ns/op 96 B/op 1 allocs/op
BenchmarkNestedFunctions-24 200000 0 B/op 0 allocs/op
BenchmarkNestedFunctions2-24 154956972 1.590 ns/op 0 B/op 0 allocs/op
BenchmarkNestedGoRoutines-24 1000000 342.1 ns/op 24 B/op 1 allocs/op
If anyone can point out my mistake in the benchmark / my interpretation of the results and explain what is really happening then that would be greatly appreciated
Background info:
Stack growth and function inlining: https://dave.cheney.net/2020/04/25/inlining-optimisations-in-go
Stack growth limitations: https://dave.cheney.net/2013/06/02/why-is-a-goroutines-stack-infinite
Golang stack structure: https://blog.cloudflare.com/how-stacks-are-handled-in-go/
Branch prediction: https://en.wikipedia.org/wiki/Branch_predictor
Top level 3900X architecture overview: https://www.techpowerup.com/review/amd-ryzen-9-3900x/3.html
3900X branch prediction history/buffer size 16/512/7k: https://www.techpowerup.com/review/amd-ryzen-9-3900x/images/arch3.jpg

Measure heap growth accurately

I am trying to measure the evolution of the number of heap-allocated objects before and after I call a function. I am forcing runtime.GC() and using runtime.ReadMemStats to measure the number of heap objects I have before and after.
The problem I have is that I sometimes see unexpected heap growth. And it is different after each run.
A simple example below, where I would always expect to see a zero heap-objects growth.
https://go.dev/play/p/FBWfXQHClaG
var mem1_before, mem2_before, mem1_after, mem2_after runtime.MemStats
func measure_nothing(before, after *runtime.MemStats) {
runtime.GC()
runtime.ReadMemStats(before)
runtime.GC()
runtime.ReadMemStats(after)
}
func main() {
measure_nothing(&mem1_before, &mem1_after)
measure_nothing(&mem2_before, &mem2_after)
log.Printf("HeapObjects diff = %d", int64(mem1_after.HeapObjects-mem1_before.HeapObjects))
log.Printf("HeapAlloc diff %d", int64(mem1_after.HeapAlloc-mem1_before.HeapAlloc))
log.Printf("HeapObjects diff = %d", int64(mem2_after.HeapObjects-mem2_before.HeapObjects))
log.Printf("HeapAlloc diff %d", int64(mem2_after.HeapAlloc-mem2_before.HeapAlloc))
}
Sample output:
2009/11/10 23:00:00 HeapObjects diff = 0
2009/11/10 23:00:00 HeapAlloc diff 0
2009/11/10 23:00:00 HeapObjects diff = 4
2009/11/10 23:00:00 HeapAlloc diff 1864
Is what I'm trying to do unpractical? I assume the runtime is doing things that allocate/free heap-memory. Can I tell it to stop to make my measurements? (this is for a test checking for memory leaks, not production code)
You can't predict what garbage collection and reading all the memory stats require in the background. Calling those to calculate memory allocations and usage is not a reliable way.
Luckily for us, Go's testing framework can monitor and calculate memory usage.
So what you should do is write a benchmark function and let the testing framework do its job to report memory allocations and usage.
Let's assume we want to measure this foo() function:
var x []int64
func foo(allocs, size int) {
for i := 0; i < allocs; i++ {
x = make([]int64, size)
}
}
All it does is allocate a slice of the given size, and it does this with the given number of times (allocs).
Let's write benchmarking functions for different scenarios:
func BenchmarkFoo_0_0(b *testing.B) {
for i := 0; i < b.N; i++ {
foo(0, 0)
}
}
func BenchmarkFoo_1_1(b *testing.B) {
for i := 0; i < b.N; i++ {
foo(1, 1)
}
}
func BenchmarkFoo_2_2(b *testing.B) {
for i := 0; i < b.N; i++ {
foo(2, 2)
}
}
Running the benchmark with go test -bench . -benchmem, the output is:
BenchmarkFoo_0_0-8 1000000000 0.3204 ns/op 0 B/op 0 allocs/op
BenchmarkFoo_1_1-8 67101626 16.58 ns/op 8 B/op 1 allocs/op
BenchmarkFoo_2_2-8 27375050 42.42 ns/op 32 B/op 2 allocs/op
As you can see, the allocations per function call is the same what we pass as the allocs argument. The allocated memory is the expected allocs * size * 8 bytes.
Note that the reported allocations per op is an integer value (it's the result of an integer division), so if the benchmarked function only occasionally allocates, it might not be reported in the integer result. For details, see Output from benchmem.
Like in this example:
var x []int64
func bar() {
if rand.Float64() < 0.3 {
x = make([]int64, 10)
}
}
This bar() function does 1 allocation with 30% probability (and none with 70% probability), which means on average it does 0.3 allocations. Benchmarking it:
func BenchmarkBar(b *testing.B) {
for i := 0; i < b.N; i++ {
bar()
}
}
Output is:
BenchmarkBar-8 38514928 29.60 ns/op 24 B/op 0 allocs/op
We can see there is 24 bytes allocation (0.3 * 10 * 8 bytes), which is correct, but the reported allocations per op is 0.
Luckily for us, we can also benchmark a function from our main app using the testing.Benchmark() function. It returns a testing.BenchmarkResult including all details about memory usage. We have access to the total number of allocations and to the number of iterations, so we can calculate allocations per op using floating point numbers:
func main() {
rand.Seed(time.Now().UnixNano())
tr := testing.Benchmark(BenchmarkBar)
fmt.Println("Allocs/op", tr.AllocsPerOp())
fmt.Println("B/op", tr.AllocedBytesPerOp())
fmt.Println("Precise allocs/op:", float64(tr.MemAllocs)/float64(tr.N))
}
This will output:
Allocs/op 0
B/op 24
Precise allocs/op: 0.3000516369276302
We can see the expected ~0.3 allocations per op.
Now if we go ahead and benchmark your measure_nothing() function:
func BenchmarkNothing(b *testing.B) {
for i := 0; i < b.N; i++ {
measure_nothing(&mem1_before, &mem1_after)
}
}
We get this output:
Allocs/op 0
B/op 11
Precise allocs/op: 0.12182030338389732
As you can see, running the garbage collector twice and reading memory stats twice occasionally needs allocation (~1 out of 10 calls: 0.12 times on average).

How to use channels efficiently [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I read on Uber's style guide that one should use at most a channel length of 1.
Although it's clear to me that using a channel size of 100 or 1000 is very bad practice, I was however wondering why a channel size of 10 isn't considered a valid option. I'm missing some part to get to the right conclusion.
Below, you can follow my arguments (and counter arguments) backed by some benchmark test.
I understand that, if your both go-routines, responsible for writing or reading from this channel, would be interrupted in between sequential writings or readings to/from the channel by some other IO action, no gain is to be expected from a higher channel buffer and I agree that 1 is the best option.
But, lets say that there is no significant other go-routine switching needed apart from the implicit locking and unlocking caused by writing/reading to/from the channel. Then I would conclude the following:
Consider the amount of context switches when processing 100 values on a channel with either a channel buffer of size 1 and of 10 (GR = go-routine)
Buffer=1: (GR1 inserts 1 value, GR2 reads 1 value) X 100 ~ 200 go-routine switches
Buffer=10: (GR1 inserts 10 values, GR2 reads 10 values) X 10 ~ 20 go-routine switches
I did some benchmarking to prove that this actually goes faster:
package main
import (
"testing"
)
type a struct {
b [100]int64
}
func BenchmarkBuffer1(b *testing.B) {
count := 0
c := make(chan a, 1)
go func() {
for i := 0; i < b.N; i++ {
c <- a{}
}
close(c)
}()
for v := range c {
for i := range v.b {
count += i
}
}
}
func BenchmarkBuffer10(b *testing.B) {
count := 0
c := make(chan a, 10)
go func() {
for i := 0; i < b.N; i++ {
c <- a{}
}
close(c)
}()
for v := range c {
for i := range v.b {
count += i
}
}
}
Results when comparing simple reading & writing + non-blocking processing:
BenchmarkBuffer1-12 5072902 266 ns/op
BenchmarkBuffer10-12 6029602 179 ns/op
PASS
BenchmarkBuffer1-12 5228782 256 ns/op
BenchmarkBuffer10-12 5392410 216 ns/op
PASS
BenchmarkBuffer1-12 4806208 287 ns/op
BenchmarkBuffer10-12 4637842 233 ns/op
PASS
However, if I add a sleep every 10 reads, it doesn't yield any better results.
import (
"testing"
"time"
)
func BenchmarkBuffer1WithSleep(b *testing.B) {
count := 0
c := make(chan int, 1)
go func() {
for i := 0; i < b.N; i++ {
c <- i
}
close(c)
}()
for a := range c {
count++
if count%10 == 0 {
time.Sleep(time.Duration(a) * time.Nanosecond)
}
}
}
func BenchmarkBuffer10WithSleep(b *testing.B) {
count := 0
c := make(chan int, 10)
go func() {
for i := 0; i < b.N; i++ {
c <- i
}
close(c)
}()
for a := range c {
count++
if count%10 == 0 {
time.Sleep(time.Duration(a) * time.Nanosecond)
}
}
}
Results when adding a sleep every 10 reads:
BenchmarkBuffer1WithSleep-12 856886 53219 ns/op
BenchmarkBuffer10WithSleep-12 929113 56939 ns/op
FYI: I also did the test again with only one CPU and got the following results:
BenchmarkBuffer1 5831193 207 ns/op
BenchmarkBuffer10 6226983 180 ns/op
BenchmarkBuffer1WithSleep 556635 35510 ns/op
BenchmarkBuffer10WithSleep 984472 61434 ns/op
Absolutely nothing is wrong with a channel of cap 500 e.g. if this channel is used as a semaphore.
The style guide you read recommends to not use buffered channels of let's say cap 64 "because this looks like a nice number". But this recommendation is not because of performance! (Btw: You microbenchmarks are useless microbenchmarks, they do not measure anything relevant.)
An unbuffered channel is some kind of synchronisation primitive and us such very much useful.
A buffered channel, well, may buffer between sender and receiver and this buffering can be problematic for observing, tuning and debugging the code (because creation and consumption are further decoupled). Thats why the style guide recommends unbuffered channels (or at most a cap of 1 as this is sometimes needed for correctness!).
It also doesn't prohibit larger buffer caps:
Any other [than 0 or 1] size must be subject to a high level of scrutiny. Consider how the size is determined, what prevents the channel from filling up under load and blocking writers, and what happens when this occurs. [emph. mine]
You may use a cap of 27 if you can explain why 27 (and not 22 or 31) and how this will influence program behaviour (not only performance!) if the buffer is filled.
Most people overrate performance. Correctness, operational stability and maintainability come first. And this is what this style guide is about here.

Channels and Parallelism confusion

I'm learning myself Golang, and I'm a bit confused about parallelism and how it is implemented in Golang.
Given the following example:
package main
import (
"fmt"
"sync"
"math/rand"
"time"
)
const (
workers = 1
rand_count = 5000000
)
func start_rand(ch chan int) {
defer close(ch)
var wg sync.WaitGroup
wg.Add(workers)
rand_routine := func(counter int) {
defer wg.Done()
for i:=0;i<counter;i++ {
seed := time.Now().UnixNano()
rand.Seed(seed)
ch<-rand.Intn(5000)
}
}
for i:=0; i<workers; i++ {
go rand_routine(rand_count/workers)
}
wg.Wait()
}
func main() {
start_time := time.Now()
mychan := make(chan int, workers)
go start_rand(mychan)
var wg sync.WaitGroup
wg.Add(workers)
work_handler := func() {
defer wg.Done()
for {
v, isOpen := <-mychan
if !isOpen { break }
fmt.Println(v)
}
}
for i:=0;i<workers;i++ {
go work_handler()
}
wg.Wait()
elapsed_time := time.Since(start_time)
fmt.Println("Done",elapsed_time)
}
This piece of code takes about one minute to run on my Macbook. I assumed that increasing the "workers" constants, would launch additional go routines, and since my laptop has multiple cores, would shorten the execution time.
This is not the case however. Increasing the workers does not reduce the execution time.
I was thinking that setting workers to 1, would create 1 goroutine to generate the random numbers, and setting it to 4, would create 4 goroutines. Given the multicore nature of my laptop, I was expecting that 4 workers would run on different cores, and therefore, increae the performance.
However, I see increased load on all my cores, even when workers is set to 1. What am I missing here?
Your code has some issues which makes it inherently slow:
You are seeding inside the loop. This needs only to be done once
You are using the same source for random numbers. This source is thread safe, but takes away any performance gains for concurrent workers. You could create a source for each worker with rand.New
You are printing a lot. Printing is thread safe, too. So that takes away any speed gains for concurrent workers.
As Zak already pointed out: The concurrent work inside the go routines is very cheap and the communication is expensive.
You could rewrite your program like that. Then you will see some speed gains when you change the number of workers:
package main
import (
"fmt"
"math/rand"
"time"
)
const (
workers = 1
randCount = 5000000
)
var results = [randCount]int{}
func randRoutine(start, counter int, c chan bool) {
r := rand.New(rand.NewSource(time.Now().UnixNano()))
for i := 0; i < counter; i++ {
results[start+i] = r.Intn(5000)
}
c <- true
}
func main() {
startTime := time.Now()
c := make(chan bool)
start := 0
for w := 0; w < workers; w++ {
go randRoutine(start, randCount/workers, c)
start += randCount / workers
}
for i := 0; i < workers; i++ {
<-c
}
elapsedTime := time.Since(startTime)
for _, i := range results {
fmt.Println(i)
}
fmt.Println("Time calulating", elapsedTime)
elapsedTime = time.Since(startTime)
fmt.Println("Toal time", elapsedTime)
}
This program does a lot of work in a go routine and communicates minimal. Also a different random source is used for each go routine.
Your code does not have just a single routine, even though you set the workers to 1.
There is 1 goroutine from the call go start_rand(...)
That goroutine creates N (worker) routines with go rand_routine(...) and waits for them to finish.
Then you also start N (worker) go routines with go work_handler()
Then you also have 1 goroutine that was started by main() func call.
so: 1 + 2N + 1 routines running for any given N where N == workers.
Plus, on top of that, the work that you are doing in the goroutines is pretty cheap (fast to execute). You are just generating random numbers.
If you look at the blocking and scheduler latency profiles of the program:
You can see from both of the images above that most of the time is spent in the concurrency constructs. This suggests there is a lot of contention in your program. While goroutines are cheap, there is still some blocking and synchronisation that needs to be done when sending a value over a channel. This can take a large proportion of the time of the program when the work being done by the producer is very fast / cheap.
To answer your original question, you see load on many cores because you have more than a single goroutine running.

How to use time value of benchmark

I have written a benchmark for my chess engine in Go:
func BenchmarkStartpos(b *testing.B) {
board := ParseFen(startpos)
for i := 0; i < b.N; i++ {
Perft(&board, 5)
}
}
I see this output when it runs:
goos: darwin
goarch: amd64
BenchmarkStartpos-4 10 108737398 ns/op
PASS
ok _/Users/dylhunn/Documents/go-chess 1.215s
I want to use the time per execution (in this case, 108737398 ns/op) to compute another value, and also print it as a result of the benchmark. Specifically, I want to output nodes per second, which is given as the result of the Perft call divided by the time per call.
How can I access the time the benchmark took to execute, so I can print my own derived results?
You may use the testing.Benchmark() function to manually measure / benchmark "benchmark" functions (that have the signature of func(*testing.B)), and you get the result as a value of testing.BenchmarkResult, which is a struct with all the details you need:
type BenchmarkResult struct {
N int // The number of iterations.
T time.Duration // The total time taken.
Bytes int64 // Bytes processed in one iteration.
MemAllocs uint64 // The total number of memory allocations.
MemBytes uint64 // The total number of bytes allocated.
}
The time per execution is returned by the BenchmarkResult.NsPerOp() method, you can do whatever you want to with that.
See this simple example:
func main() {
res := testing.Benchmark(BenchmarkSleep)
fmt.Println(res)
fmt.Println("Ns per op:", res.NsPerOp())
fmt.Println("Time per op:", time.Duration(res.NsPerOp()))
}
func BenchmarkSleep(b *testing.B) {
for i := 0; i < b.N; i++ {
time.Sleep(time.Millisecond * 12)
}
}
Output is (try it on the Go Playground):
100 12000000 ns/op
Ns per op: 12000000
Time per op: 12ms

Resources