Golang benchmark: why does allocs/op show 0 B/op? - go

Here is a code snippet for benchmark:
// bench_test.go
package main
import (
"testing"
)
func BenchmarkHello(b *testing.B) {
for i := 0; i < b.N; i++ {
a := 1
a++
}
}
The metric allocs/op shows 0 B/op. variable a is an int type and doesn't take too much memory, but it should not take zero B.
> go test -bench=. -benchmem
goos: darwin
goarch: amd64
pkg: a
BenchmarkHello-4 2000000000 0.26 ns/op 0 B/op 0 allocs/op
PASS
ok a 0.553s

Why is this metric allocs/ops zero?
package main
import (
"testing"
)
func BenchmarkHello(b *testing.B) {
for i := 0; i < b.N; i++ {
a := 1
a++
}
}
The allocs/ops average only counts heap allocations, not stack allocations.
The allocs/ops average is rounded down to the nearest integer value.
The Go gc compiler is an optimizing compiler. Since
{
a := 1
a++
}
doesn't accomplish anything, it is elided.

The benchmark tool only reports heap allocations. Stack allocations via escape analysis are less costly, possibly free, so are not reported.
Reference
Why is this simple benchmark showing zero allocations?

Related

Confusing results from golang benchmarking of function and go routine call overhead

Out of curiosity, I am trying to understand what the function and go routine call overhead is for golang. I therefore wrote the benchmarks below giving the results below that. The result for BenchmarkNestedFunctions confuses me as it seems far too high so I naturally assume I have done something wrong. I was expecting the BenchmarkNestedFunctions to be slightly higher than the BenchmarkNopFunc and very close to the BenchmarkSplitNestedFunctions. Please can anyone suggest what I may be either not understanding or doing wrong.
package main
import (
"testing"
)
// Intended to allow me to see the iteration overhead being used in the benchmarking
func BenchmarkTestLoop(b *testing.B) {
for i := 0; i < b.N; i++ {
}
}
//go:noinline
func nop() {
}
// Intended to allow me to see the overhead from making a do nothing function call which I hope is not being optimised out
func BenchmarkNopFunc(b *testing.B) {
for i := 0; i < b.N; i++ {
nop()
}
}
// Intended to allow me to see the added cost from creating a channel, closing it and then reading from it
func BenchmarkChannelMakeCloseRead(b *testing.B) {
for i := 0; i < b.N; i++ {
done := make(chan struct{})
close(done)
_, _ = <-done
}
}
//go:noinline
func nestedfunction(n int, done chan<- struct{}) {
n--
if n > 0 {
nestedfunction(n, done)
} else {
close(done)
}
}
// Intended to allow me to see the added cost of making 1 function call doing a set of channel operations for each call
func BenchmarkUnnestedFunctions(b *testing.B) {
for i := 0; i < b.N; i++ {
done := make(chan struct{})
nestedfunction(1, done)
_, _ = <-done
}
}
// Intended to allow me to see the added cost of repeated nested calls and stack growth with an upper limit on the call depth to allow examination of a particular stack size
func BenchmarkNestedFunctions(b *testing.B) {
// Max number of nested function calls to prevent excessive stack growth
const max int = 200000
if b.N > max {
b.N = max
}
done := make(chan struct{})
nestedfunction(b.N, done)
_, _ = <-done
}
// Intended to allow me to see the added cost of repeated nested call with any stack reuse the runtime supports (presuming it doesn't free and the realloc the stack as it grows)
func BenchmarkSplitNestedFunctions(b *testing.B) {
// Max number of nested function calls to prevent excessive stack growth
const max int = 200000
for i := 0; i < b.N; i += max {
done := make(chan struct{})
if (b.N - i) > max {
nestedfunction(max, done)
} else {
nestedfunction(b.N-i, done)
}
_, _ = <-done
}
}
// Intended to allow me to see the added cost of spinning up a go routine to perform comparable useful work as the nested function calls
func BenchmarkNestedGoRoutines(b *testing.B) {
done := make(chan struct{})
go nestedgoroutines(b.N, done)
_, _ = <-done
}
The benchmarks are invoked as follows:
$ go test -bench=. -benchmem -benchtime=200ms
goos: windows
goarch: amd64
pkg: golangbenchmarks
cpu: AMD Ryzen 9 3900X 12-Core Processor
BenchmarkTestLoop-24 1000000000 0.2247 ns/op 0 B/op 0 allocs/op
BenchmarkNopFunc-24 170787386 1.402 ns/op 0 B/op 0 allocs/op
BenchmarkChannelMakeCloseRead-24 3990243 52.72 ns/op 96 B/op 1 allocs/op
BenchmarkUnnestedFunctions-24 4791862 58.63 ns/op 96 B/op 1 allocs/op
BenchmarkNestedFunctions-24 200000 50.11 ns/op 0 B/op 0 allocs/op
BenchmarkSplitNestedFunctions-24 155160835 1.528 ns/op 0 B/op 0 allocs/op
BenchmarkNestedGoRoutines-24 636734 412.2 ns/op 24 B/op 1 allocs/op
PASS
ok golangbenchmarks 1.700s
The BenchmarkTestLoop, BenchmarkNopFunc and BenchmarkSplitNestedFunctions results seem reasonably consistent with each other and make sense, the BenchmarkSplitNestedFunctions is doing more work than the BenchmarkNopFunc on average per benchmark operation but not by much because the expensive BenchmarkChannelMakeCloseRead operation is only done about once every 200,000 benchmarking operations.
Similarly the BenchmarkChannelMakeCloseRead and BenchmarkUnnestedFunctions results seem consistent since each BenchmarkUnnestedFunctions is doing slightly more than each BenchmarkChannelMakeCloseRead if only by a decrement and if test which is potentially causing a branch prediction failure (although I would have hoped the branch predicter would have been able to use the last branch result, but I don't know how complex the close function implementation is which may be overwhelming the branch history)
However BenchmarkNestedFunctions and BenchmarkSplitNestedFunctions are radically different and I don't understand why. There should be similar with the only intentional difference being any grown stack re-use and I did not expect the stack growth cost to be nearly so high (or is that the explanation and it is just co-incidence that result is so similar to the BenchmarkChannelMakeCloseRead result making me think it is not actually doing what I thought it was?)
It should also be noted that the BenchmarkSplitNestedFunctions result can occasionally take significantly different values; I have seen a few values in the range of 10 to 200 ns/op when running it repeatedly. It can also fail to report any result ns/op time while still passing when I run it; I have no idea what is going on there:
BenchmarkChannelMakeCloseRead-24 5724488 54.26 ns/op 96 B/op 1 allocs/op
BenchmarkUnnestedFunctions-24 3992061 57.49 ns/op 96 B/op 1 allocs/op
BenchmarkNestedFunctions-24 200000 0 B/op 0 allocs/op
BenchmarkNestedFunctions2-24 154956972 1.590 ns/op 0 B/op 0 allocs/op
BenchmarkNestedGoRoutines-24 1000000 342.1 ns/op 24 B/op 1 allocs/op
If anyone can point out my mistake in the benchmark / my interpretation of the results and explain what is really happening then that would be greatly appreciated
Background info:
Stack growth and function inlining: https://dave.cheney.net/2020/04/25/inlining-optimisations-in-go
Stack growth limitations: https://dave.cheney.net/2013/06/02/why-is-a-goroutines-stack-infinite
Golang stack structure: https://blog.cloudflare.com/how-stacks-are-handled-in-go/
Branch prediction: https://en.wikipedia.org/wiki/Branch_predictor
Top level 3900X architecture overview: https://www.techpowerup.com/review/amd-ryzen-9-3900x/3.html
3900X branch prediction history/buffer size 16/512/7k: https://www.techpowerup.com/review/amd-ryzen-9-3900x/images/arch3.jpg

Why does my benchmark show same fast performance for ranging over a slice by value vs. index?

type Item struct {
A int
B [1024]byte
}
func BenchmarkRange1(b *testing.B) {
s := make([]Item, 1024)
for i := 0; i < b.N; i++ {
for _, v := range s {
_ = v.A
}
}
}
func BenchmarkRange2(b *testing.B) {
s := make([]Item, 1024)
for i := 0; i < b.N; i++ {
for i := range s {
_ = s[i].A
}
}
}
Now, take a look at the result of the benchmark.
go test -bench=BenchmarkRange -benchmem main/copy
goos: darwin
goarch: amd64
pkg: main/copy
cpu: Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
BenchmarkRange1-12 4577601 260.9 ns/op 0 B/op 0 allocs/op
BenchmarkRange2-12 4697178 254.9 ns/op 0 B/op 0 allocs/op
PASS
ok main/copy 3.391s
Isn't it to copy elements when range slice by value? Why the performance is same? What optimization does the compiler do when we range the slice by value?
When I fobidden the optimization of compiler by compiling option "-gcflags=-N", I will get the expected result:
go test -bench=BenchmarkRange -benchmem -gcflags=-N main/copy
goos: darwin
goarch: amd64
pkg: main/copy
cpu: Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
BenchmarkRange1-12 39004 29481 ns/op 27 B/op 0 allocs/op
BenchmarkRange2-12 777356 1572 ns/op 1 B/op 0 allocs/op
PASS
ok main/copy 3.169s
Who can explain how the compiler to optimize.
With the default optimizations, your inner loop in both both BenchmarkRange1 and BenchmarkRange2 is being compiled down to an empty loop with 1024 iterations, as if you had written your inner loop like:
for i := 0; i < 1024; i++ {
}
In both of your examples, the compiler is smart enough to recognize that you aren't doing anything inside the inner loop (that is, not making use of v, v.A, s[i], or s[i].A).
go.godbolt.org is a great resource for looking at the assembly the Go compiler produces. For example, the inner loop in BenchmarkRange1 gets compiled down to the following (which zeros out AX, then loops 1024 times):
XORL AX, AX
Range1_pc39:
INCQ AX
CMPQ AX, $1024
JLT Range1_pc39
You can look at the complete output here, along with handy tooltips that (usually) explain the different assembly instructions:
https://go.godbolt.org/z/raTPjTrYG
(To make your example shorter, I dropped the testing package; the //go:nosplit comments aren't really needed, but slightly simplify the resulting assembly).

Measure heap growth accurately

I am trying to measure the evolution of the number of heap-allocated objects before and after I call a function. I am forcing runtime.GC() and using runtime.ReadMemStats to measure the number of heap objects I have before and after.
The problem I have is that I sometimes see unexpected heap growth. And it is different after each run.
A simple example below, where I would always expect to see a zero heap-objects growth.
https://go.dev/play/p/FBWfXQHClaG
var mem1_before, mem2_before, mem1_after, mem2_after runtime.MemStats
func measure_nothing(before, after *runtime.MemStats) {
runtime.GC()
runtime.ReadMemStats(before)
runtime.GC()
runtime.ReadMemStats(after)
}
func main() {
measure_nothing(&mem1_before, &mem1_after)
measure_nothing(&mem2_before, &mem2_after)
log.Printf("HeapObjects diff = %d", int64(mem1_after.HeapObjects-mem1_before.HeapObjects))
log.Printf("HeapAlloc diff %d", int64(mem1_after.HeapAlloc-mem1_before.HeapAlloc))
log.Printf("HeapObjects diff = %d", int64(mem2_after.HeapObjects-mem2_before.HeapObjects))
log.Printf("HeapAlloc diff %d", int64(mem2_after.HeapAlloc-mem2_before.HeapAlloc))
}
Sample output:
2009/11/10 23:00:00 HeapObjects diff = 0
2009/11/10 23:00:00 HeapAlloc diff 0
2009/11/10 23:00:00 HeapObjects diff = 4
2009/11/10 23:00:00 HeapAlloc diff 1864
Is what I'm trying to do unpractical? I assume the runtime is doing things that allocate/free heap-memory. Can I tell it to stop to make my measurements? (this is for a test checking for memory leaks, not production code)
You can't predict what garbage collection and reading all the memory stats require in the background. Calling those to calculate memory allocations and usage is not a reliable way.
Luckily for us, Go's testing framework can monitor and calculate memory usage.
So what you should do is write a benchmark function and let the testing framework do its job to report memory allocations and usage.
Let's assume we want to measure this foo() function:
var x []int64
func foo(allocs, size int) {
for i := 0; i < allocs; i++ {
x = make([]int64, size)
}
}
All it does is allocate a slice of the given size, and it does this with the given number of times (allocs).
Let's write benchmarking functions for different scenarios:
func BenchmarkFoo_0_0(b *testing.B) {
for i := 0; i < b.N; i++ {
foo(0, 0)
}
}
func BenchmarkFoo_1_1(b *testing.B) {
for i := 0; i < b.N; i++ {
foo(1, 1)
}
}
func BenchmarkFoo_2_2(b *testing.B) {
for i := 0; i < b.N; i++ {
foo(2, 2)
}
}
Running the benchmark with go test -bench . -benchmem, the output is:
BenchmarkFoo_0_0-8 1000000000 0.3204 ns/op 0 B/op 0 allocs/op
BenchmarkFoo_1_1-8 67101626 16.58 ns/op 8 B/op 1 allocs/op
BenchmarkFoo_2_2-8 27375050 42.42 ns/op 32 B/op 2 allocs/op
As you can see, the allocations per function call is the same what we pass as the allocs argument. The allocated memory is the expected allocs * size * 8 bytes.
Note that the reported allocations per op is an integer value (it's the result of an integer division), so if the benchmarked function only occasionally allocates, it might not be reported in the integer result. For details, see Output from benchmem.
Like in this example:
var x []int64
func bar() {
if rand.Float64() < 0.3 {
x = make([]int64, 10)
}
}
This bar() function does 1 allocation with 30% probability (and none with 70% probability), which means on average it does 0.3 allocations. Benchmarking it:
func BenchmarkBar(b *testing.B) {
for i := 0; i < b.N; i++ {
bar()
}
}
Output is:
BenchmarkBar-8 38514928 29.60 ns/op 24 B/op 0 allocs/op
We can see there is 24 bytes allocation (0.3 * 10 * 8 bytes), which is correct, but the reported allocations per op is 0.
Luckily for us, we can also benchmark a function from our main app using the testing.Benchmark() function. It returns a testing.BenchmarkResult including all details about memory usage. We have access to the total number of allocations and to the number of iterations, so we can calculate allocations per op using floating point numbers:
func main() {
rand.Seed(time.Now().UnixNano())
tr := testing.Benchmark(BenchmarkBar)
fmt.Println("Allocs/op", tr.AllocsPerOp())
fmt.Println("B/op", tr.AllocedBytesPerOp())
fmt.Println("Precise allocs/op:", float64(tr.MemAllocs)/float64(tr.N))
}
This will output:
Allocs/op 0
B/op 24
Precise allocs/op: 0.3000516369276302
We can see the expected ~0.3 allocations per op.
Now if we go ahead and benchmark your measure_nothing() function:
func BenchmarkNothing(b *testing.B) {
for i := 0; i < b.N; i++ {
measure_nothing(&mem1_before, &mem1_after)
}
}
We get this output:
Allocs/op 0
B/op 11
Precise allocs/op: 0.12182030338389732
As you can see, running the garbage collector twice and reading memory stats twice occasionally needs allocation (~1 out of 10 calls: 0.12 times on average).

Why fixed-sized slices are not cheaper to allocate than variable-sized bytes.Buffer?

Here is my test code:
package app
import (
"bytes"
"testing"
)
const ALLOC_SIZE = 64 * 1024
func BenchmarkFunc1(b *testing.B) {
for i := 0; i < b.N; i++ {
v := make([]byte, ALLOC_SIZE)
fill(v, '1', 0, ALLOC_SIZE)
}
}
func BenchmarkFunc2(b *testing.B) {
for i := 0; i < b.N; i++ {
b := new(bytes.Buffer)
b.Grow(ALLOC_SIZE)
fill(b.Bytes(), '2', 0, ALLOC_SIZE)
}
}
func fill(slice []byte, val byte, start, end int) {
for i := start; i < end; i++ {
slice = append(slice, val)
}
}
Result:
at 19:05:47 ❯ go test -bench . -benchmem -gcflags=-m
# app [app.test]
./main_test.go:25:6: can inline fill
./main_test.go:10:6: can inline BenchmarkFunc1
./main_test.go:13:7: inlining call to fill
./main_test.go:20:9: inlining call to bytes.(*Buffer).Grow
./main_test.go:21:15: inlining call to bytes.(*Buffer).Bytes
./main_test.go:21:7: inlining call to fill
./main_test.go:10:21: b does not escape
./main_test.go:12:12: make([]byte, ALLOC_SIZE) escapes to heap
./main_test.go:20:9: BenchmarkFunc2 ignoring self-assignment in bytes.b.buf = bytes.b.buf[:bytes.m·3]
./main_test.go:17:21: b does not escape
./main_test.go:19:11: new(bytes.Buffer) does not escape
./main_test.go:25:11: slice does not escape
# app.test
/var/folders/45/vh6dxx396d590hxtz7_9_smmhqf0sq/T/go-build1328509211/b001/_testmain.go:35:6: can inline init.0
/var/folders/45/vh6dxx396d590hxtz7_9_smmhqf0sq/T/go-build1328509211/b001/_testmain.go:43:24: inlining call to testing.MainStart
/var/folders/45/vh6dxx396d590hxtz7_9_smmhqf0sq/T/go-build1328509211/b001/_testmain.go:43:42: testdeps.TestDeps{} escapes to heap
/var/folders/45/vh6dxx396d590hxtz7_9_smmhqf0sq/T/go-build1328509211/b001/_testmain.go:43:24: &testing.M{...} escapes to heap
goos: darwin
goarch: amd64
pkg: app
cpu: Intel(R) Core(TM) i7-7700HQ CPU # 2.80GHz
BenchmarkFunc1-8 8565 118348 ns/op 393217 B/op 4 allocs/op
BenchmarkFunc2-8 23332 53043 ns/op 65536 B/op 1 allocs/op
PASS
ok app 2.902s
My assumption was using fixed-sized slice created by make is way cheaper than bytes.Buffer, because the compiler might be able to know the size of memory required to be allocated at the compile-time. Using bytes.Buffer looks something like runtime-thing. However, the result is not what I have expected to be.
Any explaination on this?
You are confusing capacity and length of slices.
v := make([]byte, ALLOC_SIZE)
v is now a slice with length 64k and capacity 64k. Appending anything to this slice forces Go to copy the backing array into a new, larger one.
b := new(bytes.Buffer)
b.Grow(ALLOC_SIZE)
v := b.Bytes()
Here, v is a slice with length zero and capacity 64k. You can append 64k bytes to this slice without any reallocation, because it is initially empty but the 64k backing array is ready to be used.
In summary, you are comparing a slice that is already filled to capacity to an empty slice with the same capacity.
To make a fair comparison change your first benchmark to allocate an empty slice as well:
func BenchmarkFunc1(b *testing.B) {
for i := 0; i < b.N; i++ {
v := make([]byte, 0, ALLOC_SIZE) // note the three argument form
fill(v, '1', 0, ALLOC_SIZE)
}
}
goos: linux
goarch: amd64
pkg: foo
cpu: Intel(R) Core(TM) i5-10210U CPU # 1.60GHz
BenchmarkFunc1-8 23540 51990 ns/op 65536 B/op 1 allocs/op
BenchmarkFunc2-8 24939 45096 ns/op 65536 B/op 1 allocs/op
The relationship between slices, arrays, length, and capacity is explained in great detail in https://blog.golang.org/slices-intro

in golang, is there any performance difference between maps initialized using make vs {}

as we know there are two ways to initialize a map (as listed below). I'm wondering if there is any performance difference between the two approaches.
var myMap map[string]int
then
myMap = map[string]int{}
vs
myMap = make(map[string]int)
On my machine they appear to be about equivalent.
You can easily make a benchmark test to compare. For example:
package bench
import "testing"
var result map[string]int
func BenchmarkMakeLiteral(b *testing.B) {
var m map[string]int
for n := 0; n < b.N; n++ {
m = InitMapLiteral()
}
result = m
}
func BenchmarkMakeMake(b *testing.B) {
var m map[string]int
for n := 0; n < b.N; n++ {
m = InitMapMake()
}
result = m
}
func InitMapLiteral() map[string]int {
return map[string]int{}
}
func InitMapMake() map[string]int {
return make(map[string]int)
}
Which on 3 different runs yielded results that are close enough to be insignificant:
First Run
$ go test -bench=.
testing: warning: no tests to run
PASS
BenchmarkMakeLiteral-8 10000000 160 ns/op
BenchmarkMakeMake-8 10000000 171 ns/op
ok github.com/johnweldon/bench 3.664s
Second Run
$ go test -bench=.
testing: warning: no tests to run
PASS
BenchmarkMakeLiteral-8 10000000 182 ns/op
BenchmarkMakeMake-8 10000000 173 ns/op
ok github.com/johnweldon/bench 3.945s
Third Run
$ go test -bench=.
testing: warning: no tests to run
PASS
BenchmarkMakeLiteral-8 10000000 170 ns/op
BenchmarkMakeMake-8 10000000 170 ns/op
ok github.com/johnweldon/bench 3.751s
When allocating empty maps there is no difference but with make you can pass second parameter to pre-allocate space in map. This will save a lot of reallocations when maps are being populated.
Benchmarks
package maps
import "testing"
const SIZE = 10000
func fill(m map[int]bool, size int) {
for i := 0; i < size; i++ {
m[i] = true
}
}
func BenchmarkEmpty(b *testing.B) {
for n := 0; n < b.N; n++ {
m := make(map[int]bool)
fill(m, SIZE)
}
}
func BenchmarkAllocated(b *testing.B) {
for n := 0; n < b.N; n++ {
m := make(map[int]bool, 2*SIZE)
fill(m, SIZE)
}
}
Results
go test -benchmem -bench .
BenchmarkEmpty-8 500 2988680 ns/op 431848 B/op 625 allocs/op
BenchmarkAllocated-8 1000 1618251 ns/op 360949 B/op 11 allocs/op
A year ago I actually stumped on the fact that using make with explicitly allocated space is better then using map literal if your values are not static
So doing
return map[string]float {
"key1": SOME_COMPUTED_ABOVE_VALUE,
"key2": SOME_COMPUTED_ABOVE_VALUE,
// more keys here
"keyN": SOME_COMPUTED_ABOVE_VALUE,
}
is slower then
// some code above
result := make(map[string]float, SIZE) // SIZE >= N
result["key1"] = SOME_COMPUTED_ABOVE_VALUE
result["key2"] = SOME_COMPUTED_ABOVE_VALUE
// more keys here
result["keyN"] = SOME_COMPUTED_ABOVE_VALUE
return result
for N which are quite big (N=300 in my use case).
The reason is the compiler fails to understand that one needs to allocate at least N slots in the first case.
I wrote a blog post about it
https://trams.github.io/golang-map-literal-performance/
and I reported a bug to the community
https://github.com/golang/go/issues/43020
As of golang 1.17 it is still an issue.

Resources