Here is a simple golang benchmark test, it runs x++ in three different ways:
in a simple for loop with x declared inside function
in a nested loop with x declared inside function
in a nested loop with x declared as global variable
package main
import (
"testing"
)
var x = 0
func BenchmarkLoop(b *testing.B) {
x := 0
for n := 0; n < b.N; n++ {
x++
}
}
func BenchmarkDoubleLoop(b *testing.B) {
x := 0
for n := 0; n < b.N/1000; n++ {
for m := 0; m < 1000; m++ {
x++
}
}
}
func BenchmarkDoubleLoopGlobalVariable(b *testing.B) {
for n := 0; n < b.N/1000; n++ {
for m := 0; m < 1000; m++ {
x++
}
}
}
And the result is as following:
$ go test -bench=.
BenchmarkLoop-8 2000000000 0.32 ns/op
BenchmarkDoubleLoop-8 2000000000 0.34 ns/op
BenchmarkDoubleLoopGlobalVariable-8 2000000000 2.00 ns/op
PASS
ok github.com/cizixs/playground/loop-perf 5.597s
Obviously, the first and second methods have similar performance, while the third function is much slower(about 6x times slow).
And I wonder why this is happening, is there a way to improve performance of global variable access?
I wonder why this is happening.
The compiler optimizes away your whole code. 300ps per op means a only a noop was "executed".
Related
My code:
// repeat fib(n) 10000 times
i := 10000
var total_time time.Duration
for i > 0 {
// do fib(n) -> f0
start := time.Now()
for n > 0 {
f0, f1, n = f1, f0.Add(f0, f1), n-1
}
total_time = total_time + time.Since(start)
i--
}
// and divide total execution time by 10000
var normalized_time = total_time / 10000
fmt.Println(normalized_time)
The execution times I'm seeing are so extremely short that I am suspicious that what I've done isn't useful. If it's wrong, what am I doing wrong and how can I make it right?
what am I doing wrong and how can I make it right?
Use the Go testing package for benchmarks. For example:
Write the Fibonacci number computation as a function in your code.
fibonacci.go:
package main
import "fmt"
// fibonacci returns the Fibonacci number for 0 <= n <= 92.
// OEIS: A000045: Fibonacci numbers:
// F(n) = F(n-1) + F(n-2) with F(0) = 0 and F(1) = 1.
func fibonacci(n int) int64 {
if n < 0 {
panic("n < 0")
}
f := int64(0)
a, b := int64(0), int64(1)
for i := 0; i <= n; i++ {
if a < 0 {
panic("overflow")
}
f, a, b = a, b, a+b
}
return f
}
func main() {
for _, n := range []int{0, 1, 2, 3, 90, 91, 92} {
fmt.Printf("%-2d %d\n", n, fibonacci(n))
}
}
Playground: https://play.golang.org/p/FFdG4RlNpUZ
Output:
$ go run fibonacci.go
0 0
1 1
2 1
3 2
90 2880067194370816120
91 4660046610375530309
92 7540113804746346429
$
Write and run some benchmarks using the Go testing package.
fibonacci_test.go:
package main
import "testing"
func BenchmarkFibonacciN0(b *testing.B) {
for i := 0; i < b.N; i++ {
fibonacci(0)
}
}
func BenchmarkFibonacciN92(b *testing.B) {
for i := 0; i < b.N; i++ {
fibonacci(92)
}
}
Output:
$ go test fibonacci.go fibonacci_test.go -bench=. -benchmem
goos: linux
goarch: amd64
BenchmarkFibonacciN0-4 367003574 3.25 ns/op 0 B/op 0 allocs/op
BenchmarkFibonacciN92-4 17369262 63.0 ns/op 0 B/op 0 allocs/op
$
func Benchmark_foreach1(b *testing.B) {
var test map[int]int
test = make(map[int]int)
for i := 0; i < 100000; i++ {
test[i] = 1
}
for i := 0; i < b.N; i++ {
for i, _ := range test {
if test[i] != 1 {
panic("ds")
}
}
}
}
func Benchmark_foreach2(b *testing.B) {
var test map[int]int
test = make(map[int]int)
for i := 0; i < 100000; i++ {
test[i] = 1
}
for i := 0; i < b.N; i++ {
for _, v := range test {
if v != 1 {
panic("heh")
}
}
}
}
run with result as below
goos: linux
goarch: amd64
Benchmark_foreach1-2 500 3172323 ns/op
Benchmark_foreach2-2 1000 1707214 ns/op
why is foreach-2 slow?
I think Benchmark_foreach2-2 is about 2 times faster - it requires 1707214 nanoseconds per operation, and first one takes 3172323. So second one is 3172323 / 1707214 = 1.85 times faster.
Reason: second doesn't need to take value from a memory again, it already used value in v variable.
The test[k] statement in BenchmarkForeachK takes time to randomly read the value, so BenchmarkForeachK takes more time than BenchmarkForeachV, 9362945 ns/op versus 4213940 ns/op .
For example,
package main
import "testing"
func testMap() map[int]int {
test := make(map[int]int)
for i := 0; i < 100000; i++ {
test[i] = 1
}
return test
}
func BenchmarkForeachK(b *testing.B) {
test := testMap()
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
for k := range test {
if test[k] != 1 {
panic("eh")
}
}
}
}
func BenchmarkForeachV(b *testing.B) {
test := testMap()
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
for _, v := range test {
if v != 1 {
panic("heh")
}
}
}
}
Output:
$ go test foreach_test.go -bench=.
BenchmarkForeachK-4 200 9362945 ns/op 0 B/op 0 allocs/op
BenchmarkForeachV-4 300 4213940 ns/op 0 B/op 0 allocs/op
I've been trying to understand slice preallocation with make and why it's a good idea. I noticed a large performance difference between preallocating a slice and appending to it vs just initializing it with 0 length/capacity and then appending to it. I wrote a set of very simple benchmarks:
import "testing"
func BenchmarkNoPreallocate(b *testing.B) {
for i := 0; i < b.N; i++ {
// Don't preallocate our initial slice
init := []int64{}
init = append(init, 5)
}
}
func BenchmarkPreallocate(b *testing.B) {
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, 0, 1)
init = append(init, 5)
}
}
and was a little puzzled with the results:
$ go test -bench=. -benchmem
goos: linux
goarch: amd64
BenchmarkNoPreallocate-4 30000000 41.8 ns/op 8 B/op 1 allocs/op
BenchmarkPreallocate-4 2000000000 0.29 ns/op 0 B/op 0 allocs/op
I have a couple of questions:
Why are there no allocations (it shows 0 allocs/op) in the preallocation benchmark case? Certainly we're preallocating, but the allocation had to have happened at some point.
I imagine this may become clearer after the first question is answered, but how is the preallocation case so much quicker? Am I misinterpetting this benchmark?
Please let me know if anything is unclear. Thank you!
Go has an optimizing compiler. Constants are evaluated at compile time. Variables are evaluated at runtime. Constant values can be used to optimize compiler generated code. For example,
package main
import "testing"
func BenchmarkNoPreallocate(b *testing.B) {
for i := 0; i < b.N; i++ {
// Don't preallocate our initial slice
init := []int64{}
init = append(init, 5)
}
}
func BenchmarkPreallocateConst(b *testing.B) {
const (
l = 0
c = 1
)
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, l, c)
init = append(init, 5)
}
}
func BenchmarkPreallocateVar(b *testing.B) {
var (
l = 0
c = 1
)
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, l, c)
init = append(init, 5)
}
}
Output:
$ go test alloc_test.go -bench=. -benchmem
BenchmarkNoPreallocate-4 50000000 39.3 ns/op 8 B/op 1 allocs/op
BenchmarkPreallocateConst-4 2000000000 0.36 ns/op 0 B/op 0 allocs/op
BenchmarkPreallocateVar-4 50000000 28.2 ns/op 8 B/op 1 allocs/op
Another interesting set of benchmarks:
package main
import "testing"
func BenchmarkNoPreallocate(b *testing.B) {
const (
l = 0
c = 8 * 1024
)
for i := 0; i < b.N; i++ {
// Don't preallocate our initial slice
init := []int64{}
for j := 0; j < c; j++ {
init = append(init, 42)
}
}
}
func BenchmarkPreallocateConst(b *testing.B) {
const (
l = 0
c = 8 * 1024
)
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, l, c)
for j := 0; j < cap(init); j++ {
init = append(init, 42)
}
}
}
func BenchmarkPreallocateVar(b *testing.B) {
var (
l = 0
c = 8 * 1024
)
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, l, c)
for j := 0; j < cap(init); j++ {
init = append(init, 42)
}
}
}
Output:
$ go test peter_test.go -bench=. -benchmem
BenchmarkNoPreallocate-4 20000 75656 ns/op 287992 B/op 19 allocs/op
BenchmarkPreallocateConst-4 100000 22386 ns/op 65536 B/op 1 allocs/op
BenchmarkPreallocateVar-4 100000 22112 ns/op 65536 B/op 1 allocs/op
as we know there are two ways to initialize a map (as listed below). I'm wondering if there is any performance difference between the two approaches.
var myMap map[string]int
then
myMap = map[string]int{}
vs
myMap = make(map[string]int)
On my machine they appear to be about equivalent.
You can easily make a benchmark test to compare. For example:
package bench
import "testing"
var result map[string]int
func BenchmarkMakeLiteral(b *testing.B) {
var m map[string]int
for n := 0; n < b.N; n++ {
m = InitMapLiteral()
}
result = m
}
func BenchmarkMakeMake(b *testing.B) {
var m map[string]int
for n := 0; n < b.N; n++ {
m = InitMapMake()
}
result = m
}
func InitMapLiteral() map[string]int {
return map[string]int{}
}
func InitMapMake() map[string]int {
return make(map[string]int)
}
Which on 3 different runs yielded results that are close enough to be insignificant:
First Run
$ go test -bench=.
testing: warning: no tests to run
PASS
BenchmarkMakeLiteral-8 10000000 160 ns/op
BenchmarkMakeMake-8 10000000 171 ns/op
ok github.com/johnweldon/bench 3.664s
Second Run
$ go test -bench=.
testing: warning: no tests to run
PASS
BenchmarkMakeLiteral-8 10000000 182 ns/op
BenchmarkMakeMake-8 10000000 173 ns/op
ok github.com/johnweldon/bench 3.945s
Third Run
$ go test -bench=.
testing: warning: no tests to run
PASS
BenchmarkMakeLiteral-8 10000000 170 ns/op
BenchmarkMakeMake-8 10000000 170 ns/op
ok github.com/johnweldon/bench 3.751s
When allocating empty maps there is no difference but with make you can pass second parameter to pre-allocate space in map. This will save a lot of reallocations when maps are being populated.
Benchmarks
package maps
import "testing"
const SIZE = 10000
func fill(m map[int]bool, size int) {
for i := 0; i < size; i++ {
m[i] = true
}
}
func BenchmarkEmpty(b *testing.B) {
for n := 0; n < b.N; n++ {
m := make(map[int]bool)
fill(m, SIZE)
}
}
func BenchmarkAllocated(b *testing.B) {
for n := 0; n < b.N; n++ {
m := make(map[int]bool, 2*SIZE)
fill(m, SIZE)
}
}
Results
go test -benchmem -bench .
BenchmarkEmpty-8 500 2988680 ns/op 431848 B/op 625 allocs/op
BenchmarkAllocated-8 1000 1618251 ns/op 360949 B/op 11 allocs/op
A year ago I actually stumped on the fact that using make with explicitly allocated space is better then using map literal if your values are not static
So doing
return map[string]float {
"key1": SOME_COMPUTED_ABOVE_VALUE,
"key2": SOME_COMPUTED_ABOVE_VALUE,
// more keys here
"keyN": SOME_COMPUTED_ABOVE_VALUE,
}
is slower then
// some code above
result := make(map[string]float, SIZE) // SIZE >= N
result["key1"] = SOME_COMPUTED_ABOVE_VALUE
result["key2"] = SOME_COMPUTED_ABOVE_VALUE
// more keys here
result["keyN"] = SOME_COMPUTED_ABOVE_VALUE
return result
for N which are quite big (N=300 in my use case).
The reason is the compiler fails to understand that one needs to allocate at least N slots in the first case.
I wrote a blog post about it
https://trams.github.io/golang-map-literal-performance/
and I reported a bug to the community
https://github.com/golang/go/issues/43020
As of golang 1.17 it is still an issue.
Can someone please point at a more efficient version of the following
b:=make([]byte,0,sizeTotal)
b=append(b,size...)
b=append(b,contentType...)
b=append(b,lenCallbackid...)
b=append(b,lenTarget...)
b=append(b,lenAction...)
b=append(b,lenContent...)
b=append(b,callbackid...)
b=append(b,target...)
b=append(b,action...)
b=append(b,content...)
every variable is a byte slice apart from size sizeTotal
Update:
Code:
type Message struct {
size uint32
contentType uint8
callbackId string
target string
action string
content string
}
var res []byte
var b []byte = make([]byte,0,4096)
func (m *Message)ToByte()[]byte{
callbackIdIntLen:=len(m.callbackId)
targetIntLen := len(m.target)
actionIntLen := len(m.action)
contentIntLen := len(m.content)
lenCallbackid:=make([]byte,4)
binary.LittleEndian.PutUint32(lenCallbackid, uint32(callbackIdIntLen))
callbackid := []byte(m.callbackId)
lenTarget := make([]byte,4)
binary.LittleEndian.PutUint32(lenTarget, uint32(targetIntLen))
target:=[]byte(m.target)
lenAction := make([]byte,4)
binary.LittleEndian.PutUint32(lenAction, uint32(actionIntLen))
action := []byte(m.action)
lenContent:= make([]byte,4)
binary.LittleEndian.PutUint32(lenContent, uint32(contentIntLen))
content := []byte(m.content)
sizeTotal:= 21+callbackIdIntLen+targetIntLen+actionIntLen+contentIntLen
size := make([]byte,4)
binary.LittleEndian.PutUint32(size, uint32(sizeTotal))
b=b[:0]
b=append(b,size...)
b=append(b,byte(m.contentType))
b=append(b,lenCallbackid...)
b=append(b,lenTarget...)
b=append(b,lenAction...)
b=append(b,lenContent...)
b=append(b,callbackid...)
b=append(b,target...)
b=append(b,action...)
b=append(b,content...)
res = b
return b
}
func FromByte(bytes []byte)(*Message){
size :=binary.LittleEndian.Uint32(bytes[0:4])
contentType :=bytes[4:5][0]
lenCallbackid:=binary.LittleEndian.Uint32(bytes[5:9])
lenTarget :=binary.LittleEndian.Uint32(bytes[9:13])
lenAction :=binary.LittleEndian.Uint32(bytes[13:17])
lenContent :=binary.LittleEndian.Uint32(bytes[17:21])
callbackid := string(bytes[21:21+lenCallbackid])
target:= string(bytes[21+lenCallbackid:21+lenCallbackid+lenTarget])
action:= string(bytes[21+lenCallbackid+lenTarget:21+lenCallbackid+lenTarget+lenAction])
content:=string(bytes[size-lenContent:size])
return &Message{size,contentType,callbackid,target,action,content}
}
Benchs:
func BenchmarkMessageToByte(b *testing.B) {
m:=NewMessage(uint8(3),"agsdggsdasagdsdgsgddggds","sometarSFAFFget","somFSAFSAFFSeaction","somfasfsasfafsejsonzhit")
for n := 0; n < b.N; n++ {
m.ToByte()
}
}
func BenchmarkMessageFromByte(b *testing.B) {
m:=NewMessage(uint8(1),"sagdsgaasdg","soSASFASFASAFSFASFAGmetarget","adsgdgsagdssgdsgd","agsdsdgsagdsdgasdg").ToByte()
for n := 0; n < b.N; n++ {
FromByte(m)
}
}
func BenchmarkStringToByte(b *testing.B) {
for n := 0; n < b.N; n++ {
_ = []byte("abcdefghijklmnoqrstuvwxyz")
}
}
func BenchmarkStringFromByte(b *testing.B) {
s:=[]byte("abcdefghijklmnoqrstuvwxyz")
for n := 0; n < b.N; n++ {
_ = string(s)
}
}
func BenchmarkUintToByte(b *testing.B) {
for n := 0; n < b.N; n++ {
i:=make([]byte,4)
binary.LittleEndian.PutUint32(i, uint32(99))
}
}
func BenchmarkUintFromByte(b *testing.B) {
i:=make([]byte,4)
binary.LittleEndian.PutUint32(i, uint32(99))
for n := 0; n < b.N; n++ {
binary.LittleEndian.Uint32(i)
}
}
Bench results:
BenchmarkMessageToByte 10000000 280 ns/op
BenchmarkMessageFromByte 10000000 293 ns/op
BenchmarkStringToByte 50000000 55.1 ns/op
BenchmarkStringFromByte 50000000 49.7 ns/op
BenchmarkUintToByte 1000000000 2.14 ns/op
BenchmarkUintFromByte 2000000000 1.71 ns/op
Provided memory is already allocated, a sequence of x=append(x,a...) is rather efficient in Go.
In your example, the initial allocation (make) probably costs more than the sequence of appends. It depends on the size of the fields. Consider the following benchmark:
package main
import (
"testing"
)
const sizeTotal = 25
var res []byte // To enforce heap allocation
func BenchmarkWithAlloc(b *testing.B) {
a := []byte("abcde")
for i := 0; i < b.N; i++ {
x := make([]byte, 0, sizeTotal)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
res = x // Make sure x escapes, and is therefore heap allocated
}
}
func BenchmarkWithoutAlloc(b *testing.B) {
a := []byte("abcde")
x := make([]byte, 0, sizeTotal)
for i := 0; i < b.N; i++ {
x = x[:0]
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
res = x
}
}
On my box, the result is:
testing: warning: no tests to run
PASS
BenchmarkWithAlloc 10000000 116 ns/op 32 B/op 1 allocs/op
BenchmarkWithoutAlloc 50000000 24.0 ns/op 0 B/op 0 allocs/op
Systematically reallocating the buffer (even a small one) makes this benchmark at least 5 times slower.
So your best hope to optimize this code it to make sure you do not reallocate a buffer for each packet you build. On the contrary, you should keep your buffer, and reuse it for each marshalling operation.
You can reset a slice while keeping its underlying buffer allocated with the following statement:
x = x[:0]
I looked carefully at that and made the following benchmarks.
package append
import "testing"
func BenchmarkAppend(b *testing.B) {
as := 1000
a := make([]byte, as)
s := make([]byte, 0, b.N*as)
for i := 0; i < b.N; i++ {
s = append(s, a...)
}
}
func BenchmarkCopy(b *testing.B) {
as := 1000
a := make([]byte, as)
s := make([]byte, 0, b.N*as)
for i := 0; i < b.N; i++ {
copy(s[i*as:(i+1)*as], a)
}
}
The results are
grzesiek#klapacjusz ~/g/s/t/append> go test -bench . -benchmem
testing: warning: no tests to run
PASS
BenchmarkAppend 10000000 202 ns/op 1000 B/op 0 allocs/op
BenchmarkCopy 10000000 201 ns/op 1000 B/op 0 allocs/op
ok test/append 4.564s
If the totalSize is big enough then your code makes no memory allocations. It copies only the amount of bytes it needs to copy. It is perfectly fine.