Am I doing execution timing measurement in Go in a useful way? - go

My code:
// repeat fib(n) 10000 times
i := 10000
var total_time time.Duration
for i > 0 {
// do fib(n) -> f0
start := time.Now()
for n > 0 {
f0, f1, n = f1, f0.Add(f0, f1), n-1
}
total_time = total_time + time.Since(start)
i--
}
// and divide total execution time by 10000
var normalized_time = total_time / 10000
fmt.Println(normalized_time)
The execution times I'm seeing are so extremely short that I am suspicious that what I've done isn't useful. If it's wrong, what am I doing wrong and how can I make it right?

what am I doing wrong and how can I make it right?
Use the Go testing package for benchmarks. For example:
Write the Fibonacci number computation as a function in your code.
fibonacci.go:
package main
import "fmt"
// fibonacci returns the Fibonacci number for 0 <= n <= 92.
// OEIS: A000045: Fibonacci numbers:
// F(n) = F(n-1) + F(n-2) with F(0) = 0 and F(1) = 1.
func fibonacci(n int) int64 {
if n < 0 {
panic("n < 0")
}
f := int64(0)
a, b := int64(0), int64(1)
for i := 0; i <= n; i++ {
if a < 0 {
panic("overflow")
}
f, a, b = a, b, a+b
}
return f
}
func main() {
for _, n := range []int{0, 1, 2, 3, 90, 91, 92} {
fmt.Printf("%-2d %d\n", n, fibonacci(n))
}
}
Playground: https://play.golang.org/p/FFdG4RlNpUZ
Output:
$ go run fibonacci.go
0 0
1 1
2 1
3 2
90 2880067194370816120
91 4660046610375530309
92 7540113804746346429
$
Write and run some benchmarks using the Go testing package.
fibonacci_test.go:
package main
import "testing"
func BenchmarkFibonacciN0(b *testing.B) {
for i := 0; i < b.N; i++ {
fibonacci(0)
}
}
func BenchmarkFibonacciN92(b *testing.B) {
for i := 0; i < b.N; i++ {
fibonacci(92)
}
}
Output:
$ go test fibonacci.go fibonacci_test.go -bench=. -benchmem
goos: linux
goarch: amd64
BenchmarkFibonacciN0-4 367003574 3.25 ns/op 0 B/op 0 allocs/op
BenchmarkFibonacciN92-4 17369262 63.0 ns/op 0 B/op 0 allocs/op
$

Related

Different Benchmark results for functions with same early return statement

I've implemented two functions for rotating an input array n times to the right.
Both implementations exit early, without performing anything, if the initial array would be equal to the resulting array after all rotations. This happens anytime n is 0 or a multiple of the length of the array.
func rotateRight1(nums []int, n int) {
n = n % len(nums)
if n == 0 {
return
}
lastNDigits := make([]int, n)
copy(lastNDigits, nums[len(nums)-n:])
copy(nums[n:], nums[:len(nums)-n])
copy(nums[:n], lastNDigits)
}
and
func rotateRight2(nums []int, n int) {
n = n % len(nums)
if n == 0 {
return
}
i := 0
current := nums[i]
iAlreadySeen := i
for j := 0; j < len(nums); j++ {
nextI := (i + n) % len(nums)
nums[nextI], current = current, nums[nextI]
i = nextI
// handle even length arrays where i+k might equal an already seen index
if nextI == iAlreadySeen {
i = (i + 1) % len(nums)
iAlreadySeen = i
current = nums[i]
}
}
}
When benchmarking, I was surprised to see a +20x difference in speed when n equals 0 for both functions.
func BenchmarkRotateRight1(b *testing.B) {
nums := make([]int, 5_000)
b.ResetTimer()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
rotateRight1(nums, 0)
}
}
func BenchmarkRotateRight2(b *testing.B) {
nums := make([]int, 5_000)
b.ResetTimer()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
rotateRight2(nums, 0)
}
}
go test -bench=. yields a result like this consistently:
cpu: Intel(R) Core(TM) i7-7500U CPU # 2.70GHz
BenchmarkRotateRight1-4 1000000000 0.4603 ns/op 0 B/op 0 allocs/op
BenchmarkRotateRight2-4 97236492 12.11 ns/op 0 B/op 0 allocs/op
PASS
I don't understand this performance difference as both functions are basically doing the same thing and exiting early in the if k == 0 condition.
Could someone help me understand this?
go version go1.18 linux/amd64
What you see is the result of compiler optimization. The compiler is clever enough to make the applications faster by inlining function calls. Sometimes, the compiler optimizes to remove function calls and artificially lowers the run time of benchmarks (that can be tricky).
After profiling, we can notice the function rotateRight1 is not even being called during the benchmark execution:
(pprof) list BenchmarkRotateRight1
Total: 260ms
ROUTINE ======================== main_test.BenchmarkRotateRight1 in (edited)
260ms 260ms (flat, cum) 100% of Total
. . 43:func BenchmarkRotateRight1(b *testing.B) {
. . 44: nums := make([]int, 5_000)
. . 45:
. . 46: b.ResetTimer()
. . 47: b.ReportAllocs()
260ms 260ms 48: for i := 0; i < b.N; i++ {
. . 49: rotateRight1(nums, 0)
. . 50: }
. . 51:}
On the other hand, rotateRight2 is being called, and that's why you see that difference in the run time of the benchmarks:
(pprof) list BenchmarkRotateRight2
Total: 1.89s
ROUTINE ======================== main_test.BenchmarkRotateRight2 in (edited)
180ms 1.89s (flat, cum) 100% of Total
. . 53:func BenchmarkRotateRight2(b *testing.B) {
. . 54: nums := make([]int, 5_000)
. . 55:
. . 56: b.ResetTimer()
. . 57: b.ReportAllocs()
130ms 130ms 58: for i := 0; i < b.N; i++ {
50ms 1.76s 59: rotateRight2(nums, 0)
. . 60: }
. . 61:}
If you run your benchmark with -gcflags '-m -m' (double -m or -m=2), you will see some of the compiler decisions to optimize the code (reference):
$ go test -gcflags '-m -m' -benchmem -bench . main_test.go
# command-line-arguments_test [command-line-arguments.test]
./main_test.go:7:6: can inline rotateRight1 with cost 40 as: func([]int, int) { n = n % len(nums); if n == 0 { return }; lastNDigits := make([]int, n); copy(lastNDigits, nums[len(nums) - n:]); copy(nums[n:], nums[:len(nums) - n]); copy(nums[:n], lastNDigits) }
./main_test.go:20:6: cannot inline rotateRight2: function too complex: cost 83 exceeds budget 80
...
So, based on a complexity threshold, the compiler decides whether to optimize or not.
You can use the//go:noinline directive right before the function you want to disable inlining optimization though. That will override the compiler's usual optimization rules:
//go:noinline
func rotateRight1(nums []int, n int) {
...
}
Now you will notice the benchmarks are very similar:
cpu: Intel(R) Core(TM) i9-9880H CPU # 2.30GHz
BenchmarkRotateRight1
BenchmarkRotateRight1-16 135554571 8.886 ns/op 0 B/op 0 allocs/op
BenchmarkRotateRight2
BenchmarkRotateRight2-16 143716638 8.775 ns/op 0 B/op 0 allocs/op
PASS

golang global variable access is slow in benchmark test

Here is a simple golang benchmark test, it runs x++ in three different ways:
in a simple for loop with x declared inside function
in a nested loop with x declared inside function
in a nested loop with x declared as global variable
package main
import (
"testing"
)
var x = 0
func BenchmarkLoop(b *testing.B) {
x := 0
for n := 0; n < b.N; n++ {
x++
}
}
func BenchmarkDoubleLoop(b *testing.B) {
x := 0
for n := 0; n < b.N/1000; n++ {
for m := 0; m < 1000; m++ {
x++
}
}
}
func BenchmarkDoubleLoopGlobalVariable(b *testing.B) {
for n := 0; n < b.N/1000; n++ {
for m := 0; m < 1000; m++ {
x++
}
}
}
And the result is as following:
$ go test -bench=.
BenchmarkLoop-8 2000000000 0.32 ns/op
BenchmarkDoubleLoop-8 2000000000 0.34 ns/op
BenchmarkDoubleLoopGlobalVariable-8 2000000000 2.00 ns/op
PASS
ok github.com/cizixs/playground/loop-perf 5.597s
Obviously, the first and second methods have similar performance, while the third function is much slower(about 6x times slow).
And I wonder why this is happening, is there a way to improve performance of global variable access?
I wonder why this is happening.
The compiler optimizes away your whole code. 300ps per op means a only a noop was "executed".

Interpretting benchmarks of preallocating a slice

I've been trying to understand slice preallocation with make and why it's a good idea. I noticed a large performance difference between preallocating a slice and appending to it vs just initializing it with 0 length/capacity and then appending to it. I wrote a set of very simple benchmarks:
import "testing"
func BenchmarkNoPreallocate(b *testing.B) {
for i := 0; i < b.N; i++ {
// Don't preallocate our initial slice
init := []int64{}
init = append(init, 5)
}
}
func BenchmarkPreallocate(b *testing.B) {
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, 0, 1)
init = append(init, 5)
}
}
and was a little puzzled with the results:
$ go test -bench=. -benchmem
goos: linux
goarch: amd64
BenchmarkNoPreallocate-4 30000000 41.8 ns/op 8 B/op 1 allocs/op
BenchmarkPreallocate-4 2000000000 0.29 ns/op 0 B/op 0 allocs/op
I have a couple of questions:
Why are there no allocations (it shows 0 allocs/op) in the preallocation benchmark case? Certainly we're preallocating, but the allocation had to have happened at some point.
I imagine this may become clearer after the first question is answered, but how is the preallocation case so much quicker? Am I misinterpetting this benchmark?
Please let me know if anything is unclear. Thank you!
Go has an optimizing compiler. Constants are evaluated at compile time. Variables are evaluated at runtime. Constant values can be used to optimize compiler generated code. For example,
package main
import "testing"
func BenchmarkNoPreallocate(b *testing.B) {
for i := 0; i < b.N; i++ {
// Don't preallocate our initial slice
init := []int64{}
init = append(init, 5)
}
}
func BenchmarkPreallocateConst(b *testing.B) {
const (
l = 0
c = 1
)
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, l, c)
init = append(init, 5)
}
}
func BenchmarkPreallocateVar(b *testing.B) {
var (
l = 0
c = 1
)
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, l, c)
init = append(init, 5)
}
}
Output:
$ go test alloc_test.go -bench=. -benchmem
BenchmarkNoPreallocate-4 50000000 39.3 ns/op 8 B/op 1 allocs/op
BenchmarkPreallocateConst-4 2000000000 0.36 ns/op 0 B/op 0 allocs/op
BenchmarkPreallocateVar-4 50000000 28.2 ns/op 8 B/op 1 allocs/op
Another interesting set of benchmarks:
package main
import "testing"
func BenchmarkNoPreallocate(b *testing.B) {
const (
l = 0
c = 8 * 1024
)
for i := 0; i < b.N; i++ {
// Don't preallocate our initial slice
init := []int64{}
for j := 0; j < c; j++ {
init = append(init, 42)
}
}
}
func BenchmarkPreallocateConst(b *testing.B) {
const (
l = 0
c = 8 * 1024
)
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, l, c)
for j := 0; j < cap(init); j++ {
init = append(init, 42)
}
}
}
func BenchmarkPreallocateVar(b *testing.B) {
var (
l = 0
c = 8 * 1024
)
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, l, c)
for j := 0; j < cap(init); j++ {
init = append(init, 42)
}
}
}
Output:
$ go test peter_test.go -bench=. -benchmem
BenchmarkNoPreallocate-4 20000 75656 ns/op 287992 B/op 19 allocs/op
BenchmarkPreallocateConst-4 100000 22386 ns/op 65536 B/op 1 allocs/op
BenchmarkPreallocateVar-4 100000 22112 ns/op 65536 B/op 1 allocs/op

Benchmarking bad results

so I have the Quicksort algorithm implemented with concurrency (the one without as well). Now I wanted to compare the times. I wrote this:
func benchmarkConcurrentQuickSort(size int, b *testing.B) {
A := RandomArray(size)
var wg sync.WaitGroup
b.ResetTimer()
ConcurrentQuicksort(A, 0, len(A)-1, &wg)
wg.Wait()
}
func BenchmarkConcurrentQuickSort500(b *testing.B) {
benchmarkConcurrentQuickSort(500, b)
}
func BenchmarkConcurrentQuickSort1000(b *testing.B) {
benchmarkConcurrentQuickSort(1000, b)
}
func BenchmarkConcurrentQuickSort5000(b *testing.B) {
benchmarkConcurrentQuickSort(5000, b)
}
func BenchmarkConcurrentQuickSort10000(b *testing.B) {
benchmarkConcurrentQuickSort(10000, b)
}
func BenchmarkConcurrentQuickSort20000(b *testing.B) {
benchmarkConcurrentQuickSort(20000, b)
}
func BenchmarkConcurrentQuickSort1000000(b *testing.B) {
benchmarkConcurrentQuickSort(1000000, b)
}
The results are like this:
C:\projects\go\src\github.com\frynio\mysort>go test -bench=.
BenchmarkConcurrentQuickSort500-4 2000000000 0.00 ns/op
BenchmarkConcurrentQuickSort1000-4 2000000000 0.00 ns/op
BenchmarkConcurrentQuickSort5000-4 2000000000 0.00 ns/op
BenchmarkConcurrentQuickSort10000-4 2000000000 0.00 ns/op
BenchmarkConcurrentQuickSort20000-4 2000000000 0.00 ns/op
BenchmarkConcurrentQuickSort1000000-4 30 49635266 ns/op
PASS
ok github.com/frynio/mysort 8.342s
I can believe the last one, but I definitely think that sorting 500-element array takes longer than 1ns. What am i doing wrong? I am pretty sure that RandomArray returns array of wanted size, as we can see in the last benchmark. Why does it print out the 0.00 ns?
func RandomArray(n int) []int {
a := []int{}
for i := 0; i < n; i++ {
a = append(a, rand.Intn(500))
}
return a
}
// ConcurrentPartition - ConcurrentQuicksort function for partitioning the array (randomized choice of a pivot)
func ConcurrentPartition(A []int, p int, r int) int {
index := rand.Intn(r-p) + p
pivot := A[index]
A[index] = A[r]
A[r] = pivot
x := A[r]
j := p - 1
i := p
for i < r {
if A[i] <= x {
j++
tmp := A[j]
A[j] = A[i]
A[i] = tmp
}
i++
}
temp := A[j+1]
A[j+1] = A[r]
A[r] = temp
return j + 1
}
// ConcurrentQuicksort - a concurrent version of a quicksort algorithm
func ConcurrentQuicksort(A []int, p int, r int, wg *sync.WaitGroup) {
if p < r {
q := ConcurrentPartition(A, p, r)
wg.Add(2)
go func() {
ConcurrentQuicksort(A, p, q-1, wg)
wg.Done()
}()
go func() {
ConcurrentQuicksort(A, q+1, r, wg)
wg.Done()
}()
}
}
Package testing
A sample benchmark function looks like this:
func BenchmarkHello(b *testing.B) {
for i := 0; i < b.N; i++ {
fmt.Sprintf("hello")
}
}
The benchmark function must run the target code b.N times. During
benchmark execution, b.N is adjusted until the benchmark function
lasts long enough to be timed reliably.
I don't see a benchmark loop in your code. Try
func benchmarkConcurrentQuickSort(size int, b *testing.B) {
A := RandomArray(size)
var wg sync.WaitGroup
b.ResetTimer()
for i := 0; i < b.N; i++ {
ConcurrentQuicksort(A, 0, len(A)-1, &wg)
wg.Wait()
}
}
Output:
BenchmarkConcurrentQuickSort500-4 10000 122291 ns/op
BenchmarkConcurrentQuickSort1000-4 5000 221154 ns/op
BenchmarkConcurrentQuickSort5000-4 1000 1225230 ns/op
BenchmarkConcurrentQuickSort10000-4 500 2568024 ns/op
BenchmarkConcurrentQuickSort20000-4 300 5808130 ns/op
BenchmarkConcurrentQuickSort1000000-4 1 1371751710 ns/op

Concat byte arrays

Can someone please point at a more efficient version of the following
b:=make([]byte,0,sizeTotal)
b=append(b,size...)
b=append(b,contentType...)
b=append(b,lenCallbackid...)
b=append(b,lenTarget...)
b=append(b,lenAction...)
b=append(b,lenContent...)
b=append(b,callbackid...)
b=append(b,target...)
b=append(b,action...)
b=append(b,content...)
every variable is a byte slice apart from size sizeTotal
Update:
Code:
type Message struct {
size uint32
contentType uint8
callbackId string
target string
action string
content string
}
var res []byte
var b []byte = make([]byte,0,4096)
func (m *Message)ToByte()[]byte{
callbackIdIntLen:=len(m.callbackId)
targetIntLen := len(m.target)
actionIntLen := len(m.action)
contentIntLen := len(m.content)
lenCallbackid:=make([]byte,4)
binary.LittleEndian.PutUint32(lenCallbackid, uint32(callbackIdIntLen))
callbackid := []byte(m.callbackId)
lenTarget := make([]byte,4)
binary.LittleEndian.PutUint32(lenTarget, uint32(targetIntLen))
target:=[]byte(m.target)
lenAction := make([]byte,4)
binary.LittleEndian.PutUint32(lenAction, uint32(actionIntLen))
action := []byte(m.action)
lenContent:= make([]byte,4)
binary.LittleEndian.PutUint32(lenContent, uint32(contentIntLen))
content := []byte(m.content)
sizeTotal:= 21+callbackIdIntLen+targetIntLen+actionIntLen+contentIntLen
size := make([]byte,4)
binary.LittleEndian.PutUint32(size, uint32(sizeTotal))
b=b[:0]
b=append(b,size...)
b=append(b,byte(m.contentType))
b=append(b,lenCallbackid...)
b=append(b,lenTarget...)
b=append(b,lenAction...)
b=append(b,lenContent...)
b=append(b,callbackid...)
b=append(b,target...)
b=append(b,action...)
b=append(b,content...)
res = b
return b
}
func FromByte(bytes []byte)(*Message){
size :=binary.LittleEndian.Uint32(bytes[0:4])
contentType :=bytes[4:5][0]
lenCallbackid:=binary.LittleEndian.Uint32(bytes[5:9])
lenTarget :=binary.LittleEndian.Uint32(bytes[9:13])
lenAction :=binary.LittleEndian.Uint32(bytes[13:17])
lenContent :=binary.LittleEndian.Uint32(bytes[17:21])
callbackid := string(bytes[21:21+lenCallbackid])
target:= string(bytes[21+lenCallbackid:21+lenCallbackid+lenTarget])
action:= string(bytes[21+lenCallbackid+lenTarget:21+lenCallbackid+lenTarget+lenAction])
content:=string(bytes[size-lenContent:size])
return &Message{size,contentType,callbackid,target,action,content}
}
Benchs:
func BenchmarkMessageToByte(b *testing.B) {
m:=NewMessage(uint8(3),"agsdggsdasagdsdgsgddggds","sometarSFAFFget","somFSAFSAFFSeaction","somfasfsasfafsejsonzhit")
for n := 0; n < b.N; n++ {
m.ToByte()
}
}
func BenchmarkMessageFromByte(b *testing.B) {
m:=NewMessage(uint8(1),"sagdsgaasdg","soSASFASFASAFSFASFAGmetarget","adsgdgsagdssgdsgd","agsdsdgsagdsdgasdg").ToByte()
for n := 0; n < b.N; n++ {
FromByte(m)
}
}
func BenchmarkStringToByte(b *testing.B) {
for n := 0; n < b.N; n++ {
_ = []byte("abcdefghijklmnoqrstuvwxyz")
}
}
func BenchmarkStringFromByte(b *testing.B) {
s:=[]byte("abcdefghijklmnoqrstuvwxyz")
for n := 0; n < b.N; n++ {
_ = string(s)
}
}
func BenchmarkUintToByte(b *testing.B) {
for n := 0; n < b.N; n++ {
i:=make([]byte,4)
binary.LittleEndian.PutUint32(i, uint32(99))
}
}
func BenchmarkUintFromByte(b *testing.B) {
i:=make([]byte,4)
binary.LittleEndian.PutUint32(i, uint32(99))
for n := 0; n < b.N; n++ {
binary.LittleEndian.Uint32(i)
}
}
Bench results:
BenchmarkMessageToByte 10000000 280 ns/op
BenchmarkMessageFromByte 10000000 293 ns/op
BenchmarkStringToByte 50000000 55.1 ns/op
BenchmarkStringFromByte 50000000 49.7 ns/op
BenchmarkUintToByte 1000000000 2.14 ns/op
BenchmarkUintFromByte 2000000000 1.71 ns/op
Provided memory is already allocated, a sequence of x=append(x,a...) is rather efficient in Go.
In your example, the initial allocation (make) probably costs more than the sequence of appends. It depends on the size of the fields. Consider the following benchmark:
package main
import (
"testing"
)
const sizeTotal = 25
var res []byte // To enforce heap allocation
func BenchmarkWithAlloc(b *testing.B) {
a := []byte("abcde")
for i := 0; i < b.N; i++ {
x := make([]byte, 0, sizeTotal)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
res = x // Make sure x escapes, and is therefore heap allocated
}
}
func BenchmarkWithoutAlloc(b *testing.B) {
a := []byte("abcde")
x := make([]byte, 0, sizeTotal)
for i := 0; i < b.N; i++ {
x = x[:0]
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
res = x
}
}
On my box, the result is:
testing: warning: no tests to run
PASS
BenchmarkWithAlloc 10000000 116 ns/op 32 B/op 1 allocs/op
BenchmarkWithoutAlloc 50000000 24.0 ns/op 0 B/op 0 allocs/op
Systematically reallocating the buffer (even a small one) makes this benchmark at least 5 times slower.
So your best hope to optimize this code it to make sure you do not reallocate a buffer for each packet you build. On the contrary, you should keep your buffer, and reuse it for each marshalling operation.
You can reset a slice while keeping its underlying buffer allocated with the following statement:
x = x[:0]
I looked carefully at that and made the following benchmarks.
package append
import "testing"
func BenchmarkAppend(b *testing.B) {
as := 1000
a := make([]byte, as)
s := make([]byte, 0, b.N*as)
for i := 0; i < b.N; i++ {
s = append(s, a...)
}
}
func BenchmarkCopy(b *testing.B) {
as := 1000
a := make([]byte, as)
s := make([]byte, 0, b.N*as)
for i := 0; i < b.N; i++ {
copy(s[i*as:(i+1)*as], a)
}
}
The results are
grzesiek#klapacjusz ~/g/s/t/append> go test -bench . -benchmem
testing: warning: no tests to run
PASS
BenchmarkAppend 10000000 202 ns/op 1000 B/op 0 allocs/op
BenchmarkCopy 10000000 201 ns/op 1000 B/op 0 allocs/op
ok test/append 4.564s
If the totalSize is big enough then your code makes no memory allocations. It copies only the amount of bytes it needs to copy. It is perfectly fine.

Resources