Interpretting benchmarks of preallocating a slice

Interpretting benchmarks of preallocating a slice - go

I've been trying to understand slice preallocation with make and why it's a good idea. I noticed a large performance difference between preallocating a slice and appending to it vs just initializing it with 0 length/capacity and then appending to it. I wrote a set of very simple benchmarks:
import "testing"
func BenchmarkNoPreallocate(b *testing.B) {
for i := 0; i < b.N; i++ {
// Don't preallocate our initial slice
init := []int64{}
init = append(init, 5)
}
}
func BenchmarkPreallocate(b *testing.B) {
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, 0, 1)
init = append(init, 5)
}
}
and was a little puzzled with the results:
$ go test -bench=. -benchmem
goos: linux
goarch: amd64
BenchmarkNoPreallocate-4 30000000 41.8 ns/op 8 B/op 1 allocs/op
BenchmarkPreallocate-4 2000000000 0.29 ns/op 0 B/op 0 allocs/op
I have a couple of questions:
Why are there no allocations (it shows 0 allocs/op) in the preallocation benchmark case? Certainly we're preallocating, but the allocation had to have happened at some point.
I imagine this may become clearer after the first question is answered, but how is the preallocation case so much quicker? Am I misinterpetting this benchmark?
Please let me know if anything is unclear. Thank you!

Go has an optimizing compiler. Constants are evaluated at compile time. Variables are evaluated at runtime. Constant values can be used to optimize compiler generated code. For example,
package main
import "testing"
func BenchmarkNoPreallocate(b *testing.B) {
for i := 0; i < b.N; i++ {
// Don't preallocate our initial slice
init := []int64{}
init = append(init, 5)
}
}
func BenchmarkPreallocateConst(b *testing.B) {
const (
l = 0
c = 1
)
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, l, c)
init = append(init, 5)
}
}
func BenchmarkPreallocateVar(b *testing.B) {
var (
l = 0
c = 1
)
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, l, c)
init = append(init, 5)
}
}
Output:
$ go test alloc_test.go -bench=. -benchmem
BenchmarkNoPreallocate-4 50000000 39.3 ns/op 8 B/op 1 allocs/op
BenchmarkPreallocateConst-4 2000000000 0.36 ns/op 0 B/op 0 allocs/op
BenchmarkPreallocateVar-4 50000000 28.2 ns/op 8 B/op 1 allocs/op
Another interesting set of benchmarks:
package main
import "testing"
func BenchmarkNoPreallocate(b *testing.B) {
const (
l = 0
c = 8 * 1024
)
for i := 0; i < b.N; i++ {
// Don't preallocate our initial slice
init := []int64{}
for j := 0; j < c; j++ {
init = append(init, 42)
}
}
}
func BenchmarkPreallocateConst(b *testing.B) {
const (
l = 0
c = 8 * 1024
)
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, l, c)
for j := 0; j < cap(init); j++ {
init = append(init, 42)
}
}
}
func BenchmarkPreallocateVar(b *testing.B) {
var (
l = 0
c = 8 * 1024
)
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, l, c)
for j := 0; j < cap(init); j++ {
init = append(init, 42)
}
}
}
Output:
$ go test peter_test.go -bench=. -benchmem
BenchmarkNoPreallocate-4 20000 75656 ns/op 287992 B/op 19 allocs/op
BenchmarkPreallocateConst-4 100000 22386 ns/op 65536 B/op 1 allocs/op
BenchmarkPreallocateVar-4 100000 22112 ns/op 65536 B/op 1 allocs/op

Related

go maps non-performant for large number of keys

I discovered very strange behaviour with go maps recently. The use case is to create a group of integers and have O(1) check for IsMember(id int).
The current implementation is :
func convertToMap(v []int64) map[int64]void {
out := make(map[int64]void, len(v))
for _, i := range v {
out[i] = void{}
}
return out
}
type Group struct {
members map[int64]void
}
type void struct{}
func (g *Group) IsMember(input string) (ok bool) {
memberID, _ := strconv.ParseInt(input, 10, 64)
_, ok = g.members[memberID]
return
}
When i benchmark the IsMember method, until 6 million members, everything looks fine. But above that the map look up is taking 1 second for each lookup!!
The benchmark test:
func BenchmarkIsMember(b *testing.B) {
b.ReportAllocs()
b.ResetTimer()
g := &Group{}
g.members = convertToMap(benchmarkV)
for N := 0; N < b.N && N < sizeOfGroup; N++ {
g.IsMember(benchmarkKVString[N])
}
}
var benchmarkV, benchmarkKVString = func(size int) ([]int64, []string{
v := make([]int64, size)
s := make([]string, size)
for i := range v {
val := rand.Int63()
v[i] = val
s[i] = strconv.FormatInt(val, 10)
}
return v, s
}(sizeOfGroup)
Benchmark numbers:
const sizeOfGroup = 6000000
BenchmarkIsMember-8 2000000 568 ns/op 50 B/op 0 allocs/op
const sizeOfGroup = 6830000
BenchmarkIsMember-8 1 1051725455 ns/op 178767208 B/op 25 allocs/op
Anything above group size of 6.8 million gives the same result.
Can someone help me to explain why this is happening, and can anything be done to make this performant while still using maps?
Also, i dont understand why so much memory is being allocated? Even if the time taken is due to collision and then linked list traversal, there shouldn't be any mem allocation, is my thought process wrong?

No need to measure extra allocation for converting slice to map because we just want to measure the lookup operation.
I've slightly modify the benchmark:
func BenchmarkIsMember(b *testing.B) {
fn := func(size int) ([]int64, []string) {
v := make([]int64, size)
s := make([]string, size)
for i := range v {
val := rand.Int63()
v[i] = val
s[i] = strconv.FormatInt(val, 10)
}
return v, s
}
for _, size := range []int{
6000000,
6800000,
6830000,
60000000,
} {
b.Run(fmt.Sprintf("size=%d", size), func(b *testing.B) {
var benchmarkV, benchmarkKVString = fn(size)
g := &deltaGroup{}
g.members = convertToMap(benchmarkV)
b.ReportAllocs()
b.ResetTimer()
for N := 0; N < b.N && N < size; N++ {
g.IsMember(benchmarkKVString[N])
}
})
}
}
And got the following results:
go test ./... -bench=. -benchtime=10s -cpu=1
goos: linux
goarch: amd64
pkg: trash
BenchmarkIsMember/size=6000000 2000000000 0.55 ns/op 0 B/op 0 allocs/op
BenchmarkIsMember/size=6800000 1000000000 1.27 ns/op 0 B/op 0 allocs/op
BenchmarkIsMember/size=6830000 1000000000 1.23 ns/op 0 B/op 0 allocs/op
BenchmarkIsMember/size=60000000 100000000 136 ns/op 0 B/op 0 allocs/op
PASS
ok trash 167.578s
Degradation isn't so significant as in your example.

Using interfaces{} in slices seems to incur a 40x slowdown. Is there a way of bypassing this when implementing `interface{}` based datastructures?

I am currently trying to implement a tree based datastructure in Go and I am seeing disappointing results in my benchmarking. Because I am trying to be generic as to what values I accept, I am limited to using interface{}.
The code in question is an immutable vector trie. Essentially, any time a value in the vector is modified I need to make a copy of several nodes in the trie. Each of these nodes is implemented as a slice of const (known at compile time) length. For example, writing a value into a large trie will require the copying of 5 seperate 32 long slices. They must be copies to preserve the immutability of the previous contents.
I believe the disappointing benchmark results are because I am storing my data as interface{} in slices, which get created, copied and appended to often. To measure this I set up the following benchmark
package main
import (
"math/rand"
"testing"
)
func BenchmarkMake10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
_ = make([]int, 10e6, 10e6)
}
}
func BenchmarkMakePtr10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
_ = make([]*int, 10e6, 10e6)
}
}
func BenchmarkMakeInterface10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
_ = make([]interface{}, 10e6, 10e6)
}
}
func BenchmarkMakeInterfacePtr10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
_ = make([]interface{}, 10e6, 10e6)
}
}
func BenchmarkAppend10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
slc := make([]int, 0, 0)
for jj := 0; jj < 10e6; jj++ {
slc = append(slc, jj)
}
}
}
func BenchmarkAppendPtr10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
slc := make([]*int, 0, 0)
for jj := 0; jj < 10e6; jj++ {
slc = append(slc, &jj)
}
}
}
func BenchmarkAppendInterface10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
slc := make([]interface{}, 0, 0)
for jj := 0; jj < 10e6; jj++ {
slc = append(slc, jj)
}
}
}
func BenchmarkAppendInterfacePtr10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
slc := make([]interface{}, 0, 0)
for jj := 0; jj < 10e6; jj++ {
slc = append(slc, &jj)
}
}
}
func BenchmarkSet(b *testing.B) {
slc := make([]int, 10e6, 10e6)
b.ResetTimer()
for ii := 0; ii < b.N; ii++ {
slc[rand.Intn(10e6-1)] = 1
}
}
func BenchmarkSetPtr(b *testing.B) {
slc := make([]*int, 10e6, 10e6)
b.ResetTimer()
for ii := 0; ii < b.N; ii++ {
theInt := 1
slc[rand.Intn(10e6-1)] = &theInt
}
}
func BenchmarkSetInterface(b *testing.B) {
slc := make([]interface{}, 10e6, 10e6)
b.ResetTimer()
for ii := 0; ii < b.N; ii++ {
slc[rand.Intn(10e6-1)] = 1
}
}
func BenchmarkSetInterfacePtr(b *testing.B) {
slc := make([]interface{}, 10e6, 10e6)
b.ResetTimer()
for ii := 0; ii < b.N; ii++ {
theInt := 1
slc[rand.Intn(10e6-1)] = &theInt
}
}
which gives the following result
BenchmarkMake10M-4 300 4962381 ns/op
BenchmarkMakePtr10M-4 100 10255522 ns/op
BenchmarkMakeInterface10M-4 100 19788588 ns/op
BenchmarkMakeInterfacePtr10M-4 100 19850682 ns/op
BenchmarkAppend10M-4 20 67090711 ns/op
BenchmarkAppendPtr10M-4 1 2784300818 ns/op
BenchmarkAppendInterface10M-4 1 3457503833 ns/op
BenchmarkAppendInterfacePtr10M-4 1 3532502711 ns/op
BenchmarkSet-4 30000000 43.5 ns/op
BenchmarkSetPtr-4 20000000 91.2 ns/op
BenchmarkSetInterface-4 30000000 43.5 ns/op
BenchmarkSetInterfacePtr-4 20000000 70.9 ns/op
Where the difference on Set and Make seems to be about 2-4x but the difference on Append is about 40x.
From what I understand the performance hit is because behind the scenes interfaces are implemented as pointers, and that pointers must be allocated on the heap. That still doesn't explain why Append is significantly worse than the difference between Set or Make.
Is there a way in the current language of Go without using a code generation tool (e.g., a generics tool that lets the consumer of the library generate a version of the library to store FooType) to work around this 40x performance hit? Alternatively, have I made some error in my benchmarking?

Let's profile the test with memory benchmarks.
go test -bench . -cpuprofile cpu.prof -benchmem
goos: linux
goarch: amd64
BenchmarkMake10M-8 100 10254248 ns/op 80003282 B/op 1 allocs/op
BenchmarkMakePtr10M-8 100 18696295 ns/op 80003134 B/op 1 allocs/op
BenchmarkMakeInterface10M-8 50 34501361 ns/op 160006147 B/op 1 allocs/op
BenchmarkMakeInterfacePtr10M-8 50 35129085 ns/op 160006652 B/op 1 allocs/op
BenchmarkAppend10M-8 20 69971722 ns/op 423503264 B/op 50 allocs/op
BenchmarkAppendPtr10M-8 1 2135090501 ns/op 423531096 B/op 62 allocs/op
BenchmarkAppendInterface10M-8 1 1833396620 ns/op 907567984 B/op 10000060 allocs/op
BenchmarkAppendInterfacePtr10M-8 1 2270970241 ns/op 827546240 B/op 53 allocs/op
BenchmarkSet-8 30000000 54.0 ns/op 0 B/op 0 allocs/op
BenchmarkSetPtr-8 20000000 91.6 ns/op 8 B/op 1 allocs/op
BenchmarkSetInterface-8 30000000 58.0 ns/op 0 B/op 0 allocs/op
BenchmarkSetInterfacePtr-8 20000000 88.0 ns/op 8 B/op 1 allocs/op
PASS
ok _/home/grzesiek/test 22.427s
We can see that the slowest benchmarks are the ones that makes allocations.
PPROF_BINARY_PATH=. go tool pprof -disasm BenchmarkAppend cpu.prof
Total: 29.75s
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppend10M
210m 1.51s (flat, cum) 5.08% of Total
. 1.30s 4e827a: CALL runtime.growslice(SB) ;_/home/grzesiek/test.BenchmarkAppend10M test_test.go:35
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppendInterface10M
20m 930ms (flat, cum) 3.13% of Total
. 630ms 4e8519: CALL runtime.growslice(SB) ;_/home/grzesiek/test.BenchmarkAppendInterface10M test_test.go:53
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppendInterfacePtr10M
0 800ms (flat, cum) 2.69% of Total
. 770ms 4e8625: CALL runtime.growslice(SB) ;_/home/grzesiek/test.BenchmarkAppendInterfacePtr10M test_test.go:62
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppendPtr10M
0 950ms (flat, cum) 3.19% of Total
. 870ms 4e8374: CALL runtime.growslice(SB) ;_/home/grzesiek/test.BenchmarkAppendPtr10M test_test.go:44
By analyzing the number of bytes allocated, we can see that use of interface doubles the allocations size.
Why is BenchmarkAppendPtr10M so much faster than other BenchmarkAppend*?
To figure this out, we need to see the escape analysis.
go test -gcflags '-m -l' original_test.go
./original_test.go:31:28: BenchmarkAppend10M b does not escape
./original_test.go:33:14: BenchmarkAppend10M make([]int, 0, 0) does not escape
./original_test.go:40:31: BenchmarkAppendPtr10M b does not escape
./original_test.go:42:14: BenchmarkAppendPtr10M make([]*int, 0, 0) does not escape
./original_test.go:43:7: moved to heap: jj
./original_test.go:44:22: &jj escapes to heap
./original_test.go:49:37: BenchmarkAppendInterface10M b does not escape
./original_test.go:51:14: BenchmarkAppendInterface10M make([]interface {}, 0, 0) does not escape
./original_test.go:53:16: jj escapes to heap
./original_test.go:58:40: BenchmarkAppendInterfacePtr10M b does not escape
./original_test.go:60:14: BenchmarkAppendInterfacePtr10M make([]interface {}, 0, 0) does not escape
./original_test.go:61:7: moved to heap: jj
./original_test.go:62:16: &jj escapes to heap
./original_test.go:62:22: &jj escapes to heap
We can see that it is the only benchmark in which jj does not escape to the heap. We can deduce that accessing heap variable causes slowdown.
Why does BenchmarkAppendInterface10M make so many allocations?
In the assembler, we can see that it is the only one that calls runtime.convT2E64 function.
PPROF_BINARY_PATH=. go tool pprof -disasm BenchmarkAppend cpu.prof
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppendInterface10M
30ms 1.10s (flat, cum) 3.35% of Total
. 260ms 4e8490: CALL runtime.convT2E64(SB)
The source code from runtime/iface.go looks like this:
func convT2E64(t *_type, elem unsafe.Pointer) (e eface) {
if raceenabled {
raceReadObjectPC(t, elem, getcallerpc(), funcPC(convT2E64))
}
if msanenabled {
msanread(elem, t.size)
}
var x unsafe.Pointer
if *(*uint64)(elem) == 0 {
x = unsafe.Pointer(&zeroVal[0])
} else {
x = mallocgc(8, t, false)
*(*uint64)(x) = *(*uint64)(elem)
}
e._type = t
e.data = x
return
}
As we see it makes the allocation by calling mallocgc function.
I know it does not directly help fix your code but I hope it gives you the tools and techniques to analyze optimize it.

Golang foreach difference

func Benchmark_foreach1(b *testing.B) {
var test map[int]int
test = make(map[int]int)
for i := 0; i < 100000; i++ {
test[i] = 1
}
for i := 0; i < b.N; i++ {
for i, _ := range test {
if test[i] != 1 {
panic("ds")
}
}
}
}
func Benchmark_foreach2(b *testing.B) {
var test map[int]int
test = make(map[int]int)
for i := 0; i < 100000; i++ {
test[i] = 1
}
for i := 0; i < b.N; i++ {
for _, v := range test {
if v != 1 {
panic("heh")
}
}
}
}
run with result as below
goos: linux
goarch: amd64
Benchmark_foreach1-2 500 3172323 ns/op
Benchmark_foreach2-2 1000 1707214 ns/op
why is foreach-2 slow?

I think Benchmark_foreach2-2 is about 2 times faster - it requires 1707214 nanoseconds per operation, and first one takes 3172323. So second one is 3172323 / 1707214 = 1.85 times faster.
Reason: second doesn't need to take value from a memory again, it already used value in v variable.

The test[k] statement in BenchmarkForeachK takes time to randomly read the value, so BenchmarkForeachK takes more time than BenchmarkForeachV, 9362945 ns/op versus 4213940 ns/op .
For example,
package main
import "testing"
func testMap() map[int]int {
test := make(map[int]int)
for i := 0; i < 100000; i++ {
test[i] = 1
}
return test
}
func BenchmarkForeachK(b *testing.B) {
test := testMap()
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
for k := range test {
if test[k] != 1 {
panic("eh")
}
}
}
}
func BenchmarkForeachV(b *testing.B) {
test := testMap()
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
for _, v := range test {
if v != 1 {
panic("heh")
}
}
}
}
Output:
$ go test foreach_test.go -bench=.
BenchmarkForeachK-4 200 9362945 ns/op 0 B/op 0 allocs/op
BenchmarkForeachV-4 300 4213940 ns/op 0 B/op 0 allocs/op

Performance of function slice parameter vs global variable?

I've got the following function:
func checkFiles(path string, excludedPatterns []string) {
// ...
}
I'm wondering, since excludedPatterns never changes, should I optimize it by making the var global (and not passing it to the function every time), or does Golang already handle this by passing them as copy-on-write?
Edit: I guess I could pass the slice as a pointer, but I'm still wondering about the copy-on-write behavior (if it exists) and whether, in general, I should worry about passing by value or by pointer.

Judging from the name of your function, performance can't be that critical to even consider moving parameters to global variables just to save time/space required to pass them as parameters (IO operations like checking files are much-much slower than calling functions and passing values to them).
Slices in Go are just small descriptors, something like a struct with a pointer to a backing array and 2 ints, a length and capacity. No matter how big the backing array is, passing slices are always efficient and you shouldn't even consider passing a pointer to them, unless you want to modify the slice header of course.
Parameters in Go are always passed by value, and a copy of the value being passed is made. If you pass a pointer, then the pointer value will be copied and passed. When a slice is passed, the slice value (which is a small descriptor) will be copied and passed - which will point to the same backing array (which will not be copied).
Also if you need to access the slice multiple times in the function, a parameter is usually an extra gain as compilers can make further optimization / caching, while if it is a global variable, more care has to be taken.
More about slices and their internals: Go Slices: usage and internals
And if you want exact numbers on performance, benchmark!
Here comes a little benchmarking code which shows no difference between the 2 solutions (passing slice as argument or accessing a global slice). Save it into a file like slices_test.go and run it with go test -bench .
package main
import (
"testing"
)
var gslice = make([]string, 1000)
func global(s string) {
for i := 0; i < 100; i++ { // Cycle to access slice may times
_ = s
_ = gslice // Access global-slice
}
}
func param(s string, ss []string) {
for i := 0; i < 100; i++ { // Cycle to access slice may times
_ = s
_ = ss // Access parameter-slice
}
}
func BenchmarkParameter(b *testing.B) {
for i := 0; i < b.N; i++ {
param("hi", gslice)
}
}
func BenchmarkGlobal(b *testing.B) {
for i := 0; i < b.N; i++ {
global("hi")
}
}
Example output:
testing: warning: no tests to run
PASS
BenchmarkParameter-2 30000000 55.4 ns/op
BenchmarkGlobal-2 30000000 55.1 ns/op
ok _/V_/workspace/IczaGo/src/play 3.569s

Piggybacking on #icza's excellent answer, there is another way to pass a slice as parameter: a pointer to a slice.
When you need to modify the underlying slice, the global variable slice works, but passing it as a parameter does not work, you are effectively working with a copy. To mitigate that, one can actually pass the slice as a pointer.
Interesting enough, it's actually faster than accessing a global variable:
package main
import (
"testing"
)
var gslice = make([]string, 1000000)
func global(s string) {
for i := 0; i < 100; i++ { // Cycle to access slice may times
_ = s
_ = gslice // Access global-slice
}
}
func param(s string, ss []string) {
for i := 0; i < 100; i++ { // Cycle to access slice may times
_ = s
_ = ss // Access parameter-slice
}
}
func paramPointer(s string, ss *[]string) {
for i := 0; i < 100; i++ { // Cycle to access slice may times
_ = s
_ = ss // Access parameter-slice
}
}
func BenchmarkParameter(b *testing.B) {
for i := 0; i < b.N; i++ {
param("hi", gslice)
}
}
func BenchmarkParameterPointer(b *testing.B) {
for i := 0; i < b.N; i++ {
paramPointer("hi", &gslice)
}
}
func BenchmarkGlobal(b *testing.B) {
for i := 0; i < b.N; i++ {
global("hi")
}
}
Results:
goos: darwin
goarch: amd64
pkg: untitled
BenchmarkParameter-8 24437403 48.2 ns/op
BenchmarkParameterPointer-8 27483115 40.3 ns/op
BenchmarkGlobal-8 25631470 46.0 ns/op

I rewrote the benchmark so you can compare the results.
As you can see the ParameterPointer bench start to get ahead after 10 records. Which is very interesting.
package slices_bench
import (
"testing"
)
var gslice = make([]string, 1000)
func global(s string) {
for i := 0; i < 100; i++ { // Cycle to access slice may times
_ = s
_ = gslice // Access global-slice
}
}
func param(s string, ss []string) {
for i := 0; i < 100; i++ { // Cycle to access slice may times
_ = s
_ = ss // Access parameter-slice
}
}
func paramPointer(s string, ss *[]string) {
for i := 0; i < 100; i++ { // Cycle to access slice may times
_ = s
_ = ss // Access parameter-slice
}
}
func BenchmarkPerformance(b *testing.B){
fixture := []struct {
desc string
records int
}{
{
desc: "1 record",
records: 1,
},
{
desc: "10 records",
records: 10,
},
{
desc: "100 records",
records: 100,
},
{
desc: "1000 records",
records: 1000,
},
{
desc: "10000 records",
records: 10000,
},
{
desc: "100000 records",
records: 100000,
},
}
tests := []struct {
desc string
fn func(b *testing.B, n int)
}{
{
desc: "ParameterPointer",
fn: func(b *testing.B, n int) {
for j := 0; j < n; j++ {
paramPointer("hi", &gslice)
}
},
},
{
desc: "Parameter",
fn: func(b *testing.B, n int) {
for j := 0; j < n; j++ {
param("hi", gslice)
}
},
},
{
desc: "Global",
fn: func(b *testing.B, n int) {
for j := 0; j < n; j++ {
global("hi")
}
},
},
}
for _, t := range tests {
b.Run(t.desc, func(b *testing.B) {
for _, f := range fixture {
b.Run(f.desc, func(b *testing.B) {
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
t.fn(b, f.records)
}
})
}
})
}
}
Results:
goos: windows
goarch: amd64
pkg: benchs/slices-bench
cpu: Intel(R) Core(TM) i7-10700 CPU # 2.90GHz
BenchmarkPerformance
BenchmarkPerformance/ParameterPointer
BenchmarkPerformance/ParameterPointer/1_record
BenchmarkPerformance/ParameterPointer/1_record-16 38661910 31.18 ns/op 0 B/op 0 allocs/op
BenchmarkPerformance/ParameterPointer/10_records
BenchmarkPerformance/ParameterPointer/10_records-16 4160023 288.4 ns/op 0 B/op 0 allocs/op
BenchmarkPerformance/ParameterPointer/100_records
BenchmarkPerformance/ParameterPointer/100_records-16 445131 2748 ns/op 0 B/op 0 allocs/op
BenchmarkPerformance/ParameterPointer/1000_records
BenchmarkPerformance/ParameterPointer/1000_records-16 43876 27380 ns/op 0 B/op 0 allocs/op
BenchmarkPerformance/ParameterPointer/10000_records
BenchmarkPerformance/ParameterPointer/10000_records-16 4441 273922 ns/op 0 B/op 0 allocs/op
BenchmarkPerformance/ParameterPointer/100000_records
BenchmarkPerformance/ParameterPointer/100000_records-16 439 2739282 ns/op 0 B/op 0 allocs/op
BenchmarkPerformance/Parameter
BenchmarkPerformance/Parameter/1_record
BenchmarkPerformance/Parameter/1_record-16 39860619 30.79 ns/op 0 B/op 0 allocs/op
BenchmarkPerformance/Parameter/10_records
BenchmarkPerformance/Parameter/10_records-16 4152728 288.6 ns/op 0 B/op 0 allocs/op
BenchmarkPerformance/Parameter/100_records
BenchmarkPerformance/Parameter/100_records-16 445634 2757 ns/op 0 B/op 0 allocs/op
BenchmarkPerformance/Parameter/1000_records
BenchmarkPerformance/Parameter/1000_records-16 43618 27496 ns/op 0 B/op 0 allocs/op
BenchmarkPerformance/Parameter/10000_records
BenchmarkPerformance/Parameter/10000_records-16 4450 273960 ns/op 0 B/op 0 allocs/op
BenchmarkPerformance/Parameter/100000_records
BenchmarkPerformance/Parameter/100000_records-16 435 2739053 ns/op 0 B/op 0 allocs/op
BenchmarkPerformance/Global
BenchmarkPerformance/Global/1_record
BenchmarkPerformance/Global/1_record-16 38813095 30.97 ns/op 0 B/op 0 allocs/op
BenchmarkPerformance/Global/10_records
BenchmarkPerformance/Global/10_records-16 4148433 288.4 ns/op 0 B/op 0 allocs/op
BenchmarkPerformance/Global/100_records
BenchmarkPerformance/Global/100_records-16 429274 2758 ns/op 0 B/op 0 allocs/op
BenchmarkPerformance/Global/1000_records
BenchmarkPerformance/Global/1000_records-16 43591 27412 ns/op 0 B/op 0 allocs/op
BenchmarkPerformance/Global/10000_records
BenchmarkPerformance/Global/10000_records-16 4521 274420 ns/op 0 B/op 0 allocs/op
BenchmarkPerformance/Global/100000_records
BenchmarkPerformance/Global/100000_records-16 436 2751042 ns/op 0 B/op 0 allocs/op

Concat byte arrays

Can someone please point at a more efficient version of the following
b:=make([]byte,0,sizeTotal)
b=append(b,size...)
b=append(b,contentType...)
b=append(b,lenCallbackid...)
b=append(b,lenTarget...)
b=append(b,lenAction...)
b=append(b,lenContent...)
b=append(b,callbackid...)
b=append(b,target...)
b=append(b,action...)
b=append(b,content...)
every variable is a byte slice apart from size sizeTotal
Update:
Code:
type Message struct {
size uint32
contentType uint8
callbackId string
target string
action string
content string
}
var res []byte
var b []byte = make([]byte,0,4096)
func (m *Message)ToByte()[]byte{
callbackIdIntLen:=len(m.callbackId)
targetIntLen := len(m.target)
actionIntLen := len(m.action)
contentIntLen := len(m.content)
lenCallbackid:=make([]byte,4)
binary.LittleEndian.PutUint32(lenCallbackid, uint32(callbackIdIntLen))
callbackid := []byte(m.callbackId)
lenTarget := make([]byte,4)
binary.LittleEndian.PutUint32(lenTarget, uint32(targetIntLen))
target:=[]byte(m.target)
lenAction := make([]byte,4)
binary.LittleEndian.PutUint32(lenAction, uint32(actionIntLen))
action := []byte(m.action)
lenContent:= make([]byte,4)
binary.LittleEndian.PutUint32(lenContent, uint32(contentIntLen))
content := []byte(m.content)
sizeTotal:= 21+callbackIdIntLen+targetIntLen+actionIntLen+contentIntLen
size := make([]byte,4)
binary.LittleEndian.PutUint32(size, uint32(sizeTotal))
b=b[:0]
b=append(b,size...)
b=append(b,byte(m.contentType))
b=append(b,lenCallbackid...)
b=append(b,lenTarget...)
b=append(b,lenAction...)
b=append(b,lenContent...)
b=append(b,callbackid...)
b=append(b,target...)
b=append(b,action...)
b=append(b,content...)
res = b
return b
}
func FromByte(bytes []byte)(*Message){
size :=binary.LittleEndian.Uint32(bytes[0:4])
contentType :=bytes[4:5][0]
lenCallbackid:=binary.LittleEndian.Uint32(bytes[5:9])
lenTarget :=binary.LittleEndian.Uint32(bytes[9:13])
lenAction :=binary.LittleEndian.Uint32(bytes[13:17])
lenContent :=binary.LittleEndian.Uint32(bytes[17:21])
callbackid := string(bytes[21:21+lenCallbackid])
target:= string(bytes[21+lenCallbackid:21+lenCallbackid+lenTarget])
action:= string(bytes[21+lenCallbackid+lenTarget:21+lenCallbackid+lenTarget+lenAction])
content:=string(bytes[size-lenContent:size])
return &Message{size,contentType,callbackid,target,action,content}
}
Benchs:
func BenchmarkMessageToByte(b *testing.B) {
m:=NewMessage(uint8(3),"agsdggsdasagdsdgsgddggds","sometarSFAFFget","somFSAFSAFFSeaction","somfasfsasfafsejsonzhit")
for n := 0; n < b.N; n++ {
m.ToByte()
}
}
func BenchmarkMessageFromByte(b *testing.B) {
m:=NewMessage(uint8(1),"sagdsgaasdg","soSASFASFASAFSFASFAGmetarget","adsgdgsagdssgdsgd","agsdsdgsagdsdgasdg").ToByte()
for n := 0; n < b.N; n++ {
FromByte(m)
}
}
func BenchmarkStringToByte(b *testing.B) {
for n := 0; n < b.N; n++ {
_ = []byte("abcdefghijklmnoqrstuvwxyz")
}
}
func BenchmarkStringFromByte(b *testing.B) {
s:=[]byte("abcdefghijklmnoqrstuvwxyz")
for n := 0; n < b.N; n++ {
_ = string(s)
}
}
func BenchmarkUintToByte(b *testing.B) {
for n := 0; n < b.N; n++ {
i:=make([]byte,4)
binary.LittleEndian.PutUint32(i, uint32(99))
}
}
func BenchmarkUintFromByte(b *testing.B) {
i:=make([]byte,4)
binary.LittleEndian.PutUint32(i, uint32(99))
for n := 0; n < b.N; n++ {
binary.LittleEndian.Uint32(i)
}
}
Bench results:
BenchmarkMessageToByte 10000000 280 ns/op
BenchmarkMessageFromByte 10000000 293 ns/op
BenchmarkStringToByte 50000000 55.1 ns/op
BenchmarkStringFromByte 50000000 49.7 ns/op
BenchmarkUintToByte 1000000000 2.14 ns/op
BenchmarkUintFromByte 2000000000 1.71 ns/op

Provided memory is already allocated, a sequence of x=append(x,a...) is rather efficient in Go.
In your example, the initial allocation (make) probably costs more than the sequence of appends. It depends on the size of the fields. Consider the following benchmark:
package main
import (
"testing"
)
const sizeTotal = 25
var res []byte // To enforce heap allocation
func BenchmarkWithAlloc(b *testing.B) {
a := []byte("abcde")
for i := 0; i < b.N; i++ {
x := make([]byte, 0, sizeTotal)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
res = x // Make sure x escapes, and is therefore heap allocated
}
}
func BenchmarkWithoutAlloc(b *testing.B) {
a := []byte("abcde")
x := make([]byte, 0, sizeTotal)
for i := 0; i < b.N; i++ {
x = x[:0]
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
res = x
}
}
On my box, the result is:
testing: warning: no tests to run
PASS
BenchmarkWithAlloc 10000000 116 ns/op 32 B/op 1 allocs/op
BenchmarkWithoutAlloc 50000000 24.0 ns/op 0 B/op 0 allocs/op
Systematically reallocating the buffer (even a small one) makes this benchmark at least 5 times slower.
So your best hope to optimize this code it to make sure you do not reallocate a buffer for each packet you build. On the contrary, you should keep your buffer, and reuse it for each marshalling operation.
You can reset a slice while keeping its underlying buffer allocated with the following statement:
x = x[:0]

I looked carefully at that and made the following benchmarks.
package append
import "testing"
func BenchmarkAppend(b *testing.B) {
as := 1000
a := make([]byte, as)
s := make([]byte, 0, b.N*as)
for i := 0; i < b.N; i++ {
s = append(s, a...)
}
}
func BenchmarkCopy(b *testing.B) {
as := 1000
a := make([]byte, as)
s := make([]byte, 0, b.N*as)
for i := 0; i < b.N; i++ {
copy(s[i*as:(i+1)*as], a)
}
}
The results are
grzesiek#klapacjusz ~/g/s/t/append> go test -bench . -benchmem
testing: warning: no tests to run
PASS
BenchmarkAppend 10000000 202 ns/op 1000 B/op 0 allocs/op
BenchmarkCopy 10000000 201 ns/op 1000 B/op 0 allocs/op
ok test/append 4.564s
If the totalSize is big enough then your code makes no memory allocations. It copies only the amount of bytes it needs to copy. It is perfectly fine.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Interpretting benchmarks of preallocating a slice - go

Related

go maps non-performant for large number of keys

Using interfaces{} in slices seems to incur a 40x slowdown. Is there a way of bypassing this when implementing `interface{}` based datastructures?

Golang foreach difference

Performance of function slice parameter vs global variable?

Concat byte arrays

Categories

Resources