i'm working on a program that allocate lots of []int with length 4,3,2
and found using a:=[]{1,1,1} is a little bit fast than a:=make([]int,3) a[0] = 1 a[1]=1 a[2]= 1
my question: why a:=[]{1,1,1} is faster than a:=make([]int,3) a[0] = 1 a[1]=1 a[2]= 1?
func BenchmarkMake(b *testing.B) {
var array []int
for i := 0; i < b.N; i++ {
array = make([]int, 4)
array[0] = 1
array[1] = 1
array[2] = 1
array[3] = 1
}
}
func BenchmarkDirect(b *testing.B) {
var array []int
for i := 0; i < b.N; i++ {
array = []int{1, 1, 1, 1}
}
array[0] = 1
}
BenchmarkMake-4 50000000 34.3 ns/op
BenchmarkDirect-4 50000000 33.8 ns/op
Let's look at benchmark output of the following code
package main
import "testing"
func BenchmarkMake(b *testing.B) {
var array []int
for i := 0; i < b.N; i++ {
array = make([]int, 4)
array[0] = 1
array[1] = 1
array[2] = 1
array[3] = 1
}
}
func BenchmarkDirect(b *testing.B) {
var array []int
for i := 0; i < b.N; i++ {
array = []int{1, 1, 1, 1}
}
array[0] = 1
}
func BenchmarkArray(b *testing.B) {
var array [4]int
for i := 0; i < b.N; i++ {
array = [4]int{1, 1, 1, 1}
}
array[0] = 1
}
Usually the output looks like that
$ go test -bench . -benchmem -o alloc_test -cpuprofile cpu.prof
goos: linux
goarch: amd64
pkg: test
BenchmarkMake-8 30000000 61.3 ns/op 32 B/op 1 allocs/op
BenchmarkDirect-8 20000000 60.2 ns/op 32 B/op 1 allocs/op
BenchmarkArray-8 1000000000 2.56 ns/op 0 B/op 0 allocs/op
PASS
ok test 6.003s
The difference is so small that it can be the opposite in some circumstances.
Let's look at the profiling data
$go tool pprof -list 'Benchmark.*' cpu.prof
ROUTINE ======================== test.BenchmarkMake in /home/grzesiek/go/src/test/alloc_test.go
260ms 1.59s (flat, cum) 24.84% of Total
. . 5:func BenchmarkMake(b *testing.B) {
. . 6: var array []int
40ms 40ms 7: for i := 0; i < b.N; i++ {
50ms 1.38s 8: array = make([]int, 4)
. . 9: array[0] = 1
130ms 130ms 10: array[1] = 1
20ms 20ms 11: array[2] = 1
20ms 20ms 12: array[3] = 1
. . 13: }
. . 14:}
ROUTINE ======================== test.BenchmarkDirect in /home/grzesiek/go/src/test/alloc_test.go
90ms 1.66s (flat, cum) 25.94% of Total
. . 16:func BenchmarkDirect(b *testing.B) {
. . 17: var array []int
10ms 10ms 18: for i := 0; i < b.N; i++ {
80ms 1.65s 19: array = []int{1, 1, 1, 1}
. . 20: }
. . 21: array[0] = 1
. . 22:}
ROUTINE ======================== test.BenchmarkArray in /home/grzesiek/go/src/test/alloc_test.go
2.86s 2.86s (flat, cum) 44.69% of Total
. . 24:func BenchmarkArray(b *testing.B) {
. . 25: var array [4]int
500ms 500ms 26: for i := 0; i < b.N; i++ {
2.36s 2.36s 27: array = [4]int{1, 1, 1, 1}
. . 28: }
. . 29: array[0] = 1
. . 30:}
We can see that assignments takes some time.
To learn why we need to see the assembler code.
$go tool pprof -disasm 'BenchmarkMake' cpu.prof
. . 4eda93: MOVQ AX, 0(SP) ;alloc_test.go:8
30ms 30ms 4eda97: MOVQ $0x4, 0x8(SP) ;test.BenchmarkMake alloc_test.go:8
. . 4edaa0: MOVQ $0x4, 0x10(SP) ;alloc_test.go:8
10ms 1.34s 4edaa9: CALL runtime.makeslice(SB) ;test.BenchmarkMake alloc_test.go:8
. . 4edaae: MOVQ 0x18(SP), AX ;alloc_test.go:8
10ms 10ms 4edab3: MOVQ 0x20(SP), CX ;test.BenchmarkMake alloc_test.go:8
. . 4edab8: TESTQ CX, CX ;alloc_test.go:9
. . 4edabb: JBE 0x4edb0b
. . 4edabd: MOVQ $0x1, 0(AX)
130ms 130ms 4edac4: CMPQ $0x1, CX ;test.BenchmarkMake alloc_test.go:10
. . 4edac8: JBE 0x4edb04 ;alloc_test.go:10
. . 4edaca: MOVQ $0x1, 0x8(AX)
20ms 20ms 4edad2: CMPQ $0x2, CX ;test.BenchmarkMake alloc_test.go:11
. . 4edad6: JBE 0x4edafd ;alloc_test.go:11
. . 4edad8: MOVQ $0x1, 0x10(AX)
. . 4edae0: CMPQ $0x3, CX ;alloc_test.go:12
. . 4edae4: JA 0x4eda65
We can see that the time is taken by CMPQ command that compares constant with CX register. The CX register is the value copied from stack after call to make. We can deduce that it must be the size of slice while AX holds the reference to an underlying array. You can also see that the first bound check was optimized.
Conclusions
Allocations takes the same time but the assignments costs extra due to the slice size checks (as noticed by Terry Pang).
Using array instead of slice is much more cheaper as it saves allocations.
Why is using array so much cheaper?
In Go the array is basically a chunk of memory of fixed size. The [1]int is basically the same thing as int. You can find more in in Go Slices: usage and internals article.
Related
I've implemented two functions for rotating an input array n times to the right.
Both implementations exit early, without performing anything, if the initial array would be equal to the resulting array after all rotations. This happens anytime n is 0 or a multiple of the length of the array.
func rotateRight1(nums []int, n int) {
n = n % len(nums)
if n == 0 {
return
}
lastNDigits := make([]int, n)
copy(lastNDigits, nums[len(nums)-n:])
copy(nums[n:], nums[:len(nums)-n])
copy(nums[:n], lastNDigits)
}
and
func rotateRight2(nums []int, n int) {
n = n % len(nums)
if n == 0 {
return
}
i := 0
current := nums[i]
iAlreadySeen := i
for j := 0; j < len(nums); j++ {
nextI := (i + n) % len(nums)
nums[nextI], current = current, nums[nextI]
i = nextI
// handle even length arrays where i+k might equal an already seen index
if nextI == iAlreadySeen {
i = (i + 1) % len(nums)
iAlreadySeen = i
current = nums[i]
}
}
}
When benchmarking, I was surprised to see a +20x difference in speed when n equals 0 for both functions.
func BenchmarkRotateRight1(b *testing.B) {
nums := make([]int, 5_000)
b.ResetTimer()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
rotateRight1(nums, 0)
}
}
func BenchmarkRotateRight2(b *testing.B) {
nums := make([]int, 5_000)
b.ResetTimer()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
rotateRight2(nums, 0)
}
}
go test -bench=. yields a result like this consistently:
cpu: Intel(R) Core(TM) i7-7500U CPU # 2.70GHz
BenchmarkRotateRight1-4 1000000000 0.4603 ns/op 0 B/op 0 allocs/op
BenchmarkRotateRight2-4 97236492 12.11 ns/op 0 B/op 0 allocs/op
PASS
I don't understand this performance difference as both functions are basically doing the same thing and exiting early in the if k == 0 condition.
Could someone help me understand this?
go version go1.18 linux/amd64
What you see is the result of compiler optimization. The compiler is clever enough to make the applications faster by inlining function calls. Sometimes, the compiler optimizes to remove function calls and artificially lowers the run time of benchmarks (that can be tricky).
After profiling, we can notice the function rotateRight1 is not even being called during the benchmark execution:
(pprof) list BenchmarkRotateRight1
Total: 260ms
ROUTINE ======================== main_test.BenchmarkRotateRight1 in (edited)
260ms 260ms (flat, cum) 100% of Total
. . 43:func BenchmarkRotateRight1(b *testing.B) {
. . 44: nums := make([]int, 5_000)
. . 45:
. . 46: b.ResetTimer()
. . 47: b.ReportAllocs()
260ms 260ms 48: for i := 0; i < b.N; i++ {
. . 49: rotateRight1(nums, 0)
. . 50: }
. . 51:}
On the other hand, rotateRight2 is being called, and that's why you see that difference in the run time of the benchmarks:
(pprof) list BenchmarkRotateRight2
Total: 1.89s
ROUTINE ======================== main_test.BenchmarkRotateRight2 in (edited)
180ms 1.89s (flat, cum) 100% of Total
. . 53:func BenchmarkRotateRight2(b *testing.B) {
. . 54: nums := make([]int, 5_000)
. . 55:
. . 56: b.ResetTimer()
. . 57: b.ReportAllocs()
130ms 130ms 58: for i := 0; i < b.N; i++ {
50ms 1.76s 59: rotateRight2(nums, 0)
. . 60: }
. . 61:}
If you run your benchmark with -gcflags '-m -m' (double -m or -m=2), you will see some of the compiler decisions to optimize the code (reference):
$ go test -gcflags '-m -m' -benchmem -bench . main_test.go
# command-line-arguments_test [command-line-arguments.test]
./main_test.go:7:6: can inline rotateRight1 with cost 40 as: func([]int, int) { n = n % len(nums); if n == 0 { return }; lastNDigits := make([]int, n); copy(lastNDigits, nums[len(nums) - n:]); copy(nums[n:], nums[:len(nums) - n]); copy(nums[:n], lastNDigits) }
./main_test.go:20:6: cannot inline rotateRight2: function too complex: cost 83 exceeds budget 80
...
So, based on a complexity threshold, the compiler decides whether to optimize or not.
You can use the//go:noinline directive right before the function you want to disable inlining optimization though. That will override the compiler's usual optimization rules:
//go:noinline
func rotateRight1(nums []int, n int) {
...
}
Now you will notice the benchmarks are very similar:
cpu: Intel(R) Core(TM) i9-9880H CPU # 2.30GHz
BenchmarkRotateRight1
BenchmarkRotateRight1-16 135554571 8.886 ns/op 0 B/op 0 allocs/op
BenchmarkRotateRight2
BenchmarkRotateRight2-16 143716638 8.775 ns/op 0 B/op 0 allocs/op
PASS
I'm an experienced C++ programmer, used to low level optimization an I'm trying to get performances out of Go.
So far, I'm interested in GFlop/s.
I wrote the following go code:
package main
import (
"fmt"
"time"
"runtime"
"sync"
)
func expm1(x float64) float64 {
return ((((((((((((((15.0 + x) * x + 210.0) * x + 2730.0) * x + 32760.0) * x + 360360.0) * x + 3603600.0) * x + 32432400.0) * x + 259459200.0) * x + 1816214400.0) * x + 10897286400.0) * x + 54486432000.0) * x + 217945728000.0) *
x + 653837184000.0) * x + 1307674368000.0) * x * 7.6471637318198164759011319857881e-13;
}
func twelve(x float64) float64 {
return expm1( expm1( expm1( expm1( expm1( expm1( expm1( expm1( expm1( expm1( expm1( expm1(x))))))))))));
}
func populate(data []float64, N int) {
CPUCOUNT := runtime.NumCPU();
var wg sync.WaitGroup
var slice = N / CPUCOUNT;
wg.Add(CPUCOUNT)
defer wg.Wait()
for i := 0; i < CPUCOUNT; i++ {
go func(ii int) {
for j := ii * slice; j < ii * slice + slice; j += 1 {
data[j] = 0.1;
}
defer wg.Done();
}(i);
}
}
func apply(data []float64, N int) {
CPUCOUNT := runtime.NumCPU();
var wg sync.WaitGroup
var slice = N / CPUCOUNT;
wg.Add(CPUCOUNT)
defer wg.Wait()
for i := 0; i < CPUCOUNT; i++ {
go func(ii int) {
for j := ii * slice; j < ii * slice + slice; j += 8 {
data[j] = twelve(data[j]);
data[j+1] = twelve(data[j+1]);
data[j+2] = twelve(data[j+2]);
data[j+3] = twelve(data[j+3]);
data[j+4] = twelve(data[j+4]);
data[j+5] = twelve(data[j+5]);
data[j+6] = twelve(data[j+6]);
data[j+7] = twelve(data[j+7]);
}
defer wg.Done();
}(i);
}
}
func Run(data []float64, N int) {
populate(data, N);
start:= time.Now();
apply(data, N);
stop:= time.Now();
elapsed:=stop.Sub(start);
seconds := float64(elapsed.Milliseconds()) / 1000.0;
Gflop := float64(N) * 12.0 * 15.0E-9;
fmt.Printf("%f\n", Gflop / seconds);
}
func main() {
CPUCOUNT := runtime.NumCPU();
fmt.Printf("num procs : %d\n", CPUCOUNT);
N := 1024*1024*32 * CPUCOUNT;
data:= make([]float64, N);
for i := 0; i < 100; i++ {
Run(data, N);
}
}
which is an attempt of translation from my c++ benchmark which yields 80% of peak flops.
The C++ version yields 95 GFlop/s where the go version yields 6 GFlops/s (FMA counter for 1).
Here is a piece of the go assembly (gccgo -O3 -mfma -mavx2):
vfmadd132sd %xmm1, %xmm15, %xmm0
.loc 1 12 50
vfmadd132sd %xmm1, %xmm14, %xmm0
.loc 1 12 64
vfmadd132sd %xmm1, %xmm13, %xmm0
.loc 1 12 79
vfmadd132sd %xmm1, %xmm12, %xmm0
.loc 1 12 95
vfmadd132sd %xmm1, %xmm11, %xmm0
.loc 1 12 112
vfmadd132sd %xmm1, %xmm10, %xmm0
And what I get from my c++ code (g++ -fopenmp -mfma -mavx2 -O3):
vfmadd213pd .LC3(%rip), %ymm12, %ymm5
vfmadd213pd .LC3(%rip), %ymm11, %ymm4
vfmadd213pd .LC3(%rip), %ymm10, %ymm3
vfmadd213pd .LC3(%rip), %ymm9, %ymm2
vfmadd213pd .LC3(%rip), %ymm8, %ymm1
vfmadd213pd .LC3(%rip), %ymm15, %ymm0
vfmadd213pd .LC4(%rip), %ymm15, %ymm0
vfmadd213pd .LC4(%rip), %ymm14, %ymm7
vfmadd213pd .LC4(%rip), %ymm13, %ymm6
vfmadd213pd .LC4(%rip), %ymm12, %ymm5
vfmadd213pd .LC4(%rip), %ymm11, %ymm4
I therefore have a few questions, most important of which is :
Do I express parallelism the right way ?
and if not, how should I do that ?
For additional performance improvements, I'd need to know what's wrong with the following items :
Why do I see only vfmadd132sd instructions in the assembly, instead of vfmadd132pd?
How can I properly align memory allocations?
How can I remove debug info from the generated executable?
Do I pass the right options to gccgo?
Do I use the right compiler?
Do i express parallelism the right way ?
No. You might be trashing the CPU cache. (But this is hard to tell without knowing details about your system. Guess it's not NUMA?). Anyway, technically your code is concurrent not parallel.
Why do I see only vfmadd132sd instructions in the assembly, instead of vfmadd132pd ?
Because the compiler put it there. Is this a compiler question or a programming question?
How can i properly align memory allocations ?
That depends on your definition of "properly". Struct field and slice alignments are not ad hoc controllable, but you can reorder struct fields (which you did not use at all, so I do not know what you are asking here).
How can i remove debug info from the generated executable ?
Consult the documentation of gcc.
Do i pass the right options to gccgo ?
I do not know.
Do I use the right compiler ?
What makes a compiler "right"?
I am currently trying to implement a tree based datastructure in Go and I am seeing disappointing results in my benchmarking. Because I am trying to be generic as to what values I accept, I am limited to using interface{}.
The code in question is an immutable vector trie. Essentially, any time a value in the vector is modified I need to make a copy of several nodes in the trie. Each of these nodes is implemented as a slice of const (known at compile time) length. For example, writing a value into a large trie will require the copying of 5 seperate 32 long slices. They must be copies to preserve the immutability of the previous contents.
I believe the disappointing benchmark results are because I am storing my data as interface{} in slices, which get created, copied and appended to often. To measure this I set up the following benchmark
package main
import (
"math/rand"
"testing"
)
func BenchmarkMake10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
_ = make([]int, 10e6, 10e6)
}
}
func BenchmarkMakePtr10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
_ = make([]*int, 10e6, 10e6)
}
}
func BenchmarkMakeInterface10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
_ = make([]interface{}, 10e6, 10e6)
}
}
func BenchmarkMakeInterfacePtr10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
_ = make([]interface{}, 10e6, 10e6)
}
}
func BenchmarkAppend10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
slc := make([]int, 0, 0)
for jj := 0; jj < 10e6; jj++ {
slc = append(slc, jj)
}
}
}
func BenchmarkAppendPtr10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
slc := make([]*int, 0, 0)
for jj := 0; jj < 10e6; jj++ {
slc = append(slc, &jj)
}
}
}
func BenchmarkAppendInterface10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
slc := make([]interface{}, 0, 0)
for jj := 0; jj < 10e6; jj++ {
slc = append(slc, jj)
}
}
}
func BenchmarkAppendInterfacePtr10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
slc := make([]interface{}, 0, 0)
for jj := 0; jj < 10e6; jj++ {
slc = append(slc, &jj)
}
}
}
func BenchmarkSet(b *testing.B) {
slc := make([]int, 10e6, 10e6)
b.ResetTimer()
for ii := 0; ii < b.N; ii++ {
slc[rand.Intn(10e6-1)] = 1
}
}
func BenchmarkSetPtr(b *testing.B) {
slc := make([]*int, 10e6, 10e6)
b.ResetTimer()
for ii := 0; ii < b.N; ii++ {
theInt := 1
slc[rand.Intn(10e6-1)] = &theInt
}
}
func BenchmarkSetInterface(b *testing.B) {
slc := make([]interface{}, 10e6, 10e6)
b.ResetTimer()
for ii := 0; ii < b.N; ii++ {
slc[rand.Intn(10e6-1)] = 1
}
}
func BenchmarkSetInterfacePtr(b *testing.B) {
slc := make([]interface{}, 10e6, 10e6)
b.ResetTimer()
for ii := 0; ii < b.N; ii++ {
theInt := 1
slc[rand.Intn(10e6-1)] = &theInt
}
}
which gives the following result
BenchmarkMake10M-4 300 4962381 ns/op
BenchmarkMakePtr10M-4 100 10255522 ns/op
BenchmarkMakeInterface10M-4 100 19788588 ns/op
BenchmarkMakeInterfacePtr10M-4 100 19850682 ns/op
BenchmarkAppend10M-4 20 67090711 ns/op
BenchmarkAppendPtr10M-4 1 2784300818 ns/op
BenchmarkAppendInterface10M-4 1 3457503833 ns/op
BenchmarkAppendInterfacePtr10M-4 1 3532502711 ns/op
BenchmarkSet-4 30000000 43.5 ns/op
BenchmarkSetPtr-4 20000000 91.2 ns/op
BenchmarkSetInterface-4 30000000 43.5 ns/op
BenchmarkSetInterfacePtr-4 20000000 70.9 ns/op
Where the difference on Set and Make seems to be about 2-4x but the difference on Append is about 40x.
From what I understand the performance hit is because behind the scenes interfaces are implemented as pointers, and that pointers must be allocated on the heap. That still doesn't explain why Append is significantly worse than the difference between Set or Make.
Is there a way in the current language of Go without using a code generation tool (e.g., a generics tool that lets the consumer of the library generate a version of the library to store FooType) to work around this 40x performance hit? Alternatively, have I made some error in my benchmarking?
Let's profile the test with memory benchmarks.
go test -bench . -cpuprofile cpu.prof -benchmem
goos: linux
goarch: amd64
BenchmarkMake10M-8 100 10254248 ns/op 80003282 B/op 1 allocs/op
BenchmarkMakePtr10M-8 100 18696295 ns/op 80003134 B/op 1 allocs/op
BenchmarkMakeInterface10M-8 50 34501361 ns/op 160006147 B/op 1 allocs/op
BenchmarkMakeInterfacePtr10M-8 50 35129085 ns/op 160006652 B/op 1 allocs/op
BenchmarkAppend10M-8 20 69971722 ns/op 423503264 B/op 50 allocs/op
BenchmarkAppendPtr10M-8 1 2135090501 ns/op 423531096 B/op 62 allocs/op
BenchmarkAppendInterface10M-8 1 1833396620 ns/op 907567984 B/op 10000060 allocs/op
BenchmarkAppendInterfacePtr10M-8 1 2270970241 ns/op 827546240 B/op 53 allocs/op
BenchmarkSet-8 30000000 54.0 ns/op 0 B/op 0 allocs/op
BenchmarkSetPtr-8 20000000 91.6 ns/op 8 B/op 1 allocs/op
BenchmarkSetInterface-8 30000000 58.0 ns/op 0 B/op 0 allocs/op
BenchmarkSetInterfacePtr-8 20000000 88.0 ns/op 8 B/op 1 allocs/op
PASS
ok _/home/grzesiek/test 22.427s
We can see that the slowest benchmarks are the ones that makes allocations.
PPROF_BINARY_PATH=. go tool pprof -disasm BenchmarkAppend cpu.prof
Total: 29.75s
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppend10M
210m 1.51s (flat, cum) 5.08% of Total
. 1.30s 4e827a: CALL runtime.growslice(SB) ;_/home/grzesiek/test.BenchmarkAppend10M test_test.go:35
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppendInterface10M
20m 930ms (flat, cum) 3.13% of Total
. 630ms 4e8519: CALL runtime.growslice(SB) ;_/home/grzesiek/test.BenchmarkAppendInterface10M test_test.go:53
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppendInterfacePtr10M
0 800ms (flat, cum) 2.69% of Total
. 770ms 4e8625: CALL runtime.growslice(SB) ;_/home/grzesiek/test.BenchmarkAppendInterfacePtr10M test_test.go:62
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppendPtr10M
0 950ms (flat, cum) 3.19% of Total
. 870ms 4e8374: CALL runtime.growslice(SB) ;_/home/grzesiek/test.BenchmarkAppendPtr10M test_test.go:44
By analyzing the number of bytes allocated, we can see that use of interface doubles the allocations size.
Why is BenchmarkAppendPtr10M so much faster than other BenchmarkAppend*?
To figure this out, we need to see the escape analysis.
go test -gcflags '-m -l' original_test.go
./original_test.go:31:28: BenchmarkAppend10M b does not escape
./original_test.go:33:14: BenchmarkAppend10M make([]int, 0, 0) does not escape
./original_test.go:40:31: BenchmarkAppendPtr10M b does not escape
./original_test.go:42:14: BenchmarkAppendPtr10M make([]*int, 0, 0) does not escape
./original_test.go:43:7: moved to heap: jj
./original_test.go:44:22: &jj escapes to heap
./original_test.go:49:37: BenchmarkAppendInterface10M b does not escape
./original_test.go:51:14: BenchmarkAppendInterface10M make([]interface {}, 0, 0) does not escape
./original_test.go:53:16: jj escapes to heap
./original_test.go:58:40: BenchmarkAppendInterfacePtr10M b does not escape
./original_test.go:60:14: BenchmarkAppendInterfacePtr10M make([]interface {}, 0, 0) does not escape
./original_test.go:61:7: moved to heap: jj
./original_test.go:62:16: &jj escapes to heap
./original_test.go:62:22: &jj escapes to heap
We can see that it is the only benchmark in which jj does not escape to the heap. We can deduce that accessing heap variable causes slowdown.
Why does BenchmarkAppendInterface10M make so many allocations?
In the assembler, we can see that it is the only one that calls runtime.convT2E64 function.
PPROF_BINARY_PATH=. go tool pprof -disasm BenchmarkAppend cpu.prof
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppendInterface10M
30ms 1.10s (flat, cum) 3.35% of Total
. 260ms 4e8490: CALL runtime.convT2E64(SB)
The source code from runtime/iface.go looks like this:
func convT2E64(t *_type, elem unsafe.Pointer) (e eface) {
if raceenabled {
raceReadObjectPC(t, elem, getcallerpc(), funcPC(convT2E64))
}
if msanenabled {
msanread(elem, t.size)
}
var x unsafe.Pointer
if *(*uint64)(elem) == 0 {
x = unsafe.Pointer(&zeroVal[0])
} else {
x = mallocgc(8, t, false)
*(*uint64)(x) = *(*uint64)(elem)
}
e._type = t
e.data = x
return
}
As we see it makes the allocation by calling mallocgc function.
I know it does not directly help fix your code but I hope it gives you the tools and techniques to analyze optimize it.
I've been trying to understand slice preallocation with make and why it's a good idea. I noticed a large performance difference between preallocating a slice and appending to it vs just initializing it with 0 length/capacity and then appending to it. I wrote a set of very simple benchmarks:
import "testing"
func BenchmarkNoPreallocate(b *testing.B) {
for i := 0; i < b.N; i++ {
// Don't preallocate our initial slice
init := []int64{}
init = append(init, 5)
}
}
func BenchmarkPreallocate(b *testing.B) {
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, 0, 1)
init = append(init, 5)
}
}
and was a little puzzled with the results:
$ go test -bench=. -benchmem
goos: linux
goarch: amd64
BenchmarkNoPreallocate-4 30000000 41.8 ns/op 8 B/op 1 allocs/op
BenchmarkPreallocate-4 2000000000 0.29 ns/op 0 B/op 0 allocs/op
I have a couple of questions:
Why are there no allocations (it shows 0 allocs/op) in the preallocation benchmark case? Certainly we're preallocating, but the allocation had to have happened at some point.
I imagine this may become clearer after the first question is answered, but how is the preallocation case so much quicker? Am I misinterpetting this benchmark?
Please let me know if anything is unclear. Thank you!
Go has an optimizing compiler. Constants are evaluated at compile time. Variables are evaluated at runtime. Constant values can be used to optimize compiler generated code. For example,
package main
import "testing"
func BenchmarkNoPreallocate(b *testing.B) {
for i := 0; i < b.N; i++ {
// Don't preallocate our initial slice
init := []int64{}
init = append(init, 5)
}
}
func BenchmarkPreallocateConst(b *testing.B) {
const (
l = 0
c = 1
)
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, l, c)
init = append(init, 5)
}
}
func BenchmarkPreallocateVar(b *testing.B) {
var (
l = 0
c = 1
)
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, l, c)
init = append(init, 5)
}
}
Output:
$ go test alloc_test.go -bench=. -benchmem
BenchmarkNoPreallocate-4 50000000 39.3 ns/op 8 B/op 1 allocs/op
BenchmarkPreallocateConst-4 2000000000 0.36 ns/op 0 B/op 0 allocs/op
BenchmarkPreallocateVar-4 50000000 28.2 ns/op 8 B/op 1 allocs/op
Another interesting set of benchmarks:
package main
import "testing"
func BenchmarkNoPreallocate(b *testing.B) {
const (
l = 0
c = 8 * 1024
)
for i := 0; i < b.N; i++ {
// Don't preallocate our initial slice
init := []int64{}
for j := 0; j < c; j++ {
init = append(init, 42)
}
}
}
func BenchmarkPreallocateConst(b *testing.B) {
const (
l = 0
c = 8 * 1024
)
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, l, c)
for j := 0; j < cap(init); j++ {
init = append(init, 42)
}
}
}
func BenchmarkPreallocateVar(b *testing.B) {
var (
l = 0
c = 8 * 1024
)
for i := 0; i < b.N; i++ {
// Preallocate our initial slice
init := make([]int64, l, c)
for j := 0; j < cap(init); j++ {
init = append(init, 42)
}
}
}
Output:
$ go test peter_test.go -bench=. -benchmem
BenchmarkNoPreallocate-4 20000 75656 ns/op 287992 B/op 19 allocs/op
BenchmarkPreallocateConst-4 100000 22386 ns/op 65536 B/op 1 allocs/op
BenchmarkPreallocateVar-4 100000 22112 ns/op 65536 B/op 1 allocs/op
At the moment I am implementing some sorting algorithms. As it's in the nature of algorithms, there are a lot of calls on the length of some arrays/slices using the len() method.
Now, given the following code for a (part of) the Mergesort algorithm:
for len(left) > 0 || len(right) > 0 {
if len(left) > 0 && len(right) > 0 {
if left[0] <= right[0] {
result = append(result, left[0])
left = left[1:len(left)]
} else {
result = append(result, right[0])
right = right[1:len(right)]
}
} else if len(left) > 0 {
result = append(result, left[0])
left = left[1:len(left)]
} else if len(right) > 0 {
result = append(result, right[0])
right = right[1:len(right)]
}
}
My question is: Do these multiple len() calls affect the performance of the algorithm negatively? Is it better to make a temporary variable for the length of the right and left slice? Or does the compiler does this itself?
There are two cases:
Local slice: length will be cached and there is no overhead
Global slice or passed (by reference): length cannot be cached and there is overhead
No overhead for local slices
For locally defined slices the length is cached, so there is no runtime overhead. You can see this in the assembly of the following program:
func generateSlice(x int) []int {
return make([]int, x)
}
func main() {
x := generateSlice(10)
println(len(x))
}
Compiled with go tool 6g -S test.go this yields, amongst other things, the following lines:
MOVQ "".x+40(SP),BX
MOVQ BX,(SP)
// ...
CALL ,runtime.printint(SB)
What happens here is that the first line retrieves the length of x by getting the value located 40 bytes from the beginning of x and most importantly caches this value in BX, which is then used for every occurrence of len(x). The reason for the offset is that an array has the following structure (source):
typedef struct
{ // must not move anything
uchar array[8]; // pointer to data
uchar nel[4]; // number of elements
uchar cap[4]; // allocated number of elements
} Array;
nel is what is accessed by len(). You can see this in the code generation as well.
Global and referenced slices have overhead
For shared values caching of the length is not possible since the compiler has to assume that the slice changes between calls. Therefore the compiler has to write code that accesses the length attribute directly every time. Example:
func accessLocal() int {
a := make([]int, 1000) // local
count := 0
for i := 0; i < len(a); i++ {
count += len(a)
}
return count
}
var ag = make([]int, 1000) // pseudo-code
func accessGlobal() int {
count := 0
for i := 0; i < len(ag); i++ {
count += len(ag)
}
return count
}
Comparing the assembly of both functions yields the crucial difference that as soon as the variable is global the access to the nel attribute is not cached anymore and there will be a runtime overhead:
// accessLocal
MOVQ "".a+8048(SP),SI // cache length in SI
// ...
CMPQ SI,AX // i < len(a)
// ...
MOVQ SI,BX
ADDQ CX,BX
MOVQ BX,CX // count += len(a)
// accessGlobal
MOVQ "".ag+8(SB),BX
CMPQ BX,AX // i < len(ag)
// ...
MOVQ "".ag+8(SB),BX
ADDQ CX,BX
MOVQ BX,CX // count += len(ag)
Despite the good answers you are getting, I'm getting poorer performance if calling len(a) constantly, for example in this test http://play.golang.org/p/fiP1Sy2Hfk
package main
import "testing"
func BenchmarkTest1(b *testing.B) {
a := make([]int, 1000)
for i := 0; i < b.N; i++ {
count := 0
for i := 0; i < len(a); i++ {
count += len(a)
}
}
}
func BenchmarkTest2(b *testing.B) {
a := make([]int, 1000)
for i := 0; i < b.N; i++ {
count := 0
lena := len(a)
for i := 0; i < lena; i++ {
count += lena
}
}
}
When run as go test -bench=. I get:
BenchmarkTest1 5000000 668 ns/op
BenchmarkTest2 5000000 402 ns/op
So there is clearly a penalty here, possibly because the compiler is making worse optimizations in compile-time.
Hope things got improved in the latest version of Go
go version go1.16.7 linux/amd64
goos: linux
goarch: amd64
pkg: 001_test
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 # 2.80GHz
BenchmarkTest1-8 4903609 228.8 ns/op
BenchmarkTest2-8 5280086 229.9 ns/op