I was wondering about the time complexity of go's copy function?
Intuitively I would assume the worst case of linear time. But I was wondering if there was any magic that was able to bulk allocate, or something, which would allow it to perform better?
https://golang.org/ref/spec#Appending_and_copying_slices
I figured the assembly would explain something but I'm not sure what I"m reading :p
$ GOOS=linux GOARCH=amd64 go tool compile -S main.go
func main() {
src := []int{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
dst := make([]int, len(src))
numCopied := copy(dst, src)
if numCopied != 10 {
panic(fmt.Sprintf("expected 5 copied received: %d", numCopied))
}
}
With the following output from the copy line:
0x007a 00122 (main.go:23) CMPQ AX, $10
0x007e 00126 (main.go:23) JLE 133
0x0080 00128 (main.go:23) MOVL $10, AX
0x0085 00133 (main.go:23) MOVQ AX, "".numCopied+56(SP)
0x008a 00138 (main.go:23) MOVQ CX, (SP)
0x008e 00142 (main.go:23) LEAQ ""..autotmp_8+72(SP), CX
0x0093 00147 (main.go:23) MOVQ CX, 8(SP)
0x0098 00152 (main.go:23) SHLQ $3, AX
0x009c 00156 (main.go:23) MOVQ AX, 16(SP)
0x00a1 00161 (main.go:23) PCDATA $0, $0
0x00a1 00161 (main.go:23) CALL runtime.memmove(SB)
0x00a6 00166 (main.go:23) MOVQ "".numCopied+56(SP), AX
I then tried with 5 elements as well:
func main() {
src := []int{1, 2, 3, 4, 5}
dst := make([]int, len(src))
numCopied := copy(dst, src)
if numCopied != 5 {
panic(fmt.Sprintf("expected 5 copied received: %d", numCopied))
}
}
With the following output from the copy line:
0x0086 00134 (main.go:9) CMPQ AX, $5
0x008a 00138 (main.go:9) JLE 145
0x008c 00140 (main.go:9) MOVL $5, AX
0x0091 00145 (main.go:9) MOVQ AX, "".numCopied+56(SP)
0x0096 00150 (main.go:9) MOVQ CX, (SP)
0x009a 00154 (main.go:9) LEAQ ""..autotmp_8+72(SP), CX
0x009f 00159 (main.go:9) MOVQ CX, 8(SP)
0x00a4 00164 (main.go:9) SHLQ $3, AX
0x00a8 00168 (main.go:9) MOVQ AX, 16(SP)
0x00ad 00173 (main.go:9) PCDATA $0, $0
0x00ad 00173 (main.go:9) CALL runtime.memmove(SB)
0x00b2 00178 (main.go:9) MOVQ "".numCopied+56(SP), AX
I suggest benchmarking the time it takes to copy array/slices of different sizes. Here is something to get the ball rolling:
package main
import (
"fmt"
"math"
"testing"
)
func main() {
for i := 0; i < 16; i++ {
size := powerOfTwo(i)
runBench(size)
}
}
func runBench(size int) {
bench := func(b *testing.B) {
src := make([]int, size, size)
dst := make([]int, size, size)
// we don't want to measure the time
// it takes to make the arrays, so reset timer
b.ResetTimer()
for i := 0; i < b.N; i++ {
copy(dst, src)
}
}
fmt.Printf("size = %d, %s", size, testing.Benchmark(bench))
}
func powerOfTwo(i int) int {
return int(math.Pow(float64(2), float64(i)))
}
Related
I have the following code:
func AddToSliceByValue(mySlice []int) {
for idx := range mySlice {
mySlice[idx]++
}
}
func AddToSliceByPointer(mySlice *[]int) {
for idx := range *mySlice {
(*mySlice)[idx]++
}
}
My first thought was that the performance should be nearly the same because pass by value copies the slice header and pass by pointer would force me to dereferencing pointers but my benchmark shows something else:
func BenchmarkAddByValue(b *testing.B) {
mySlice := rand.Perm(1000)
for n := 0; n < b.N; n++ {
AddToSliceByValue(mySlice)
}
}
func BenchmarkAddByPointer(b *testing.B) {
mySlice := rand.Perm(1000)
for n := 0; n < b.N; n++ {
AddToSliceByPointer(&mySlice)
}
}
BenchmarkAddByValue-12 1151256 1035 ns/op
BenchmarkAddByPointer-12 2145110 525 ns/op
Can anyone explain to me why the difference in performance is so great?
I also added the assembly code for the two functions.
Assembly code for pass by value:
TEXT main.AddToSliceByValue(SB) /go_test/pointer/pointer_value.go
pointer_value.go:4 0x1056f60 488b442410 MOVQ 0x10(SP), AX
pointer_value.go:4 0x1056f65 488b4c2408 MOVQ 0x8(SP), CX
pointer_value.go:4 0x1056f6a 31d2 XORL DX, DX
pointer_value.go:4 0x1056f6c eb0e JMP 0x1056f7c
pointer_value.go:5 0x1056f6e 488b1cd1 MOVQ 0(CX)(DX*8), BX
pointer_value.go:5 0x1056f72 48ffc3 INCQ BX
pointer_value.go:5 0x1056f75 48891cd1 MOVQ BX, 0(CX)(DX*8)
pointer_value.go:4 0x1056f79 48ffc2 INCQ DX
pointer_value.go:4 0x1056f7c 4839c2 CMPQ AX, DX
pointer_value.go:4 0x1056f7f 7ced JL 0x1056f6e
pointer_value.go:4 0x1056f81 c3 RET
:-1 0x1056f82 cc INT $0x3
:-1 0x1056f83 cc INT $0x3
:-1 0x1056f84 cc INT $0x3
:-1 0x1056f85 cc INT $0x3
:-1 0x1056f86 cc INT $0x3
:-1 0x1056f87 cc INT $0x3
:-1 0x1056f88 cc INT $0x3
:-1 0x1056f89 cc INT $0x3
:-1 0x1056f8a cc INT $0x3
:-1 0x1056f8b cc INT $0x3
:-1 0x1056f8c cc INT $0x3
:-1 0x1056f8d cc INT $0x3
:-1 0x1056f8e cc INT $0x3
:-1 0x1056f8f cc INT $0x3
TEXT main.main(SB) /go_test/pointer/pointer_value.go
pointer_value.go:9 0x1056f90 65488b0c2530000000 MOVQ GS:0x30, CX
pointer_value.go:9 0x1056f99 483b6110 CMPQ 0x10(CX), SP
pointer_value.go:9 0x1056f9d 0f86a8000000 JBE 0x105704b
pointer_value.go:9 0x1056fa3 4883ec70 SUBQ $0x70, SP
pointer_value.go:9 0x1056fa7 48896c2468 MOVQ BP, 0x68(SP)
pointer_value.go:9 0x1056fac 488d6c2468 LEAQ 0x68(SP), BP
pointer_value.go:11 0x1056fb1 488d7c2418 LEAQ 0x18(SP), DI
pointer_value.go:11 0x1056fb6 0f57c0 XORPS X0, X0
pointer_value.go:11 0x1056fb9 488d7fd0 LEAQ -0x30(DI), DI
pointer_value.go:11 0x1056fbd 48896c24f0 MOVQ BP, -0x10(SP)
pointer_value.go:11 0x1056fc2 488d6c24f0 LEAQ -0x10(SP), BP
pointer_value.go:11 0x1056fc7 e849c6ffff CALL 0x1053615
pointer_value.go:11 0x1056fcc 488b6d00 MOVQ 0(BP), BP
pointer_value.go:11 0x1056fd0 48c744242001000000 MOVQ $0x1, 0x20(SP)
pointer_value.go:11 0x1056fd9 48c744242802000000 MOVQ $0x2, 0x28(SP)
pointer_value.go:11 0x1056fe2 48c744243003000000 MOVQ $0x3, 0x30(SP)
pointer_value.go:11 0x1056feb 48c744243804000000 MOVQ $0x4, 0x38(SP)
pointer_value.go:11 0x1056ff4 48c744244005000000 MOVQ $0x5, 0x40(SP)
pointer_value.go:11 0x1056ffd 48c744244806000000 MOVQ $0x6, 0x48(SP)
pointer_value.go:11 0x1057006 48c744245007000000 MOVQ $0x7, 0x50(SP)
pointer_value.go:11 0x105700f 48c744245808000000 MOVQ $0x8, 0x58(SP)
pointer_value.go:11 0x1057018 48c744246009000000 MOVQ $0x9, 0x60(SP)
pointer_value.go:12 0x1057021 488d442418 LEAQ 0x18(SP), AX
pointer_value.go:12 0x1057026 48890424 MOVQ AX, 0(SP)
pointer_value.go:12 0x105702a 48c74424080a000000 MOVQ $0xa, 0x8(SP)
pointer_value.go:12 0x1057033 48c74424100a000000 MOVQ $0xa, 0x10(SP)
pointer_value.go:12 0x105703c e81fffffff CALL main.AddToSliceByValue(SB)
pointer_value.go:13 0x1057041 488b6c2468 MOVQ 0x68(SP), BP
pointer_value.go:13 0x1057046 4883c470 ADDQ $0x70, SP
pointer_value.go:13 0x105704a c3 RET
pointer_value.go:9 0x105704b e8909cffff CALL runtime.morestack_noctxt(SB)
pointer_value.go:9 0x1057050 e93bffffff JMP main.main(SB)
assembly code for pass by pointer:
TEXT main.AddToSliceByPointer(SB) /go_test/pointer/pointer_ref.go
pointer_ref.go:3 0x1056f60 4883ec18 SUBQ $0x18, SP
pointer_ref.go:3 0x1056f64 48896c2410 MOVQ BP, 0x10(SP)
pointer_ref.go:3 0x1056f69 488d6c2410 LEAQ 0x10(SP), BP
pointer_ref.go:4 0x1056f6e 488b542420 MOVQ 0x20(SP), DX
pointer_ref.go:4 0x1056f73 488b5a08 MOVQ 0x8(DX), BX
pointer_ref.go:4 0x1056f77 31c0 XORL AX, AX
pointer_ref.go:4 0x1056f79 eb0e JMP 0x1056f89
pointer_ref.go:5 0x1056f7b 488b3cc6 MOVQ 0(SI)(AX*8), DI
pointer_ref.go:5 0x1056f7f 48ffc7 INCQ DI
pointer_ref.go:5 0x1056f82 48893cc6 MOVQ DI, 0(SI)(AX*8)
pointer_ref.go:4 0x1056f86 48ffc0 INCQ AX
pointer_ref.go:4 0x1056f89 4839d8 CMPQ BX, AX
pointer_ref.go:4 0x1056f8c 7d0e JGE 0x1056f9c
pointer_ref.go:5 0x1056f8e 488b4a08 MOVQ 0x8(DX), CX
pointer_ref.go:5 0x1056f92 488b32 MOVQ 0(DX), SI
pointer_ref.go:5 0x1056f95 4839c8 CMPQ CX, AX
pointer_ref.go:5 0x1056f98 72e1 JB 0x1056f7b
pointer_ref.go:5 0x1056f9a eb0a JMP 0x1056fa6
pointer_ref.go:4 0x1056f9c 488b6c2410 MOVQ 0x10(SP), BP
pointer_ref.go:4 0x1056fa1 4883c418 ADDQ $0x18, SP
pointer_ref.go:4 0x1056fa5 c3 RET
pointer_ref.go:5 0x1056fa6 e8b5c4ffff CALL runtime.panicIndex(SB)
pointer_ref.go:5 0x1056fab 90 NOPL
:-1 0x1056fac cc INT $0x3
:-1 0x1056fad cc INT $0x3
:-1 0x1056fae cc INT $0x3
:-1 0x1056faf cc INT $0x3
TEXT main.main(SB) /go_test/pointer/pointer_ref.go
pointer_ref.go:9 0x1056fb0 65488b0c2530000000 MOVQ GS:0x30, CX
pointer_ref.go:9 0x1056fb9 483b6110 CMPQ 0x10(CX), SP
pointer_ref.go:9 0x1056fbd 0f86b2000000 JBE 0x1057075
pointer_ref.go:9 0x1056fc3 4883ec78 SUBQ $0x78, SP
pointer_ref.go:9 0x1056fc7 48896c2470 MOVQ BP, 0x70(SP)
pointer_ref.go:9 0x1056fcc 488d6c2470 LEAQ 0x70(SP), BP
pointer_ref.go:11 0x1056fd1 488d7c2408 LEAQ 0x8(SP), DI
pointer_ref.go:11 0x1056fd6 0f57c0 XORPS X0, X0
pointer_ref.go:11 0x1056fd9 488d7fd0 LEAQ -0x30(DI), DI
pointer_ref.go:11 0x1056fdd 48896c24f0 MOVQ BP, -0x10(SP)
pointer_ref.go:11 0x1056fe2 488d6c24f0 LEAQ -0x10(SP), BP
pointer_ref.go:11 0x1056fe7 e829c6ffff CALL 0x1053615
pointer_ref.go:11 0x1056fec 488b6d00 MOVQ 0(BP), BP
pointer_ref.go:11 0x1056ff0 48c744241001000000 MOVQ $0x1, 0x10(SP)
pointer_ref.go:11 0x1056ff9 48c744241802000000 MOVQ $0x2, 0x18(SP)
pointer_ref.go:11 0x1057002 48c744242003000000 MOVQ $0x3, 0x20(SP)
pointer_ref.go:11 0x105700b 48c744242804000000 MOVQ $0x4, 0x28(SP)
pointer_ref.go:11 0x1057014 48c744243005000000 MOVQ $0x5, 0x30(SP)
pointer_ref.go:11 0x105701d 48c744243806000000 MOVQ $0x6, 0x38(SP)
pointer_ref.go:11 0x1057026 48c744244007000000 MOVQ $0x7, 0x40(SP)
pointer_ref.go:11 0x105702f 48c744244808000000 MOVQ $0x8, 0x48(SP)
pointer_ref.go:11 0x1057038 48c744245009000000 MOVQ $0x9, 0x50(SP)
pointer_ref.go:11 0x1057041 488d442408 LEAQ 0x8(SP), AX
pointer_ref.go:11 0x1057046 4889442458 MOVQ AX, 0x58(SP)
pointer_ref.go:11 0x105704b 48c74424600a000000 MOVQ $0xa, 0x60(SP)
pointer_ref.go:11 0x1057054 48c74424680a000000 MOVQ $0xa, 0x68(SP)
pointer_ref.go:12 0x105705d 488d442458 LEAQ 0x58(SP), AX
pointer_ref.go:12 0x1057062 48890424 MOVQ AX, 0(SP)
pointer_ref.go:12 0x1057066 e8f5feffff CALL main.AddToSliceByPointer(SB)
pointer_ref.go:13 0x105706b 488b6c2470 MOVQ 0x70(SP), BP
pointer_ref.go:13 0x1057070 4883c478 ADDQ $0x78, SP
pointer_ref.go:13 0x1057074 c3 RET
pointer_ref.go:9 0x1057075 e8669cffff CALL runtime.morestack_noctxt(SB)
pointer_ref.go:9 0x105707a e931ffffff JMP main.main(SB)
I could not reproduce your benchmark...
package main_test
import (
"math/rand"
"testing"
)
func AddToSliceByValue(mySlice []int) {
for idx := range mySlice {
mySlice[idx]++
}
}
func AddToSliceByPointer(mySlice *[]int) {
for idx := range *mySlice {
(*mySlice)[idx]++
}
}
func BenchmarkAddByValue(b *testing.B) {
mySlice := rand.Perm(1000)
for n := 0; n < b.N; n++ {
AddToSliceByValue(mySlice)
}
}
func BenchmarkAddByPointer(b *testing.B) {
mySlice := rand.Perm(1000)
for n := 0; n < b.N; n++ {
AddToSliceByPointer(&mySlice)
}
}
$ go test -bench=. -benchmem -count=4
goos: linux
goarch: amd64
pkg: test/bencslice
BenchmarkAddByValue-4 3010280 385 ns/op 0 B/op 0 allocs/op
BenchmarkAddByValue-4 3118990 385 ns/op 0 B/op 0 allocs/op
BenchmarkAddByValue-4 3117450 384 ns/op 0 B/op 0 allocs/op
BenchmarkAddByValue-4 3109251 386 ns/op 0 B/op 0 allocs/op
BenchmarkAddByPointer-4 2012487 610 ns/op 0 B/op 0 allocs/op
BenchmarkAddByPointer-4 2009690 594 ns/op 0 B/op 0 allocs/op
BenchmarkAddByPointer-4 2009222 594 ns/op 0 B/op 0 allocs/op
BenchmarkAddByPointer-4 1850820 596 ns/op 0 B/op 0 allocs/op
PASS
ok test/bencslice 13.476s
$ go version
go version go1.15.2 linux/amd64
Anyways, the behavior might be dependent of many factors, first of all the version of the runtime. Understanding the intrinsec is of little interest as long as you can test, reproduce and monitor.
I found out that my variance was too high:
AddByValue-12 5.41µs ±15%
AddByPointer-12 5.30µs ± 4%
with go test -benchmem -count 5 -benchtime=1000000x -bench=. ./... I was able to reduce the variance in the test results and could confirm my first assumption that the results should be approximately equal:
AddByValue-12 5.04µs ± 1%
AddByPointer-12 5.17µs ± 1%
According to the comments the main reason for the high variance was that I did not reset the timer after the benchmark setup.
With the following code and a lower benchtime I also reduced the variance:
func BenchmarkAddByValue(b *testing.B) {
mySlice := rand.Perm(10000)
b.ResetTimer()
for n := 0; n < b.N; n++ {
AddToSliceByValue(mySlice)
}
}
func BenchmarkAddByPointer(b *testing.B) {
mySlice := rand.Perm(10000)
b.ResetTimer()
for n := 0; n < b.N; n++ {
AddToSliceByPointer(&mySlice)
}
}
Results:
AddByValue-12 5.03µs ± 0%
AddByPointer-12 5.17µs ± 1%
Thanks a lot for your help!
This is general issue with "lower level" languages. When you pass by value that means you are actually copying the data. Here is how that works.
When you pass by reference :
The copy of the reference is created and passed to method ( reference is most likely 8 bytes, so this is fast )
You read out the data behind the reference ( this is also fast, since the reference is most likely in the CPU cache )
In case of pass by value:
The block of memory is allocated to store the data you passed in ( slow )
The data is copied into the newly allocated block of memory ( maybe fast, maybe slow )
Then your data is accessed via references ( maybe slow, maybe fast, depending if data landed in cache or not )
I have recently started the Go track on exercism.io and had fun optimizing the "nth-prime" calculation. Actually I came across a funny fact I can't explain. Imagine the following code:
// Package prime provides ...
package prime
// Nth function checks for the prime number on position n
func Nth(n int) (int, bool) {
if n <= 0 {
return 0, false
}
if (n == 1) {
return 2, true
}
currentNumber := 1
primeCounter := 1
for n > primeCounter {
currentNumber+=2
if isPrime(currentNumber) {
primeCounter++
}
}
return currentNumber, primeCounter==n
}
// isPrime function checks if a number
// is a prime number
func isPrime(n int) bool {
//useless because never triggered but makes it faster??
if n < 2 {
println("n < 2")
return false
}
//useless because never triggered but makes it faster??
if n%2 == 0 {
println("n%2")
return n==2
}
for i := 3; i*i <= n; i+=2 {
if n%i == 0 {
return false
}
}
return true
}
In the private function isPrime I have two initial if-statements that are never triggered, because I only give in uneven numbers greater than 2. The benchmark returns following:
Running tool: /usr/bin/go test -benchmem -run=^$ -bench ^(BenchmarkNth)$
BenchmarkNth-8 100 18114825 ns/op 0 B/op 0
If I remove the never triggered if-statements the benchmark goes slower:
Running tool: /usr/bin/go test -benchmem -run=^$ -bench ^(BenchmarkNth)$
BenchmarkNth-8 50 21880749 ns/op 0 B/op 0
I have run the benchmark multiple times changing the code back and forth always getting more or less the same numbers and I can't think of a reason why these two if-statements should make the execution faster. Yes it is micro-optimization, but I want to know: Why?
Here is the whole exercise from exercism with test-cases: nth-prime
Go version i am using is 1.12.1 linux/amd64 on a manjaro i3 linux
What happens is the compiler is guaranteed with some assertions about the input when those if's are added.
If those assertions are lifted, the compiler has to add it himself. The way it does it is by validating it on each iteration. We can take a look at the assembly code to prove it. (by passing -gcflags=-S to the go test command)
With the if's:
0x004b 00075 (func.go:16) JMP 81
0x004d 00077 (func.go:16) LEAQ 2(BX), AX
0x0051 00081 (func.go:16) MOVQ AX, DX
0x0054 00084 (func.go:16) IMULQ AX, AX
0x0058 00088 (func.go:16) CMPQ AX, CX
0x005b 00091 (func.go:16) JGT 133
0x005d 00093 (func.go:17) TESTQ DX, DX
0x0060 00096 (func.go:17) JEQ 257
0x0066 00102 (func.go:17) MOVQ CX, AX
0x0069 00105 (func.go:17) MOVQ DX, BX
0x006c 00108 (func.go:17) CQO
0x006e 00110 (func.go:17) IDIVQ BX
0x0071 00113 (func.go:17) TESTQ DX, DX
0x0074 00116 (func.go:17) JNE 77
Without the if's:
0x0016 00022 (func.go:16) JMP 28
0x0018 00024 (func.go:16) LEAQ 2(BX), AX
0x001c 00028 (func.go:16) MOVQ AX, DX
0x001f 00031 (func.go:16) IMULQ AX, AX
0x0023 00035 (func.go:16) CMPQ AX, CX
0x0026 00038 (func.go:16) JGT 88
0x0028 00040 (func.go:17) TESTQ DX, DX
0x002b 00043 (func.go:17) JEQ 102
0x002d 00045 (func.go:17) MOVQ CX, AX
0x0030 00048 (func.go:17) MOVQ DX, BX
0x0033 00051 (func.go:17) CMPQ BX, $-1
0x0037 00055 (func.go:17) JEQ 64
0x0039 00057 (func.go:17) CQO
0x003b 00059 (func.go:17) IDIVQ BX
0x003e 00062 (func.go:17) JMP 69
0x0040 00064 func.go:17) NEGQ AX
0x0043 00067 (func.go:17) XORL DX, DX
0x0045 00069 (func.go:17) TESTQ DX, DX
0x0048 00072 (func.go:17) JNE 24
Line 51 in the assembly code 0x0033 00051 (func.go:17) CMPQ BX, $-1 is the culprit.
Line 16, for i := 3; i*i <= n; i+=2, in the original Go code, is translated the same for both cases. But line 17 if n%i == 0 that runs every iteration compiles to more instructions and as a result more work for the CPU in total.
Something similar in the encoding/base64 package by ensuring the loop won't receive a nil value. You can take a look here:
https://go-review.googlesource.com/c/go/+/151158/3/src/encoding/base64/base64.go
This check was added intentionally. In your case, you optimized it accidentally :)
I noticed a 3x speed factor for the two following increment methods for map[int]int variables:
fast: myMap[key]++
slow: myMap[key]=myMap[key]+1
This probably isn't surprising because, at least naively, in the second case I'm directing Go to access myMap twice. I'm just curious: Can anyone familiar with the Go compiler help me understand the difference between these operations on maps? And with knowledge of how the compiler works, is there a faster trick to increment maps?
edit: running locally the difference is less pronounced, but still present:
package main
import (
"fmt"
"math"
"time"
)
func main() {
x, y := make(map[int]int), make(map[int]int)
x[0], y[0] = 0, 0
steps := int(math.Pow(10, 9))
start1 := time.Now()
for i := 0; i < steps; i++ {
x[0]++
}
elapsed1 := time.Since(start1)
fmt.Println("++ took", elapsed1)
start2 := time.Now()
for i := 0; i < steps; i++ {
y[0] = y[0] + 1
}
elapsed2 := time.Since(start2)
fmt.Println("y=y+1 took", elapsed2)
}
Output:
++ took 8.1739809s
y=y+1 took 17.9079386s
Edit2: As suggested I dumped the machine code. Here are the relevant snippets
For x[0]++
0x4981e3 488d05b6830100 LEAQ runtime.types+95648(SB), AX
0x4981ea 48890424 MOVQ AX, 0(SP)
0x4981ee 488d8c2400020000 LEAQ 0x200(SP), CX
0x4981f6 48894c2408 MOVQ CX, 0x8(SP)
0x4981fb 48c744241000000000 MOVQ $0x0, 0x10(SP)
0x498204 e8976df7ff CALL runtime.mapassign_fast64(SB)
0x498209 488b442418 MOVQ 0x18(SP), AX
0x49820e 48ff00 INCQ 0(AX)
For y[0] = y[0] + 1
0x498302 488d0597820100 LEAQ runtime.types+95648(SB), AX
0x498309 48890424 MOVQ AX, 0(SP)
0x49830d 488d8c24d0010000 LEAQ 0x1d0(SP), CX
0x498315 48894c2408 MOVQ CX, 0x8(SP)
0x49831a 48c744241000000000 MOVQ $0x0, 0x10(SP)
0x498323 e80869f7ff CALL runtime.mapaccess1_fast64(SB)
0x498328 488b442418 MOVQ 0x18(SP), AX
0x49832d 488b00 MOVQ 0(AX), AX
0x498330 4889442448 MOVQ AX, 0x48(SP)
0x498335 488d0d64820100 LEAQ runtime.types+95648(SB), CX
0x49833c 48890c24 MOVQ CX, 0(SP)
0x498340 488d9424d0010000 LEAQ 0x1d0(SP), DX
0x498348 4889542408 MOVQ DX, 0x8(SP)
0x49834d 48c744241000000000 MOVQ $0x0, 0x10(SP)
0x498356 e8456cf7ff CALL runtime.mapassign_fast64(SB)
0x49835b 488b442418 MOVQ 0x18(SP), AX
0x498360 488b4c2448 MOVQ 0x48(SP), CX
0x498365 48ffc1 INCQ CX
0x498368 488908 MOVQ CX, 0(AX)
Oddly enough, ++ doesn't even call map access! ++ is clearly a simpler operation by an order of 2 or 3. My ability to parse machine ends there, so if anyone has insight into what's going on, I'd love to hear it
The Go gc compiler is an optimizing compiler. It is continuosly being improved. For example, for Go1.11,
Go Issue: cmd/compile: We can avoid extra mapaccess in "m[k] op= r" #23661
Go commit: 7395083136539331537d46875ab9d196797a2173
cmd/compile: avoid extra mapaccess in "m[k] op= r"
Currently, order desugars map assignment operations like
m[k] op= r
into
m[k] = m[k] op r
which in turn is transformed during walk into:
tmp := *mapaccess(m, k)
tmp = tmp op r
*mapassign(m, k) = tmp
However, this is suboptimal, as we could instead produce just:
*mapassign(m, k) op= r
One complication though is if "r == 0", then "m[k] /= r" and "m[k] %=
r" will panic, and they need to do so *before* calling mapassign,
otherwise we may insert a new zero-value element into the map.
It would be spec compliant to just emit the "r != 0" check before
calling mapassign (see #23735), but currently these checks aren't
generated until SSA construction. For now, it's simpler to continue
desugaring /= and %= into two map indexing operations.
Fixes #23661.
Results for your code:
go1.10:
++ took 10.258130907s
y=y+1 took 10.233823639s
go1.11:
++ took 7.995184419s
y=y+1 took 10.259916484s
The general answer to your question is to be simple, explicit, and obvious in your code. The compiler then has an easier task to recognize a common optimizable pattern.
I'm very interested in go, and trying to read go function's implementations. I found some of these function doesn't have implementations there.
Such as append or call:
// The append built-in function appends elements to the end of a slice. If
// it has sufficient capacity, the destination is resliced to accommodate the
// new elements. If it does not, a new underlying array will be allocated.
// Append returns the updated slice. It is therefore necessary to store the
// result of append, often in the variable holding the slice itself:
// slice = append(slice, elem1, elem2)
// slice = append(slice, anotherSlice...)
// As a special case, it is legal to append a string to a byte slice, like this:
// slice = append([]byte("hello "), "world"...)
func append(slice []Type, elems ...Type) []Type
// call calls fn with a copy of the n argument bytes pointed at by arg.
// After fn returns, reflectcall copies n-retoffset result bytes
// back into arg+retoffset before returning. If copying result bytes back,
// the caller must pass the argument frame type as argtype, so that
// call can execute appropriate write barriers during the copy.
func call(argtype *rtype, fn, arg unsafe.Pointer, n uint32, retoffset uint32)
It seems not calling a C code, because using cgo needs some special comments.
Where is these function's implementations?
The code you are reading and citing is just dummy code to have consistent documentation. The built-in functions are, well, built into the language and, as such, are included in the code processing step (the compiler).
Simplified what happens is: lexer will detect 'append(...)' as APPEND token, parser will translate APPEND, depending on the circumstances/parameters/environment to code, code is written as assembly and assembled. The middle step - the implementation of append - can be found in the compiler here.
What happens to an append call is best seen when looking at the assembly of an example program. Consider this:
b := []byte{'a'}
b = append(b, 'b')
println(string(b), cap(b))
Running it will yield the following output:
ab 2
The append call is translated to assembly like this:
// create new slice object
MOVQ BX, "".b+120(SP) // BX contains data addr., write to b.addr
MOVQ BX, CX // store addr. in CX
MOVQ AX, "".b+128(SP) // AX contains len(b) == 1, write to b.len
MOVQ DI, "".b+136(SP) // DI contains cap(b) == 1, write to b.cap
MOVQ AX, BX // BX now contains len(b)
INCQ BX // BX++
CMPQ BX, DI // compare new length (2) with cap (1)
JHI $1, 225 // jump to grow code if len > cap
...
LEAQ (CX)(AX*1), BX // load address of newly allocated slice entry
MOVB $98, (BX) // write 'b' to loaded address
// grow code, call runtime.growslice(t *slicetype, old slice, cap int)
LEAQ type.[]uint8(SB), BP
MOVQ BP, (SP) // load parameters onto stack
MOVQ CX, 8(SP)
MOVQ AX, 16(SP)
MOVQ SI, 24(SP)
MOVQ BX, 32(SP)
PCDATA $0, $0
CALL runtime.growslice(SB) // call
MOVQ 40(SP), DI
MOVQ 48(SP), R8
MOVQ 56(SP), SI
MOVQ R8, AX
INCQ R8
MOVQ DI, CX
JMP 108 // jump back, growing done
As you can see, no CALL statement to a function called append can be seen. This is the full implementation of the append call in the example code. Another call with different parameters will look differently (other registers, different parameters depending on the slice type, etc.).
The Go append builtin function code is generated by the Go gc and gccgo compilers and uses Go package runtime functions (for example, runtime.growslice()) in go/src/runtime/slice.go.
For example,
package main
func main() {
b := []int{0, 1}
b = append(b, 2)
}
Go pseudo-assembler:
$ go tool compile -S a.go
"".main t=1 size=192 value=0 args=0x0 locals=0x68
0x0000 00000 (a.go:3) TEXT "".main(SB), $104-0
0x0000 00000 (a.go:3) MOVQ (TLS), CX
0x0009 00009 (a.go:3) CMPQ SP, 16(CX)
0x000d 00013 (a.go:3) JLS 167
0x0013 00019 (a.go:3) SUBQ $104, SP
0x0017 00023 (a.go:3) FUNCDATA $0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0017 00023 (a.go:3) FUNCDATA $1, gclocals·790e5cc5051fc0affc980ade09e929ec(SB)
0x0017 00023 (a.go:4) LEAQ "".autotmp_0002+64(SP), BX
0x001c 00028 (a.go:4) MOVQ BX, CX
0x001f 00031 (a.go:4) NOP
0x001f 00031 (a.go:4) MOVQ "".statictmp_0000(SB), BP
0x0026 00038 (a.go:4) MOVQ BP, (BX)
0x0029 00041 (a.go:4) MOVQ "".statictmp_0000+8(SB), BP
0x0030 00048 (a.go:4) MOVQ BP, 8(BX)
0x0034 00052 (a.go:4) NOP
0x0034 00052 (a.go:4) MOVQ $2, AX
0x003b 00059 (a.go:4) MOVQ $2, DX
0x0042 00066 (a.go:5) MOVQ CX, "".b+80(SP)
0x0047 00071 (a.go:5) MOVQ AX, "".b+88(SP)
0x004c 00076 (a.go:5) MOVQ DX, "".b+96(SP)
0x0051 00081 (a.go:5) MOVQ AX, BX
0x0054 00084 (a.go:5) INCQ BX
0x0057 00087 (a.go:5) CMPQ BX, DX
0x005a 00090 (a.go:5) JHI $1, 108
0x005c 00092 (a.go:5) LEAQ (CX)(AX*8), BX
0x0060 00096 (a.go:5) MOVQ $2, (BX)
0x0067 00103 (a.go:6) ADDQ $104, SP
0x006b 00107 (a.go:6) RET
0x006c 00108 (a.go:5) LEAQ type.[]int(SB), BP
0x0073 00115 (a.go:5) MOVQ BP, (SP)
0x0077 00119 (a.go:5) MOVQ CX, 8(SP)
0x007c 00124 (a.go:5) MOVQ AX, 16(SP)
0x0081 00129 (a.go:5) MOVQ DX, 24(SP)
0x0086 00134 (a.go:5) MOVQ BX, 32(SP)
0x008b 00139 (a.go:5) PCDATA $0, $0
0x008b 00139 (a.go:5) CALL runtime.growslice(SB)
0x0090 00144 (a.go:5) MOVQ 40(SP), CX
0x0095 00149 (a.go:5) MOVQ 48(SP), AX
0x009a 00154 (a.go:5) MOVQ 56(SP), DX
0x009f 00159 (a.go:5) MOVQ AX, BX
0x00a2 00162 (a.go:5) INCQ BX
0x00a5 00165 (a.go:5) JMP 92
0x00a7 00167 (a.go:3) CALL runtime.morestack_noctxt(SB)
0x00ac 00172 (a.go:3) JMP 0
To add to the assembly code given by the others, you can find the Go (1.5.1) code for gc there : https://github.com/golang/go/blob/f2e4c8b5fb3660d793b2c545ef207153db0a34b1/src/cmd/compile/internal/gc/walk.go#L2895
// expand append(l1, l2...) to
// init {
// s := l1
// if n := len(l1) + len(l2) - cap(s); n > 0 {
// s = growslice_n(s, n)
// }
// s = s[:len(l1)+len(l2)]
// memmove(&s[len(l1)], &l2[0], len(l2)*sizeof(T))
// }
// s
//
// l2 is allowed to be a string.
with growslice_n being defined there : https://github.com/golang/go/blob/f2e4c8b5fb3660d793b2c545ef207153db0a34b1/src/runtime/slice.go#L36
// growslice_n is a variant of growslice that takes the number of new elements
// instead of the new minimum capacity.
// TODO(rsc): This is used by append(slice, slice...).
// The compiler should change that code to use growslice directly (issue #11419).
func growslice_n(t *slicetype, old slice, n int) slice {
if n < 1 {
panic(errorString("growslice: invalid n"))
}
return growslice(t, old, old.cap+n)
}
// growslice handles slice growth during append.
// It is passed the slice type, the old slice, and the desired new minimum capacity,
// and it returns a new slice with at least that capacity, with the old data
// copied into it.
func growslice(t *slicetype, old slice, cap int) slice {
if cap < old.cap || t.elem.size > 0 && uintptr(cap) > _MaxMem/uintptr(t.elem.size) {
panic(errorString("growslice: cap out of range"))
}
if raceenabled {
callerpc := getcallerpc(unsafe.Pointer(&t))
racereadrangepc(old.array, uintptr(old.len*int(t.elem.size)), callerpc, funcPC(growslice))
}
et := t.elem
if et.size == 0 {
// append should not create a slice with nil pointer but non-zero len.
// We assume that append doesn't need to preserve old.array in this case.
return slice{unsafe.Pointer(&zerobase), old.len, cap}
}
newcap := old.cap
if newcap+newcap < cap {
newcap = cap
} else {
for {
if old.len < 1024 {
newcap += newcap
} else {
newcap += newcap / 4
}
if newcap >= cap {
break
}
}
}
if uintptr(newcap) >= _MaxMem/uintptr(et.size) {
panic(errorString("growslice: cap out of range"))
}
lenmem := uintptr(old.len) * uintptr(et.size)
capmem := roundupsize(uintptr(newcap) * uintptr(et.size))
newcap = int(capmem / uintptr(et.size))
var p unsafe.Pointer
if et.kind&kindNoPointers != 0 {
p = rawmem(capmem)
memmove(p, old.array, lenmem)
memclr(add(p, lenmem), capmem-lenmem)
} else {
// Note: can't use rawmem (which avoids zeroing of memory), because then GC can scan uninitialized memory.
p = newarray(et, uintptr(newcap))
if !writeBarrierEnabled {
memmove(p, old.array, lenmem)
} else {
for i := uintptr(0); i < lenmem; i += et.size {
typedmemmove(et, add(p, i), add(old.array, i))
}
}
}
return slice{p, old.len, newcap}
}
I currently play around with go, it's assembly, performance of floating point operations (float32) and optimizations in the nano-seconds-scale. I was a bit confused by the overhead of a simple function call:
func BenchmarkEmpty(b *testing.B) {
for i := 0; i < b.N; i++ {
}
}
func BenchmarkNop(b *testing.B) {
for i := 0; i < b.N; i++ {
doNop()
}
}
The implementation of doNop:
TEXT ·doNop(SB),0,$0-0
RET
The result (go test -bench .):
BenchmarkEmpty 2000000000 0.30 ns/op
BenchmarkNop 2000000000 1.73 ns/op
Im not used to assembly and/ or the internals of go. It is possible fo the go compiler/ linker to inline a function defined in assembly? Can I give the linker a hint somehow? For some simple functions like 'add two R3-vectors' this eats up all possible performance gain.
(go 1.4.2, amd64)
Assembly functions are not inlined. Here are 3 things you could try:
Move your loop into assembly. For example with this function:
func Sum(xs []int64) int64
You can do this:
#include "textflag.h"
TEXT ·Sum(SB),NOSPLIT,$0-24
MOVQ xs+0(FP),DI
MOVQ xs+8(FP),SI
MOVQ $0,CX
MOVQ $0,AX
L1: CMPQ AX,SI // i < len(xs)
JGE Z1
LEAQ (DI)(AX*8),BX // BX = &xs[i]
MOVQ (BX),BX // BX = *BX
ADDQ BX,CX // CX += BX
INCQ AX // i++
JMP L1
Z1: MOVQ CX,ret+24(FP)
RET
If you look in the standard libraries you will see examples of this.
Write some of your code in c, leverage the support it has for intrinsics or inline assembly, and use cgo to call it from go.
Use gccgo to do the same thing as #2, except you can do it directly:
//extern open
func c_open(name *byte, mode int, perm int) int
https://golang.org/doc/install/gccgo#Function_names