I noticed a 3x speed factor for the two following increment methods for map[int]int variables:
fast: myMap[key]++
slow: myMap[key]=myMap[key]+1
This probably isn't surprising because, at least naively, in the second case I'm directing Go to access myMap twice. I'm just curious: Can anyone familiar with the Go compiler help me understand the difference between these operations on maps? And with knowledge of how the compiler works, is there a faster trick to increment maps?
edit: running locally the difference is less pronounced, but still present:
package main
import (
"fmt"
"math"
"time"
)
func main() {
x, y := make(map[int]int), make(map[int]int)
x[0], y[0] = 0, 0
steps := int(math.Pow(10, 9))
start1 := time.Now()
for i := 0; i < steps; i++ {
x[0]++
}
elapsed1 := time.Since(start1)
fmt.Println("++ took", elapsed1)
start2 := time.Now()
for i := 0; i < steps; i++ {
y[0] = y[0] + 1
}
elapsed2 := time.Since(start2)
fmt.Println("y=y+1 took", elapsed2)
}
Output:
++ took 8.1739809s
y=y+1 took 17.9079386s
Edit2: As suggested I dumped the machine code. Here are the relevant snippets
For x[0]++
0x4981e3 488d05b6830100 LEAQ runtime.types+95648(SB), AX
0x4981ea 48890424 MOVQ AX, 0(SP)
0x4981ee 488d8c2400020000 LEAQ 0x200(SP), CX
0x4981f6 48894c2408 MOVQ CX, 0x8(SP)
0x4981fb 48c744241000000000 MOVQ $0x0, 0x10(SP)
0x498204 e8976df7ff CALL runtime.mapassign_fast64(SB)
0x498209 488b442418 MOVQ 0x18(SP), AX
0x49820e 48ff00 INCQ 0(AX)
For y[0] = y[0] + 1
0x498302 488d0597820100 LEAQ runtime.types+95648(SB), AX
0x498309 48890424 MOVQ AX, 0(SP)
0x49830d 488d8c24d0010000 LEAQ 0x1d0(SP), CX
0x498315 48894c2408 MOVQ CX, 0x8(SP)
0x49831a 48c744241000000000 MOVQ $0x0, 0x10(SP)
0x498323 e80869f7ff CALL runtime.mapaccess1_fast64(SB)
0x498328 488b442418 MOVQ 0x18(SP), AX
0x49832d 488b00 MOVQ 0(AX), AX
0x498330 4889442448 MOVQ AX, 0x48(SP)
0x498335 488d0d64820100 LEAQ runtime.types+95648(SB), CX
0x49833c 48890c24 MOVQ CX, 0(SP)
0x498340 488d9424d0010000 LEAQ 0x1d0(SP), DX
0x498348 4889542408 MOVQ DX, 0x8(SP)
0x49834d 48c744241000000000 MOVQ $0x0, 0x10(SP)
0x498356 e8456cf7ff CALL runtime.mapassign_fast64(SB)
0x49835b 488b442418 MOVQ 0x18(SP), AX
0x498360 488b4c2448 MOVQ 0x48(SP), CX
0x498365 48ffc1 INCQ CX
0x498368 488908 MOVQ CX, 0(AX)
Oddly enough, ++ doesn't even call map access! ++ is clearly a simpler operation by an order of 2 or 3. My ability to parse machine ends there, so if anyone has insight into what's going on, I'd love to hear it
The Go gc compiler is an optimizing compiler. It is continuosly being improved. For example, for Go1.11,
Go Issue: cmd/compile: We can avoid extra mapaccess in "m[k] op= r" #23661
Go commit: 7395083136539331537d46875ab9d196797a2173
cmd/compile: avoid extra mapaccess in "m[k] op= r"
Currently, order desugars map assignment operations like
m[k] op= r
into
m[k] = m[k] op r
which in turn is transformed during walk into:
tmp := *mapaccess(m, k)
tmp = tmp op r
*mapassign(m, k) = tmp
However, this is suboptimal, as we could instead produce just:
*mapassign(m, k) op= r
One complication though is if "r == 0", then "m[k] /= r" and "m[k] %=
r" will panic, and they need to do so *before* calling mapassign,
otherwise we may insert a new zero-value element into the map.
It would be spec compliant to just emit the "r != 0" check before
calling mapassign (see #23735), but currently these checks aren't
generated until SSA construction. For now, it's simpler to continue
desugaring /= and %= into two map indexing operations.
Fixes #23661.
Results for your code:
go1.10:
++ took 10.258130907s
y=y+1 took 10.233823639s
go1.11:
++ took 7.995184419s
y=y+1 took 10.259916484s
The general answer to your question is to be simple, explicit, and obvious in your code. The compiler then has an easier task to recognize a common optimizable pattern.
Related
I read that Golang language manages memory in a smart way. Using escape analysis, go may not allocate memory when calling new, and vice versa. Can golang allocate memory with such a notation var bob * Person = & Person {2, 3}. Or always the pointer will point to the stack
The pointer may escape to the heap, or it may not, depends on your use case. The compiler is pretty smart. E.g. given:
type Person struct {
b, c int
}
func foo(b, c int) int {
bob := &Person{b, c}
return bob.b
}
The function foo will be compiled into:
TEXT "".foo(SB)
MOVQ "".b+8(SP), AX
MOVQ AX, "".~r2+24(SP)
RET
It's all on the stack here, because even though bob is a pointer, it doesn't escape this function's scope.
However, if we consider a slight (albeit artificial) modification:
var globalBob *Person
func foo(b, c int) int {
bob := &Person{b, c}
globalBob = bob
return bob.b
}
Then bob escapes, and foo will be compiled to:
TEXT "".foo(SB), ABIInternal, $24-24
MOVQ (TLS), CX
CMPQ SP, 16(CX)
PCDATA $0, $-2
JLS foo_pc115
PCDATA $0, $-1
SUBQ $24, SP
MOVQ BP, 16(SP)
LEAQ 16(SP), BP
LEAQ type."".Person(SB), AX
MOVQ AX, (SP)
PCDATA $1, $0
CALL runtime.newobject(SB)
MOVQ 8(SP), AX
MOVQ "".b+32(SP), CX
MOVQ CX, (AX)
MOVQ "".c+40(SP), CX
MOVQ CX, 8(AX)
PCDATA $0, $-2
CMPL runtime.writeBarrier(SB), $0
JNE foo_pc101
MOVQ AX, "".globalBob(SB)
foo_pc83:
PCDATA $0, $-1
MOVQ (AX), AX
MOVQ AX, "".~r2+48(SP)
MOVQ 16(SP), BP
ADDQ $24, SP
RET
Which, as you can see, invokes newobject.
These disassembly listings were generated by https://godbolt.org/, and are for go 1.16 on amd64
Whether memory is allocated on the stack or "escapes" to the heap is entirely dependent on how you use the memory, not on how you declare the variable.
If you return a pointer to a stack-allocated variable in, say, C, the value your pointer will be invalid by the time you attempt to use it. This isn't possible in Go, because you cannot explicitly tell Go where to place a variable. It does a very good job of choosing the correct place, and if it sees that references to a blob of memory may live beyond the stack frame, it will ensure that allocation happens on the heap instead.
Can golang allocate memory with such a notation
var bob * Person = & Person {2, 3}
Or always the pointer will point to the stack
That line of code cannot be said "always" point to the stack, but it might sometimes, so yes, it may allocate memory (on the heap).
Again, it's not about that line of code, it's about what comes after it. If value of bob is returned (the address of the Person object) then it cannot be allocated on the stack because the returned address would point to reclaimed memory.
Put simply, if the compiler can prove that the value can safely be created on the stack, it will (probably) be created on the stack. Otherwise, it will be allocated on the heap.
The tools that the compiler has to do these proofs are pretty good, but it doesn't get it right all the time. Most of the time, though, cost vs benefit of worrying about it is not really benefitial.
I'm very interested in go, and trying to read go function's implementations. I found some of these function doesn't have implementations there.
Such as append or call:
// The append built-in function appends elements to the end of a slice. If
// it has sufficient capacity, the destination is resliced to accommodate the
// new elements. If it does not, a new underlying array will be allocated.
// Append returns the updated slice. It is therefore necessary to store the
// result of append, often in the variable holding the slice itself:
// slice = append(slice, elem1, elem2)
// slice = append(slice, anotherSlice...)
// As a special case, it is legal to append a string to a byte slice, like this:
// slice = append([]byte("hello "), "world"...)
func append(slice []Type, elems ...Type) []Type
// call calls fn with a copy of the n argument bytes pointed at by arg.
// After fn returns, reflectcall copies n-retoffset result bytes
// back into arg+retoffset before returning. If copying result bytes back,
// the caller must pass the argument frame type as argtype, so that
// call can execute appropriate write barriers during the copy.
func call(argtype *rtype, fn, arg unsafe.Pointer, n uint32, retoffset uint32)
It seems not calling a C code, because using cgo needs some special comments.
Where is these function's implementations?
The code you are reading and citing is just dummy code to have consistent documentation. The built-in functions are, well, built into the language and, as such, are included in the code processing step (the compiler).
Simplified what happens is: lexer will detect 'append(...)' as APPEND token, parser will translate APPEND, depending on the circumstances/parameters/environment to code, code is written as assembly and assembled. The middle step - the implementation of append - can be found in the compiler here.
What happens to an append call is best seen when looking at the assembly of an example program. Consider this:
b := []byte{'a'}
b = append(b, 'b')
println(string(b), cap(b))
Running it will yield the following output:
ab 2
The append call is translated to assembly like this:
// create new slice object
MOVQ BX, "".b+120(SP) // BX contains data addr., write to b.addr
MOVQ BX, CX // store addr. in CX
MOVQ AX, "".b+128(SP) // AX contains len(b) == 1, write to b.len
MOVQ DI, "".b+136(SP) // DI contains cap(b) == 1, write to b.cap
MOVQ AX, BX // BX now contains len(b)
INCQ BX // BX++
CMPQ BX, DI // compare new length (2) with cap (1)
JHI $1, 225 // jump to grow code if len > cap
...
LEAQ (CX)(AX*1), BX // load address of newly allocated slice entry
MOVB $98, (BX) // write 'b' to loaded address
// grow code, call runtime.growslice(t *slicetype, old slice, cap int)
LEAQ type.[]uint8(SB), BP
MOVQ BP, (SP) // load parameters onto stack
MOVQ CX, 8(SP)
MOVQ AX, 16(SP)
MOVQ SI, 24(SP)
MOVQ BX, 32(SP)
PCDATA $0, $0
CALL runtime.growslice(SB) // call
MOVQ 40(SP), DI
MOVQ 48(SP), R8
MOVQ 56(SP), SI
MOVQ R8, AX
INCQ R8
MOVQ DI, CX
JMP 108 // jump back, growing done
As you can see, no CALL statement to a function called append can be seen. This is the full implementation of the append call in the example code. Another call with different parameters will look differently (other registers, different parameters depending on the slice type, etc.).
The Go append builtin function code is generated by the Go gc and gccgo compilers and uses Go package runtime functions (for example, runtime.growslice()) in go/src/runtime/slice.go.
For example,
package main
func main() {
b := []int{0, 1}
b = append(b, 2)
}
Go pseudo-assembler:
$ go tool compile -S a.go
"".main t=1 size=192 value=0 args=0x0 locals=0x68
0x0000 00000 (a.go:3) TEXT "".main(SB), $104-0
0x0000 00000 (a.go:3) MOVQ (TLS), CX
0x0009 00009 (a.go:3) CMPQ SP, 16(CX)
0x000d 00013 (a.go:3) JLS 167
0x0013 00019 (a.go:3) SUBQ $104, SP
0x0017 00023 (a.go:3) FUNCDATA $0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0017 00023 (a.go:3) FUNCDATA $1, gclocals·790e5cc5051fc0affc980ade09e929ec(SB)
0x0017 00023 (a.go:4) LEAQ "".autotmp_0002+64(SP), BX
0x001c 00028 (a.go:4) MOVQ BX, CX
0x001f 00031 (a.go:4) NOP
0x001f 00031 (a.go:4) MOVQ "".statictmp_0000(SB), BP
0x0026 00038 (a.go:4) MOVQ BP, (BX)
0x0029 00041 (a.go:4) MOVQ "".statictmp_0000+8(SB), BP
0x0030 00048 (a.go:4) MOVQ BP, 8(BX)
0x0034 00052 (a.go:4) NOP
0x0034 00052 (a.go:4) MOVQ $2, AX
0x003b 00059 (a.go:4) MOVQ $2, DX
0x0042 00066 (a.go:5) MOVQ CX, "".b+80(SP)
0x0047 00071 (a.go:5) MOVQ AX, "".b+88(SP)
0x004c 00076 (a.go:5) MOVQ DX, "".b+96(SP)
0x0051 00081 (a.go:5) MOVQ AX, BX
0x0054 00084 (a.go:5) INCQ BX
0x0057 00087 (a.go:5) CMPQ BX, DX
0x005a 00090 (a.go:5) JHI $1, 108
0x005c 00092 (a.go:5) LEAQ (CX)(AX*8), BX
0x0060 00096 (a.go:5) MOVQ $2, (BX)
0x0067 00103 (a.go:6) ADDQ $104, SP
0x006b 00107 (a.go:6) RET
0x006c 00108 (a.go:5) LEAQ type.[]int(SB), BP
0x0073 00115 (a.go:5) MOVQ BP, (SP)
0x0077 00119 (a.go:5) MOVQ CX, 8(SP)
0x007c 00124 (a.go:5) MOVQ AX, 16(SP)
0x0081 00129 (a.go:5) MOVQ DX, 24(SP)
0x0086 00134 (a.go:5) MOVQ BX, 32(SP)
0x008b 00139 (a.go:5) PCDATA $0, $0
0x008b 00139 (a.go:5) CALL runtime.growslice(SB)
0x0090 00144 (a.go:5) MOVQ 40(SP), CX
0x0095 00149 (a.go:5) MOVQ 48(SP), AX
0x009a 00154 (a.go:5) MOVQ 56(SP), DX
0x009f 00159 (a.go:5) MOVQ AX, BX
0x00a2 00162 (a.go:5) INCQ BX
0x00a5 00165 (a.go:5) JMP 92
0x00a7 00167 (a.go:3) CALL runtime.morestack_noctxt(SB)
0x00ac 00172 (a.go:3) JMP 0
To add to the assembly code given by the others, you can find the Go (1.5.1) code for gc there : https://github.com/golang/go/blob/f2e4c8b5fb3660d793b2c545ef207153db0a34b1/src/cmd/compile/internal/gc/walk.go#L2895
// expand append(l1, l2...) to
// init {
// s := l1
// if n := len(l1) + len(l2) - cap(s); n > 0 {
// s = growslice_n(s, n)
// }
// s = s[:len(l1)+len(l2)]
// memmove(&s[len(l1)], &l2[0], len(l2)*sizeof(T))
// }
// s
//
// l2 is allowed to be a string.
with growslice_n being defined there : https://github.com/golang/go/blob/f2e4c8b5fb3660d793b2c545ef207153db0a34b1/src/runtime/slice.go#L36
// growslice_n is a variant of growslice that takes the number of new elements
// instead of the new minimum capacity.
// TODO(rsc): This is used by append(slice, slice...).
// The compiler should change that code to use growslice directly (issue #11419).
func growslice_n(t *slicetype, old slice, n int) slice {
if n < 1 {
panic(errorString("growslice: invalid n"))
}
return growslice(t, old, old.cap+n)
}
// growslice handles slice growth during append.
// It is passed the slice type, the old slice, and the desired new minimum capacity,
// and it returns a new slice with at least that capacity, with the old data
// copied into it.
func growslice(t *slicetype, old slice, cap int) slice {
if cap < old.cap || t.elem.size > 0 && uintptr(cap) > _MaxMem/uintptr(t.elem.size) {
panic(errorString("growslice: cap out of range"))
}
if raceenabled {
callerpc := getcallerpc(unsafe.Pointer(&t))
racereadrangepc(old.array, uintptr(old.len*int(t.elem.size)), callerpc, funcPC(growslice))
}
et := t.elem
if et.size == 0 {
// append should not create a slice with nil pointer but non-zero len.
// We assume that append doesn't need to preserve old.array in this case.
return slice{unsafe.Pointer(&zerobase), old.len, cap}
}
newcap := old.cap
if newcap+newcap < cap {
newcap = cap
} else {
for {
if old.len < 1024 {
newcap += newcap
} else {
newcap += newcap / 4
}
if newcap >= cap {
break
}
}
}
if uintptr(newcap) >= _MaxMem/uintptr(et.size) {
panic(errorString("growslice: cap out of range"))
}
lenmem := uintptr(old.len) * uintptr(et.size)
capmem := roundupsize(uintptr(newcap) * uintptr(et.size))
newcap = int(capmem / uintptr(et.size))
var p unsafe.Pointer
if et.kind&kindNoPointers != 0 {
p = rawmem(capmem)
memmove(p, old.array, lenmem)
memclr(add(p, lenmem), capmem-lenmem)
} else {
// Note: can't use rawmem (which avoids zeroing of memory), because then GC can scan uninitialized memory.
p = newarray(et, uintptr(newcap))
if !writeBarrierEnabled {
memmove(p, old.array, lenmem)
} else {
for i := uintptr(0); i < lenmem; i += et.size {
typedmemmove(et, add(p, i), add(old.array, i))
}
}
}
return slice{p, old.len, newcap}
}
I currently play around with go, it's assembly, performance of floating point operations (float32) and optimizations in the nano-seconds-scale. I was a bit confused by the overhead of a simple function call:
func BenchmarkEmpty(b *testing.B) {
for i := 0; i < b.N; i++ {
}
}
func BenchmarkNop(b *testing.B) {
for i := 0; i < b.N; i++ {
doNop()
}
}
The implementation of doNop:
TEXT ·doNop(SB),0,$0-0
RET
The result (go test -bench .):
BenchmarkEmpty 2000000000 0.30 ns/op
BenchmarkNop 2000000000 1.73 ns/op
Im not used to assembly and/ or the internals of go. It is possible fo the go compiler/ linker to inline a function defined in assembly? Can I give the linker a hint somehow? For some simple functions like 'add two R3-vectors' this eats up all possible performance gain.
(go 1.4.2, amd64)
Assembly functions are not inlined. Here are 3 things you could try:
Move your loop into assembly. For example with this function:
func Sum(xs []int64) int64
You can do this:
#include "textflag.h"
TEXT ·Sum(SB),NOSPLIT,$0-24
MOVQ xs+0(FP),DI
MOVQ xs+8(FP),SI
MOVQ $0,CX
MOVQ $0,AX
L1: CMPQ AX,SI // i < len(xs)
JGE Z1
LEAQ (DI)(AX*8),BX // BX = &xs[i]
MOVQ (BX),BX // BX = *BX
ADDQ BX,CX // CX += BX
INCQ AX // i++
JMP L1
Z1: MOVQ CX,ret+24(FP)
RET
If you look in the standard libraries you will see examples of this.
Write some of your code in c, leverage the support it has for intrinsics or inline assembly, and use cgo to call it from go.
Use gccgo to do the same thing as #2, except you can do it directly:
//extern open
func c_open(name *byte, mode int, perm int) int
https://golang.org/doc/install/gccgo#Function_names
I write a lot of vectorized loops, so 1 common idiom is
volatile int dummy[1<<10];
for (int64_t i = 0; i + 16 <= argc; i+= 16) // process all elements with whole vector
{
int x = dummy[i];
}
// handle remainder (hopefully with SIMD too)
But the resulting machine code has 1 more instruction than I would like (using gcc 4.9)
.L3:
leaq -16(%rax), %rdx
addq $16, %rax
cmpq %rcx, %rax
movl -120(%rsp,%rdx,4), %edx
jbe .L3
If I change the code to for (int64_t i = 0; i <= argc - 16; i+= 16), then the "extra"
instruction is gone:
.L2:
movl -120(%rsp,%rax,4), %ecx
addq $16, %rax
cmpq %rdx, %rax
jbe .L2
But why the difference? I was thinking maybe it was due to loop invariants, but too vaguely. Then I noticed in the 5 instruction case, the increment is done before the load, which would require an extra mov due to x86's destructive 2 operand instructions.
So another explanation could be that it's trading instruction parallelism for 1 extra instruction.
Although it seems there would hardly be any performance difference, can someone explain this mystery (preferably who knows about compiler transformations)?
Ideally I would like to keep the i + 16 <= size form since that has a more intuitive meaning (the last element of the vector doesn't go out of bounds)
If argc were below -2147483632, and i was below 2147483632, the expressions i+16 <= argc would be required to yield an arithmetically-correct result, while the expression and i<argc-16 would not. The need to give an arithmetically-correct result in that corner case prevents the compiler from optimizing the former expression to match the latter.
I have the following program for converting 6bit ASCII to binary format.
ascii2bin :: Char -> B.ByteString
ascii2bin = B.reverse . fst . B.unfoldrN 6 decomp . to6BitASCII -- replace to6BitASCII with ord if you want to compile this
where decomp n = case quotRem n 2 of (q,r) -> Just (chr r,q)
bs2bin :: B.ByteString -> B.ByteString
bs2bin = B.concatMap ascii2bin
this produces the following core segment:
Rec {
$wa
$wa =
\ ww ww1 ww2 w ->
case ww2 of wild {
__DEFAULT ->
let {
wild2
wild2 = remInt# ww1 2 } in
case leWord# (int2Word# wild2) (__word 1114111) of _ {
False -> (lvl2 wild2) `cast` ...;
True ->
case writeWord8OffAddr#
ww 0 (narrow8Word# (int2Word# (ord# (chr# wild2)))) w
of s2 { __DEFAULT ->
$wa (plusAddr# ww 1) (quotInt# ww1 2) (+# wild 1) s2
}
};
6 -> (# w, (lvl, lvl1, Just (I# ww1)) #)
}
end Rec }
notice that ord . chr == id, and so there is a redundant operation here: narrow8Word# (int2Word# (ord# (chr# wild2)))
Is there a reason GHC is needlessly converting from Int -> Char -> Int, or is this an example of poor code generation? Can this be optimized out?
EDIT: This is using GHC 7.4.2, I have not tried compiling with any other version. I have since found the problem remains in GHC 7.6.2, but the redundant operations are removed in the current HEAD branch on github.
Is there a reason GHC is needlessly converting from Int -> Char -> Int, or is this an example of poor code generation? Can this be optimized out?
Not really (to both). The core you get from -ddump-simpl is not the end. There are a few optimisations and transformations still done after that on the way to the assembly code. But removing the redundant conversions here isn't actually an optimisation.
They can be, and are, removed between the core and the assembly. The point is that these primops - except for the narrowing - are no-ops, they are only present in the core because that's typed. Since they are no-ops, it doesn't matter whether there is a redundant chain of them in the core.
The assembly that 7.6.1 produces from the code [it's more readable than what 7.4.2 produces, so I take that] - with ord instead of to6BitASCII - is
ASCII.$wa_info:
_cXT:
addq $64,%r12
cmpq 144(%r13),%r12
ja _cXX
movq %rdi,%rcx
cmpq $6,%rdi
jne _cXZ
movq $GHC.Types.I#_con_info,-56(%r12)
movq %rsi,-48(%r12)
movq $Data.Maybe.Just_con_info,-40(%r12)
leaq -55(%r12),%rax
movq %rax,-32(%r12)
movq $(,,)_con_info,-24(%r12)
movq $lvl1_rVq_closure+1,-16(%r12)
movq $lvl_rVp_closure+1,-8(%r12)
leaq -38(%r12),%rax
movq %rax,0(%r12)
leaq -23(%r12),%rbx
jmp *0(%rbp)
_cXX:
movq $64,192(%r13)
_cXV:
movl $ASCII.$wa_closure,%ebx
jmp *-8(%r13)
_cXZ:
movl $2,%ebx
movq %rsi,%rax
cqto
idivq %rbx
movq %rax,%rsi
cmpq $1114111,%rdx
jbe _cY2
movq %rdx,%r14
addq $-64,%r12
jmp GHC.Char.chr2_info
_cY2:
movb %dl,(%r14)
incq %r14
leaq 1(%rcx),%rdi
addq $-64,%r12
jmp ASCII.$wa_info
.size ASCII.$wa_info, .-ASCII.$wa_info
The part where the narrow8Word# (int2Word# (ord# (chr# wild2))) appears in the core is after the cmpq $1114111, %rdx. If the quotient is not out-of-range, the code jumps to _cY2 which contains no such conversions anymore. A byte is written to the array, some pointers/counters are incremented, and that's it, jump back to the top.
I think it would be possible to generate better code from this than GHC currently does, but the redundant no-op conversions already vanish.