Overhead of ASM-function-call in go - performance

I currently play around with go, it's assembly, performance of floating point operations (float32) and optimizations in the nano-seconds-scale. I was a bit confused by the overhead of a simple function call:
func BenchmarkEmpty(b *testing.B) {
for i := 0; i < b.N; i++ {
}
}
func BenchmarkNop(b *testing.B) {
for i := 0; i < b.N; i++ {
doNop()
}
}
The implementation of doNop:
TEXT ·doNop(SB),0,$0-0
RET
The result (go test -bench .):
BenchmarkEmpty 2000000000 0.30 ns/op
BenchmarkNop 2000000000 1.73 ns/op
Im not used to assembly and/ or the internals of go. It is possible fo the go compiler/ linker to inline a function defined in assembly? Can I give the linker a hint somehow? For some simple functions like 'add two R3-vectors' this eats up all possible performance gain.
(go 1.4.2, amd64)

Assembly functions are not inlined. Here are 3 things you could try:
Move your loop into assembly. For example with this function:
func Sum(xs []int64) int64
You can do this:
#include "textflag.h"
TEXT ·Sum(SB),NOSPLIT,$0-24
MOVQ xs+0(FP),DI
MOVQ xs+8(FP),SI
MOVQ $0,CX
MOVQ $0,AX
L1: CMPQ AX,SI // i < len(xs)
JGE Z1
LEAQ (DI)(AX*8),BX // BX = &xs[i]
MOVQ (BX),BX // BX = *BX
ADDQ BX,CX // CX += BX
INCQ AX // i++
JMP L1
Z1: MOVQ CX,ret+24(FP)
RET
If you look in the standard libraries you will see examples of this.
Write some of your code in c, leverage the support it has for intrinsics or inline assembly, and use cgo to call it from go.
Use gccgo to do the same thing as #2, except you can do it directly:
//extern open
func c_open(name *byte, mode int, perm int) int
https://golang.org/doc/install/gccgo#Function_names

Related

Never triggered if statements make code execution in benchmark faster? Why?

I have recently started the Go track on exercism.io and had fun optimizing the "nth-prime" calculation. Actually I came across a funny fact I can't explain. Imagine the following code:
// Package prime provides ...
package prime
// Nth function checks for the prime number on position n
func Nth(n int) (int, bool) {
if n <= 0 {
return 0, false
}
if (n == 1) {
return 2, true
}
currentNumber := 1
primeCounter := 1
for n > primeCounter {
currentNumber+=2
if isPrime(currentNumber) {
primeCounter++
}
}
return currentNumber, primeCounter==n
}
// isPrime function checks if a number
// is a prime number
func isPrime(n int) bool {
//useless because never triggered but makes it faster??
if n < 2 {
println("n < 2")
return false
}
//useless because never triggered but makes it faster??
if n%2 == 0 {
println("n%2")
return n==2
}
for i := 3; i*i <= n; i+=2 {
if n%i == 0 {
return false
}
}
return true
}
In the private function isPrime I have two initial if-statements that are never triggered, because I only give in uneven numbers greater than 2. The benchmark returns following:
Running tool: /usr/bin/go test -benchmem -run=^$ -bench ^(BenchmarkNth)$
BenchmarkNth-8 100 18114825 ns/op 0 B/op 0
If I remove the never triggered if-statements the benchmark goes slower:
Running tool: /usr/bin/go test -benchmem -run=^$ -bench ^(BenchmarkNth)$
BenchmarkNth-8 50 21880749 ns/op 0 B/op 0
I have run the benchmark multiple times changing the code back and forth always getting more or less the same numbers and I can't think of a reason why these two if-statements should make the execution faster. Yes it is micro-optimization, but I want to know: Why?
Here is the whole exercise from exercism with test-cases: nth-prime
Go version i am using is 1.12.1 linux/amd64 on a manjaro i3 linux
What happens is the compiler is guaranteed with some assertions about the input when those if's are added.
If those assertions are lifted, the compiler has to add it himself. The way it does it is by validating it on each iteration. We can take a look at the assembly code to prove it. (by passing -gcflags=-S to the go test command)
With the if's:
0x004b 00075 (func.go:16) JMP 81
0x004d 00077 (func.go:16) LEAQ 2(BX), AX
0x0051 00081 (func.go:16) MOVQ AX, DX
0x0054 00084 (func.go:16) IMULQ AX, AX
0x0058 00088 (func.go:16) CMPQ AX, CX
0x005b 00091 (func.go:16) JGT 133
0x005d 00093 (func.go:17) TESTQ DX, DX
0x0060 00096 (func.go:17) JEQ 257
0x0066 00102 (func.go:17) MOVQ CX, AX
0x0069 00105 (func.go:17) MOVQ DX, BX
0x006c 00108 (func.go:17) CQO
0x006e 00110 (func.go:17) IDIVQ BX
0x0071 00113 (func.go:17) TESTQ DX, DX
0x0074 00116 (func.go:17) JNE 77
Without the if's:
0x0016 00022 (func.go:16) JMP 28
0x0018 00024 (func.go:16) LEAQ 2(BX), AX
0x001c 00028 (func.go:16) MOVQ AX, DX
0x001f 00031 (func.go:16) IMULQ AX, AX
0x0023 00035 (func.go:16) CMPQ AX, CX
0x0026 00038 (func.go:16) JGT 88
0x0028 00040 (func.go:17) TESTQ DX, DX
0x002b 00043 (func.go:17) JEQ 102
0x002d 00045 (func.go:17) MOVQ CX, AX
0x0030 00048 (func.go:17) MOVQ DX, BX
0x0033 00051 (func.go:17) CMPQ BX, $-1
0x0037 00055 (func.go:17) JEQ 64
0x0039 00057 (func.go:17) CQO
0x003b 00059 (func.go:17) IDIVQ BX
0x003e 00062 (func.go:17) JMP 69
0x0040 00064 func.go:17) NEGQ AX
0x0043 00067 (func.go:17) XORL DX, DX
0x0045 00069 (func.go:17) TESTQ DX, DX
0x0048 00072 (func.go:17) JNE 24
Line 51 in the assembly code 0x0033 00051 (func.go:17) CMPQ BX, $-1 is the culprit.
Line 16, for i := 3; i*i <= n; i+=2, in the original Go code, is translated the same for both cases. But line 17 if n%i == 0 that runs every iteration compiles to more instructions and as a result more work for the CPU in total.
Something similar in the encoding/base64 package by ensuring the loop won't receive a nil value. You can take a look here:
https://go-review.googlesource.com/c/go/+/151158/3/src/encoding/base64/base64.go
This check was added intentionally. In your case, you optimized it accidentally :)

What is the fastest way to increment a map?

I noticed a 3x speed factor for the two following increment methods for map[int]int variables:
fast: myMap[key]++
slow: myMap[key]=myMap[key]+1
This probably isn't surprising because, at least naively, in the second case I'm directing Go to access myMap twice. I'm just curious: Can anyone familiar with the Go compiler help me understand the difference between these operations on maps? And with knowledge of how the compiler works, is there a faster trick to increment maps?
edit: running locally the difference is less pronounced, but still present:
package main
import (
"fmt"
"math"
"time"
)
func main() {
x, y := make(map[int]int), make(map[int]int)
x[0], y[0] = 0, 0
steps := int(math.Pow(10, 9))
start1 := time.Now()
for i := 0; i < steps; i++ {
x[0]++
}
elapsed1 := time.Since(start1)
fmt.Println("++ took", elapsed1)
start2 := time.Now()
for i := 0; i < steps; i++ {
y[0] = y[0] + 1
}
elapsed2 := time.Since(start2)
fmt.Println("y=y+1 took", elapsed2)
}
Output:
++ took 8.1739809s
y=y+1 took 17.9079386s
Edit2: As suggested I dumped the machine code. Here are the relevant snippets
For x[0]++
0x4981e3 488d05b6830100 LEAQ runtime.types+95648(SB), AX
0x4981ea 48890424 MOVQ AX, 0(SP)
0x4981ee 488d8c2400020000 LEAQ 0x200(SP), CX
0x4981f6 48894c2408 MOVQ CX, 0x8(SP)
0x4981fb 48c744241000000000 MOVQ $0x0, 0x10(SP)
0x498204 e8976df7ff CALL runtime.mapassign_fast64(SB)
0x498209 488b442418 MOVQ 0x18(SP), AX
0x49820e 48ff00 INCQ 0(AX)
For y[0] = y[0] + 1
0x498302 488d0597820100 LEAQ runtime.types+95648(SB), AX
0x498309 48890424 MOVQ AX, 0(SP)
0x49830d 488d8c24d0010000 LEAQ 0x1d0(SP), CX
0x498315 48894c2408 MOVQ CX, 0x8(SP)
0x49831a 48c744241000000000 MOVQ $0x0, 0x10(SP)
0x498323 e80869f7ff CALL runtime.mapaccess1_fast64(SB)
0x498328 488b442418 MOVQ 0x18(SP), AX
0x49832d 488b00 MOVQ 0(AX), AX
0x498330 4889442448 MOVQ AX, 0x48(SP)
0x498335 488d0d64820100 LEAQ runtime.types+95648(SB), CX
0x49833c 48890c24 MOVQ CX, 0(SP)
0x498340 488d9424d0010000 LEAQ 0x1d0(SP), DX
0x498348 4889542408 MOVQ DX, 0x8(SP)
0x49834d 48c744241000000000 MOVQ $0x0, 0x10(SP)
0x498356 e8456cf7ff CALL runtime.mapassign_fast64(SB)
0x49835b 488b442418 MOVQ 0x18(SP), AX
0x498360 488b4c2448 MOVQ 0x48(SP), CX
0x498365 48ffc1 INCQ CX
0x498368 488908 MOVQ CX, 0(AX)
Oddly enough, ++ doesn't even call map access! ++ is clearly a simpler operation by an order of 2 or 3. My ability to parse machine ends there, so if anyone has insight into what's going on, I'd love to hear it
The Go gc compiler is an optimizing compiler. It is continuosly being improved. For example, for Go1.11,
Go Issue: cmd/compile: We can avoid extra mapaccess in "m[k] op= r" #23661
Go commit: 7395083136539331537d46875ab9d196797a2173
cmd/compile: avoid extra mapaccess in "m[k] op= r"
Currently, order desugars map assignment operations like
m[k] op= r
into
m[k] = m[k] op r
which in turn is transformed during walk into:
tmp := *mapaccess(m, k)
tmp = tmp op r
*mapassign(m, k) = tmp
However, this is suboptimal, as we could instead produce just:
*mapassign(m, k) op= r
One complication though is if "r == 0", then "m[k] /= r" and "m[k] %=
r" will panic, and they need to do so *before* calling mapassign,
otherwise we may insert a new zero-value element into the map.
It would be spec compliant to just emit the "r != 0" check before
calling mapassign (see #23735), but currently these checks aren't
generated until SSA construction. For now, it's simpler to continue
desugaring /= and %= into two map indexing operations.
Fixes #23661.
Results for your code:
go1.10:
++ took 10.258130907s
y=y+1 took 10.233823639s
go1.11:
++ took 7.995184419s
y=y+1 took 10.259916484s
The general answer to your question is to be simple, explicit, and obvious in your code. The compiler then has an easier task to recognize a common optimizable pattern.

Golang `copy` time complexity

I was wondering about the time complexity of go's copy function?
Intuitively I would assume the worst case of linear time. But I was wondering if there was any magic that was able to bulk allocate, or something, which would allow it to perform better?
https://golang.org/ref/spec#Appending_and_copying_slices
I figured the assembly would explain something but I'm not sure what I"m reading :p
$ GOOS=linux GOARCH=amd64 go tool compile -S main.go
func main() {
src := []int{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
dst := make([]int, len(src))
numCopied := copy(dst, src)
if numCopied != 10 {
panic(fmt.Sprintf("expected 5 copied received: %d", numCopied))
}
}
With the following output from the copy line:
0x007a 00122 (main.go:23) CMPQ AX, $10
0x007e 00126 (main.go:23) JLE 133
0x0080 00128 (main.go:23) MOVL $10, AX
0x0085 00133 (main.go:23) MOVQ AX, "".numCopied+56(SP)
0x008a 00138 (main.go:23) MOVQ CX, (SP)
0x008e 00142 (main.go:23) LEAQ ""..autotmp_8+72(SP), CX
0x0093 00147 (main.go:23) MOVQ CX, 8(SP)
0x0098 00152 (main.go:23) SHLQ $3, AX
0x009c 00156 (main.go:23) MOVQ AX, 16(SP)
0x00a1 00161 (main.go:23) PCDATA $0, $0
0x00a1 00161 (main.go:23) CALL runtime.memmove(SB)
0x00a6 00166 (main.go:23) MOVQ "".numCopied+56(SP), AX
I then tried with 5 elements as well:
func main() {
src := []int{1, 2, 3, 4, 5}
dst := make([]int, len(src))
numCopied := copy(dst, src)
if numCopied != 5 {
panic(fmt.Sprintf("expected 5 copied received: %d", numCopied))
}
}
With the following output from the copy line:
0x0086 00134 (main.go:9) CMPQ AX, $5
0x008a 00138 (main.go:9) JLE 145
0x008c 00140 (main.go:9) MOVL $5, AX
0x0091 00145 (main.go:9) MOVQ AX, "".numCopied+56(SP)
0x0096 00150 (main.go:9) MOVQ CX, (SP)
0x009a 00154 (main.go:9) LEAQ ""..autotmp_8+72(SP), CX
0x009f 00159 (main.go:9) MOVQ CX, 8(SP)
0x00a4 00164 (main.go:9) SHLQ $3, AX
0x00a8 00168 (main.go:9) MOVQ AX, 16(SP)
0x00ad 00173 (main.go:9) PCDATA $0, $0
0x00ad 00173 (main.go:9) CALL runtime.memmove(SB)
0x00b2 00178 (main.go:9) MOVQ "".numCopied+56(SP), AX
I suggest benchmarking the time it takes to copy array/slices of different sizes. Here is something to get the ball rolling:
package main
import (
"fmt"
"math"
"testing"
)
func main() {
for i := 0; i < 16; i++ {
size := powerOfTwo(i)
runBench(size)
}
}
func runBench(size int) {
bench := func(b *testing.B) {
src := make([]int, size, size)
dst := make([]int, size, size)
// we don't want to measure the time
// it takes to make the arrays, so reset timer
b.ResetTimer()
for i := 0; i < b.N; i++ {
copy(dst, src)
}
}
fmt.Printf("size = %d, %s", size, testing.Benchmark(bench))
}
func powerOfTwo(i int) int {
return int(math.Pow(float64(2), float64(i)))
}

Can't move 64-bit immediate values in assembler

I am new to 64bit Assembly coding. So I tried some simple Programms:
c-programm:
#include <stdio.h>
extern double bla();
double x=0;
int main() {
x=bla();
printf(" %f",x);
return 0;
}
Assembly:
section .data
section .text
global bla
bla:
mov rax,10
movq xmm0,rax
ret
The result was alwals 0.0 instead of 10.0
But when i make it without a immediate it works fine
#include <stdio.h>
extern double bla(double y);
double x=0;
double a=10;
int main() {
x=bla(a);
printf("add returned %f",x);
return 0;
}
section .data
section .text
global bla
bla:
movq rax,xmm0
movq xmm0,rbx ;xmm0=0 now
movq xmm0,rax ;xmm0=10 now
ret
Do I need a different Instruction to load a Immediate in a 64bit Register?
The problem here was that the OP was trying to move 10 into a floating-point register with the following code:
mov rax,10
movq xmm0,rax
That cannot work, since movq into xmm0 assumes that the bit-pattern of the source is already in floating-point format - and of course it isn't: it's an integer.
#Michael Petch's suggestion was to use the (NASM) assembler's floating-point converter as follows:
mov rax,__float64__(10.0)
movq xmm0,rax
That then produces the expected output.

Where is the implementation of func append in Go?

I'm very interested in go, and trying to read go function's implementations. I found some of these function doesn't have implementations there.
Such as append or call:
// The append built-in function appends elements to the end of a slice. If
// it has sufficient capacity, the destination is resliced to accommodate the
// new elements. If it does not, a new underlying array will be allocated.
// Append returns the updated slice. It is therefore necessary to store the
// result of append, often in the variable holding the slice itself:
// slice = append(slice, elem1, elem2)
// slice = append(slice, anotherSlice...)
// As a special case, it is legal to append a string to a byte slice, like this:
// slice = append([]byte("hello "), "world"...)
func append(slice []Type, elems ...Type) []Type
// call calls fn with a copy of the n argument bytes pointed at by arg.
// After fn returns, reflectcall copies n-retoffset result bytes
// back into arg+retoffset before returning. If copying result bytes back,
// the caller must pass the argument frame type as argtype, so that
// call can execute appropriate write barriers during the copy.
func call(argtype *rtype, fn, arg unsafe.Pointer, n uint32, retoffset uint32)
It seems not calling a C code, because using cgo needs some special comments.
Where is these function's implementations?
The code you are reading and citing is just dummy code to have consistent documentation. The built-in functions are, well, built into the language and, as such, are included in the code processing step (the compiler).
Simplified what happens is: lexer will detect 'append(...)' as APPEND token, parser will translate APPEND, depending on the circumstances/parameters/environment to code, code is written as assembly and assembled. The middle step - the implementation of append - can be found in the compiler here.
What happens to an append call is best seen when looking at the assembly of an example program. Consider this:
b := []byte{'a'}
b = append(b, 'b')
println(string(b), cap(b))
Running it will yield the following output:
ab 2
The append call is translated to assembly like this:
// create new slice object
MOVQ BX, "".b+120(SP) // BX contains data addr., write to b.addr
MOVQ BX, CX // store addr. in CX
MOVQ AX, "".b+128(SP) // AX contains len(b) == 1, write to b.len
MOVQ DI, "".b+136(SP) // DI contains cap(b) == 1, write to b.cap
MOVQ AX, BX // BX now contains len(b)
INCQ BX // BX++
CMPQ BX, DI // compare new length (2) with cap (1)
JHI $1, 225 // jump to grow code if len > cap
...
LEAQ (CX)(AX*1), BX // load address of newly allocated slice entry
MOVB $98, (BX) // write 'b' to loaded address
// grow code, call runtime.growslice(t *slicetype, old slice, cap int)
LEAQ type.[]uint8(SB), BP
MOVQ BP, (SP) // load parameters onto stack
MOVQ CX, 8(SP)
MOVQ AX, 16(SP)
MOVQ SI, 24(SP)
MOVQ BX, 32(SP)
PCDATA $0, $0
CALL runtime.growslice(SB) // call
MOVQ 40(SP), DI
MOVQ 48(SP), R8
MOVQ 56(SP), SI
MOVQ R8, AX
INCQ R8
MOVQ DI, CX
JMP 108 // jump back, growing done
As you can see, no CALL statement to a function called append can be seen. This is the full implementation of the append call in the example code. Another call with different parameters will look differently (other registers, different parameters depending on the slice type, etc.).
The Go append builtin function code is generated by the Go gc and gccgo compilers and uses Go package runtime functions (for example, runtime.growslice()) in go/src/runtime/slice.go.
For example,
package main
func main() {
b := []int{0, 1}
b = append(b, 2)
}
Go pseudo-assembler:
$ go tool compile -S a.go
"".main t=1 size=192 value=0 args=0x0 locals=0x68
0x0000 00000 (a.go:3) TEXT "".main(SB), $104-0
0x0000 00000 (a.go:3) MOVQ (TLS), CX
0x0009 00009 (a.go:3) CMPQ SP, 16(CX)
0x000d 00013 (a.go:3) JLS 167
0x0013 00019 (a.go:3) SUBQ $104, SP
0x0017 00023 (a.go:3) FUNCDATA $0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0017 00023 (a.go:3) FUNCDATA $1, gclocals·790e5cc5051fc0affc980ade09e929ec(SB)
0x0017 00023 (a.go:4) LEAQ "".autotmp_0002+64(SP), BX
0x001c 00028 (a.go:4) MOVQ BX, CX
0x001f 00031 (a.go:4) NOP
0x001f 00031 (a.go:4) MOVQ "".statictmp_0000(SB), BP
0x0026 00038 (a.go:4) MOVQ BP, (BX)
0x0029 00041 (a.go:4) MOVQ "".statictmp_0000+8(SB), BP
0x0030 00048 (a.go:4) MOVQ BP, 8(BX)
0x0034 00052 (a.go:4) NOP
0x0034 00052 (a.go:4) MOVQ $2, AX
0x003b 00059 (a.go:4) MOVQ $2, DX
0x0042 00066 (a.go:5) MOVQ CX, "".b+80(SP)
0x0047 00071 (a.go:5) MOVQ AX, "".b+88(SP)
0x004c 00076 (a.go:5) MOVQ DX, "".b+96(SP)
0x0051 00081 (a.go:5) MOVQ AX, BX
0x0054 00084 (a.go:5) INCQ BX
0x0057 00087 (a.go:5) CMPQ BX, DX
0x005a 00090 (a.go:5) JHI $1, 108
0x005c 00092 (a.go:5) LEAQ (CX)(AX*8), BX
0x0060 00096 (a.go:5) MOVQ $2, (BX)
0x0067 00103 (a.go:6) ADDQ $104, SP
0x006b 00107 (a.go:6) RET
0x006c 00108 (a.go:5) LEAQ type.[]int(SB), BP
0x0073 00115 (a.go:5) MOVQ BP, (SP)
0x0077 00119 (a.go:5) MOVQ CX, 8(SP)
0x007c 00124 (a.go:5) MOVQ AX, 16(SP)
0x0081 00129 (a.go:5) MOVQ DX, 24(SP)
0x0086 00134 (a.go:5) MOVQ BX, 32(SP)
0x008b 00139 (a.go:5) PCDATA $0, $0
0x008b 00139 (a.go:5) CALL runtime.growslice(SB)
0x0090 00144 (a.go:5) MOVQ 40(SP), CX
0x0095 00149 (a.go:5) MOVQ 48(SP), AX
0x009a 00154 (a.go:5) MOVQ 56(SP), DX
0x009f 00159 (a.go:5) MOVQ AX, BX
0x00a2 00162 (a.go:5) INCQ BX
0x00a5 00165 (a.go:5) JMP 92
0x00a7 00167 (a.go:3) CALL runtime.morestack_noctxt(SB)
0x00ac 00172 (a.go:3) JMP 0
To add to the assembly code given by the others, you can find the Go (1.5.1) code for gc there : https://github.com/golang/go/blob/f2e4c8b5fb3660d793b2c545ef207153db0a34b1/src/cmd/compile/internal/gc/walk.go#L2895
// expand append(l1, l2...) to
// init {
// s := l1
// if n := len(l1) + len(l2) - cap(s); n > 0 {
// s = growslice_n(s, n)
// }
// s = s[:len(l1)+len(l2)]
// memmove(&s[len(l1)], &l2[0], len(l2)*sizeof(T))
// }
// s
//
// l2 is allowed to be a string.
with growslice_n being defined there : https://github.com/golang/go/blob/f2e4c8b5fb3660d793b2c545ef207153db0a34b1/src/runtime/slice.go#L36
// growslice_n is a variant of growslice that takes the number of new elements
// instead of the new minimum capacity.
// TODO(rsc): This is used by append(slice, slice...).
// The compiler should change that code to use growslice directly (issue #11419).
func growslice_n(t *slicetype, old slice, n int) slice {
if n < 1 {
panic(errorString("growslice: invalid n"))
}
return growslice(t, old, old.cap+n)
}
// growslice handles slice growth during append.
// It is passed the slice type, the old slice, and the desired new minimum capacity,
// and it returns a new slice with at least that capacity, with the old data
// copied into it.
func growslice(t *slicetype, old slice, cap int) slice {
if cap < old.cap || t.elem.size > 0 && uintptr(cap) > _MaxMem/uintptr(t.elem.size) {
panic(errorString("growslice: cap out of range"))
}
if raceenabled {
callerpc := getcallerpc(unsafe.Pointer(&t))
racereadrangepc(old.array, uintptr(old.len*int(t.elem.size)), callerpc, funcPC(growslice))
}
et := t.elem
if et.size == 0 {
// append should not create a slice with nil pointer but non-zero len.
// We assume that append doesn't need to preserve old.array in this case.
return slice{unsafe.Pointer(&zerobase), old.len, cap}
}
newcap := old.cap
if newcap+newcap < cap {
newcap = cap
} else {
for {
if old.len < 1024 {
newcap += newcap
} else {
newcap += newcap / 4
}
if newcap >= cap {
break
}
}
}
if uintptr(newcap) >= _MaxMem/uintptr(et.size) {
panic(errorString("growslice: cap out of range"))
}
lenmem := uintptr(old.len) * uintptr(et.size)
capmem := roundupsize(uintptr(newcap) * uintptr(et.size))
newcap = int(capmem / uintptr(et.size))
var p unsafe.Pointer
if et.kind&kindNoPointers != 0 {
p = rawmem(capmem)
memmove(p, old.array, lenmem)
memclr(add(p, lenmem), capmem-lenmem)
} else {
// Note: can't use rawmem (which avoids zeroing of memory), because then GC can scan uninitialized memory.
p = newarray(et, uintptr(newcap))
if !writeBarrierEnabled {
memmove(p, old.array, lenmem)
} else {
for i := uintptr(0); i < lenmem; i += et.size {
typedmemmove(et, add(p, i), add(old.array, i))
}
}
}
return slice{p, old.len, newcap}
}

Resources