Never triggered if statements make code execution in benchmark faster? Why? - go

I have recently started the Go track on exercism.io and had fun optimizing the "nth-prime" calculation. Actually I came across a funny fact I can't explain. Imagine the following code:
// Package prime provides ...
package prime
// Nth function checks for the prime number on position n
func Nth(n int) (int, bool) {
if n <= 0 {
return 0, false
}
if (n == 1) {
return 2, true
}
currentNumber := 1
primeCounter := 1
for n > primeCounter {
currentNumber+=2
if isPrime(currentNumber) {
primeCounter++
}
}
return currentNumber, primeCounter==n
}
// isPrime function checks if a number
// is a prime number
func isPrime(n int) bool {
//useless because never triggered but makes it faster??
if n < 2 {
println("n < 2")
return false
}
//useless because never triggered but makes it faster??
if n%2 == 0 {
println("n%2")
return n==2
}
for i := 3; i*i <= n; i+=2 {
if n%i == 0 {
return false
}
}
return true
}
In the private function isPrime I have two initial if-statements that are never triggered, because I only give in uneven numbers greater than 2. The benchmark returns following:
Running tool: /usr/bin/go test -benchmem -run=^$ -bench ^(BenchmarkNth)$
BenchmarkNth-8 100 18114825 ns/op 0 B/op 0
If I remove the never triggered if-statements the benchmark goes slower:
Running tool: /usr/bin/go test -benchmem -run=^$ -bench ^(BenchmarkNth)$
BenchmarkNth-8 50 21880749 ns/op 0 B/op 0
I have run the benchmark multiple times changing the code back and forth always getting more or less the same numbers and I can't think of a reason why these two if-statements should make the execution faster. Yes it is micro-optimization, but I want to know: Why?
Here is the whole exercise from exercism with test-cases: nth-prime
Go version i am using is 1.12.1 linux/amd64 on a manjaro i3 linux

What happens is the compiler is guaranteed with some assertions about the input when those if's are added.
If those assertions are lifted, the compiler has to add it himself. The way it does it is by validating it on each iteration. We can take a look at the assembly code to prove it. (by passing -gcflags=-S to the go test command)
With the if's:
0x004b 00075 (func.go:16) JMP 81
0x004d 00077 (func.go:16) LEAQ 2(BX), AX
0x0051 00081 (func.go:16) MOVQ AX, DX
0x0054 00084 (func.go:16) IMULQ AX, AX
0x0058 00088 (func.go:16) CMPQ AX, CX
0x005b 00091 (func.go:16) JGT 133
0x005d 00093 (func.go:17) TESTQ DX, DX
0x0060 00096 (func.go:17) JEQ 257
0x0066 00102 (func.go:17) MOVQ CX, AX
0x0069 00105 (func.go:17) MOVQ DX, BX
0x006c 00108 (func.go:17) CQO
0x006e 00110 (func.go:17) IDIVQ BX
0x0071 00113 (func.go:17) TESTQ DX, DX
0x0074 00116 (func.go:17) JNE 77
Without the if's:
0x0016 00022 (func.go:16) JMP 28
0x0018 00024 (func.go:16) LEAQ 2(BX), AX
0x001c 00028 (func.go:16) MOVQ AX, DX
0x001f 00031 (func.go:16) IMULQ AX, AX
0x0023 00035 (func.go:16) CMPQ AX, CX
0x0026 00038 (func.go:16) JGT 88
0x0028 00040 (func.go:17) TESTQ DX, DX
0x002b 00043 (func.go:17) JEQ 102
0x002d 00045 (func.go:17) MOVQ CX, AX
0x0030 00048 (func.go:17) MOVQ DX, BX
0x0033 00051 (func.go:17) CMPQ BX, $-1
0x0037 00055 (func.go:17) JEQ 64
0x0039 00057 (func.go:17) CQO
0x003b 00059 (func.go:17) IDIVQ BX
0x003e 00062 (func.go:17) JMP 69
0x0040 00064 func.go:17) NEGQ AX
0x0043 00067 (func.go:17) XORL DX, DX
0x0045 00069 (func.go:17) TESTQ DX, DX
0x0048 00072 (func.go:17) JNE 24
Line 51 in the assembly code 0x0033 00051 (func.go:17) CMPQ BX, $-1 is the culprit.
Line 16, for i := 3; i*i <= n; i+=2, in the original Go code, is translated the same for both cases. But line 17 if n%i == 0 that runs every iteration compiles to more instructions and as a result more work for the CPU in total.
Something similar in the encoding/base64 package by ensuring the loop won't receive a nil value. You can take a look here:
https://go-review.googlesource.com/c/go/+/151158/3/src/encoding/base64/base64.go
This check was added intentionally. In your case, you optimized it accidentally :)

Related

Performance difference Rust and C++

I am currently learning Rust, and as a first exercise I wanted to implement a function that computes the nth fibonacci number:
fn main() {
for i in 0..48 {
println!("{}: {}", i, fibonacci(i));
}
}
fn fibonacci(n: u32) -> u32 {
match n {
0 => 0,
1 => 1,
_ => fibonacci(n - 1) + fibonacci(n - 2),
}
}
I run it as:
$ time cargo run --release
real 0m15.380s
user 0m15.362s
sys 0m0.014s
As an exercise, I also implemented the same algorithm in C++. I was expecting a similar performance, but the C++ code runs in 80% of the time:
#include<iostream>
unsigned int fibonacci(unsigned int n);
int main (int argc, char* argv[]) {
for(unsigned int i = 0; i < 48; ++i) {
std::cout << i << ": " << fibonacci(i) << '\n';
}
return 0;
}
unsigned int fibonacci(unsigned int n) {
if(n == 0) {
return 0;
} else if (n == 1) {
return 1;
} else {
return fibonacci(n - 1) + fibonacci(n - 2);
}
}
Compiled as:
$ g++ test.cpp -o test.exe -O2
And running:
$ time ./test.exe
real 0m12.127s
user 0m12.124s
sys 0m0.000s
Why do I see such a difference in performance? I am not interested in calculating the fibonacci faster in Rust (with a different algorithm); I am only interested on where the difference comes from. This is just an exercise in my progress as I learn Rust.
TL;DR: It's not Rust vs C++, it's LLVM (Clang) vs GCC.
Different optimizers optimize the code differently, and in this case GCC produces larger but faster code.
This can be verified using godbolt.
Here is Rust, compiled with both GCC (via rustgcc-master):
example::fibonacci:
push r15
push r14
push r13
push r12
push rbp
xor ebp, ebp
push rbx
mov ebx, edi
sub rsp, 24
.L2:
test ebx, ebx
je .L1
cmp ebx, 1
je .L4
lea r12d, -1[rbx]
xor r13d, r13d
.L19:
cmp r12d, 1
je .L6
lea r14d, -1[r12]
xor r15d, r15d
.L16:
cmp r14d, 1
je .L8
lea edx, -1[r14]
xor ecx, ecx
.L13:
cmp edx, 1
je .L10
lea edi, -1[rdx]
mov DWORD PTR 12[rsp], ecx
mov DWORD PTR 8[rsp], edx
call example::fibonacci.localalias
mov ecx, DWORD PTR 12[rsp]
mov edx, DWORD PTR 8[rsp]
add ecx, eax
sub edx, 2
jne .L13
.L14:
add r15d, ecx
sub r14d, 2
je .L17
jmp .L16
.L4:
add ebp, 1
.L1:
add rsp, 24
mov eax, ebp
pop rbx
pop rbp
pop r12
pop r13
pop r14
pop r15
ret
.L6:
add r13d, 1
.L20:
sub ebx, 2
add ebp, r13d
jmp .L2
.L8:
add r15d, 1
.L17:
add r13d, r15d
sub r12d, 2
je .L20
jmp .L19
.L10:
add ecx, 1
jmp .L14
And with LLVM (via rustc):
example::fibonacci:
push rbp
push r14
push rbx
mov ebx, edi
xor ebp, ebp
mov r14, qword ptr [rip + example::fibonacci#GOTPCREL]
cmp ebx, 2
jb .LBB0_3
.LBB0_2:
lea edi, [rbx - 1]
call r14
add ebp, eax
add ebx, -2
cmp ebx, 2
jae .LBB0_2
.LBB0_3:
add ebx, ebp
mov eax, ebx
pop rbx
pop r14
pop rbp
ret
We can see that LLVM produces a naive version -- calling the function in each iteration of the loop -- while GCC partially unrolls the recursion by inlining some calls. This results in a smaller number of calls in the case of GCC, and at about 5ns of overhead per function call, it's significant enough.
We can do the same exercise with the C++ version using LLVM via Clang and GCC and note that the result is pretty much similar.
So, as announced, it's a LLVM vs GCC difference, not a language one.
Incidentally, the fact that optimizers may produce such widely different results is a reason why I am quite excited at the progress of the rustc_codegen_gcc initiative (dubbed rustgcc-master on godbolt) which aims at pluging a GCC backend into the rustc frontend: once complete anyone will be able to switch to the better optimizer for their own workload.

Faster bitwise AND operation on byte slices

I'd like to perform bitwise AND on every column of a byte matrix, which is stored in a [][]byte in golang. I created a repo with runnable test code.
It can be simplified as a bitwise AND operation on two byte-slice of equal length. The simplest way is using for loop to handle every pair of bytes.
func and(x, y []byte) []byte {
z := make([]byte, lenght(x))
for i:= 0; i < len(x); i++ {
z[i] = x[i] & y[i]
}
return z
}
However, it's very slow for long slices. A faster way is to unroll the for loop (check the benchmark result)
BenchmarkLoop-16 14467 84265 ns/op
BenchmarkUnrollLoop-16 17668 67550 ns/op
Any faster way? Go assembly?
Thank you in advance.
I write a go assembly implementation using AVX2 instructions after two days of (go) assembly learning.
The performance is good, 10X of simple loop version. While optimizations for compatibility and performance are still needed. Suggestions and PRs are welcome.
Note: code and benchmark results are updated.
I appreciate #PeterCordes for many valuable suggestions.
#include "textflag.h"
// func AND(x []byte, y []byte)
// Requires: AVX
TEXT ·AND(SB), NOSPLIT|NOPTR, $0-48
// pointer of x
MOVQ x_base+0(FP), AX
// length of x
MOVQ x_len+8(FP), CX
// pointer of y
MOVQ y_base+24(FP), DX
// --------------------------------------------
// end address of x, will not change: p + n
MOVQ AX, BX
ADDQ CX, BX
// end address for loop
// n <= 8, jump to tail
CMPQ CX, $0x00000008
JLE tail
// n < 16, jump to loop8
CMPQ CX, $0x00000010
JL loop8_start
// n < 32, jump to loop16
CMPQ CX, $0x00000020
JL loop16_start
// --------------------------------------------
// end address for loop32
MOVQ BX, CX
SUBQ $0x0000001f, CX
loop32:
// compute x & y, and save value to x
VMOVDQU (AX), Y0
VANDPS (DX), Y0, Y0
VMOVDQU Y0, (AX)
// move pointer
ADDQ $0x00000020, AX
ADDQ $0x00000020, DX
CMPQ AX, CX
JL loop32
// n <= 8, jump to tail
MOVQ BX, CX
SUBQ AX, CX
CMPQ CX, $0x00000008
JLE tail
// n < 16, jump to loop8
CMPQ CX, $0x00000010
JL loop8_start
// --------------------------------------------
loop16_start:
// end address for loop16
MOVQ BX, CX
SUBQ $0x0000000f, CX
loop16:
// compute x & y, and save value to x
VMOVDQU (AX), X0
VANDPS (DX), X0, X0
VMOVDQU X0, (AX)
// move pointer
ADDQ $0x00000010, AX
ADDQ $0x00000010, DX
CMPQ AX, CX
JL loop16
// n <= 8, jump to tail
MOVQ BX, CX
SUBQ AX, CX
CMPQ CX, $0x00000008
JLE tail
// --------------------------------------------
loop8_start:
// end address for loop8
MOVQ BX, CX
SUBQ $0x00000007, CX
loop8:
// compute x & y, and save value to x
MOVQ (AX), BX
ANDQ (DX), BX
MOVQ BX, (AX)
// move pointer
ADDQ $0x00000008, AX
ADDQ $0x00000008, DX
CMPQ AX, CX
JL loop8
// --------------------------------------------
tail:
// left elements (<=8)
MOVQ (AX), BX
ANDQ (DX), BX
MOVQ BX, (AX)
RET
Benchmark result:
test data-size time
------------------- --------- -----------
BenchmarkGrailbio 8.00_B 4.654 ns/op
BenchmarkGoAsm 8.00_B 4.824 ns/op
BenchmarkUnrollLoop 8.00_B 6.851 ns/op
BenchmarkLoop 8.00_B 8.683 ns/op
BenchmarkGrailbio 16.00_B 5.363 ns/op
BenchmarkGoAsm 16.00_B 6.369 ns/op
BenchmarkUnrollLoop 16.00_B 10.47 ns/op
BenchmarkLoop 16.00_B 13.48 ns/op
BenchmarkGoAsm 32.00_B 6.079 ns/op
BenchmarkGrailbio 32.00_B 6.497 ns/op
BenchmarkUnrollLoop 32.00_B 17.46 ns/op
BenchmarkLoop 32.00_B 21.09 ns/op
BenchmarkGoAsm 128.00_B 10.52 ns/op
BenchmarkGrailbio 128.00_B 14.40 ns/op
BenchmarkUnrollLoop 128.00_B 56.97 ns/op
BenchmarkLoop 128.00_B 80.12 ns/op
BenchmarkGoAsm 256.00_B 15.48 ns/op
BenchmarkGrailbio 256.00_B 23.76 ns/op
BenchmarkUnrollLoop 256.00_B 110.8 ns/op
BenchmarkLoop 256.00_B 147.5 ns/op
BenchmarkGoAsm 1.00_KB 47.16 ns/op
BenchmarkGrailbio 1.00_KB 87.75 ns/op
BenchmarkUnrollLoop 1.00_KB 443.1 ns/op
BenchmarkLoop 1.00_KB 540.5 ns/op
BenchmarkGoAsm 16.00_KB 751.6 ns/op
BenchmarkGrailbio 16.00_KB 1342 ns/op
BenchmarkUnrollLoop 16.00_KB 7007 ns/op
BenchmarkLoop 16.00_KB 8623 ns/op

Why is there a performance difference when I pass a slice argument as value or a pointer?

I have the following code:
func AddToSliceByValue(mySlice []int) {
for idx := range mySlice {
mySlice[idx]++
}
}
func AddToSliceByPointer(mySlice *[]int) {
for idx := range *mySlice {
(*mySlice)[idx]++
}
}
My first thought was that the performance should be nearly the same because pass by value copies the slice header and pass by pointer would force me to dereferencing pointers but my benchmark shows something else:
func BenchmarkAddByValue(b *testing.B) {
mySlice := rand.Perm(1000)
for n := 0; n < b.N; n++ {
AddToSliceByValue(mySlice)
}
}
func BenchmarkAddByPointer(b *testing.B) {
mySlice := rand.Perm(1000)
for n := 0; n < b.N; n++ {
AddToSliceByPointer(&mySlice)
}
}
BenchmarkAddByValue-12 1151256 1035 ns/op
BenchmarkAddByPointer-12 2145110 525 ns/op
Can anyone explain to me why the difference in performance is so great?
I also added the assembly code for the two functions.
Assembly code for pass by value:
TEXT main.AddToSliceByValue(SB) /go_test/pointer/pointer_value.go
pointer_value.go:4 0x1056f60 488b442410 MOVQ 0x10(SP), AX
pointer_value.go:4 0x1056f65 488b4c2408 MOVQ 0x8(SP), CX
pointer_value.go:4 0x1056f6a 31d2 XORL DX, DX
pointer_value.go:4 0x1056f6c eb0e JMP 0x1056f7c
pointer_value.go:5 0x1056f6e 488b1cd1 MOVQ 0(CX)(DX*8), BX
pointer_value.go:5 0x1056f72 48ffc3 INCQ BX
pointer_value.go:5 0x1056f75 48891cd1 MOVQ BX, 0(CX)(DX*8)
pointer_value.go:4 0x1056f79 48ffc2 INCQ DX
pointer_value.go:4 0x1056f7c 4839c2 CMPQ AX, DX
pointer_value.go:4 0x1056f7f 7ced JL 0x1056f6e
pointer_value.go:4 0x1056f81 c3 RET
:-1 0x1056f82 cc INT $0x3
:-1 0x1056f83 cc INT $0x3
:-1 0x1056f84 cc INT $0x3
:-1 0x1056f85 cc INT $0x3
:-1 0x1056f86 cc INT $0x3
:-1 0x1056f87 cc INT $0x3
:-1 0x1056f88 cc INT $0x3
:-1 0x1056f89 cc INT $0x3
:-1 0x1056f8a cc INT $0x3
:-1 0x1056f8b cc INT $0x3
:-1 0x1056f8c cc INT $0x3
:-1 0x1056f8d cc INT $0x3
:-1 0x1056f8e cc INT $0x3
:-1 0x1056f8f cc INT $0x3
TEXT main.main(SB) /go_test/pointer/pointer_value.go
pointer_value.go:9 0x1056f90 65488b0c2530000000 MOVQ GS:0x30, CX
pointer_value.go:9 0x1056f99 483b6110 CMPQ 0x10(CX), SP
pointer_value.go:9 0x1056f9d 0f86a8000000 JBE 0x105704b
pointer_value.go:9 0x1056fa3 4883ec70 SUBQ $0x70, SP
pointer_value.go:9 0x1056fa7 48896c2468 MOVQ BP, 0x68(SP)
pointer_value.go:9 0x1056fac 488d6c2468 LEAQ 0x68(SP), BP
pointer_value.go:11 0x1056fb1 488d7c2418 LEAQ 0x18(SP), DI
pointer_value.go:11 0x1056fb6 0f57c0 XORPS X0, X0
pointer_value.go:11 0x1056fb9 488d7fd0 LEAQ -0x30(DI), DI
pointer_value.go:11 0x1056fbd 48896c24f0 MOVQ BP, -0x10(SP)
pointer_value.go:11 0x1056fc2 488d6c24f0 LEAQ -0x10(SP), BP
pointer_value.go:11 0x1056fc7 e849c6ffff CALL 0x1053615
pointer_value.go:11 0x1056fcc 488b6d00 MOVQ 0(BP), BP
pointer_value.go:11 0x1056fd0 48c744242001000000 MOVQ $0x1, 0x20(SP)
pointer_value.go:11 0x1056fd9 48c744242802000000 MOVQ $0x2, 0x28(SP)
pointer_value.go:11 0x1056fe2 48c744243003000000 MOVQ $0x3, 0x30(SP)
pointer_value.go:11 0x1056feb 48c744243804000000 MOVQ $0x4, 0x38(SP)
pointer_value.go:11 0x1056ff4 48c744244005000000 MOVQ $0x5, 0x40(SP)
pointer_value.go:11 0x1056ffd 48c744244806000000 MOVQ $0x6, 0x48(SP)
pointer_value.go:11 0x1057006 48c744245007000000 MOVQ $0x7, 0x50(SP)
pointer_value.go:11 0x105700f 48c744245808000000 MOVQ $0x8, 0x58(SP)
pointer_value.go:11 0x1057018 48c744246009000000 MOVQ $0x9, 0x60(SP)
pointer_value.go:12 0x1057021 488d442418 LEAQ 0x18(SP), AX
pointer_value.go:12 0x1057026 48890424 MOVQ AX, 0(SP)
pointer_value.go:12 0x105702a 48c74424080a000000 MOVQ $0xa, 0x8(SP)
pointer_value.go:12 0x1057033 48c74424100a000000 MOVQ $0xa, 0x10(SP)
pointer_value.go:12 0x105703c e81fffffff CALL main.AddToSliceByValue(SB)
pointer_value.go:13 0x1057041 488b6c2468 MOVQ 0x68(SP), BP
pointer_value.go:13 0x1057046 4883c470 ADDQ $0x70, SP
pointer_value.go:13 0x105704a c3 RET
pointer_value.go:9 0x105704b e8909cffff CALL runtime.morestack_noctxt(SB)
pointer_value.go:9 0x1057050 e93bffffff JMP main.main(SB)
assembly code for pass by pointer:
TEXT main.AddToSliceByPointer(SB) /go_test/pointer/pointer_ref.go
pointer_ref.go:3 0x1056f60 4883ec18 SUBQ $0x18, SP
pointer_ref.go:3 0x1056f64 48896c2410 MOVQ BP, 0x10(SP)
pointer_ref.go:3 0x1056f69 488d6c2410 LEAQ 0x10(SP), BP
pointer_ref.go:4 0x1056f6e 488b542420 MOVQ 0x20(SP), DX
pointer_ref.go:4 0x1056f73 488b5a08 MOVQ 0x8(DX), BX
pointer_ref.go:4 0x1056f77 31c0 XORL AX, AX
pointer_ref.go:4 0x1056f79 eb0e JMP 0x1056f89
pointer_ref.go:5 0x1056f7b 488b3cc6 MOVQ 0(SI)(AX*8), DI
pointer_ref.go:5 0x1056f7f 48ffc7 INCQ DI
pointer_ref.go:5 0x1056f82 48893cc6 MOVQ DI, 0(SI)(AX*8)
pointer_ref.go:4 0x1056f86 48ffc0 INCQ AX
pointer_ref.go:4 0x1056f89 4839d8 CMPQ BX, AX
pointer_ref.go:4 0x1056f8c 7d0e JGE 0x1056f9c
pointer_ref.go:5 0x1056f8e 488b4a08 MOVQ 0x8(DX), CX
pointer_ref.go:5 0x1056f92 488b32 MOVQ 0(DX), SI
pointer_ref.go:5 0x1056f95 4839c8 CMPQ CX, AX
pointer_ref.go:5 0x1056f98 72e1 JB 0x1056f7b
pointer_ref.go:5 0x1056f9a eb0a JMP 0x1056fa6
pointer_ref.go:4 0x1056f9c 488b6c2410 MOVQ 0x10(SP), BP
pointer_ref.go:4 0x1056fa1 4883c418 ADDQ $0x18, SP
pointer_ref.go:4 0x1056fa5 c3 RET
pointer_ref.go:5 0x1056fa6 e8b5c4ffff CALL runtime.panicIndex(SB)
pointer_ref.go:5 0x1056fab 90 NOPL
:-1 0x1056fac cc INT $0x3
:-1 0x1056fad cc INT $0x3
:-1 0x1056fae cc INT $0x3
:-1 0x1056faf cc INT $0x3
TEXT main.main(SB) /go_test/pointer/pointer_ref.go
pointer_ref.go:9 0x1056fb0 65488b0c2530000000 MOVQ GS:0x30, CX
pointer_ref.go:9 0x1056fb9 483b6110 CMPQ 0x10(CX), SP
pointer_ref.go:9 0x1056fbd 0f86b2000000 JBE 0x1057075
pointer_ref.go:9 0x1056fc3 4883ec78 SUBQ $0x78, SP
pointer_ref.go:9 0x1056fc7 48896c2470 MOVQ BP, 0x70(SP)
pointer_ref.go:9 0x1056fcc 488d6c2470 LEAQ 0x70(SP), BP
pointer_ref.go:11 0x1056fd1 488d7c2408 LEAQ 0x8(SP), DI
pointer_ref.go:11 0x1056fd6 0f57c0 XORPS X0, X0
pointer_ref.go:11 0x1056fd9 488d7fd0 LEAQ -0x30(DI), DI
pointer_ref.go:11 0x1056fdd 48896c24f0 MOVQ BP, -0x10(SP)
pointer_ref.go:11 0x1056fe2 488d6c24f0 LEAQ -0x10(SP), BP
pointer_ref.go:11 0x1056fe7 e829c6ffff CALL 0x1053615
pointer_ref.go:11 0x1056fec 488b6d00 MOVQ 0(BP), BP
pointer_ref.go:11 0x1056ff0 48c744241001000000 MOVQ $0x1, 0x10(SP)
pointer_ref.go:11 0x1056ff9 48c744241802000000 MOVQ $0x2, 0x18(SP)
pointer_ref.go:11 0x1057002 48c744242003000000 MOVQ $0x3, 0x20(SP)
pointer_ref.go:11 0x105700b 48c744242804000000 MOVQ $0x4, 0x28(SP)
pointer_ref.go:11 0x1057014 48c744243005000000 MOVQ $0x5, 0x30(SP)
pointer_ref.go:11 0x105701d 48c744243806000000 MOVQ $0x6, 0x38(SP)
pointer_ref.go:11 0x1057026 48c744244007000000 MOVQ $0x7, 0x40(SP)
pointer_ref.go:11 0x105702f 48c744244808000000 MOVQ $0x8, 0x48(SP)
pointer_ref.go:11 0x1057038 48c744245009000000 MOVQ $0x9, 0x50(SP)
pointer_ref.go:11 0x1057041 488d442408 LEAQ 0x8(SP), AX
pointer_ref.go:11 0x1057046 4889442458 MOVQ AX, 0x58(SP)
pointer_ref.go:11 0x105704b 48c74424600a000000 MOVQ $0xa, 0x60(SP)
pointer_ref.go:11 0x1057054 48c74424680a000000 MOVQ $0xa, 0x68(SP)
pointer_ref.go:12 0x105705d 488d442458 LEAQ 0x58(SP), AX
pointer_ref.go:12 0x1057062 48890424 MOVQ AX, 0(SP)
pointer_ref.go:12 0x1057066 e8f5feffff CALL main.AddToSliceByPointer(SB)
pointer_ref.go:13 0x105706b 488b6c2470 MOVQ 0x70(SP), BP
pointer_ref.go:13 0x1057070 4883c478 ADDQ $0x78, SP
pointer_ref.go:13 0x1057074 c3 RET
pointer_ref.go:9 0x1057075 e8669cffff CALL runtime.morestack_noctxt(SB)
pointer_ref.go:9 0x105707a e931ffffff JMP main.main(SB)
I could not reproduce your benchmark...
package main_test
import (
"math/rand"
"testing"
)
func AddToSliceByValue(mySlice []int) {
for idx := range mySlice {
mySlice[idx]++
}
}
func AddToSliceByPointer(mySlice *[]int) {
for idx := range *mySlice {
(*mySlice)[idx]++
}
}
func BenchmarkAddByValue(b *testing.B) {
mySlice := rand.Perm(1000)
for n := 0; n < b.N; n++ {
AddToSliceByValue(mySlice)
}
}
func BenchmarkAddByPointer(b *testing.B) {
mySlice := rand.Perm(1000)
for n := 0; n < b.N; n++ {
AddToSliceByPointer(&mySlice)
}
}
$ go test -bench=. -benchmem -count=4
goos: linux
goarch: amd64
pkg: test/bencslice
BenchmarkAddByValue-4 3010280 385 ns/op 0 B/op 0 allocs/op
BenchmarkAddByValue-4 3118990 385 ns/op 0 B/op 0 allocs/op
BenchmarkAddByValue-4 3117450 384 ns/op 0 B/op 0 allocs/op
BenchmarkAddByValue-4 3109251 386 ns/op 0 B/op 0 allocs/op
BenchmarkAddByPointer-4 2012487 610 ns/op 0 B/op 0 allocs/op
BenchmarkAddByPointer-4 2009690 594 ns/op 0 B/op 0 allocs/op
BenchmarkAddByPointer-4 2009222 594 ns/op 0 B/op 0 allocs/op
BenchmarkAddByPointer-4 1850820 596 ns/op 0 B/op 0 allocs/op
PASS
ok test/bencslice 13.476s
$ go version
go version go1.15.2 linux/amd64
Anyways, the behavior might be dependent of many factors, first of all the version of the runtime. Understanding the intrinsec is of little interest as long as you can test, reproduce and monitor.
I found out that my variance was too high:
AddByValue-12 5.41µs ±15%
AddByPointer-12 5.30µs ± 4%
with go test -benchmem -count 5 -benchtime=1000000x -bench=. ./... I was able to reduce the variance in the test results and could confirm my first assumption that the results should be approximately equal:
AddByValue-12 5.04µs ± 1%
AddByPointer-12 5.17µs ± 1%
According to the comments the main reason for the high variance was that I did not reset the timer after the benchmark setup.
With the following code and a lower benchtime I also reduced the variance:
func BenchmarkAddByValue(b *testing.B) {
mySlice := rand.Perm(10000)
b.ResetTimer()
for n := 0; n < b.N; n++ {
AddToSliceByValue(mySlice)
}
}
func BenchmarkAddByPointer(b *testing.B) {
mySlice := rand.Perm(10000)
b.ResetTimer()
for n := 0; n < b.N; n++ {
AddToSliceByPointer(&mySlice)
}
}
Results:
AddByValue-12 5.03µs ± 0%
AddByPointer-12 5.17µs ± 1%
Thanks a lot for your help!
This is general issue with "lower level" languages. When you pass by value that means you are actually copying the data. Here is how that works.
When you pass by reference :
The copy of the reference is created and passed to method ( reference is most likely 8 bytes, so this is fast )
You read out the data behind the reference ( this is also fast, since the reference is most likely in the CPU cache )
In case of pass by value:
The block of memory is allocated to store the data you passed in ( slow )
The data is copied into the newly allocated block of memory ( maybe fast, maybe slow )
Then your data is accessed via references ( maybe slow, maybe fast, depending if data landed in cache or not )

Where is the implementation of func append in Go?

I'm very interested in go, and trying to read go function's implementations. I found some of these function doesn't have implementations there.
Such as append or call:
// The append built-in function appends elements to the end of a slice. If
// it has sufficient capacity, the destination is resliced to accommodate the
// new elements. If it does not, a new underlying array will be allocated.
// Append returns the updated slice. It is therefore necessary to store the
// result of append, often in the variable holding the slice itself:
// slice = append(slice, elem1, elem2)
// slice = append(slice, anotherSlice...)
// As a special case, it is legal to append a string to a byte slice, like this:
// slice = append([]byte("hello "), "world"...)
func append(slice []Type, elems ...Type) []Type
// call calls fn with a copy of the n argument bytes pointed at by arg.
// After fn returns, reflectcall copies n-retoffset result bytes
// back into arg+retoffset before returning. If copying result bytes back,
// the caller must pass the argument frame type as argtype, so that
// call can execute appropriate write barriers during the copy.
func call(argtype *rtype, fn, arg unsafe.Pointer, n uint32, retoffset uint32)
It seems not calling a C code, because using cgo needs some special comments.
Where is these function's implementations?
The code you are reading and citing is just dummy code to have consistent documentation. The built-in functions are, well, built into the language and, as such, are included in the code processing step (the compiler).
Simplified what happens is: lexer will detect 'append(...)' as APPEND token, parser will translate APPEND, depending on the circumstances/parameters/environment to code, code is written as assembly and assembled. The middle step - the implementation of append - can be found in the compiler here.
What happens to an append call is best seen when looking at the assembly of an example program. Consider this:
b := []byte{'a'}
b = append(b, 'b')
println(string(b), cap(b))
Running it will yield the following output:
ab 2
The append call is translated to assembly like this:
// create new slice object
MOVQ BX, "".b+120(SP) // BX contains data addr., write to b.addr
MOVQ BX, CX // store addr. in CX
MOVQ AX, "".b+128(SP) // AX contains len(b) == 1, write to b.len
MOVQ DI, "".b+136(SP) // DI contains cap(b) == 1, write to b.cap
MOVQ AX, BX // BX now contains len(b)
INCQ BX // BX++
CMPQ BX, DI // compare new length (2) with cap (1)
JHI $1, 225 // jump to grow code if len > cap
...
LEAQ (CX)(AX*1), BX // load address of newly allocated slice entry
MOVB $98, (BX) // write 'b' to loaded address
// grow code, call runtime.growslice(t *slicetype, old slice, cap int)
LEAQ type.[]uint8(SB), BP
MOVQ BP, (SP) // load parameters onto stack
MOVQ CX, 8(SP)
MOVQ AX, 16(SP)
MOVQ SI, 24(SP)
MOVQ BX, 32(SP)
PCDATA $0, $0
CALL runtime.growslice(SB) // call
MOVQ 40(SP), DI
MOVQ 48(SP), R8
MOVQ 56(SP), SI
MOVQ R8, AX
INCQ R8
MOVQ DI, CX
JMP 108 // jump back, growing done
As you can see, no CALL statement to a function called append can be seen. This is the full implementation of the append call in the example code. Another call with different parameters will look differently (other registers, different parameters depending on the slice type, etc.).
The Go append builtin function code is generated by the Go gc and gccgo compilers and uses Go package runtime functions (for example, runtime.growslice()) in go/src/runtime/slice.go.
For example,
package main
func main() {
b := []int{0, 1}
b = append(b, 2)
}
Go pseudo-assembler:
$ go tool compile -S a.go
"".main t=1 size=192 value=0 args=0x0 locals=0x68
0x0000 00000 (a.go:3) TEXT "".main(SB), $104-0
0x0000 00000 (a.go:3) MOVQ (TLS), CX
0x0009 00009 (a.go:3) CMPQ SP, 16(CX)
0x000d 00013 (a.go:3) JLS 167
0x0013 00019 (a.go:3) SUBQ $104, SP
0x0017 00023 (a.go:3) FUNCDATA $0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0017 00023 (a.go:3) FUNCDATA $1, gclocals·790e5cc5051fc0affc980ade09e929ec(SB)
0x0017 00023 (a.go:4) LEAQ "".autotmp_0002+64(SP), BX
0x001c 00028 (a.go:4) MOVQ BX, CX
0x001f 00031 (a.go:4) NOP
0x001f 00031 (a.go:4) MOVQ "".statictmp_0000(SB), BP
0x0026 00038 (a.go:4) MOVQ BP, (BX)
0x0029 00041 (a.go:4) MOVQ "".statictmp_0000+8(SB), BP
0x0030 00048 (a.go:4) MOVQ BP, 8(BX)
0x0034 00052 (a.go:4) NOP
0x0034 00052 (a.go:4) MOVQ $2, AX
0x003b 00059 (a.go:4) MOVQ $2, DX
0x0042 00066 (a.go:5) MOVQ CX, "".b+80(SP)
0x0047 00071 (a.go:5) MOVQ AX, "".b+88(SP)
0x004c 00076 (a.go:5) MOVQ DX, "".b+96(SP)
0x0051 00081 (a.go:5) MOVQ AX, BX
0x0054 00084 (a.go:5) INCQ BX
0x0057 00087 (a.go:5) CMPQ BX, DX
0x005a 00090 (a.go:5) JHI $1, 108
0x005c 00092 (a.go:5) LEAQ (CX)(AX*8), BX
0x0060 00096 (a.go:5) MOVQ $2, (BX)
0x0067 00103 (a.go:6) ADDQ $104, SP
0x006b 00107 (a.go:6) RET
0x006c 00108 (a.go:5) LEAQ type.[]int(SB), BP
0x0073 00115 (a.go:5) MOVQ BP, (SP)
0x0077 00119 (a.go:5) MOVQ CX, 8(SP)
0x007c 00124 (a.go:5) MOVQ AX, 16(SP)
0x0081 00129 (a.go:5) MOVQ DX, 24(SP)
0x0086 00134 (a.go:5) MOVQ BX, 32(SP)
0x008b 00139 (a.go:5) PCDATA $0, $0
0x008b 00139 (a.go:5) CALL runtime.growslice(SB)
0x0090 00144 (a.go:5) MOVQ 40(SP), CX
0x0095 00149 (a.go:5) MOVQ 48(SP), AX
0x009a 00154 (a.go:5) MOVQ 56(SP), DX
0x009f 00159 (a.go:5) MOVQ AX, BX
0x00a2 00162 (a.go:5) INCQ BX
0x00a5 00165 (a.go:5) JMP 92
0x00a7 00167 (a.go:3) CALL runtime.morestack_noctxt(SB)
0x00ac 00172 (a.go:3) JMP 0
To add to the assembly code given by the others, you can find the Go (1.5.1) code for gc there : https://github.com/golang/go/blob/f2e4c8b5fb3660d793b2c545ef207153db0a34b1/src/cmd/compile/internal/gc/walk.go#L2895
// expand append(l1, l2...) to
// init {
// s := l1
// if n := len(l1) + len(l2) - cap(s); n > 0 {
// s = growslice_n(s, n)
// }
// s = s[:len(l1)+len(l2)]
// memmove(&s[len(l1)], &l2[0], len(l2)*sizeof(T))
// }
// s
//
// l2 is allowed to be a string.
with growslice_n being defined there : https://github.com/golang/go/blob/f2e4c8b5fb3660d793b2c545ef207153db0a34b1/src/runtime/slice.go#L36
// growslice_n is a variant of growslice that takes the number of new elements
// instead of the new minimum capacity.
// TODO(rsc): This is used by append(slice, slice...).
// The compiler should change that code to use growslice directly (issue #11419).
func growslice_n(t *slicetype, old slice, n int) slice {
if n < 1 {
panic(errorString("growslice: invalid n"))
}
return growslice(t, old, old.cap+n)
}
// growslice handles slice growth during append.
// It is passed the slice type, the old slice, and the desired new minimum capacity,
// and it returns a new slice with at least that capacity, with the old data
// copied into it.
func growslice(t *slicetype, old slice, cap int) slice {
if cap < old.cap || t.elem.size > 0 && uintptr(cap) > _MaxMem/uintptr(t.elem.size) {
panic(errorString("growslice: cap out of range"))
}
if raceenabled {
callerpc := getcallerpc(unsafe.Pointer(&t))
racereadrangepc(old.array, uintptr(old.len*int(t.elem.size)), callerpc, funcPC(growslice))
}
et := t.elem
if et.size == 0 {
// append should not create a slice with nil pointer but non-zero len.
// We assume that append doesn't need to preserve old.array in this case.
return slice{unsafe.Pointer(&zerobase), old.len, cap}
}
newcap := old.cap
if newcap+newcap < cap {
newcap = cap
} else {
for {
if old.len < 1024 {
newcap += newcap
} else {
newcap += newcap / 4
}
if newcap >= cap {
break
}
}
}
if uintptr(newcap) >= _MaxMem/uintptr(et.size) {
panic(errorString("growslice: cap out of range"))
}
lenmem := uintptr(old.len) * uintptr(et.size)
capmem := roundupsize(uintptr(newcap) * uintptr(et.size))
newcap = int(capmem / uintptr(et.size))
var p unsafe.Pointer
if et.kind&kindNoPointers != 0 {
p = rawmem(capmem)
memmove(p, old.array, lenmem)
memclr(add(p, lenmem), capmem-lenmem)
} else {
// Note: can't use rawmem (which avoids zeroing of memory), because then GC can scan uninitialized memory.
p = newarray(et, uintptr(newcap))
if !writeBarrierEnabled {
memmove(p, old.array, lenmem)
} else {
for i := uintptr(0); i < lenmem; i += et.size {
typedmemmove(et, add(p, i), add(old.array, i))
}
}
}
return slice{p, old.len, newcap}
}

Comparing AX register against zero

I have a assembly program to write. I need to check the AX register, if the AX register is greater than 0 move +1 in BX, if the AX register has a value less than 0 then move -1 in BX else if AX =0 then move 0 in BX. I have the following code that does it but I am looking for an alternate solution. Please help out. Thanks
CMP AX, 0
JG GREATER
JL LESS
MOV BX, 0
GREATER:
MOV BX, 1
LESS:
MOV BX, -1
The code you gave always returns -1. Try this:
CMP AX, 0
JG GREATER
JL LESS
MOV BX, 0
JMP END
GREATER:
MOV BX, 1
JMP END
LESS:
MOV BX, -1
END:
Try this, which only requires a single conditional branch and no unconditional jumps:
mov bx, ax // copy ax to bx
sarw bx, 15 // arithmetic shift - any -ve => -1, 0 or +ve => 0
cmp ax, 0 // compare original number to zero
jle end // if it's <=, we're done
mov bx, 1 // else bx = 1
end:
NB - my x86 code is very very rusty. Also, that version of sar wasn't in the 8086, but was in the 286 and later, and didn't get particularly speedy until the 80386.
EDIT I think I found a better version for 386+ without any branches:
mov bx, ax // copy ax to bx
sarw bx, 15 // arithmetic shift - any -ve => -1, 0 or +ve => 0
cmp ax, 0 // compare original to zero
setg bl // if it was greater, bl = 1 (bh already == 0 from above)

Resources