Faster bitwise AND operation on byte slices

Faster bitwise AND operation on byte slices - go

I'd like to perform bitwise AND on every column of a byte matrix, which is stored in a [][]byte in golang. I created a repo with runnable test code.
It can be simplified as a bitwise AND operation on two byte-slice of equal length. The simplest way is using for loop to handle every pair of bytes.
func and(x, y []byte) []byte {
z := make([]byte, lenght(x))
for i:= 0; i < len(x); i++ {
z[i] = x[i] & y[i]
}
return z
}
However, it's very slow for long slices. A faster way is to unroll the for loop (check the benchmark result)
BenchmarkLoop-16 14467 84265 ns/op
BenchmarkUnrollLoop-16 17668 67550 ns/op
Any faster way? Go assembly?
Thank you in advance.

I write a go assembly implementation using AVX2 instructions after two days of (go) assembly learning.
The performance is good, 10X of simple loop version. While optimizations for compatibility and performance are still needed. Suggestions and PRs are welcome.
Note: code and benchmark results are updated.
I appreciate #PeterCordes for many valuable suggestions.
#include "textflag.h"
// func AND(x []byte, y []byte)
// Requires: AVX
TEXT ·AND(SB), NOSPLIT|NOPTR, $0-48
// pointer of x
MOVQ x_base+0(FP), AX
// length of x
MOVQ x_len+8(FP), CX
// pointer of y
MOVQ y_base+24(FP), DX
// --------------------------------------------
// end address of x, will not change: p + n
MOVQ AX, BX
ADDQ CX, BX
// end address for loop
// n <= 8, jump to tail
CMPQ CX, $0x00000008
JLE tail
// n < 16, jump to loop8
CMPQ CX, $0x00000010
JL loop8_start
// n < 32, jump to loop16
CMPQ CX, $0x00000020
JL loop16_start
// --------------------------------------------
// end address for loop32
MOVQ BX, CX
SUBQ $0x0000001f, CX
loop32:
// compute x & y, and save value to x
VMOVDQU (AX), Y0
VANDPS (DX), Y0, Y0
VMOVDQU Y0, (AX)
// move pointer
ADDQ $0x00000020, AX
ADDQ $0x00000020, DX
CMPQ AX, CX
JL loop32
// n <= 8, jump to tail
MOVQ BX, CX
SUBQ AX, CX
CMPQ CX, $0x00000008
JLE tail
// n < 16, jump to loop8
CMPQ CX, $0x00000010
JL loop8_start
// --------------------------------------------
loop16_start:
// end address for loop16
MOVQ BX, CX
SUBQ $0x0000000f, CX
loop16:
// compute x & y, and save value to x
VMOVDQU (AX), X0
VANDPS (DX), X0, X0
VMOVDQU X0, (AX)
// move pointer
ADDQ $0x00000010, AX
ADDQ $0x00000010, DX
CMPQ AX, CX
JL loop16
// n <= 8, jump to tail
MOVQ BX, CX
SUBQ AX, CX
CMPQ CX, $0x00000008
JLE tail
// --------------------------------------------
loop8_start:
// end address for loop8
MOVQ BX, CX
SUBQ $0x00000007, CX
loop8:
// compute x & y, and save value to x
MOVQ (AX), BX
ANDQ (DX), BX
MOVQ BX, (AX)
// move pointer
ADDQ $0x00000008, AX
ADDQ $0x00000008, DX
CMPQ AX, CX
JL loop8
// --------------------------------------------
tail:
// left elements (<=8)
MOVQ (AX), BX
ANDQ (DX), BX
MOVQ BX, (AX)
RET
Benchmark result:
test data-size time
------------------- --------- -----------
BenchmarkGrailbio 8.00_B 4.654 ns/op
BenchmarkGoAsm 8.00_B 4.824 ns/op
BenchmarkUnrollLoop 8.00_B 6.851 ns/op
BenchmarkLoop 8.00_B 8.683 ns/op
BenchmarkGrailbio 16.00_B 5.363 ns/op
BenchmarkGoAsm 16.00_B 6.369 ns/op
BenchmarkUnrollLoop 16.00_B 10.47 ns/op
BenchmarkLoop 16.00_B 13.48 ns/op
BenchmarkGoAsm 32.00_B 6.079 ns/op
BenchmarkGrailbio 32.00_B 6.497 ns/op
BenchmarkUnrollLoop 32.00_B 17.46 ns/op
BenchmarkLoop 32.00_B 21.09 ns/op
BenchmarkGoAsm 128.00_B 10.52 ns/op
BenchmarkGrailbio 128.00_B 14.40 ns/op
BenchmarkUnrollLoop 128.00_B 56.97 ns/op
BenchmarkLoop 128.00_B 80.12 ns/op
BenchmarkGoAsm 256.00_B 15.48 ns/op
BenchmarkGrailbio 256.00_B 23.76 ns/op
BenchmarkUnrollLoop 256.00_B 110.8 ns/op
BenchmarkLoop 256.00_B 147.5 ns/op
BenchmarkGoAsm 1.00_KB 47.16 ns/op
BenchmarkGrailbio 1.00_KB 87.75 ns/op
BenchmarkUnrollLoop 1.00_KB 443.1 ns/op
BenchmarkLoop 1.00_KB 540.5 ns/op
BenchmarkGoAsm 16.00_KB 751.6 ns/op
BenchmarkGrailbio 16.00_KB 1342 ns/op
BenchmarkUnrollLoop 16.00_KB 7007 ns/op
BenchmarkLoop 16.00_KB 8623 ns/op

Related

Why is there a performance difference when I pass a slice argument as value or a pointer?

I have the following code:
func AddToSliceByValue(mySlice []int) {
for idx := range mySlice {
mySlice[idx]++
}
}
func AddToSliceByPointer(mySlice *[]int) {
for idx := range *mySlice {
(*mySlice)[idx]++
}
}
My first thought was that the performance should be nearly the same because pass by value copies the slice header and pass by pointer would force me to dereferencing pointers but my benchmark shows something else:
func BenchmarkAddByValue(b *testing.B) {
mySlice := rand.Perm(1000)
for n := 0; n < b.N; n++ {
AddToSliceByValue(mySlice)
}
}
func BenchmarkAddByPointer(b *testing.B) {
mySlice := rand.Perm(1000)
for n := 0; n < b.N; n++ {
AddToSliceByPointer(&mySlice)
}
}
BenchmarkAddByValue-12 1151256 1035 ns/op
BenchmarkAddByPointer-12 2145110 525 ns/op
Can anyone explain to me why the difference in performance is so great?
I also added the assembly code for the two functions.
Assembly code for pass by value:
TEXT main.AddToSliceByValue(SB) /go_test/pointer/pointer_value.go
pointer_value.go:4 0x1056f60 488b442410 MOVQ 0x10(SP), AX
pointer_value.go:4 0x1056f65 488b4c2408 MOVQ 0x8(SP), CX
pointer_value.go:4 0x1056f6a 31d2 XORL DX, DX
pointer_value.go:4 0x1056f6c eb0e JMP 0x1056f7c
pointer_value.go:5 0x1056f6e 488b1cd1 MOVQ 0(CX)(DX*8), BX
pointer_value.go:5 0x1056f72 48ffc3 INCQ BX
pointer_value.go:5 0x1056f75 48891cd1 MOVQ BX, 0(CX)(DX*8)
pointer_value.go:4 0x1056f79 48ffc2 INCQ DX
pointer_value.go:4 0x1056f7c 4839c2 CMPQ AX, DX
pointer_value.go:4 0x1056f7f 7ced JL 0x1056f6e
pointer_value.go:4 0x1056f81 c3 RET
:-1 0x1056f82 cc INT $0x3
:-1 0x1056f83 cc INT $0x3
:-1 0x1056f84 cc INT $0x3
:-1 0x1056f85 cc INT $0x3
:-1 0x1056f86 cc INT $0x3
:-1 0x1056f87 cc INT $0x3
:-1 0x1056f88 cc INT $0x3
:-1 0x1056f89 cc INT $0x3
:-1 0x1056f8a cc INT $0x3
:-1 0x1056f8b cc INT $0x3
:-1 0x1056f8c cc INT $0x3
:-1 0x1056f8d cc INT $0x3
:-1 0x1056f8e cc INT $0x3
:-1 0x1056f8f cc INT $0x3
TEXT main.main(SB) /go_test/pointer/pointer_value.go
pointer_value.go:9 0x1056f90 65488b0c2530000000 MOVQ GS:0x30, CX
pointer_value.go:9 0x1056f99 483b6110 CMPQ 0x10(CX), SP
pointer_value.go:9 0x1056f9d 0f86a8000000 JBE 0x105704b
pointer_value.go:9 0x1056fa3 4883ec70 SUBQ $0x70, SP
pointer_value.go:9 0x1056fa7 48896c2468 MOVQ BP, 0x68(SP)
pointer_value.go:9 0x1056fac 488d6c2468 LEAQ 0x68(SP), BP
pointer_value.go:11 0x1056fb1 488d7c2418 LEAQ 0x18(SP), DI
pointer_value.go:11 0x1056fb6 0f57c0 XORPS X0, X0
pointer_value.go:11 0x1056fb9 488d7fd0 LEAQ -0x30(DI), DI
pointer_value.go:11 0x1056fbd 48896c24f0 MOVQ BP, -0x10(SP)
pointer_value.go:11 0x1056fc2 488d6c24f0 LEAQ -0x10(SP), BP
pointer_value.go:11 0x1056fc7 e849c6ffff CALL 0x1053615
pointer_value.go:11 0x1056fcc 488b6d00 MOVQ 0(BP), BP
pointer_value.go:11 0x1056fd0 48c744242001000000 MOVQ $0x1, 0x20(SP)
pointer_value.go:11 0x1056fd9 48c744242802000000 MOVQ $0x2, 0x28(SP)
pointer_value.go:11 0x1056fe2 48c744243003000000 MOVQ $0x3, 0x30(SP)
pointer_value.go:11 0x1056feb 48c744243804000000 MOVQ $0x4, 0x38(SP)
pointer_value.go:11 0x1056ff4 48c744244005000000 MOVQ $0x5, 0x40(SP)
pointer_value.go:11 0x1056ffd 48c744244806000000 MOVQ $0x6, 0x48(SP)
pointer_value.go:11 0x1057006 48c744245007000000 MOVQ $0x7, 0x50(SP)
pointer_value.go:11 0x105700f 48c744245808000000 MOVQ $0x8, 0x58(SP)
pointer_value.go:11 0x1057018 48c744246009000000 MOVQ $0x9, 0x60(SP)
pointer_value.go:12 0x1057021 488d442418 LEAQ 0x18(SP), AX
pointer_value.go:12 0x1057026 48890424 MOVQ AX, 0(SP)
pointer_value.go:12 0x105702a 48c74424080a000000 MOVQ $0xa, 0x8(SP)
pointer_value.go:12 0x1057033 48c74424100a000000 MOVQ $0xa, 0x10(SP)
pointer_value.go:12 0x105703c e81fffffff CALL main.AddToSliceByValue(SB)
pointer_value.go:13 0x1057041 488b6c2468 MOVQ 0x68(SP), BP
pointer_value.go:13 0x1057046 4883c470 ADDQ $0x70, SP
pointer_value.go:13 0x105704a c3 RET
pointer_value.go:9 0x105704b e8909cffff CALL runtime.morestack_noctxt(SB)
pointer_value.go:9 0x1057050 e93bffffff JMP main.main(SB)
assembly code for pass by pointer:
TEXT main.AddToSliceByPointer(SB) /go_test/pointer/pointer_ref.go
pointer_ref.go:3 0x1056f60 4883ec18 SUBQ $0x18, SP
pointer_ref.go:3 0x1056f64 48896c2410 MOVQ BP, 0x10(SP)
pointer_ref.go:3 0x1056f69 488d6c2410 LEAQ 0x10(SP), BP
pointer_ref.go:4 0x1056f6e 488b542420 MOVQ 0x20(SP), DX
pointer_ref.go:4 0x1056f73 488b5a08 MOVQ 0x8(DX), BX
pointer_ref.go:4 0x1056f77 31c0 XORL AX, AX
pointer_ref.go:4 0x1056f79 eb0e JMP 0x1056f89
pointer_ref.go:5 0x1056f7b 488b3cc6 MOVQ 0(SI)(AX*8), DI
pointer_ref.go:5 0x1056f7f 48ffc7 INCQ DI
pointer_ref.go:5 0x1056f82 48893cc6 MOVQ DI, 0(SI)(AX*8)
pointer_ref.go:4 0x1056f86 48ffc0 INCQ AX
pointer_ref.go:4 0x1056f89 4839d8 CMPQ BX, AX
pointer_ref.go:4 0x1056f8c 7d0e JGE 0x1056f9c
pointer_ref.go:5 0x1056f8e 488b4a08 MOVQ 0x8(DX), CX
pointer_ref.go:5 0x1056f92 488b32 MOVQ 0(DX), SI
pointer_ref.go:5 0x1056f95 4839c8 CMPQ CX, AX
pointer_ref.go:5 0x1056f98 72e1 JB 0x1056f7b
pointer_ref.go:5 0x1056f9a eb0a JMP 0x1056fa6
pointer_ref.go:4 0x1056f9c 488b6c2410 MOVQ 0x10(SP), BP
pointer_ref.go:4 0x1056fa1 4883c418 ADDQ $0x18, SP
pointer_ref.go:4 0x1056fa5 c3 RET
pointer_ref.go:5 0x1056fa6 e8b5c4ffff CALL runtime.panicIndex(SB)
pointer_ref.go:5 0x1056fab 90 NOPL
:-1 0x1056fac cc INT $0x3
:-1 0x1056fad cc INT $0x3
:-1 0x1056fae cc INT $0x3
:-1 0x1056faf cc INT $0x3
TEXT main.main(SB) /go_test/pointer/pointer_ref.go
pointer_ref.go:9 0x1056fb0 65488b0c2530000000 MOVQ GS:0x30, CX
pointer_ref.go:9 0x1056fb9 483b6110 CMPQ 0x10(CX), SP
pointer_ref.go:9 0x1056fbd 0f86b2000000 JBE 0x1057075
pointer_ref.go:9 0x1056fc3 4883ec78 SUBQ $0x78, SP
pointer_ref.go:9 0x1056fc7 48896c2470 MOVQ BP, 0x70(SP)
pointer_ref.go:9 0x1056fcc 488d6c2470 LEAQ 0x70(SP), BP
pointer_ref.go:11 0x1056fd1 488d7c2408 LEAQ 0x8(SP), DI
pointer_ref.go:11 0x1056fd6 0f57c0 XORPS X0, X0
pointer_ref.go:11 0x1056fd9 488d7fd0 LEAQ -0x30(DI), DI
pointer_ref.go:11 0x1056fdd 48896c24f0 MOVQ BP, -0x10(SP)
pointer_ref.go:11 0x1056fe2 488d6c24f0 LEAQ -0x10(SP), BP
pointer_ref.go:11 0x1056fe7 e829c6ffff CALL 0x1053615
pointer_ref.go:11 0x1056fec 488b6d00 MOVQ 0(BP), BP
pointer_ref.go:11 0x1056ff0 48c744241001000000 MOVQ $0x1, 0x10(SP)
pointer_ref.go:11 0x1056ff9 48c744241802000000 MOVQ $0x2, 0x18(SP)
pointer_ref.go:11 0x1057002 48c744242003000000 MOVQ $0x3, 0x20(SP)
pointer_ref.go:11 0x105700b 48c744242804000000 MOVQ $0x4, 0x28(SP)
pointer_ref.go:11 0x1057014 48c744243005000000 MOVQ $0x5, 0x30(SP)
pointer_ref.go:11 0x105701d 48c744243806000000 MOVQ $0x6, 0x38(SP)
pointer_ref.go:11 0x1057026 48c744244007000000 MOVQ $0x7, 0x40(SP)
pointer_ref.go:11 0x105702f 48c744244808000000 MOVQ $0x8, 0x48(SP)
pointer_ref.go:11 0x1057038 48c744245009000000 MOVQ $0x9, 0x50(SP)
pointer_ref.go:11 0x1057041 488d442408 LEAQ 0x8(SP), AX
pointer_ref.go:11 0x1057046 4889442458 MOVQ AX, 0x58(SP)
pointer_ref.go:11 0x105704b 48c74424600a000000 MOVQ $0xa, 0x60(SP)
pointer_ref.go:11 0x1057054 48c74424680a000000 MOVQ $0xa, 0x68(SP)
pointer_ref.go:12 0x105705d 488d442458 LEAQ 0x58(SP), AX
pointer_ref.go:12 0x1057062 48890424 MOVQ AX, 0(SP)
pointer_ref.go:12 0x1057066 e8f5feffff CALL main.AddToSliceByPointer(SB)
pointer_ref.go:13 0x105706b 488b6c2470 MOVQ 0x70(SP), BP
pointer_ref.go:13 0x1057070 4883c478 ADDQ $0x78, SP
pointer_ref.go:13 0x1057074 c3 RET
pointer_ref.go:9 0x1057075 e8669cffff CALL runtime.morestack_noctxt(SB)
pointer_ref.go:9 0x105707a e931ffffff JMP main.main(SB)

I could not reproduce your benchmark...
package main_test
import (
"math/rand"
"testing"
)
func AddToSliceByValue(mySlice []int) {
for idx := range mySlice {
mySlice[idx]++
}
}
func AddToSliceByPointer(mySlice *[]int) {
for idx := range *mySlice {
(*mySlice)[idx]++
}
}
func BenchmarkAddByValue(b *testing.B) {
mySlice := rand.Perm(1000)
for n := 0; n < b.N; n++ {
AddToSliceByValue(mySlice)
}
}
func BenchmarkAddByPointer(b *testing.B) {
mySlice := rand.Perm(1000)
for n := 0; n < b.N; n++ {
AddToSliceByPointer(&mySlice)
}
}
$ go test -bench=. -benchmem -count=4
goos: linux
goarch: amd64
pkg: test/bencslice
BenchmarkAddByValue-4 3010280 385 ns/op 0 B/op 0 allocs/op
BenchmarkAddByValue-4 3118990 385 ns/op 0 B/op 0 allocs/op
BenchmarkAddByValue-4 3117450 384 ns/op 0 B/op 0 allocs/op
BenchmarkAddByValue-4 3109251 386 ns/op 0 B/op 0 allocs/op
BenchmarkAddByPointer-4 2012487 610 ns/op 0 B/op 0 allocs/op
BenchmarkAddByPointer-4 2009690 594 ns/op 0 B/op 0 allocs/op
BenchmarkAddByPointer-4 2009222 594 ns/op 0 B/op 0 allocs/op
BenchmarkAddByPointer-4 1850820 596 ns/op 0 B/op 0 allocs/op
PASS
ok test/bencslice 13.476s
$ go version
go version go1.15.2 linux/amd64
Anyways, the behavior might be dependent of many factors, first of all the version of the runtime. Understanding the intrinsec is of little interest as long as you can test, reproduce and monitor.

I found out that my variance was too high:
AddByValue-12 5.41µs ±15%
AddByPointer-12 5.30µs ± 4%
with go test -benchmem -count 5 -benchtime=1000000x -bench=. ./... I was able to reduce the variance in the test results and could confirm my first assumption that the results should be approximately equal:
AddByValue-12 5.04µs ± 1%
AddByPointer-12 5.17µs ± 1%
According to the comments the main reason for the high variance was that I did not reset the timer after the benchmark setup.
With the following code and a lower benchtime I also reduced the variance:
func BenchmarkAddByValue(b *testing.B) {
mySlice := rand.Perm(10000)
b.ResetTimer()
for n := 0; n < b.N; n++ {
AddToSliceByValue(mySlice)
}
}
func BenchmarkAddByPointer(b *testing.B) {
mySlice := rand.Perm(10000)
b.ResetTimer()
for n := 0; n < b.N; n++ {
AddToSliceByPointer(&mySlice)
}
}
Results:
AddByValue-12 5.03µs ± 0%
AddByPointer-12 5.17µs ± 1%
Thanks a lot for your help!

This is general issue with "lower level" languages. When you pass by value that means you are actually copying the data. Here is how that works.
When you pass by reference :
The copy of the reference is created and passed to method ( reference is most likely 8 bytes, so this is fast )
You read out the data behind the reference ( this is also fast, since the reference is most likely in the CPU cache )
In case of pass by value:
The block of memory is allocated to store the data you passed in ( slow )
The data is copied into the newly allocated block of memory ( maybe fast, maybe slow )
Then your data is accessed via references ( maybe slow, maybe fast, depending if data landed in cache or not )

How to optimise this 8-bit positional popcount using assembly?

This post is related to Golang assembly implement of _mm_add_epi32 , where it adds paired elements in two [8]int32 list, and returns the updated first one.
According to pprof profile, I found passing [8]int32 is expensive, so I think passing pointer of the list is much cheaper and the bech result verified this. Here's the go version:
func __mm_add_epi32_inplace_purego(x, y *[8]int32) {
(*x)[0] += (*y)[0]
(*x)[1] += (*y)[1]
(*x)[2] += (*y)[2]
(*x)[3] += (*y)[3]
(*x)[4] += (*y)[4]
(*x)[5] += (*y)[5]
(*x)[6] += (*y)[6]
(*x)[7] += (*y)[7]
}
This function is called in two levels of loop.
The algorithm computes a position population count over an array of bytes.
Thanks advice from #fuz , I know that writing whole algorithm in assembly is the best choice and makes sense, but it's beyond my ability since I never learn programming in assembly.
However, it should be easy to optimize the inner loop with assembly:
counts := make([][8]int32, numRowBytes)
for i, b = range byteSlice {
if b == 0 { // more than half of elements in byteSlice is 0.
continue
}
expand = _expand_byte[b]
__mm_add_epi32_inplace_purego(&counts[i], expand)
}
// expands a byte into its bits
var _expand_byte = [256]*[8]int32{
&[8]int32{0, 0, 0, 0, 0, 0, 0, 0},
&[8]int32{0, 0, 0, 0, 0, 0, 0, 1},
&[8]int32{0, 0, 0, 0, 0, 0, 1, 0},
&[8]int32{0, 0, 0, 0, 0, 0, 1, 1},
&[8]int32{0, 0, 0, 0, 0, 1, 0, 0},
...
}
Can you help to write an assembly version of __mm_add_epi32_inplace_purego (this is enough for me), or even the whole loop? Thank you in advance.

The operation you want to perform is called a positional population count on bytes. This is a well-known operation used in machine learning and some research has been done on fast algorithms to solve this problem.
Unfortunately, the implementation of these algorithms is fairly involved. For this reason, I have developed a custom algorithm that is much simpler to implement but only yields roughly half the performance of the other other method. However, at measured 10 GB/s, it should still be a decent improvement over what you had previously.
The idea of this algorithm is to gather corresponding bits from groups of 32 bytes using vpmovmskb and then to take a scalar population count which is then added to the corresponding counter. This allows the dependency chains to be short and a consistent IPC of 3 to be reached.
Note that compared to your algorithm, my code flips the order of bits around. You can change this by editing which counts array elements the assembly code accesses if you want. However, in the interest of future readers, I'd like to leave this code with the more common convention where the least significant bit is considered bit 0.
Source code
The complete source code can be found on github. The author has meanwhile developed this algorithm idea into a portable library that can be used like this:
import "github.com/clausecker/pospop"
var counts [8]int
pospop.Count8(counts, buf) // add positional popcounts for buf to counts
The algorithm is provided in two variants and has been tested on a machine with a processor identified as “Intel(R) Xeon(R) W-2133 CPU # 3.60GHz.”
Positional Population Count 32 Bytes at a Time.
The counters are kept in general purpose registers for best performance. Memory is prefetched well in advance for better streaming behaviour. The scalar tail is processed using a very simple SHRL/ADCL combination. A performance of up to 11 GB/s is achieved.
#include "textflag.h"
// func PospopcntReg(counts *[8]int32, buf []byte)
TEXT ·PospopcntReg(SB),NOSPLIT,$0-32
MOVQ counts+0(FP), DI
MOVQ buf_base+8(FP), SI // SI = &buf[0]
MOVQ buf_len+16(FP), CX // CX = len(buf)
// load counts into register R8--R15
MOVL 4*0(DI), R8
MOVL 4*1(DI), R9
MOVL 4*2(DI), R10
MOVL 4*3(DI), R11
MOVL 4*4(DI), R12
MOVL 4*5(DI), R13
MOVL 4*6(DI), R14
MOVL 4*7(DI), R15
SUBQ $32, CX // pre-subtract 32 bit from CX
JL scalar
vector: VMOVDQU (SI), Y0 // load 32 bytes from buf
PREFETCHT0 384(SI) // prefetch some data
ADDQ $32, SI // advance SI past them
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R15 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R14 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R13 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R12 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R11 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R10 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R9 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R8 // add to counter
SUBQ $32, CX
JGE vector // repeat as long as bytes are left
scalar: ADDQ $32, CX // undo last subtraction
JE done // if CX=0, there's nothing left
loop: MOVBLZX (SI), AX // load a byte from buf
INCQ SI // advance past it
SHRL $1, AX // CF=LSB, shift byte to the right
ADCL $0, R8 // add CF to R8
SHRL $1, AX
ADCL $0, R9 // add CF to R9
SHRL $1, AX
ADCL $0, R10 // add CF to R10
SHRL $1, AX
ADCL $0, R11 // add CF to R11
SHRL $1, AX
ADCL $0, R12 // add CF to R12
SHRL $1, AX
ADCL $0, R13 // add CF to R13
SHRL $1, AX
ADCL $0, R14 // add CF to R14
SHRL $1, AX
ADCL $0, R15 // add CF to R15
DECQ CX // mark this byte as done
JNE loop // and proceed if any bytes are left
// write R8--R15 back to counts
done: MOVL R8, 4*0(DI)
MOVL R9, 4*1(DI)
MOVL R10, 4*2(DI)
MOVL R11, 4*3(DI)
MOVL R12, 4*4(DI)
MOVL R13, 4*5(DI)
MOVL R14, 4*6(DI)
MOVL R15, 4*7(DI)
VZEROUPPER // restore SSE-compatibility
RET
Positional Population Count 96 Bytes at a Time with CSA
This variant performs all of the optimisations above but reduces 96 bytes to 64 using a single CSA step beforehand. As expected, this improves the performance by roughly 30% and achieves up to 16 GB/s.
#include "textflag.h"
// func PospopcntRegCSA(counts *[8]int32, buf []byte)
TEXT ·PospopcntRegCSA(SB),NOSPLIT,$0-32
MOVQ counts+0(FP), DI
MOVQ buf_base+8(FP), SI // SI = &buf[0]
MOVQ buf_len+16(FP), CX // CX = len(buf)
// load counts into register R8--R15
MOVL 4*0(DI), R8
MOVL 4*1(DI), R9
MOVL 4*2(DI), R10
MOVL 4*3(DI), R11
MOVL 4*4(DI), R12
MOVL 4*5(DI), R13
MOVL 4*6(DI), R14
MOVL 4*7(DI), R15
SUBQ $96, CX // pre-subtract 32 bit from CX
JL scalar
vector: VMOVDQU (SI), Y0 // load 96 bytes from buf into Y0--Y2
VMOVDQU 32(SI), Y1
VMOVDQU 64(SI), Y2
ADDQ $96, SI // advance SI past them
PREFETCHT0 320(SI)
PREFETCHT0 384(SI)
VPXOR Y0, Y1, Y3 // first adder: sum
VPAND Y0, Y1, Y0 // first adder: carry out
VPAND Y2, Y3, Y1 // second adder: carry out
VPXOR Y2, Y3, Y2 // second adder: sum (full sum)
VPOR Y0, Y1, Y0 // full adder: carry out
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R15
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R14
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R13
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R12
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R11
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R10
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R9
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R8
SUBQ $96, CX
JGE vector // repeat as long as bytes are left
scalar: ADDQ $96, CX // undo last subtraction
JE done // if CX=0, there's nothing left
loop: MOVBLZX (SI), AX // load a byte from buf
INCQ SI // advance past it
SHRL $1, AX // is bit 0 set?
ADCL $0, R8 // add it to R8
SHRL $1, AX // is bit 0 set?
ADCL $0, R9 // add it to R9
SHRL $1, AX // is bit 0 set?
ADCL $0, R10 // add it to R10
SHRL $1, AX // is bit 0 set?
ADCL $0, R11 // add it to R11
SHRL $1, AX // is bit 0 set?
ADCL $0, R12 // add it to R12
SHRL $1, AX // is bit 0 set?
ADCL $0, R13 // add it to R13
SHRL $1, AX // is bit 0 set?
ADCL $0, R14 // add it to R14
SHRL $1, AX // is bit 0 set?
ADCL $0, R15 // add it to R15
DECQ CX // mark this byte as done
JNE loop // and proceed if any bytes are left
// write R8--R15 back to counts
done: MOVL R8, 4*0(DI)
MOVL R9, 4*1(DI)
MOVL R10, 4*2(DI)
MOVL R11, 4*3(DI)
MOVL R12, 4*4(DI)
MOVL R13, 4*5(DI)
MOVL R14, 4*6(DI)
MOVL R15, 4*7(DI)
VZEROUPPER // restore SSE-compatibility
RET
Benchmarks
Here are benchmarks for the two algorithms and a naïve reference implementation in pure Go. Full benchmarks can be found in the github repository.
BenchmarkReference/10-12 12448764 80.9 ns/op 123.67 MB/s
BenchmarkReference/32-12 4357808 258 ns/op 124.25 MB/s
BenchmarkReference/1000-12 151173 7889 ns/op 126.76 MB/s
BenchmarkReference/2000-12 68959 15774 ns/op 126.79 MB/s
BenchmarkReference/4000-12 36481 31619 ns/op 126.51 MB/s
BenchmarkReference/10000-12 14804 78917 ns/op 126.72 MB/s
BenchmarkReference/100000-12 1540 789450 ns/op 126.67 MB/s
BenchmarkReference/10000000-12 14 77782267 ns/op 128.56 MB/s
BenchmarkReference/1000000000-12 1 7781360044 ns/op 128.51 MB/s
BenchmarkReg/10-12 49255107 24.5 ns/op 407.42 MB/s
BenchmarkReg/32-12 186935192 6.40 ns/op 4998.53 MB/s
BenchmarkReg/1000-12 8778610 115 ns/op 8677.33 MB/s
BenchmarkReg/2000-12 5358495 208 ns/op 9635.30 MB/s
BenchmarkReg/4000-12 3385945 357 ns/op 11200.23 MB/s
BenchmarkReg/10000-12 1298670 901 ns/op 11099.24 MB/s
BenchmarkReg/100000-12 115629 8662 ns/op 11544.98 MB/s
BenchmarkReg/10000000-12 1270 916817 ns/op 10907.30 MB/s
BenchmarkReg/1000000000-12 12 93609392 ns/op 10682.69 MB/s
BenchmarkRegCSA/10-12 48337226 23.9 ns/op 417.92 MB/s
BenchmarkRegCSA/32-12 12843939 80.2 ns/op 398.86 MB/s
BenchmarkRegCSA/1000-12 7175629 150 ns/op 6655.70 MB/s
BenchmarkRegCSA/2000-12 3988408 295 ns/op 6776.20 MB/s
BenchmarkRegCSA/4000-12 3016693 382 ns/op 10467.41 MB/s
BenchmarkRegCSA/10000-12 1810195 642 ns/op 15575.65 MB/s
BenchmarkRegCSA/100000-12 191974 6229 ns/op 16053.40 MB/s
BenchmarkRegCSA/10000000-12 1622 698856 ns/op 14309.10 MB/s
BenchmarkRegCSA/1000000000-12 16 68540642 ns/op 14589.88 MB/s

NASM x64 on macOS using mach_absolute_time - Trouble getting nanoseconds (working code review)

I am just learning NASM so sorry if I am making some obvious mistake but I cannot understand what I am doing wrong.
Please look into the code below and let me know what is incorrect. It compiles & runs ok, but prints garbage as a result.
I know that the info coming from mach_absolute_time is hardware dependent and so it needs to be adjusted with the info from the struct from mach_timebase_info.
I created the below test program that artificially takes 1 sec to execute.
It prints the start, end and elapsed absolute mach time info, (that curiously in my machine displays the correct amount of nanoseconds).
But the calculated nanoseconds are garbage - probably related to some error I am making with the math / use of xmm registers and data sizes but for the love of me cannot figure it out.
Thanks for the help!
Example run:
; ----------------------------------------------------------------------------------------
; Testing mach_absolute_time
; nasm -fmacho64 mach.asm && gcc -o mach mach.o
; ----------------------------------------------------------------------------------------
global _main
extern _printf
extern _mach_absolute_time
extern _mach_timebase_info
extern _nanosleep
default rel
section .text
_main:
push rbx ; aligns the stack x C calls
; start measurement
call _mach_absolute_time ; get the absolute time hardware dependant
mov [start], rax ; save start in start
; print start
lea rdi, [time_absolute]
mov rsi, rax
call _printf
; do some time intensive stuff - This simulates 1 sec work
lea rdi, [timeval]
call _nanosleep
; end measurement
call _mach_absolute_time
mov [end], rax
; print end
lea rdi, [time_absolute]
mov rsi, rax
call _printf
; calc elapsed
mov r10d, [end]
mov r11d, [start]
sub r10d, r11d ; r10d = end - start
mov [diff], r10d ; copy to diff
mov rax, [diff] ; diff to rax to print as int
cvtsi2ss xmm2, r10d ; diff to xmm2 to calc nanoseconds
; print elapsed
lea rdi, [diff_absolute]
mov rsi, rax
call _printf
; get conversion factor to get nanoseconds and store numerator and denominator
; in xmm0 and xmm1
lea rdi, [timebase_info]
call _mach_timebase_info ; get conversion factor to nanoseconds
movss xmm0, [numer]
movss xmm1, [denom]
; print numerator & denominator as float to ensure I am getting the info into xmm regs
lea rdi, [time_base]
mov rax, 2
call _printf
; calc nanoseconds - xmm0 ends with nanoseconds
mulss xmm0, xmm2 ; multiply elapsed * numerator
divss xmm0, xmm1 ; divide by the denominator
; print nanoseconds as float
lea rdi, [nanosecs_calc]
mov rax, 1 ; 1 non-int argument
call _printf
pop rbx ; undoes the stack alignment push
ret
section .data
; _mach_timebase_info call struct
timebase_info:
numer db 8
denom db 8
; lazy way to set up 1 sec wait
timeval:
tv_sec dq 1
tv_usec dq 0
time_absolute: db "mach_absoute_time: %ld", 10, 0
diff_absolute: db "absoute_time diff: %ld", 10, 0
time_base: db "numerator: %g, denominator: %g", 10, 0
nanosecs_calc: db "calc nanoseconds: %ld", 10, 0
; using %g format also prints garbage
; nanosecs_calc: db "calc nanoseconds: %g", 10, 0
; should use registers but for clarity
start: dq 0
end: dq 0
diff: dq 0

EDIT: I know what was wrong. xmm reges got cleared after c calls and that is why multiplication and result failed. Anyway the C workaround below to get the timebase ratio works ok, with the full code of the test below.
The workaround is to get the ratio from mach_timebase_info from a short C function, and multiply that to the result from mach_absolute_time to get nanoseconds.
As suspected in my actual hardware (late 2013 MBP 2.3 i7), mach_absolute_time already returns nanoseconds so the factor as printed by C is 1.000.
(timebase numerator = 1, timabase denominator = 1)
#include <stdio.h>
#include <mach/mach_time.h>
double timebase() {
double ratio;
mach_timebase_info_data_t tb;
mach_timebase_info(&tb);
ratio = tb.numer / tb.denom;
printf("num: %u, den: %u\n", tb.numer, tb.denom);
printf("ratio from C: %.3f\n", ratio);
return ratio;
}
NASM:
global _main
extern _printf
extern _mach_absolute_time
extern _timebase
extern _nanosleep
default rel
section .text
_main:
push rbx ; aligns the stack x C calls
; start measurement
call _mach_absolute_time ; get the absolute time hardware dependant
mov [start], rax ; save start in start
; print start
lea rdi, [time_absolute]
mov rsi, rax
call _printf
; do some time intensive stuff - This simulates 1 sec work
lea rdi, [timeval]
call _nanosleep
; end measurement
call _mach_absolute_time
mov [end], rax
; print end
lea rdi, [time_absolute]
mov rsi, rax
call _printf
; calc elapsed
mov r10d, [end]
mov r11d, [start]
sub r10d, r11d ; r10d = end - start
mov [diff], r10d ; copy to diff
mov rax, [diff] ; diff to rax to print as int
; print elapsed
lea rdi, [diff_absolute]
mov rsi, [diff]
call _printf
; get conversion ratio from C function
call _timebase ; get conversion ratio to nanoseconds into xmm0
cvtsi2sd xmm1, [diff] ; load diff from mach_absolute time in [diff]
; if you do it before register gets cleared
; calc nanoseconds - xmm0 ends with nanoseconds
; in my hardware ratio is 1.0 so mach_absolute_time = nanoseconds
mulsd xmm0, xmm1
cvtsd2si rax, xmm0
mov [result], rax ; save to result
; print nanoseconds as int
lea rdi, [nanosecs_calc]
mov rsi, [result]
call _printf
pop rbx ; undoes the stack alignment push
ret
section .data
; lazy way to set up 1 sec wait
timeval:
tv_sec dq 1
tv_usec dq 0
time_absolute: db "mach_absoute_time: %ld", 10, 0
diff_absolute: db "absoute_time diff: %ld", 10, 0
nanosecs_calc: db "nanoseconds: %ld", 10, 0
; should use registers but for clarity
start: dq 0
end: dq 0
diff: dq 0
result: dq 0

Never triggered if statements make code execution in benchmark faster? Why?

I have recently started the Go track on exercism.io and had fun optimizing the "nth-prime" calculation. Actually I came across a funny fact I can't explain. Imagine the following code:
// Package prime provides ...
package prime
// Nth function checks for the prime number on position n
func Nth(n int) (int, bool) {
if n <= 0 {
return 0, false
}
if (n == 1) {
return 2, true
}
currentNumber := 1
primeCounter := 1
for n > primeCounter {
currentNumber+=2
if isPrime(currentNumber) {
primeCounter++
}
}
return currentNumber, primeCounter==n
}
// isPrime function checks if a number
// is a prime number
func isPrime(n int) bool {
//useless because never triggered but makes it faster??
if n < 2 {
println("n < 2")
return false
}
//useless because never triggered but makes it faster??
if n%2 == 0 {
println("n%2")
return n==2
}
for i := 3; i*i <= n; i+=2 {
if n%i == 0 {
return false
}
}
return true
}
In the private function isPrime I have two initial if-statements that are never triggered, because I only give in uneven numbers greater than 2. The benchmark returns following:
Running tool: /usr/bin/go test -benchmem -run=^$ -bench ^(BenchmarkNth)$
BenchmarkNth-8 100 18114825 ns/op 0 B/op 0
If I remove the never triggered if-statements the benchmark goes slower:
Running tool: /usr/bin/go test -benchmem -run=^$ -bench ^(BenchmarkNth)$
BenchmarkNth-8 50 21880749 ns/op 0 B/op 0
I have run the benchmark multiple times changing the code back and forth always getting more or less the same numbers and I can't think of a reason why these two if-statements should make the execution faster. Yes it is micro-optimization, but I want to know: Why?
Here is the whole exercise from exercism with test-cases: nth-prime
Go version i am using is 1.12.1 linux/amd64 on a manjaro i3 linux

What happens is the compiler is guaranteed with some assertions about the input when those if's are added.
If those assertions are lifted, the compiler has to add it himself. The way it does it is by validating it on each iteration. We can take a look at the assembly code to prove it. (by passing -gcflags=-S to the go test command)
With the if's:
0x004b 00075 (func.go:16) JMP 81
0x004d 00077 (func.go:16) LEAQ 2(BX), AX
0x0051 00081 (func.go:16) MOVQ AX, DX
0x0054 00084 (func.go:16) IMULQ AX, AX
0x0058 00088 (func.go:16) CMPQ AX, CX
0x005b 00091 (func.go:16) JGT 133
0x005d 00093 (func.go:17) TESTQ DX, DX
0x0060 00096 (func.go:17) JEQ 257
0x0066 00102 (func.go:17) MOVQ CX, AX
0x0069 00105 (func.go:17) MOVQ DX, BX
0x006c 00108 (func.go:17) CQO
0x006e 00110 (func.go:17) IDIVQ BX
0x0071 00113 (func.go:17) TESTQ DX, DX
0x0074 00116 (func.go:17) JNE 77
Without the if's:
0x0016 00022 (func.go:16) JMP 28
0x0018 00024 (func.go:16) LEAQ 2(BX), AX
0x001c 00028 (func.go:16) MOVQ AX, DX
0x001f 00031 (func.go:16) IMULQ AX, AX
0x0023 00035 (func.go:16) CMPQ AX, CX
0x0026 00038 (func.go:16) JGT 88
0x0028 00040 (func.go:17) TESTQ DX, DX
0x002b 00043 (func.go:17) JEQ 102
0x002d 00045 (func.go:17) MOVQ CX, AX
0x0030 00048 (func.go:17) MOVQ DX, BX
0x0033 00051 (func.go:17) CMPQ BX, $-1
0x0037 00055 (func.go:17) JEQ 64
0x0039 00057 (func.go:17) CQO
0x003b 00059 (func.go:17) IDIVQ BX
0x003e 00062 (func.go:17) JMP 69
0x0040 00064 func.go:17) NEGQ AX
0x0043 00067 (func.go:17) XORL DX, DX
0x0045 00069 (func.go:17) TESTQ DX, DX
0x0048 00072 (func.go:17) JNE 24
Line 51 in the assembly code 0x0033 00051 (func.go:17) CMPQ BX, $-1 is the culprit.
Line 16, for i := 3; i*i <= n; i+=2, in the original Go code, is translated the same for both cases. But line 17 if n%i == 0 that runs every iteration compiles to more instructions and as a result more work for the CPU in total.
Something similar in the encoding/base64 package by ensuring the loop won't receive a nil value. You can take a look here:
https://go-review.googlesource.com/c/go/+/151158/3/src/encoding/base64/base64.go
This check was added intentionally. In your case, you optimized it accidentally :)

golang doing unexpected heap memory allocation

While benchmarking, I noticed a surprising heap memory allocation. After reducing the repro, I ended up with the following:
// --- Repro file ---
func memAllocRepro(values []int) *[]int {
for {
break
}
return &values
}
// --- Benchmark file ---
func BenchmarkMemAlloc(b *testing.B) {
values := []int{1, 2, 3, 4}
for i := 0; i < b.N; i++ {
memAllocRepro(values)
}
}
And here is the benchmark output:
BenchmarkMemAlloc-4 50000000 40.2 ns/op 32 B/op 1 allocs/op
PASS
ok memalloc_debugging 2.113s
Success: Benchmarks passed.
Now the funny this is, if I remove the for loop, or if I return the slice directly instead of a slice pointer, there are no more heap alloc:
// --- Repro file ---
func noAlloc1(values []int) *[]int {
return &values // No alloc!
}
func noAlloc2(values []int) []int {
for {
break
}
return values // No alloc!
}
// --- Benchmark file ---
func BenchmarkNoAlloc(b *testing.B) {
values := []int{1, 2, 3, 4}
for i := 0; i < b.N; i++ {
noAlloc1(values)
noAlloc2(values)
}
Benchmark result:
BenchmarkNoAlloc-4 300000000 4.20 ns/op 0 B/op 0 allocs/op
PASS
ok memalloc_debugging 1.756s
Success: Benchmarks passed.
I found that very confusing and confirmed with Delve that the disassembly does has an allocation at the start of the memAllocRepro function:
(dlv) disassemble
TEXT main.memAllocRepro(SB) memalloc_debugging/main.go
main.go:10 0x44ce10 65488b0c2528000000 mov rcx, qword ptr gs:[0x28]
main.go:10 0x44ce19 488b8900000000 mov rcx, qword ptr [rcx]
main.go:10 0x44ce20 483b6110 cmp rsp, qword ptr [rcx+0x10]
main.go:10 0x44ce24 7662 jbe 0x44ce88
main.go:10 0x44ce26 4883ec18 sub rsp, 0x18
main.go:10 0x44ce2a 48896c2410 mov qword ptr [rsp+0x10], rbp
main.go:10 0x44ce2f 488d6c2410 lea rbp, ptr [rsp+0x10]
main.go:10 0x44ce34 488d0525880000 lea rax, ptr [rip+0x8825]
main.go:10 0x44ce3b 48890424 mov qword ptr [rsp], rax
=> main.go:10 0x44ce3f* e8bcebfbff call 0x40ba00 runtime.newobject
I must say though, once I hit that point, I couldn't easily dig further. I'm pretty sure it would be possible to know at least which type is allocated by looking at the structure pointed to by the RAX register, but I wasn't very successful doing so. It's been a long time since I've read disassembly like this.
(dlv) regs
Rip = 0x000000000044ce3f
Rsp = 0x000000c042039f30
Rax = 0x0000000000455660
(...)
All that being said, I have 2 questions:
* Anyone can tell why is there a heap allocation there and if it's "expected"?
* How could I have gone further in my debugging session? Dumping memory to hex has a different address layout and go tool objdump will output disassembly, which mangle the content at the address location
Full function dump with go tool objdump:
TEXT main.memAllocRepro(SB) memalloc_debugging/main.go
main.go:10 0x44ce10 65488b0c2528000000 MOVQ GS:0x28, CX
main.go:10 0x44ce19 488b8900000000 MOVQ 0(CX), CX
main.go:10 0x44ce20 483b6110 CMPQ 0x10(CX), SP
main.go:10 0x44ce24 7662 JBE 0x44ce88
main.go:10 0x44ce26 4883ec18 SUBQ $0x18, SP
main.go:10 0x44ce2a 48896c2410 MOVQ BP, 0x10(SP)
main.go:10 0x44ce2f 488d6c2410 LEAQ 0x10(SP), BP
main.go:10 0x44ce34 488d0525880000 LEAQ runtime.types+34656(SB), AX
main.go:10 0x44ce3b 48890424 MOVQ AX, 0(SP)
main.go:10 0x44ce3f e8bcebfbff CALL runtime.newobject(SB)
main.go:10 0x44ce44 488b7c2408 MOVQ 0x8(SP), DI
main.go:10 0x44ce49 488b442428 MOVQ 0x28(SP), AX
main.go:10 0x44ce4e 48894708 MOVQ AX, 0x8(DI)
main.go:10 0x44ce52 488b442430 MOVQ 0x30(SP), AX
main.go:10 0x44ce57 48894710 MOVQ AX, 0x10(DI)
main.go:10 0x44ce5b 8b052ff60600 MOVL runtime.writeBarrier(SB), AX
main.go:10 0x44ce61 85c0 TESTL AX, AX
main.go:10 0x44ce63 7517 JNE 0x44ce7c
main.go:10 0x44ce65 488b442420 MOVQ 0x20(SP), AX
main.go:10 0x44ce6a 488907 MOVQ AX, 0(DI)
main.go:16 0x44ce6d 48897c2438 MOVQ DI, 0x38(SP)
main.go:16 0x44ce72 488b6c2410 MOVQ 0x10(SP), BP
main.go:16 0x44ce77 4883c418 ADDQ $0x18, SP
main.go:16 0x44ce7b c3 RET
main.go:16 0x44ce7c 488b442420 MOVQ 0x20(SP), AX
main.go:10 0x44ce81 e86aaaffff CALL runtime.gcWriteBarrier(SB)
main.go:10 0x44ce86 ebe5 JMP 0x44ce6d
main.go:10 0x44ce88 e85385ffff CALL runtime.morestack_noctxt(SB)
main.go:10 0x44ce8d eb81 JMP main.memAllocRepro(SB)
:-1 0x44ce8f cc INT $0x3
Disassemble of the memory pointed to by the RAX register:
(dlv) disassemble -a 0x0000000000455660 0x0000000000455860
.:0 0x455660 1800 sbb byte ptr [rax], al
.:0 0x455662 0000 add byte ptr [rax], al
.:0 0x455664 0000 add byte ptr [rax], al
.:0 0x455666 0000 add byte ptr [rax], al
.:0 0x455668 0800 or byte ptr [rax], al
.:0 0x45566a 0000 add byte ptr [rax], al
.:0 0x45566c 0000 add byte ptr [rax], al
.:0 0x45566e 0000 add byte ptr [rax], al
.:0 0x455670 8e66f9 mov fs, word ptr [rsi-0x7]
.:0 0x455673 1b02 sbb eax, dword ptr [rdx]
.:0 0x455675 0808 or byte ptr [rax], cl
.:0 0x455677 17 ?
.:0 0x455678 60 ?
.:0 0x455679 0d4a000000 or eax, 0x4a
.:0 0x45567e 0000 add byte ptr [rax], al
.:0 0x455680 c01f47 rcr byte ptr [rdi], 0x47
.:0 0x455683 0000 add byte ptr [rax], al
.:0 0x455685 0000 add byte ptr [rax], al
.:0 0x455687 0000 add byte ptr [rax], al
.:0 0x455689 0c00 or al, 0x0
.:0 0x45568b 004062 add byte ptr [rax+0x62], al
.:0 0x45568e 0000 add byte ptr [rax], al
.:0 0x455690 c0684500 shr byte ptr [rax+0x45], 0x0

Escape analysis determines whether any references to a value escape the function in which the value is declared.
In Go, arguments are passed by value, typically on the stack; the stack is reclaimed at the end of the function. However, returning the reference &values from the memAllocRepro function gives the values parameter declared in memAllocRepro a lifetime beyond the end of the function. The values variable is moved to the heap.
memAllocRepro: &values: Alloc
./escape.go:3:6: cannot inline memAllocRepro: unhandled op FOR
./escape.go:7:9: &values escapes to heap
./escape.go:7:9: from ~r1 (return) at ./escape.go:7:2
./escape.go:3:37: moved to heap: values
The noAlloc1 function is inlined in the main function. The values argument, if necessary, is declared in and does not escape from the main function.
noAlloc1: &values: No Alloc
./escape.go:10:6: can inline noAlloc1 as: func([]int)*[]int{return &values}
./escape.go:23:10: inlining call to noAlloc1 func([]int)*[]int{return &values}
The noAlloc2 function values argument is returned as values. values is returned on the stack. There is no reference to values in the noAlloc2 function and so no escape.
noAlloc2: values: No Alloc
package main
func memAllocRepro(values []int) *[]int {
for {
break
}
return &values
}
func noAlloc1(values []int) *[]int {
return &values
}
func noAlloc2(values []int) []int {
for {
break
}
return values
}
func main() {
memAllocRepro(nil)
noAlloc1(nil)
noAlloc2(nil)
}
Output:
$ go build -a -gcflags='-m -m' escape.go
# command-line-arguments
./escape.go:3:6: cannot inline memAllocRepro: unhandled op FOR
./escape.go:10:6: can inline noAlloc1 as: func([]int) *[]int { return &values }
./escape.go:14:6: cannot inline noAlloc2: unhandled op FOR
./escape.go:21:6: cannot inline main: non-leaf function
./escape.go:23:10: inlining call to noAlloc1 func([]int) *[]int { return &values }
./escape.go:7:9: &values escapes to heap
./escape.go:7:9: from ~r1 (return) at ./escape.go:7:2
./escape.go:3:37: moved to heap: values
./escape.go:11:9: &values escapes to heap
./escape.go:11:9: from ~r1 (return) at ./escape.go:11:2
./escape.go:10:32: moved to heap: values
./escape.go:14:31: leaking param: values to result ~r1 level=0
./escape.go:14:31: from ~r1 (return) at ./escape.go:18:2
./escape.go:23:10: main &values does not escape
$

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Faster bitwise AND operation on byte slices - go

Related

Why is there a performance difference when I pass a slice argument as value or a pointer?

How to optimise this 8-bit positional popcount using assembly?

NASM x64 on macOS using mach_absolute_time - Trouble getting nanoseconds (working code review)

Never triggered if statements make code execution in benchmark faster? Why?

golang doing unexpected heap memory allocation

Categories

Resources