How reach peak flops - go

I'm an experienced C++ programmer, used to low level optimization an I'm trying to get performances out of Go.
So far, I'm interested in GFlop/s.
I wrote the following go code:
package main
import (
"fmt"
"time"
"runtime"
"sync"
)
func expm1(x float64) float64 {
return ((((((((((((((15.0 + x) * x + 210.0) * x + 2730.0) * x + 32760.0) * x + 360360.0) * x + 3603600.0) * x + 32432400.0) * x + 259459200.0) * x + 1816214400.0) * x + 10897286400.0) * x + 54486432000.0) * x + 217945728000.0) *
x + 653837184000.0) * x + 1307674368000.0) * x * 7.6471637318198164759011319857881e-13;
}
func twelve(x float64) float64 {
return expm1( expm1( expm1( expm1( expm1( expm1( expm1( expm1( expm1( expm1( expm1( expm1(x))))))))))));
}
func populate(data []float64, N int) {
CPUCOUNT := runtime.NumCPU();
var wg sync.WaitGroup
var slice = N / CPUCOUNT;
wg.Add(CPUCOUNT)
defer wg.Wait()
for i := 0; i < CPUCOUNT; i++ {
go func(ii int) {
for j := ii * slice; j < ii * slice + slice; j += 1 {
data[j] = 0.1;
}
defer wg.Done();
}(i);
}
}
func apply(data []float64, N int) {
CPUCOUNT := runtime.NumCPU();
var wg sync.WaitGroup
var slice = N / CPUCOUNT;
wg.Add(CPUCOUNT)
defer wg.Wait()
for i := 0; i < CPUCOUNT; i++ {
go func(ii int) {
for j := ii * slice; j < ii * slice + slice; j += 8 {
data[j] = twelve(data[j]);
data[j+1] = twelve(data[j+1]);
data[j+2] = twelve(data[j+2]);
data[j+3] = twelve(data[j+3]);
data[j+4] = twelve(data[j+4]);
data[j+5] = twelve(data[j+5]);
data[j+6] = twelve(data[j+6]);
data[j+7] = twelve(data[j+7]);
}
defer wg.Done();
}(i);
}
}
func Run(data []float64, N int) {
populate(data, N);
start:= time.Now();
apply(data, N);
stop:= time.Now();
elapsed:=stop.Sub(start);
seconds := float64(elapsed.Milliseconds()) / 1000.0;
Gflop := float64(N) * 12.0 * 15.0E-9;
fmt.Printf("%f\n", Gflop / seconds);
}
func main() {
CPUCOUNT := runtime.NumCPU();
fmt.Printf("num procs : %d\n", CPUCOUNT);
N := 1024*1024*32 * CPUCOUNT;
data:= make([]float64, N);
for i := 0; i < 100; i++ {
Run(data, N);
}
}
which is an attempt of translation from my c++ benchmark which yields 80% of peak flops.
The C++ version yields 95 GFlop/s where the go version yields 6 GFlops/s (FMA counter for 1).
Here is a piece of the go assembly (gccgo -O3 -mfma -mavx2):
vfmadd132sd %xmm1, %xmm15, %xmm0
.loc 1 12 50
vfmadd132sd %xmm1, %xmm14, %xmm0
.loc 1 12 64
vfmadd132sd %xmm1, %xmm13, %xmm0
.loc 1 12 79
vfmadd132sd %xmm1, %xmm12, %xmm0
.loc 1 12 95
vfmadd132sd %xmm1, %xmm11, %xmm0
.loc 1 12 112
vfmadd132sd %xmm1, %xmm10, %xmm0
And what I get from my c++ code (g++ -fopenmp -mfma -mavx2 -O3):
vfmadd213pd .LC3(%rip), %ymm12, %ymm5
vfmadd213pd .LC3(%rip), %ymm11, %ymm4
vfmadd213pd .LC3(%rip), %ymm10, %ymm3
vfmadd213pd .LC3(%rip), %ymm9, %ymm2
vfmadd213pd .LC3(%rip), %ymm8, %ymm1
vfmadd213pd .LC3(%rip), %ymm15, %ymm0
vfmadd213pd .LC4(%rip), %ymm15, %ymm0
vfmadd213pd .LC4(%rip), %ymm14, %ymm7
vfmadd213pd .LC4(%rip), %ymm13, %ymm6
vfmadd213pd .LC4(%rip), %ymm12, %ymm5
vfmadd213pd .LC4(%rip), %ymm11, %ymm4
I therefore have a few questions, most important of which is :
Do I express parallelism the right way ?
and if not, how should I do that ?
For additional performance improvements, I'd need to know what's wrong with the following items :
Why do I see only vfmadd132sd instructions in the assembly, instead of vfmadd132pd?
How can I properly align memory allocations?
How can I remove debug info from the generated executable?
Do I pass the right options to gccgo?
Do I use the right compiler?

Do i express parallelism the right way ?
No. You might be trashing the CPU cache. (But this is hard to tell without knowing details about your system. Guess it's not NUMA?). Anyway, technically your code is concurrent not parallel.
Why do I see only vfmadd132sd instructions in the assembly, instead of vfmadd132pd ?
Because the compiler put it there. Is this a compiler question or a programming question?
How can i properly align memory allocations ?
That depends on your definition of "properly". Struct field and slice alignments are not ad hoc controllable, but you can reorder struct fields (which you did not use at all, so I do not know what you are asking here).
How can i remove debug info from the generated executable ?
Consult the documentation of gcc.
Do i pass the right options to gccgo ?
I do not know.
Do I use the right compiler ?
What makes a compiler "right"?

Related

Why are bitwise operators slower than division and modulo in Go?

Usually I program in C and frequently use bitwise operators since they are faster. Now I encountered this timing difference by solving Project Euler Problem 14 while using bitwise operators or division and modulo. The program was compiled with go version go1.6.2.
Version with bitwise operators:
package main
import (
"fmt"
)
func main() {
var buf, longest, cnt, longest_start int
for i:=2; i<1e6; i++ {
buf = i
cnt = 0
for buf > 1 {
if (buf & 0x01) == 0 {
buf >>= 1
} else {
buf = buf * 3 + 1
}
cnt++
}
if cnt > longest {
longest = cnt
longest_start = i
}
}
fmt.Println(longest_start)
}
executing the program:
time ./prob14
837799
real 0m0.300s
user 0m0.301s
sys 0m0.000s
Version without bitwise operators (replacing & 0x01 with % 2 and >>= 1 with /=2):
for buf > 1 {
if (buf % 2) == 0 {
buf /= 2
} else {
buf = buf * 3 + 1
}
cnt++
}
executing the program:
$ time ./prob14
837799
real 0m0.273s
user 0m0.274s
sys 0m0.000s
Why is the version with the bitwise operators in Go slower?
(I also created a solution for the problem in C. Here was the version with the bitwise operators faster without optimization flag (with -O3 they are equal).)
EDIT
I did a benchmark as suggested in the comments.
package main
import (
"testing"
)
func Colatz(num int) {
cnt := 0
buf := num
for buf > 1 {
if (buf % 2) == 0 {
buf /= 2
} else {
buf = buf * 3 + 1
}
cnt++
}
}
func ColatzBitwise(num int) {
cnt := 0
buf := num
for buf > 1 {
if (buf & 0x01) == 0 {
buf >>= 1
} else {
buf = buf * 3 + 1
}
cnt++
}
}
func BenchmarkColatz(b *testing.B) {
for i:=0; i<b.N; i++ {
Colatz(837799)
}
}
func BenchmarkColatzBitwise(b *testing.B) {
for i:=0; i<b.N; i++ {
ColatzBitwise(837799)
}
}
Here are the benchmark results:
go test -bench=.
PASS
BenchmarkColatz-8 2000000 650 ns/op
BenchmarkColatzBitwise-8 2000000 609 ns/op
It turns out the bitwise version is faster in the benchmark.
EDIT 2
I changed the type of all variables in the functions to uint. Here is the benchmark:
go test -bench=.
PASS
BenchmarkColatz-8 3000000 516 ns/op
BenchmarkColatzBitwise-8 3000000 590 ns/op
The arithmetic version is now faster, as Marc has written in his answer. I will test also with a newer compiler version.
If they ever were, they aren't now.
There are a few problems with your approach:
you're using go1.6.2 which was released over 4 years ago
you're running a binary that does other things and running it just once
you're expecting bitshift and arithmetic operations on signed integers to be the same, they're not
Using go1.15 with micro benchmarks will show the bitwise operations to be faster. The main reason for this is that a bitwise shift and a division by two are absolutely not the same for signed integers: the bitwise shift doesn't care about the sign but the division has to preserve it.
If you want to have something closer to equivalent, use unsigned integers for your arithmetic operations, the compiler may optimize it to a single bitshift.
In go1.15 on my machine, I see the following being generated for each type of division by 2:
buf >>=1:
MOVQ AX, DX
SARQ $1, AX
buf /= 2 with var buf int:
MOVQ AX, DX
SHRQ $63, AX
ADDQ DX, AX
SARQ $1, AX
buf /= 2 with var buf uint:
MOVQ CX, BX
SHRQ $1, CX
Even then, all this must be taken with a large grain of salt: the generated code will depend massively on what else is happening and how the results are used.
But the basic rule applies: when performing arithmetic operations, the type matters a lot. Bitshift operators don't care about sign.

Am I doing execution timing measurement in Go in a useful way?

My code:
// repeat fib(n) 10000 times
i := 10000
var total_time time.Duration
for i > 0 {
// do fib(n) -> f0
start := time.Now()
for n > 0 {
f0, f1, n = f1, f0.Add(f0, f1), n-1
}
total_time = total_time + time.Since(start)
i--
}
// and divide total execution time by 10000
var normalized_time = total_time / 10000
fmt.Println(normalized_time)
The execution times I'm seeing are so extremely short that I am suspicious that what I've done isn't useful. If it's wrong, what am I doing wrong and how can I make it right?
what am I doing wrong and how can I make it right?
Use the Go testing package for benchmarks. For example:
Write the Fibonacci number computation as a function in your code.
fibonacci.go:
package main
import "fmt"
// fibonacci returns the Fibonacci number for 0 <= n <= 92.
// OEIS: A000045: Fibonacci numbers:
// F(n) = F(n-1) + F(n-2) with F(0) = 0 and F(1) = 1.
func fibonacci(n int) int64 {
if n < 0 {
panic("n < 0")
}
f := int64(0)
a, b := int64(0), int64(1)
for i := 0; i <= n; i++ {
if a < 0 {
panic("overflow")
}
f, a, b = a, b, a+b
}
return f
}
func main() {
for _, n := range []int{0, 1, 2, 3, 90, 91, 92} {
fmt.Printf("%-2d %d\n", n, fibonacci(n))
}
}
Playground: https://play.golang.org/p/FFdG4RlNpUZ
Output:
$ go run fibonacci.go
0 0
1 1
2 1
3 2
90 2880067194370816120
91 4660046610375530309
92 7540113804746346429
$
Write and run some benchmarks using the Go testing package.
fibonacci_test.go:
package main
import "testing"
func BenchmarkFibonacciN0(b *testing.B) {
for i := 0; i < b.N; i++ {
fibonacci(0)
}
}
func BenchmarkFibonacciN92(b *testing.B) {
for i := 0; i < b.N; i++ {
fibonacci(92)
}
}
Output:
$ go test fibonacci.go fibonacci_test.go -bench=. -benchmem
goos: linux
goarch: amd64
BenchmarkFibonacciN0-4 367003574 3.25 ns/op 0 B/op 0 allocs/op
BenchmarkFibonacciN92-4 17369262 63.0 ns/op 0 B/op 0 allocs/op
$

Why does go benchmark show different result with the same code in different places?

I benchmarked golang system package "math/bits". It is fast.
I benchmarked the same codes copied from "math/bits", it is about 3 times slower.
I wonder what is the differences between user's code and the system code when compiling, linking or benchmarking?
// x_test.go
package x_test
import (
"math/bits"
"testing"
)
// copied from "math/bits"
const DeBruijn64 = 0x03f79d71b4ca8b09
var Input uint64 = DeBruijn64
var Output int
const m0 = 0x5555555555555555 // 01010101 ...
const m1 = 0x3333333333333333 // 00110011 ...
const m2 = 0x0f0f0f0f0f0f0f0f // 00001111 ...
const m3 = 0x00ff00ff00ff00ff // etc.
const m4 = 0x0000ffff0000ffff
func OnesCount64(x uint64) int {
const m = 1<<64 - 1
x = x>>1&(m0&m) + x&(m0&m)
x = x>>2&(m1&m) + x&(m1&m)
x = (x>>4 + x) & (m2 & m)
x += x >> 8
x += x >> 16
x += x >> 32
return int(x) & (1<<7 - 1)
}
// copied from "math/bits" END
func BenchmarkMine(b *testing.B) {
var s int
for i := 0; i < b.N; i++ {
s += OnesCount64(uint64(i))
}
Output = s
}
func BenchmarkGo(b *testing.B) {
var s int
for i := 0; i < b.N; i++ {
s += bits.OnesCount64(uint64(i))
}
Output = s
}
And running it shows the different result:
go test x_test.go -bench=.
goos: darwin
goarch: amd64
BenchmarkMine-4 500000000 3.32 ns/op
BenchmarkGo-4 2000000000 0.96 ns/op
The two benchmarks should result in similar results. But not.
After digging into go source code I found that during compiling go replaces math/bits:OnesCount64 with an instruction implementation:
go/src/cmd/compile/internal/gc/ssa.go:3428 :makeOnesCountAMD64.
When calling math/bits.OnesCount64 it actually does use the codes in math/bits.

go make slice is little bit slower than []{1,1,1,1}

i'm working on a program that allocate lots of []int with length 4,3,2
and found using a:=[]{1,1,1} is a little bit fast than a:=make([]int,3) a[0] = 1 a[1]=1 a[2]= 1
my question: why a:=[]{1,1,1} is faster than a:=make([]int,3) a[0] = 1 a[1]=1 a[2]= 1?
func BenchmarkMake(b *testing.B) {
var array []int
for i := 0; i < b.N; i++ {
array = make([]int, 4)
array[0] = 1
array[1] = 1
array[2] = 1
array[3] = 1
}
}
func BenchmarkDirect(b *testing.B) {
var array []int
for i := 0; i < b.N; i++ {
array = []int{1, 1, 1, 1}
}
array[0] = 1
}
BenchmarkMake-4 50000000 34.3 ns/op
BenchmarkDirect-4 50000000 33.8 ns/op
Let's look at benchmark output of the following code
package main
import "testing"
func BenchmarkMake(b *testing.B) {
var array []int
for i := 0; i < b.N; i++ {
array = make([]int, 4)
array[0] = 1
array[1] = 1
array[2] = 1
array[3] = 1
}
}
func BenchmarkDirect(b *testing.B) {
var array []int
for i := 0; i < b.N; i++ {
array = []int{1, 1, 1, 1}
}
array[0] = 1
}
func BenchmarkArray(b *testing.B) {
var array [4]int
for i := 0; i < b.N; i++ {
array = [4]int{1, 1, 1, 1}
}
array[0] = 1
}
Usually the output looks like that
$ go test -bench . -benchmem -o alloc_test -cpuprofile cpu.prof
goos: linux
goarch: amd64
pkg: test
BenchmarkMake-8 30000000 61.3 ns/op 32 B/op 1 allocs/op
BenchmarkDirect-8 20000000 60.2 ns/op 32 B/op 1 allocs/op
BenchmarkArray-8 1000000000 2.56 ns/op 0 B/op 0 allocs/op
PASS
ok test 6.003s
The difference is so small that it can be the opposite in some circumstances.
Let's look at the profiling data
$go tool pprof -list 'Benchmark.*' cpu.prof
ROUTINE ======================== test.BenchmarkMake in /home/grzesiek/go/src/test/alloc_test.go
260ms 1.59s (flat, cum) 24.84% of Total
. . 5:func BenchmarkMake(b *testing.B) {
. . 6: var array []int
40ms 40ms 7: for i := 0; i < b.N; i++ {
50ms 1.38s 8: array = make([]int, 4)
. . 9: array[0] = 1
130ms 130ms 10: array[1] = 1
20ms 20ms 11: array[2] = 1
20ms 20ms 12: array[3] = 1
. . 13: }
. . 14:}
ROUTINE ======================== test.BenchmarkDirect in /home/grzesiek/go/src/test/alloc_test.go
90ms 1.66s (flat, cum) 25.94% of Total
. . 16:func BenchmarkDirect(b *testing.B) {
. . 17: var array []int
10ms 10ms 18: for i := 0; i < b.N; i++ {
80ms 1.65s 19: array = []int{1, 1, 1, 1}
. . 20: }
. . 21: array[0] = 1
. . 22:}
ROUTINE ======================== test.BenchmarkArray in /home/grzesiek/go/src/test/alloc_test.go
2.86s 2.86s (flat, cum) 44.69% of Total
. . 24:func BenchmarkArray(b *testing.B) {
. . 25: var array [4]int
500ms 500ms 26: for i := 0; i < b.N; i++ {
2.36s 2.36s 27: array = [4]int{1, 1, 1, 1}
. . 28: }
. . 29: array[0] = 1
. . 30:}
We can see that assignments takes some time.
To learn why we need to see the assembler code.
$go tool pprof -disasm 'BenchmarkMake' cpu.prof
. . 4eda93: MOVQ AX, 0(SP) ;alloc_test.go:8
30ms 30ms 4eda97: MOVQ $0x4, 0x8(SP) ;test.BenchmarkMake alloc_test.go:8
. . 4edaa0: MOVQ $0x4, 0x10(SP) ;alloc_test.go:8
10ms 1.34s 4edaa9: CALL runtime.makeslice(SB) ;test.BenchmarkMake alloc_test.go:8
. . 4edaae: MOVQ 0x18(SP), AX ;alloc_test.go:8
10ms 10ms 4edab3: MOVQ 0x20(SP), CX ;test.BenchmarkMake alloc_test.go:8
. . 4edab8: TESTQ CX, CX ;alloc_test.go:9
. . 4edabb: JBE 0x4edb0b
. . 4edabd: MOVQ $0x1, 0(AX)
130ms 130ms 4edac4: CMPQ $0x1, CX ;test.BenchmarkMake alloc_test.go:10
. . 4edac8: JBE 0x4edb04 ;alloc_test.go:10
. . 4edaca: MOVQ $0x1, 0x8(AX)
20ms 20ms 4edad2: CMPQ $0x2, CX ;test.BenchmarkMake alloc_test.go:11
. . 4edad6: JBE 0x4edafd ;alloc_test.go:11
. . 4edad8: MOVQ $0x1, 0x10(AX)
. . 4edae0: CMPQ $0x3, CX ;alloc_test.go:12
. . 4edae4: JA 0x4eda65
We can see that the time is taken by CMPQ command that compares constant with CX register. The CX register is the value copied from stack after call to make. We can deduce that it must be the size of slice while AX holds the reference to an underlying array. You can also see that the first bound check was optimized.
Conclusions
Allocations takes the same time but the assignments costs extra due to the slice size checks (as noticed by Terry Pang).
Using array instead of slice is much more cheaper as it saves allocations.
Why is using array so much cheaper?
In Go the array is basically a chunk of memory of fixed size. The [1]int is basically the same thing as int. You can find more in in Go Slices: usage and internals article.

Go: multiple len() calls vs performance?

At the moment I am implementing some sorting algorithms. As it's in the nature of algorithms, there are a lot of calls on the length of some arrays/slices using the len() method.
Now, given the following code for a (part of) the Mergesort algorithm:
for len(left) > 0 || len(right) > 0 {
if len(left) > 0 && len(right) > 0 {
if left[0] <= right[0] {
result = append(result, left[0])
left = left[1:len(left)]
} else {
result = append(result, right[0])
right = right[1:len(right)]
}
} else if len(left) > 0 {
result = append(result, left[0])
left = left[1:len(left)]
} else if len(right) > 0 {
result = append(result, right[0])
right = right[1:len(right)]
}
}
My question is: Do these multiple len() calls affect the performance of the algorithm negatively? Is it better to make a temporary variable for the length of the right and left slice? Or does the compiler does this itself?
There are two cases:
Local slice: length will be cached and there is no overhead
Global slice or passed (by reference): length cannot be cached and there is overhead
No overhead for local slices
For locally defined slices the length is cached, so there is no runtime overhead. You can see this in the assembly of the following program:
func generateSlice(x int) []int {
return make([]int, x)
}
func main() {
x := generateSlice(10)
println(len(x))
}
Compiled with go tool 6g -S test.go this yields, amongst other things, the following lines:
MOVQ "".x+40(SP),BX
MOVQ BX,(SP)
// ...
CALL ,runtime.printint(SB)
What happens here is that the first line retrieves the length of x by getting the value located 40 bytes from the beginning of x and most importantly caches this value in BX, which is then used for every occurrence of len(x). The reason for the offset is that an array has the following structure (source):
typedef struct
{ // must not move anything
uchar array[8]; // pointer to data
uchar nel[4]; // number of elements
uchar cap[4]; // allocated number of elements
} Array;
nel is what is accessed by len(). You can see this in the code generation as well.
Global and referenced slices have overhead
For shared values caching of the length is not possible since the compiler has to assume that the slice changes between calls. Therefore the compiler has to write code that accesses the length attribute directly every time. Example:
func accessLocal() int {
a := make([]int, 1000) // local
count := 0
for i := 0; i < len(a); i++ {
count += len(a)
}
return count
}
var ag = make([]int, 1000) // pseudo-code
func accessGlobal() int {
count := 0
for i := 0; i < len(ag); i++ {
count += len(ag)
}
return count
}
Comparing the assembly of both functions yields the crucial difference that as soon as the variable is global the access to the nel attribute is not cached anymore and there will be a runtime overhead:
// accessLocal
MOVQ "".a+8048(SP),SI // cache length in SI
// ...
CMPQ SI,AX // i < len(a)
// ...
MOVQ SI,BX
ADDQ CX,BX
MOVQ BX,CX // count += len(a)
// accessGlobal
MOVQ "".ag+8(SB),BX
CMPQ BX,AX // i < len(ag)
// ...
MOVQ "".ag+8(SB),BX
ADDQ CX,BX
MOVQ BX,CX // count += len(ag)
Despite the good answers you are getting, I'm getting poorer performance if calling len(a) constantly, for example in this test http://play.golang.org/p/fiP1Sy2Hfk
package main
import "testing"
func BenchmarkTest1(b *testing.B) {
a := make([]int, 1000)
for i := 0; i < b.N; i++ {
count := 0
for i := 0; i < len(a); i++ {
count += len(a)
}
}
}
func BenchmarkTest2(b *testing.B) {
a := make([]int, 1000)
for i := 0; i < b.N; i++ {
count := 0
lena := len(a)
for i := 0; i < lena; i++ {
count += lena
}
}
}
When run as go test -bench=. I get:
BenchmarkTest1 5000000 668 ns/op
BenchmarkTest2 5000000 402 ns/op
So there is clearly a penalty here, possibly because the compiler is making worse optimizations in compile-time.
Hope things got improved in the latest version of Go
go version go1.16.7 linux/amd64
goos: linux
goarch: amd64
pkg: 001_test
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 # 2.80GHz
BenchmarkTest1-8 4903609 228.8 ns/op
BenchmarkTest2-8 5280086 229.9 ns/op

Resources