Which is the most efficient nil value? - performance

Hi while doing some exercises I've came across this question...
Lets say you have a map with the capacity of 100,000.
Which value is the most efficient to fill the whole map in the least amount of time?
I've ran some benchmarks on my own trying out most of the types I could think of and the resulting top list is:
Benchmark_Struct-8 200 6010422 ns/op (struct{}{})
Benchmark_Byte-8 200 6167230 ns/op (byte = 0)
Benchmark_Int-8 200 6112927 ns/op (int8 = 0)
Benchmark_Bool-8 200 6117155 ns/op (bool = false)
Example function:
func Struct() {
m := make(map[int]struct{}, 100000)
for i := 0; i < 100000; i++ {
m[i] = struct{}{}
}
}
As you can see the fastest one (most of the time) is type struct{}{} - empty struct.
But why is this the case in go?
Is there a faster/lighter nil or non-nil value?
- Thank you for your time :)

Theoretically, struct{}{} should be the most efficient because it requires no memory. In practice, a) results may vary between Go versions, operating systems, and system architectures; and b) I can't think of any case where maximizing the execution-time efficiency of empty values is relevant.

Related

Measure heap growth accurately

I am trying to measure the evolution of the number of heap-allocated objects before and after I call a function. I am forcing runtime.GC() and using runtime.ReadMemStats to measure the number of heap objects I have before and after.
The problem I have is that I sometimes see unexpected heap growth. And it is different after each run.
A simple example below, where I would always expect to see a zero heap-objects growth.
https://go.dev/play/p/FBWfXQHClaG
var mem1_before, mem2_before, mem1_after, mem2_after runtime.MemStats
func measure_nothing(before, after *runtime.MemStats) {
runtime.GC()
runtime.ReadMemStats(before)
runtime.GC()
runtime.ReadMemStats(after)
}
func main() {
measure_nothing(&mem1_before, &mem1_after)
measure_nothing(&mem2_before, &mem2_after)
log.Printf("HeapObjects diff = %d", int64(mem1_after.HeapObjects-mem1_before.HeapObjects))
log.Printf("HeapAlloc diff %d", int64(mem1_after.HeapAlloc-mem1_before.HeapAlloc))
log.Printf("HeapObjects diff = %d", int64(mem2_after.HeapObjects-mem2_before.HeapObjects))
log.Printf("HeapAlloc diff %d", int64(mem2_after.HeapAlloc-mem2_before.HeapAlloc))
}
Sample output:
2009/11/10 23:00:00 HeapObjects diff = 0
2009/11/10 23:00:00 HeapAlloc diff 0
2009/11/10 23:00:00 HeapObjects diff = 4
2009/11/10 23:00:00 HeapAlloc diff 1864
Is what I'm trying to do unpractical? I assume the runtime is doing things that allocate/free heap-memory. Can I tell it to stop to make my measurements? (this is for a test checking for memory leaks, not production code)
You can't predict what garbage collection and reading all the memory stats require in the background. Calling those to calculate memory allocations and usage is not a reliable way.
Luckily for us, Go's testing framework can monitor and calculate memory usage.
So what you should do is write a benchmark function and let the testing framework do its job to report memory allocations and usage.
Let's assume we want to measure this foo() function:
var x []int64
func foo(allocs, size int) {
for i := 0; i < allocs; i++ {
x = make([]int64, size)
}
}
All it does is allocate a slice of the given size, and it does this with the given number of times (allocs).
Let's write benchmarking functions for different scenarios:
func BenchmarkFoo_0_0(b *testing.B) {
for i := 0; i < b.N; i++ {
foo(0, 0)
}
}
func BenchmarkFoo_1_1(b *testing.B) {
for i := 0; i < b.N; i++ {
foo(1, 1)
}
}
func BenchmarkFoo_2_2(b *testing.B) {
for i := 0; i < b.N; i++ {
foo(2, 2)
}
}
Running the benchmark with go test -bench . -benchmem, the output is:
BenchmarkFoo_0_0-8 1000000000 0.3204 ns/op 0 B/op 0 allocs/op
BenchmarkFoo_1_1-8 67101626 16.58 ns/op 8 B/op 1 allocs/op
BenchmarkFoo_2_2-8 27375050 42.42 ns/op 32 B/op 2 allocs/op
As you can see, the allocations per function call is the same what we pass as the allocs argument. The allocated memory is the expected allocs * size * 8 bytes.
Note that the reported allocations per op is an integer value (it's the result of an integer division), so if the benchmarked function only occasionally allocates, it might not be reported in the integer result. For details, see Output from benchmem.
Like in this example:
var x []int64
func bar() {
if rand.Float64() < 0.3 {
x = make([]int64, 10)
}
}
This bar() function does 1 allocation with 30% probability (and none with 70% probability), which means on average it does 0.3 allocations. Benchmarking it:
func BenchmarkBar(b *testing.B) {
for i := 0; i < b.N; i++ {
bar()
}
}
Output is:
BenchmarkBar-8 38514928 29.60 ns/op 24 B/op 0 allocs/op
We can see there is 24 bytes allocation (0.3 * 10 * 8 bytes), which is correct, but the reported allocations per op is 0.
Luckily for us, we can also benchmark a function from our main app using the testing.Benchmark() function. It returns a testing.BenchmarkResult including all details about memory usage. We have access to the total number of allocations and to the number of iterations, so we can calculate allocations per op using floating point numbers:
func main() {
rand.Seed(time.Now().UnixNano())
tr := testing.Benchmark(BenchmarkBar)
fmt.Println("Allocs/op", tr.AllocsPerOp())
fmt.Println("B/op", tr.AllocedBytesPerOp())
fmt.Println("Precise allocs/op:", float64(tr.MemAllocs)/float64(tr.N))
}
This will output:
Allocs/op 0
B/op 24
Precise allocs/op: 0.3000516369276302
We can see the expected ~0.3 allocations per op.
Now if we go ahead and benchmark your measure_nothing() function:
func BenchmarkNothing(b *testing.B) {
for i := 0; i < b.N; i++ {
measure_nothing(&mem1_before, &mem1_after)
}
}
We get this output:
Allocs/op 0
B/op 11
Precise allocs/op: 0.12182030338389732
As you can see, running the garbage collector twice and reading memory stats twice occasionally needs allocation (~1 out of 10 calls: 0.12 times on average).

How to use channels efficiently [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I read on Uber's style guide that one should use at most a channel length of 1.
Although it's clear to me that using a channel size of 100 or 1000 is very bad practice, I was however wondering why a channel size of 10 isn't considered a valid option. I'm missing some part to get to the right conclusion.
Below, you can follow my arguments (and counter arguments) backed by some benchmark test.
I understand that, if your both go-routines, responsible for writing or reading from this channel, would be interrupted in between sequential writings or readings to/from the channel by some other IO action, no gain is to be expected from a higher channel buffer and I agree that 1 is the best option.
But, lets say that there is no significant other go-routine switching needed apart from the implicit locking and unlocking caused by writing/reading to/from the channel. Then I would conclude the following:
Consider the amount of context switches when processing 100 values on a channel with either a channel buffer of size 1 and of 10 (GR = go-routine)
Buffer=1: (GR1 inserts 1 value, GR2 reads 1 value) X 100 ~ 200 go-routine switches
Buffer=10: (GR1 inserts 10 values, GR2 reads 10 values) X 10 ~ 20 go-routine switches
I did some benchmarking to prove that this actually goes faster:
package main
import (
"testing"
)
type a struct {
b [100]int64
}
func BenchmarkBuffer1(b *testing.B) {
count := 0
c := make(chan a, 1)
go func() {
for i := 0; i < b.N; i++ {
c <- a{}
}
close(c)
}()
for v := range c {
for i := range v.b {
count += i
}
}
}
func BenchmarkBuffer10(b *testing.B) {
count := 0
c := make(chan a, 10)
go func() {
for i := 0; i < b.N; i++ {
c <- a{}
}
close(c)
}()
for v := range c {
for i := range v.b {
count += i
}
}
}
Results when comparing simple reading & writing + non-blocking processing:
BenchmarkBuffer1-12 5072902 266 ns/op
BenchmarkBuffer10-12 6029602 179 ns/op
PASS
BenchmarkBuffer1-12 5228782 256 ns/op
BenchmarkBuffer10-12 5392410 216 ns/op
PASS
BenchmarkBuffer1-12 4806208 287 ns/op
BenchmarkBuffer10-12 4637842 233 ns/op
PASS
However, if I add a sleep every 10 reads, it doesn't yield any better results.
import (
"testing"
"time"
)
func BenchmarkBuffer1WithSleep(b *testing.B) {
count := 0
c := make(chan int, 1)
go func() {
for i := 0; i < b.N; i++ {
c <- i
}
close(c)
}()
for a := range c {
count++
if count%10 == 0 {
time.Sleep(time.Duration(a) * time.Nanosecond)
}
}
}
func BenchmarkBuffer10WithSleep(b *testing.B) {
count := 0
c := make(chan int, 10)
go func() {
for i := 0; i < b.N; i++ {
c <- i
}
close(c)
}()
for a := range c {
count++
if count%10 == 0 {
time.Sleep(time.Duration(a) * time.Nanosecond)
}
}
}
Results when adding a sleep every 10 reads:
BenchmarkBuffer1WithSleep-12 856886 53219 ns/op
BenchmarkBuffer10WithSleep-12 929113 56939 ns/op
FYI: I also did the test again with only one CPU and got the following results:
BenchmarkBuffer1 5831193 207 ns/op
BenchmarkBuffer10 6226983 180 ns/op
BenchmarkBuffer1WithSleep 556635 35510 ns/op
BenchmarkBuffer10WithSleep 984472 61434 ns/op
Absolutely nothing is wrong with a channel of cap 500 e.g. if this channel is used as a semaphore.
The style guide you read recommends to not use buffered channels of let's say cap 64 "because this looks like a nice number". But this recommendation is not because of performance! (Btw: You microbenchmarks are useless microbenchmarks, they do not measure anything relevant.)
An unbuffered channel is some kind of synchronisation primitive and us such very much useful.
A buffered channel, well, may buffer between sender and receiver and this buffering can be problematic for observing, tuning and debugging the code (because creation and consumption are further decoupled). Thats why the style guide recommends unbuffered channels (or at most a cap of 1 as this is sometimes needed for correctness!).
It also doesn't prohibit larger buffer caps:
Any other [than 0 or 1] size must be subject to a high level of scrutiny. Consider how the size is determined, what prevents the channel from filling up under load and blocking writers, and what happens when this occurs. [emph. mine]
You may use a cap of 27 if you can explain why 27 (and not 22 or 31) and how this will influence program behaviour (not only performance!) if the buffer is filled.
Most people overrate performance. Correctness, operational stability and maintainability come first. And this is what this style guide is about here.

How to check if []byte is all zeros in go

Is there a way to check if a byte slice is empty or 0 without checking each element or using reflect?
theByteVar := make([]byte, 128)
if "theByteVar is empty or zeroes" {
doSomething()
}
One solution which seems weird that I found was to keep an empty byte array for comparison.
theByteVar := make([]byte, 128)
emptyByteVar := make([]byte, 128)
// fill with anything
theByteVar[1] = 2
if reflect.DeepEqual(theByteVar,empty) == false {
doSomething(theByteVar)
}
For sure there must be a better/quicker solution.
Thanks
UPDATE did some comparison for 1000 loops and the reflect way is the worst by far...
Equal Loops: 1000 in true in 19.197µs
Contains Loops: 1000 in true in 34.507µs
AllZero Loops: 1000 in true in 117.275µs
Reflect Loops: 1000 in true in 14.616277ms
Comparing it with another slice containing only zeros, that requires reading (and comparing) 2 slices.
Using a single for loop will be more efficient here:
for _, v := range theByteVar {
if v != 0 {
doSomething(theByteVar)
break
}
}
If you do need to use it in multiple places, wrap it in a utility function:
func allZero(s []byte) bool {
for _, v := range s {
if v != 0 {
return false
}
}
return true
}
And then using it:
if !allZero(theByteVar) {
doSomething(theByteVar)
}
Another solution borrows an idea from C. It could be achieved by using the unsafe package in Go.
The idea is simple, instead of checking each byte from []byte, we can check the value of byte[i:i+8], which is a uint64 value, in each steps. By doing this we can check 8 bytes instead of checking only one byte in each iteration.
Below codes are not best practice but only show the idea.
const (
len8 int = 0xFFFFFFF8
)
func IsAllBytesZero(data []byte) bool {
n := len(data)
// Magic to get largest length which could be divided by 8.
nlen8 := n & len8
i := 0
for ; i < nlen8; i += 8 {
b := *(*uint64)(unsafe.Pointer(uintptr(unsafe.Pointer(&data[0])) + 8*uintptr(i)))
if b != 0 {
return false
}
}
for ; i < n; i++ {
if data[i] != 0 {
return false
}
}
return true
}
Benchmark
Testcases:
Only test for worst cases (all elements are zero)
Methods:
IsAllBytesZero: unsafe package solution
NaiveCheckAllBytesAreZero: a loop to iterate the whole byte array and check it.
CompareAllBytesWithFixedEmptyArray: using bytes.Compare solution with pre-allocated fixed size empty byte array.
CompareAllBytesWithDynamicEmptyArray: using bytes.Compare solution without pre-allocated fixed size empty byte array.
Results
BenchmarkIsAllBytesZero10-8 254072224 4.68 ns/op
BenchmarkIsAllBytesZero100-8 132266841 9.09 ns/op
BenchmarkIsAllBytesZero1000-8 19989015 55.6 ns/op
BenchmarkIsAllBytesZero10000-8 2344436 507 ns/op
BenchmarkIsAllBytesZero100000-8 1727826 679 ns/op
BenchmarkNaiveCheckAllBytesAreZero10-8 234153582 5.15 ns/op
BenchmarkNaiveCheckAllBytesAreZero100-8 30038720 38.2 ns/op
BenchmarkNaiveCheckAllBytesAreZero1000-8 4300405 291 ns/op
BenchmarkNaiveCheckAllBytesAreZero10000-8 407547 2666 ns/op
BenchmarkNaiveCheckAllBytesAreZero100000-8 43382 27265 ns/op
BenchmarkCompareAllBytesWithFixedEmptyArray10-8 415171356 2.71 ns/op
BenchmarkCompareAllBytesWithFixedEmptyArray100-8 218871330 5.51 ns/op
BenchmarkCompareAllBytesWithFixedEmptyArray1000-8 56569351 21.0 ns/op
BenchmarkCompareAllBytesWithFixedEmptyArray10000-8 6592575 177 ns/op
BenchmarkCompareAllBytesWithFixedEmptyArray100000-8 567784 2104 ns/op
BenchmarkCompareAllBytesWithDynamicEmptyArray10-8 64215448 19.8 ns/op
BenchmarkCompareAllBytesWithDynamicEmptyArray100-8 32875428 35.4 ns/op
BenchmarkCompareAllBytesWithDynamicEmptyArray1000-8 8580890 140 ns/op
BenchmarkCompareAllBytesWithDynamicEmptyArray10000-8 1277070 938 ns/op
BenchmarkCompareAllBytesWithDynamicEmptyArray100000-8 121256 10355 ns/op
Summary
Assumed that we're talking about the condition in sparse zero byte array. According to the benchmark, if performance is an issue, the naive check solution would be a bad idea. And, if you don't want to use unsafe package in your project, then consider using bytes.Compare solution with pre-allocated empty array as an alternative.
An interesting point could be pointed out is that the performance comes from unsafe package varies a lot, but it basically outperform all other solution mentioned above. I think it was relevant to the CPU cache mechanism.
You can possibly use bytes.Equal or bytes.Contains to compare with a zero initialized byte slice, see https://play.golang.org/p/mvUXaTwKjP, I haven't checked for performance, but hopefully it's been optimized. You might want to try out other solutions and compare the performance numbers, if needed.
I think it is better (faster) if binary or is used instead of if condition inside loop:
func isZero(bytes []byte) bool {
b := byte(0)
for _, s := range bytes {
b |= s
}
return b == 0
}
One can optimize this even more by using idea with uint64 mentioned in previous answers

How to use time value of benchmark

I have written a benchmark for my chess engine in Go:
func BenchmarkStartpos(b *testing.B) {
board := ParseFen(startpos)
for i := 0; i < b.N; i++ {
Perft(&board, 5)
}
}
I see this output when it runs:
goos: darwin
goarch: amd64
BenchmarkStartpos-4 10 108737398 ns/op
PASS
ok _/Users/dylhunn/Documents/go-chess 1.215s
I want to use the time per execution (in this case, 108737398 ns/op) to compute another value, and also print it as a result of the benchmark. Specifically, I want to output nodes per second, which is given as the result of the Perft call divided by the time per call.
How can I access the time the benchmark took to execute, so I can print my own derived results?
You may use the testing.Benchmark() function to manually measure / benchmark "benchmark" functions (that have the signature of func(*testing.B)), and you get the result as a value of testing.BenchmarkResult, which is a struct with all the details you need:
type BenchmarkResult struct {
N int // The number of iterations.
T time.Duration // The total time taken.
Bytes int64 // Bytes processed in one iteration.
MemAllocs uint64 // The total number of memory allocations.
MemBytes uint64 // The total number of bytes allocated.
}
The time per execution is returned by the BenchmarkResult.NsPerOp() method, you can do whatever you want to with that.
See this simple example:
func main() {
res := testing.Benchmark(BenchmarkSleep)
fmt.Println(res)
fmt.Println("Ns per op:", res.NsPerOp())
fmt.Println("Time per op:", time.Duration(res.NsPerOp()))
}
func BenchmarkSleep(b *testing.B) {
for i := 0; i < b.N; i++ {
time.Sleep(time.Millisecond * 12)
}
}
Output is (try it on the Go Playground):
100 12000000 ns/op
Ns per op: 12000000
Time per op: 12ms

Different performances in Go slices resize

I'm spending some time experimenting with Go's internals and I ended up writing my own implementation of a stack using slices.
As correctly pointed out by a reddit user in this post and as outlined by another user in this SO answer Go already tries to optimise slices resize.
Turns out, though, that I rather have performance gains using my own implementation of slice growing rather than sticking with the default one.
This is the structure I use for holding the stack:
type Stack struct {
slice []interface{}
blockSize int
}
const s_DefaultAllocBlockSize = 20;
This is my own implementation of the Push method
func (s *Stack) Push(elem interface{}) {
if len(s.slice) + 1 == cap(s.slice) {
slice := make([]interface{}, 0, len(s.slice) + s.blockSize)
copy(slice, s.slice)
s.slice = slice
}
s.slice = append(s.slice, elem)
}
This is a plain implementation
func (s *Stack) Push(elem interface{}) {
s.slice = append(s.slice, elem)
}
Running the benchmarks I've implemented using Go's testing package my own implementation performs this way:
Benchmark_PushDefaultStack 20000000 87.7 ns/op 24 B/op 1 allocs/op
While relying on the plain append the results are the following
Benchmark_PushDefaultStack 10000000 209 ns/op 90 B/op 1 allocs/op
The machine I run tests on is an early 2011 Mac Book Pro, 2.3 GHz Intel Core i5 with 8GB of RAM 1333MHz DDR3
EDIT
The actual question is: is my implementation really faster than the default append behavior? Or am I not taking something into account?
Reading your code, tests, benchmarks, and results it's easy to see that they are flawed. A full code review is beyond the scope of StackOverflow.
One specific bug.
// Push pushes a new element to the stack
func (s *Stack) Push(elem interface{}) {
if len(s.slice)+1 == cap(s.slice) {
slice := make([]interface{}, 0, len(s.slice)+s.blockSize)
copy(slice, s.slice)
s.slice = slice
}
s.slice = append(s.slice, elem)
}
Should be
// Push pushes a new element to the stack
func (s *Stack) Push(elem interface{}) {
if len(s.slice)+1 == cap(s.slice) {
slice := make([]interface{}, len(s.slice), len(s.slice)+s.blockSize)
copy(slice, s.slice)
s.slice = slice
}
s.slice = append(s.slice, elem)
}
copying
slices
The function copy copies slice elements from a source src to a
destination dst and returns the number of elements copied. The
number of elements copied is the minimum of len(src) and len(dst).
You copied 0, you should have copied len(s.slice).
As expected, your Push algorithm is inordinately slow:
append:
Benchmark_PushDefaultStack-4 2000000 941 ns/op 49 B/op 1 allocs/op
alediaferia:
Benchmark_PushDefaultStack-4 100000 1246315 ns/op 42355 B/op 1 allocs/op
This how append works: append complexity.
There are other things wrong too. Your benchmark results are often not valid.
I believe your example is faster because you have a fairly small data set and are allocating with an initial capacity of 0. In your version of append you preempt a large amount of allocations by growing the block size more dramatically early (by 20) circumventing the (in this case) expensive reallocs that take you through all those trivially small capacities 0,1,2,4,8,16,32,64 ect
If your data sets were a lot larger this would likely be marginalized by the cost of large copies. I've seen a lot of misuse of slice in Go. The clear performance win is had by making your slice with a reasonable default capacity.

Resources