I'm spending some time experimenting with Go's internals and I ended up writing my own implementation of a stack using slices.
As correctly pointed out by a reddit user in this post and as outlined by another user in this SO answer Go already tries to optimise slices resize.
Turns out, though, that I rather have performance gains using my own implementation of slice growing rather than sticking with the default one.
This is the structure I use for holding the stack:
type Stack struct {
slice []interface{}
blockSize int
}
const s_DefaultAllocBlockSize = 20;
This is my own implementation of the Push method
func (s *Stack) Push(elem interface{}) {
if len(s.slice) + 1 == cap(s.slice) {
slice := make([]interface{}, 0, len(s.slice) + s.blockSize)
copy(slice, s.slice)
s.slice = slice
}
s.slice = append(s.slice, elem)
}
This is a plain implementation
func (s *Stack) Push(elem interface{}) {
s.slice = append(s.slice, elem)
}
Running the benchmarks I've implemented using Go's testing package my own implementation performs this way:
Benchmark_PushDefaultStack 20000000 87.7 ns/op 24 B/op 1 allocs/op
While relying on the plain append the results are the following
Benchmark_PushDefaultStack 10000000 209 ns/op 90 B/op 1 allocs/op
The machine I run tests on is an early 2011 Mac Book Pro, 2.3 GHz Intel Core i5 with 8GB of RAM 1333MHz DDR3
EDIT
The actual question is: is my implementation really faster than the default append behavior? Or am I not taking something into account?
Reading your code, tests, benchmarks, and results it's easy to see that they are flawed. A full code review is beyond the scope of StackOverflow.
One specific bug.
// Push pushes a new element to the stack
func (s *Stack) Push(elem interface{}) {
if len(s.slice)+1 == cap(s.slice) {
slice := make([]interface{}, 0, len(s.slice)+s.blockSize)
copy(slice, s.slice)
s.slice = slice
}
s.slice = append(s.slice, elem)
}
Should be
// Push pushes a new element to the stack
func (s *Stack) Push(elem interface{}) {
if len(s.slice)+1 == cap(s.slice) {
slice := make([]interface{}, len(s.slice), len(s.slice)+s.blockSize)
copy(slice, s.slice)
s.slice = slice
}
s.slice = append(s.slice, elem)
}
copying
slices
The function copy copies slice elements from a source src to a
destination dst and returns the number of elements copied. The
number of elements copied is the minimum of len(src) and len(dst).
You copied 0, you should have copied len(s.slice).
As expected, your Push algorithm is inordinately slow:
append:
Benchmark_PushDefaultStack-4 2000000 941 ns/op 49 B/op 1 allocs/op
alediaferia:
Benchmark_PushDefaultStack-4 100000 1246315 ns/op 42355 B/op 1 allocs/op
This how append works: append complexity.
There are other things wrong too. Your benchmark results are often not valid.
I believe your example is faster because you have a fairly small data set and are allocating with an initial capacity of 0. In your version of append you preempt a large amount of allocations by growing the block size more dramatically early (by 20) circumventing the (in this case) expensive reallocs that take you through all those trivially small capacities 0,1,2,4,8,16,32,64 ect
If your data sets were a lot larger this would likely be marginalized by the cost of large copies. I've seen a lot of misuse of slice in Go. The clear performance win is had by making your slice with a reasonable default capacity.
Related
I am trying to measure the evolution of the number of heap-allocated objects before and after I call a function. I am forcing runtime.GC() and using runtime.ReadMemStats to measure the number of heap objects I have before and after.
The problem I have is that I sometimes see unexpected heap growth. And it is different after each run.
A simple example below, where I would always expect to see a zero heap-objects growth.
https://go.dev/play/p/FBWfXQHClaG
var mem1_before, mem2_before, mem1_after, mem2_after runtime.MemStats
func measure_nothing(before, after *runtime.MemStats) {
runtime.GC()
runtime.ReadMemStats(before)
runtime.GC()
runtime.ReadMemStats(after)
}
func main() {
measure_nothing(&mem1_before, &mem1_after)
measure_nothing(&mem2_before, &mem2_after)
log.Printf("HeapObjects diff = %d", int64(mem1_after.HeapObjects-mem1_before.HeapObjects))
log.Printf("HeapAlloc diff %d", int64(mem1_after.HeapAlloc-mem1_before.HeapAlloc))
log.Printf("HeapObjects diff = %d", int64(mem2_after.HeapObjects-mem2_before.HeapObjects))
log.Printf("HeapAlloc diff %d", int64(mem2_after.HeapAlloc-mem2_before.HeapAlloc))
}
Sample output:
2009/11/10 23:00:00 HeapObjects diff = 0
2009/11/10 23:00:00 HeapAlloc diff 0
2009/11/10 23:00:00 HeapObjects diff = 4
2009/11/10 23:00:00 HeapAlloc diff 1864
Is what I'm trying to do unpractical? I assume the runtime is doing things that allocate/free heap-memory. Can I tell it to stop to make my measurements? (this is for a test checking for memory leaks, not production code)
You can't predict what garbage collection and reading all the memory stats require in the background. Calling those to calculate memory allocations and usage is not a reliable way.
Luckily for us, Go's testing framework can monitor and calculate memory usage.
So what you should do is write a benchmark function and let the testing framework do its job to report memory allocations and usage.
Let's assume we want to measure this foo() function:
var x []int64
func foo(allocs, size int) {
for i := 0; i < allocs; i++ {
x = make([]int64, size)
}
}
All it does is allocate a slice of the given size, and it does this with the given number of times (allocs).
Let's write benchmarking functions for different scenarios:
func BenchmarkFoo_0_0(b *testing.B) {
for i := 0; i < b.N; i++ {
foo(0, 0)
}
}
func BenchmarkFoo_1_1(b *testing.B) {
for i := 0; i < b.N; i++ {
foo(1, 1)
}
}
func BenchmarkFoo_2_2(b *testing.B) {
for i := 0; i < b.N; i++ {
foo(2, 2)
}
}
Running the benchmark with go test -bench . -benchmem, the output is:
BenchmarkFoo_0_0-8 1000000000 0.3204 ns/op 0 B/op 0 allocs/op
BenchmarkFoo_1_1-8 67101626 16.58 ns/op 8 B/op 1 allocs/op
BenchmarkFoo_2_2-8 27375050 42.42 ns/op 32 B/op 2 allocs/op
As you can see, the allocations per function call is the same what we pass as the allocs argument. The allocated memory is the expected allocs * size * 8 bytes.
Note that the reported allocations per op is an integer value (it's the result of an integer division), so if the benchmarked function only occasionally allocates, it might not be reported in the integer result. For details, see Output from benchmem.
Like in this example:
var x []int64
func bar() {
if rand.Float64() < 0.3 {
x = make([]int64, 10)
}
}
This bar() function does 1 allocation with 30% probability (and none with 70% probability), which means on average it does 0.3 allocations. Benchmarking it:
func BenchmarkBar(b *testing.B) {
for i := 0; i < b.N; i++ {
bar()
}
}
Output is:
BenchmarkBar-8 38514928 29.60 ns/op 24 B/op 0 allocs/op
We can see there is 24 bytes allocation (0.3 * 10 * 8 bytes), which is correct, but the reported allocations per op is 0.
Luckily for us, we can also benchmark a function from our main app using the testing.Benchmark() function. It returns a testing.BenchmarkResult including all details about memory usage. We have access to the total number of allocations and to the number of iterations, so we can calculate allocations per op using floating point numbers:
func main() {
rand.Seed(time.Now().UnixNano())
tr := testing.Benchmark(BenchmarkBar)
fmt.Println("Allocs/op", tr.AllocsPerOp())
fmt.Println("B/op", tr.AllocedBytesPerOp())
fmt.Println("Precise allocs/op:", float64(tr.MemAllocs)/float64(tr.N))
}
This will output:
Allocs/op 0
B/op 24
Precise allocs/op: 0.3000516369276302
We can see the expected ~0.3 allocations per op.
Now if we go ahead and benchmark your measure_nothing() function:
func BenchmarkNothing(b *testing.B) {
for i := 0; i < b.N; i++ {
measure_nothing(&mem1_before, &mem1_after)
}
}
We get this output:
Allocs/op 0
B/op 11
Precise allocs/op: 0.12182030338389732
As you can see, running the garbage collector twice and reading memory stats twice occasionally needs allocation (~1 out of 10 calls: 0.12 times on average).
Pretty new to the Golang here and bumped into a problem when benchmarking with goroutines.
The code I have is here:
type store struct{}
func (n *store) WriteSpan(span interface{}) error {
return nil
}
func smallTest(times int, b *testing.B) {
writer := store{}
var wg sync.WaitGroup
numGoroutines := times
wg.Add(numGoroutines)
b.ResetTimer()
b.ReportAllocs()
for n := 0; n < numGoroutines; n++ {
go func() {
writer.WriteSpan(nil)
wg.Done()
}()
}
wg.Wait()
}
func BenchmarkTest1(b *testing.B) {
smallTest(1000000, b)
}
func BenchmarkTest2(b *testing.B) {
smallTest(10000000, b)
}
It looks to me the runtime and allocation for both scenario should be similar, but running them gives me the following results which are vastly different. Wonder why this happens? Where do those extra allocations come from?
BenchmarkTest1-12 1000000000 0.26 ns/op 0 B/op 0 allocs/op
BenchmarkTest2-12 1 2868129398 ns/op 31872 B/op 83 allocs/op
PASS
I also notice If I add a inner loop to writeSpan multiple times, the runtime and allocation kind of relates to the numGoroutines * multiple times. If this is not the way how people benchmark with goroutines, are there any other standard ways to test? Thanks in advance.
Meaningless microbenchmarks produce meaningless results.
If this is not the way how people benchmark with goroutines, are there
any other standard ways to test?
It's not the way to benchmark anything. Benchmark real problems.
You run a very large number of goroutines, which do nothing, until you saturate the scheduler, the machine, and other resources. That merely proves that if you run anything enough times you can bring a machine to its knees.
Is there a way to check if a byte slice is empty or 0 without checking each element or using reflect?
theByteVar := make([]byte, 128)
if "theByteVar is empty or zeroes" {
doSomething()
}
One solution which seems weird that I found was to keep an empty byte array for comparison.
theByteVar := make([]byte, 128)
emptyByteVar := make([]byte, 128)
// fill with anything
theByteVar[1] = 2
if reflect.DeepEqual(theByteVar,empty) == false {
doSomething(theByteVar)
}
For sure there must be a better/quicker solution.
Thanks
UPDATE did some comparison for 1000 loops and the reflect way is the worst by far...
Equal Loops: 1000 in true in 19.197µs
Contains Loops: 1000 in true in 34.507µs
AllZero Loops: 1000 in true in 117.275µs
Reflect Loops: 1000 in true in 14.616277ms
Comparing it with another slice containing only zeros, that requires reading (and comparing) 2 slices.
Using a single for loop will be more efficient here:
for _, v := range theByteVar {
if v != 0 {
doSomething(theByteVar)
break
}
}
If you do need to use it in multiple places, wrap it in a utility function:
func allZero(s []byte) bool {
for _, v := range s {
if v != 0 {
return false
}
}
return true
}
And then using it:
if !allZero(theByteVar) {
doSomething(theByteVar)
}
Another solution borrows an idea from C. It could be achieved by using the unsafe package in Go.
The idea is simple, instead of checking each byte from []byte, we can check the value of byte[i:i+8], which is a uint64 value, in each steps. By doing this we can check 8 bytes instead of checking only one byte in each iteration.
Below codes are not best practice but only show the idea.
const (
len8 int = 0xFFFFFFF8
)
func IsAllBytesZero(data []byte) bool {
n := len(data)
// Magic to get largest length which could be divided by 8.
nlen8 := n & len8
i := 0
for ; i < nlen8; i += 8 {
b := *(*uint64)(unsafe.Pointer(uintptr(unsafe.Pointer(&data[0])) + 8*uintptr(i)))
if b != 0 {
return false
}
}
for ; i < n; i++ {
if data[i] != 0 {
return false
}
}
return true
}
Benchmark
Testcases:
Only test for worst cases (all elements are zero)
Methods:
IsAllBytesZero: unsafe package solution
NaiveCheckAllBytesAreZero: a loop to iterate the whole byte array and check it.
CompareAllBytesWithFixedEmptyArray: using bytes.Compare solution with pre-allocated fixed size empty byte array.
CompareAllBytesWithDynamicEmptyArray: using bytes.Compare solution without pre-allocated fixed size empty byte array.
Results
BenchmarkIsAllBytesZero10-8 254072224 4.68 ns/op
BenchmarkIsAllBytesZero100-8 132266841 9.09 ns/op
BenchmarkIsAllBytesZero1000-8 19989015 55.6 ns/op
BenchmarkIsAllBytesZero10000-8 2344436 507 ns/op
BenchmarkIsAllBytesZero100000-8 1727826 679 ns/op
BenchmarkNaiveCheckAllBytesAreZero10-8 234153582 5.15 ns/op
BenchmarkNaiveCheckAllBytesAreZero100-8 30038720 38.2 ns/op
BenchmarkNaiveCheckAllBytesAreZero1000-8 4300405 291 ns/op
BenchmarkNaiveCheckAllBytesAreZero10000-8 407547 2666 ns/op
BenchmarkNaiveCheckAllBytesAreZero100000-8 43382 27265 ns/op
BenchmarkCompareAllBytesWithFixedEmptyArray10-8 415171356 2.71 ns/op
BenchmarkCompareAllBytesWithFixedEmptyArray100-8 218871330 5.51 ns/op
BenchmarkCompareAllBytesWithFixedEmptyArray1000-8 56569351 21.0 ns/op
BenchmarkCompareAllBytesWithFixedEmptyArray10000-8 6592575 177 ns/op
BenchmarkCompareAllBytesWithFixedEmptyArray100000-8 567784 2104 ns/op
BenchmarkCompareAllBytesWithDynamicEmptyArray10-8 64215448 19.8 ns/op
BenchmarkCompareAllBytesWithDynamicEmptyArray100-8 32875428 35.4 ns/op
BenchmarkCompareAllBytesWithDynamicEmptyArray1000-8 8580890 140 ns/op
BenchmarkCompareAllBytesWithDynamicEmptyArray10000-8 1277070 938 ns/op
BenchmarkCompareAllBytesWithDynamicEmptyArray100000-8 121256 10355 ns/op
Summary
Assumed that we're talking about the condition in sparse zero byte array. According to the benchmark, if performance is an issue, the naive check solution would be a bad idea. And, if you don't want to use unsafe package in your project, then consider using bytes.Compare solution with pre-allocated empty array as an alternative.
An interesting point could be pointed out is that the performance comes from unsafe package varies a lot, but it basically outperform all other solution mentioned above. I think it was relevant to the CPU cache mechanism.
You can possibly use bytes.Equal or bytes.Contains to compare with a zero initialized byte slice, see https://play.golang.org/p/mvUXaTwKjP, I haven't checked for performance, but hopefully it's been optimized. You might want to try out other solutions and compare the performance numbers, if needed.
I think it is better (faster) if binary or is used instead of if condition inside loop:
func isZero(bytes []byte) bool {
b := byte(0)
for _, s := range bytes {
b |= s
}
return b == 0
}
One can optimize this even more by using idea with uint64 mentioned in previous answers
Hi while doing some exercises I've came across this question...
Lets say you have a map with the capacity of 100,000.
Which value is the most efficient to fill the whole map in the least amount of time?
I've ran some benchmarks on my own trying out most of the types I could think of and the resulting top list is:
Benchmark_Struct-8 200 6010422 ns/op (struct{}{})
Benchmark_Byte-8 200 6167230 ns/op (byte = 0)
Benchmark_Int-8 200 6112927 ns/op (int8 = 0)
Benchmark_Bool-8 200 6117155 ns/op (bool = false)
Example function:
func Struct() {
m := make(map[int]struct{}, 100000)
for i := 0; i < 100000; i++ {
m[i] = struct{}{}
}
}
As you can see the fastest one (most of the time) is type struct{}{} - empty struct.
But why is this the case in go?
Is there a faster/lighter nil or non-nil value?
- Thank you for your time :)
Theoretically, struct{}{} should be the most efficient because it requires no memory. In practice, a) results may vary between Go versions, operating systems, and system architectures; and b) I can't think of any case where maximizing the execution-time efficiency of empty values is relevant.
What I found:
I print the time cost of golang's copy, and it shows the first time of memory copy is slow. But the second time is much faster even I run "copy" on different memory address.
Here is my test codes:
func TestCopyLoop1x32M(t *testing.T) {
copyLoopSameDst(32*1024*1024, 1)
}
func TestCopyLoopOnex32M(t *testing.T) {
copyLoopSameDst(32*1024*1024, 1)
}
func copyLoopSameDst(size, loops int) {
in := make([]byte, size)
out := make([]byte, size)
rand.Seed(0)
fillRandom(in) // insert random byte into slice
now := time.Now()
for i := 0; i < loops; i++ {
copy(out, in)
}
cost := time.Since(now)
fmt.Println(cost.Seconds() / float64(loops))
}
func TestCopyDiffLoop1x32M(t *testing.T) {
copyLoopDiffDst(32*1024*1024, 1)
}
func copyLoopDiffDst(size, loops int) {
ins := make([][]byte, loops)
outs := make([][]byte, loops)
for i := 0; i < loops; i++ {
out := make([]byte, size)
outs[i] = out
in := make([]byte, size)
rand.Seed(0)
fillRandom(in)
ins[i] = in
}
now := time.Now()
for i := 0; i < loops; i++ {
copy(outs[i], ins[i])
}
cost := time.Since(now)
fmt.Println(cost.Seconds() / float64(loops))
}
The Result(on a i5-4278U):
Run all the three case:
TestCopyLoop1x32M : 0.023s
TestCopyLoopOnex32M : 0.0038s
TestCopyDiffLoop1x32M : 0.0038s
Run first&second case:
TestCopyLoop1x32M : 0.023s
TestCopyLoopOnex32M : 0.0038s
Run first&third case:
TestCopyLoop1x32M : 0.023s
TestCopyLoop1x32M : 0.023s
My questions:
They have different memory address and different data, how could the next case get benefit from the first one?
Why the Result3 is not same as Result2? Don't they do the same thing?
If I add the loop in "copyLoopSameDst", I know the next time will be faster because the cache, but my cpu's L3 Cache is only 3MB, I can't explain the huge improvement
Why "copyLoopDiffDst" will speed up after two case?
My guess:
the instruction cache help to improve performance, but it can't explain question2
the cpu cache works beyond my imagination, but it can't explain question2 either
After more research and testing, I think I can answer part of my questions.
The reason of cache works in next test case is Golang's (maybe other languages will do same things, because malloc memory is a system call) memory allocation.
When the data is big, kernel will reuse the block which just been freed.
I print the in&out []byte's address(in Golang, the first 8bytes of a slice is it's memory address, so I write a assembly to get the address):
addr: [0 192 8 32 196 0 0 0] [0 192 8 34 196 0 0 0]
cost: 0.019228028
addr: [0 192 8 36 196 0 0 0] [0 192 8 32 196 0 0 0]
cost: 0.003770281
addr: [0 192 8 34 196 0 0 0] [0 192 8 32 196 0 0 0]
cost: 0.003806502
You will find program reusing some memory address, so write hit happen in the next copy action.
If I create in/out out of function, the reusing will not happen, and it slow down.
But if you set the block very small (for example, under 32KB) You will find the speeding up again although kernel give your a new memory address. In my opinion the main reason is the data is not aligned by 64bytes, so the next loop data (its location is nearby the first one) will be caught into cache, at the same time, the first loop waster much time for fill cache. And the next loop can get the instruction cache and other data cache for run the function. When the data is small, these little cache will make big influence.
I still feel amazed, the data size is 10x of my cpu cache size, but the cache still can help a lot. Anyway, it's another question.