Why the first time memory copy runs is slow? - performance

What I found:
I print the time cost of golang's copy, and it shows the first time of memory copy is slow. But the second time is much faster even I run "copy" on different memory address.
Here is my test codes:
func TestCopyLoop1x32M(t *testing.T) {
copyLoopSameDst(32*1024*1024, 1)
}
func TestCopyLoopOnex32M(t *testing.T) {
copyLoopSameDst(32*1024*1024, 1)
}
func copyLoopSameDst(size, loops int) {
in := make([]byte, size)
out := make([]byte, size)
rand.Seed(0)
fillRandom(in) // insert random byte into slice
now := time.Now()
for i := 0; i < loops; i++ {
copy(out, in)
}
cost := time.Since(now)
fmt.Println(cost.Seconds() / float64(loops))
}
func TestCopyDiffLoop1x32M(t *testing.T) {
copyLoopDiffDst(32*1024*1024, 1)
}
func copyLoopDiffDst(size, loops int) {
ins := make([][]byte, loops)
outs := make([][]byte, loops)
for i := 0; i < loops; i++ {
out := make([]byte, size)
outs[i] = out
in := make([]byte, size)
rand.Seed(0)
fillRandom(in)
ins[i] = in
}
now := time.Now()
for i := 0; i < loops; i++ {
copy(outs[i], ins[i])
}
cost := time.Since(now)
fmt.Println(cost.Seconds() / float64(loops))
}
The Result(on a i5-4278U):
Run all the three case:
TestCopyLoop1x32M : 0.023s
TestCopyLoopOnex32M : 0.0038s
TestCopyDiffLoop1x32M : 0.0038s
Run first&second case:
TestCopyLoop1x32M : 0.023s
TestCopyLoopOnex32M : 0.0038s
Run first&third case:
TestCopyLoop1x32M : 0.023s
TestCopyLoop1x32M : 0.023s
My questions:
They have different memory address and different data, how could the next case get benefit from the first one?
Why the Result3 is not same as Result2? Don't they do the same thing?
If I add the loop in "copyLoopSameDst", I know the next time will be faster because the cache, but my cpu's L3 Cache is only 3MB, I can't explain the huge improvement
Why "copyLoopDiffDst" will speed up after two case?
My guess:
the instruction cache help to improve performance, but it can't explain question2
the cpu cache works beyond my imagination, but it can't explain question2 either

After more research and testing, I think I can answer part of my questions.
The reason of cache works in next test case is Golang's (maybe other languages will do same things, because malloc memory is a system call) memory allocation.
When the data is big, kernel will reuse the block which just been freed.
I print the in&out []byte's address(in Golang, the first 8bytes of a slice is it's memory address, so I write a assembly to get the address):
addr: [0 192 8 32 196 0 0 0] [0 192 8 34 196 0 0 0]
cost: 0.019228028
addr: [0 192 8 36 196 0 0 0] [0 192 8 32 196 0 0 0]
cost: 0.003770281
addr: [0 192 8 34 196 0 0 0] [0 192 8 32 196 0 0 0]
cost: 0.003806502
You will find program reusing some memory address, so write hit happen in the next copy action.
If I create in/out out of function, the reusing will not happen, and it slow down.
But if you set the block very small (for example, under 32KB) You will find the speeding up again although kernel give your a new memory address. In my opinion the main reason is the data is not aligned by 64bytes, so the next loop data (its location is nearby the first one) will be caught into cache, at the same time, the first loop waster much time for fill cache. And the next loop can get the instruction cache and other data cache for run the function. When the data is small, these little cache will make big influence.
I still feel amazed, the data size is 10x of my cpu cache size, but the cache still can help a lot. Anyway, it's another question.

Related

Workaround Go's map pointer issue

I was solving this Project Euler question. First I tried brute force and it took 0.5 seconds and then I tried the dynamic programming to utilize memoization expecting a huge improvement but I surprised that the result was 0.36 seconds.
After a little bit of googling I found out you can not use a pointer in a function (find_collatz_len) to an outside map data (memo). So each time the function below runs it copies over the entire dictionary. That sounds like a huge waste of processor power.
My question is what is a workaround so that I can use a pointer to a map outside the function to avoid the copying.
Here is my ugly code:
package main
//project euler 014 - longest collatz sequence
import (
"fmt"
"time"
)
func find_collatz_len(n int, memo map[int]int) int {
counter := 1
initital_value := n
for n != 1 {
counter++
if n < initital_value {
counter = counter + memo[n]
break
}
if n%2 == 0 {
n = int(float64(n)/2)
} else {
n = n*3+1
}
}
memo[initital_value] = counter
return counter
}
func main() {
start := time.Now()
max_length := 0
number := 0
current_length := 0
memo := make(map[int]int)
for i:=1; i<1_000_000; i++ {
current_length = find_collatz_len(i, memo)
if current_length > max_length {
max_length = current_length
number = i
}
}
fmt.Println(max_length, number)
fmt.Println("Time:", time.Since(start).Seconds())
}
Maps are already pointers under the hood. Passing a map value will pass a single pointer. For details, see why slice values can sometimes go stale but never map values?
When creating a map without a capacity hint, a map is allocated with internal structure enough to store a relatively small number of entries (around 7). As the map grows, the implementation sometimes needs to allocate more memory and restructure (rehash) the map to accommodate more elements. This can be avoided if you initialize the map with the expected final capacity as suggested by #mkopriva:
memo := make(map[int]int, 1_000_000).
As a result, enough room will be allocated to store all entries (1_000_000 in your example), so a rehash will not happen during the lifetime of your app. This will reduce the runtime from 0.3 sec to 0.2 sec.
You can also replace int(float64(n)/2) with n/2, as in the integer range you use, they give the same result. This will give you further 5% boost (0.19 sec on my machine).

How to use channels efficiently [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I read on Uber's style guide that one should use at most a channel length of 1.
Although it's clear to me that using a channel size of 100 or 1000 is very bad practice, I was however wondering why a channel size of 10 isn't considered a valid option. I'm missing some part to get to the right conclusion.
Below, you can follow my arguments (and counter arguments) backed by some benchmark test.
I understand that, if your both go-routines, responsible for writing or reading from this channel, would be interrupted in between sequential writings or readings to/from the channel by some other IO action, no gain is to be expected from a higher channel buffer and I agree that 1 is the best option.
But, lets say that there is no significant other go-routine switching needed apart from the implicit locking and unlocking caused by writing/reading to/from the channel. Then I would conclude the following:
Consider the amount of context switches when processing 100 values on a channel with either a channel buffer of size 1 and of 10 (GR = go-routine)
Buffer=1: (GR1 inserts 1 value, GR2 reads 1 value) X 100 ~ 200 go-routine switches
Buffer=10: (GR1 inserts 10 values, GR2 reads 10 values) X 10 ~ 20 go-routine switches
I did some benchmarking to prove that this actually goes faster:
package main
import (
"testing"
)
type a struct {
b [100]int64
}
func BenchmarkBuffer1(b *testing.B) {
count := 0
c := make(chan a, 1)
go func() {
for i := 0; i < b.N; i++ {
c <- a{}
}
close(c)
}()
for v := range c {
for i := range v.b {
count += i
}
}
}
func BenchmarkBuffer10(b *testing.B) {
count := 0
c := make(chan a, 10)
go func() {
for i := 0; i < b.N; i++ {
c <- a{}
}
close(c)
}()
for v := range c {
for i := range v.b {
count += i
}
}
}
Results when comparing simple reading & writing + non-blocking processing:
BenchmarkBuffer1-12 5072902 266 ns/op
BenchmarkBuffer10-12 6029602 179 ns/op
PASS
BenchmarkBuffer1-12 5228782 256 ns/op
BenchmarkBuffer10-12 5392410 216 ns/op
PASS
BenchmarkBuffer1-12 4806208 287 ns/op
BenchmarkBuffer10-12 4637842 233 ns/op
PASS
However, if I add a sleep every 10 reads, it doesn't yield any better results.
import (
"testing"
"time"
)
func BenchmarkBuffer1WithSleep(b *testing.B) {
count := 0
c := make(chan int, 1)
go func() {
for i := 0; i < b.N; i++ {
c <- i
}
close(c)
}()
for a := range c {
count++
if count%10 == 0 {
time.Sleep(time.Duration(a) * time.Nanosecond)
}
}
}
func BenchmarkBuffer10WithSleep(b *testing.B) {
count := 0
c := make(chan int, 10)
go func() {
for i := 0; i < b.N; i++ {
c <- i
}
close(c)
}()
for a := range c {
count++
if count%10 == 0 {
time.Sleep(time.Duration(a) * time.Nanosecond)
}
}
}
Results when adding a sleep every 10 reads:
BenchmarkBuffer1WithSleep-12 856886 53219 ns/op
BenchmarkBuffer10WithSleep-12 929113 56939 ns/op
FYI: I also did the test again with only one CPU and got the following results:
BenchmarkBuffer1 5831193 207 ns/op
BenchmarkBuffer10 6226983 180 ns/op
BenchmarkBuffer1WithSleep 556635 35510 ns/op
BenchmarkBuffer10WithSleep 984472 61434 ns/op
Absolutely nothing is wrong with a channel of cap 500 e.g. if this channel is used as a semaphore.
The style guide you read recommends to not use buffered channels of let's say cap 64 "because this looks like a nice number". But this recommendation is not because of performance! (Btw: You microbenchmarks are useless microbenchmarks, they do not measure anything relevant.)
An unbuffered channel is some kind of synchronisation primitive and us such very much useful.
A buffered channel, well, may buffer between sender and receiver and this buffering can be problematic for observing, tuning and debugging the code (because creation and consumption are further decoupled). Thats why the style guide recommends unbuffered channels (or at most a cap of 1 as this is sometimes needed for correctness!).
It also doesn't prohibit larger buffer caps:
Any other [than 0 or 1] size must be subject to a high level of scrutiny. Consider how the size is determined, what prevents the channel from filling up under load and blocking writers, and what happens when this occurs. [emph. mine]
You may use a cap of 27 if you can explain why 27 (and not 22 or 31) and how this will influence program behaviour (not only performance!) if the buffer is filled.
Most people overrate performance. Correctness, operational stability and maintainability come first. And this is what this style guide is about here.

Why does Go use more CPUs but does not reduce computing duration?

I wrote a simple Go program to add numbers in many goroutines.
When I increase number of goroutines, the program uses more CPUs and I expect the computing duration is shorter. It is true for 1, 2 or 4 goroutines, but when I try 8 goroutines, the duration is the same as 4 (I run the test on i5-8265U, a 8 CPUs processor).
Can you explain it to me?
The code:
package main
import (
"fmt"
"time"
)
// Sum returns n by calculating 1+1+1+..
func Sum(n int64) int64 {
ret := int64(0)
for i := int64(0); i < n; i++ {
ret += 1
}
return ret
}
func main() {
n := int64(30000000000) // 30e9
sum := int64(0)
beginTime := time.Now()
nWorkers := 4
sumChan := make(chan int64, nWorkers)
for i := 0; i < nWorkers; i++ {
go func() { sumChan <- Sum(n / int64(nWorkers)) }()
}
for i := 0; i < nWorkers; i++ {
sum += <-sumChan
}
fmt.Println("dur:", time.Since(beginTime))
fmt.Println("sum:", sum)
// Results on Intel Core i5-8265U (nWorkers,dur):
// (1, 8s), (2, 4s), (4, 2s), (8, 2s). Why 8 CPUs still need 2s?
}
I run the test on i5-8265U, a 8 CPUs processor
The i5-8265U is not an 8-core CPU, it's a 4-core 8-threads CPU: it has 4 physical cores, and each core can run 2 threads concurrently via hyperthreading.
The "performance advantage" of HT depends on the workloads, and the ability to "mix in" operations from one thread with the computations of another. This means if your CPU is highly loaded, the hyper-threads may not be able to get more than a few % of the runtime, and thus not contribute much to the total performances.
Furthermore, the 8265U has a nominal frequency of 1.6GHz and a maximum turbo of 3.9 (3.7 on 4 cores). It's also possible that fully loading the CPU including hyperthreads would further lower the "turbo ceiling". You'd have to check the cpuinfo state during the run to see.

Which is the most efficient nil value?

Hi while doing some exercises I've came across this question...
Lets say you have a map with the capacity of 100,000.
Which value is the most efficient to fill the whole map in the least amount of time?
I've ran some benchmarks on my own trying out most of the types I could think of and the resulting top list is:
Benchmark_Struct-8 200 6010422 ns/op (struct{}{})
Benchmark_Byte-8 200 6167230 ns/op (byte = 0)
Benchmark_Int-8 200 6112927 ns/op (int8 = 0)
Benchmark_Bool-8 200 6117155 ns/op (bool = false)
Example function:
func Struct() {
m := make(map[int]struct{}, 100000)
for i := 0; i < 100000; i++ {
m[i] = struct{}{}
}
}
As you can see the fastest one (most of the time) is type struct{}{} - empty struct.
But why is this the case in go?
Is there a faster/lighter nil or non-nil value?
- Thank you for your time :)
Theoretically, struct{}{} should be the most efficient because it requires no memory. In practice, a) results may vary between Go versions, operating systems, and system architectures; and b) I can't think of any case where maximizing the execution-time efficiency of empty values is relevant.

Different performances in Go slices resize

I'm spending some time experimenting with Go's internals and I ended up writing my own implementation of a stack using slices.
As correctly pointed out by a reddit user in this post and as outlined by another user in this SO answer Go already tries to optimise slices resize.
Turns out, though, that I rather have performance gains using my own implementation of slice growing rather than sticking with the default one.
This is the structure I use for holding the stack:
type Stack struct {
slice []interface{}
blockSize int
}
const s_DefaultAllocBlockSize = 20;
This is my own implementation of the Push method
func (s *Stack) Push(elem interface{}) {
if len(s.slice) + 1 == cap(s.slice) {
slice := make([]interface{}, 0, len(s.slice) + s.blockSize)
copy(slice, s.slice)
s.slice = slice
}
s.slice = append(s.slice, elem)
}
This is a plain implementation
func (s *Stack) Push(elem interface{}) {
s.slice = append(s.slice, elem)
}
Running the benchmarks I've implemented using Go's testing package my own implementation performs this way:
Benchmark_PushDefaultStack 20000000 87.7 ns/op 24 B/op 1 allocs/op
While relying on the plain append the results are the following
Benchmark_PushDefaultStack 10000000 209 ns/op 90 B/op 1 allocs/op
The machine I run tests on is an early 2011 Mac Book Pro, 2.3 GHz Intel Core i5 with 8GB of RAM 1333MHz DDR3
EDIT
The actual question is: is my implementation really faster than the default append behavior? Or am I not taking something into account?
Reading your code, tests, benchmarks, and results it's easy to see that they are flawed. A full code review is beyond the scope of StackOverflow.
One specific bug.
// Push pushes a new element to the stack
func (s *Stack) Push(elem interface{}) {
if len(s.slice)+1 == cap(s.slice) {
slice := make([]interface{}, 0, len(s.slice)+s.blockSize)
copy(slice, s.slice)
s.slice = slice
}
s.slice = append(s.slice, elem)
}
Should be
// Push pushes a new element to the stack
func (s *Stack) Push(elem interface{}) {
if len(s.slice)+1 == cap(s.slice) {
slice := make([]interface{}, len(s.slice), len(s.slice)+s.blockSize)
copy(slice, s.slice)
s.slice = slice
}
s.slice = append(s.slice, elem)
}
copying
slices
The function copy copies slice elements from a source src to a
destination dst and returns the number of elements copied. The
number of elements copied is the minimum of len(src) and len(dst).
You copied 0, you should have copied len(s.slice).
As expected, your Push algorithm is inordinately slow:
append:
Benchmark_PushDefaultStack-4 2000000 941 ns/op 49 B/op 1 allocs/op
alediaferia:
Benchmark_PushDefaultStack-4 100000 1246315 ns/op 42355 B/op 1 allocs/op
This how append works: append complexity.
There are other things wrong too. Your benchmark results are often not valid.
I believe your example is faster because you have a fairly small data set and are allocating with an initial capacity of 0. In your version of append you preempt a large amount of allocations by growing the block size more dramatically early (by 20) circumventing the (in this case) expensive reallocs that take you through all those trivially small capacities 0,1,2,4,8,16,32,64 ect
If your data sets were a lot larger this would likely be marginalized by the cost of large copies. I've seen a lot of misuse of slice in Go. The clear performance win is had by making your slice with a reasonable default capacity.

Resources