Benchmarking bad results - go

so I have the Quicksort algorithm implemented with concurrency (the one without as well). Now I wanted to compare the times. I wrote this:
func benchmarkConcurrentQuickSort(size int, b *testing.B) {
A := RandomArray(size)
var wg sync.WaitGroup
b.ResetTimer()
ConcurrentQuicksort(A, 0, len(A)-1, &wg)
wg.Wait()
}
func BenchmarkConcurrentQuickSort500(b *testing.B) {
benchmarkConcurrentQuickSort(500, b)
}
func BenchmarkConcurrentQuickSort1000(b *testing.B) {
benchmarkConcurrentQuickSort(1000, b)
}
func BenchmarkConcurrentQuickSort5000(b *testing.B) {
benchmarkConcurrentQuickSort(5000, b)
}
func BenchmarkConcurrentQuickSort10000(b *testing.B) {
benchmarkConcurrentQuickSort(10000, b)
}
func BenchmarkConcurrentQuickSort20000(b *testing.B) {
benchmarkConcurrentQuickSort(20000, b)
}
func BenchmarkConcurrentQuickSort1000000(b *testing.B) {
benchmarkConcurrentQuickSort(1000000, b)
}
The results are like this:
C:\projects\go\src\github.com\frynio\mysort>go test -bench=.
BenchmarkConcurrentQuickSort500-4 2000000000 0.00 ns/op
BenchmarkConcurrentQuickSort1000-4 2000000000 0.00 ns/op
BenchmarkConcurrentQuickSort5000-4 2000000000 0.00 ns/op
BenchmarkConcurrentQuickSort10000-4 2000000000 0.00 ns/op
BenchmarkConcurrentQuickSort20000-4 2000000000 0.00 ns/op
BenchmarkConcurrentQuickSort1000000-4 30 49635266 ns/op
PASS
ok github.com/frynio/mysort 8.342s
I can believe the last one, but I definitely think that sorting 500-element array takes longer than 1ns. What am i doing wrong? I am pretty sure that RandomArray returns array of wanted size, as we can see in the last benchmark. Why does it print out the 0.00 ns?
func RandomArray(n int) []int {
a := []int{}
for i := 0; i < n; i++ {
a = append(a, rand.Intn(500))
}
return a
}
// ConcurrentPartition - ConcurrentQuicksort function for partitioning the array (randomized choice of a pivot)
func ConcurrentPartition(A []int, p int, r int) int {
index := rand.Intn(r-p) + p
pivot := A[index]
A[index] = A[r]
A[r] = pivot
x := A[r]
j := p - 1
i := p
for i < r {
if A[i] <= x {
j++
tmp := A[j]
A[j] = A[i]
A[i] = tmp
}
i++
}
temp := A[j+1]
A[j+1] = A[r]
A[r] = temp
return j + 1
}
// ConcurrentQuicksort - a concurrent version of a quicksort algorithm
func ConcurrentQuicksort(A []int, p int, r int, wg *sync.WaitGroup) {
if p < r {
q := ConcurrentPartition(A, p, r)
wg.Add(2)
go func() {
ConcurrentQuicksort(A, p, q-1, wg)
wg.Done()
}()
go func() {
ConcurrentQuicksort(A, q+1, r, wg)
wg.Done()
}()
}
}

Package testing
A sample benchmark function looks like this:
func BenchmarkHello(b *testing.B) {
for i := 0; i < b.N; i++ {
fmt.Sprintf("hello")
}
}
The benchmark function must run the target code b.N times. During
benchmark execution, b.N is adjusted until the benchmark function
lasts long enough to be timed reliably.
I don't see a benchmark loop in your code. Try
func benchmarkConcurrentQuickSort(size int, b *testing.B) {
A := RandomArray(size)
var wg sync.WaitGroup
b.ResetTimer()
for i := 0; i < b.N; i++ {
ConcurrentQuicksort(A, 0, len(A)-1, &wg)
wg.Wait()
}
}
Output:
BenchmarkConcurrentQuickSort500-4 10000 122291 ns/op
BenchmarkConcurrentQuickSort1000-4 5000 221154 ns/op
BenchmarkConcurrentQuickSort5000-4 1000 1225230 ns/op
BenchmarkConcurrentQuickSort10000-4 500 2568024 ns/op
BenchmarkConcurrentQuickSort20000-4 300 5808130 ns/op
BenchmarkConcurrentQuickSort1000000-4 1 1371751710 ns/op

Related

Concurrent code slower than sequential code on parallel problem? [duplicate]

This question already has answers here:
Why does adding concurrency slow down this golang code?
(4 answers)
Closed 3 months ago.
I wrote some code to execute Monte Carlo simulations. The first thing I wrote was this sequential version:
func simulationSequential(experiment func() bool, numTrials int) float64 {
ocurrencesEvent := 0
for trial := 0; trial < numTrials; trial++ {
eventHappend := experiment()
if eventHappend {
ocurrencesEvent++
}
}
return float64(ocurrencesEvent) / float64(numTrials)
}
Then, I figured I could run some of the experiments concurrently and get a result faster using my laptop's multiple cores. So, I wrote the following version:
func simulationConcurrent(experiment func() bool, numTrials, nGoroutines int) float64 {
ch := make(chan int)
var wg sync.WaitGroup
// Launch work in multiple goroutines
for i := 0; i < nGoroutines; i++ {
wg.Add(1)
go func() {
localOcurrences := 0
for j := 0; j < numTrials/nGoroutines; j++ {
eventHappend := experiment()
if eventHappend {
localOcurrences++
}
}
ch <- localOcurrences
wg.Done()
}()
}
// Close the channel when all the goroutines are done
go func() {
wg.Wait()
close(ch)
}()
// Acummulate the results of each experiment
ocurrencesEvent := 0
for localOcurrences := range ch {
ocurrencesEvent += localOcurrences
}
return float64(ocurrencesEvent) / float64(numTrials)
}
To my surprise, when I run benchmarks on the two versions, I get that the sequential is faster than the concurrent one, with the concurrent version getting better as I decrease the number of goroutines. Why does this happen? I thought the concurrent version will be faster since this is a highly parallelizable problem.
Here is my benchmark's code:
func tossEqualToSix() bool {
// Simulate the toss of a six-sided die
roll := rand.Intn(6) + 1
if roll != 6 {
return false
}
return true
}
const (
numsSimBenchmark = 1_000_000
numGoroutinesBenckmark = 10
)
func BenchmarkSimulationSequential(b *testing.B) {
for i := 0; i < b.N; i++ {
simulationSequential(tossEqualToSix, numsSimBenchmark)
}
}
func BenchmarkSimulationConcurrent(b *testing.B) {
for i := 0; i < b.N; i++ {
simulationConcurrent(tossEqualToSix, numsSimBenchmark, numGoroutinesBenckmark)
}
}
And the results:
goos: linux
goarch: amd64
pkg: github.com/jpuriol/montecarlo
cpu: Intel(R) Core(TM) i7-10510U CPU # 1.80GHz
BenchmarkSimulationSequential-8 36 30453588 ns/op
BenchmarkSimulationConcurrent-8 9 117462720 ns/op
PASS
ok github.com/jpuriol/montecarlo 2.478s
You can download my code from Github.
I thought I would elaborate on my comment and post it with code and benchmark result.
Examine function uses package level rand functions from rand package. These functions under the hood uses globalRand instance of rand.Rand. For example func Intn(n int) int { return globalRand.Intn(n) }. As the random number generator is not thread safe, the globalRand is instantiated in following way:
/*
* Top-level convenience functions
*/
var globalRand = New(&lockedSource{src: NewSource(1).(*rngSource)})
type lockedSource struct {
lk sync.Mutex
src *rngSource
}
func (r *lockedSource) Int63() (n int64) {
r.lk.Lock()
n = r.src.Int63()
r.lk.Unlock()
return
}
...
This means that all invocations of rand.Intn are guarded by the global lock. The consequence is that examine function "works sequentially", because of the lock. More specifically, each call to rand.Intn will not start generating a random number before the previous call completes.
Here is redesigned code. Each examine function has its own random generator. The assumption is that single examine function is used by one goroutine, so it does not require lock protection.
package main
import (
"math/rand"
"sync"
"testing"
"time"
)
func simulationSequential(experimentFuncFactory func() func() bool, numTrials int) float64 {
experiment := experimentFuncFactory()
ocurrencesEvent := 0
for trial := 0; trial < numTrials; trial++ {
eventHappend := experiment()
if eventHappend {
ocurrencesEvent++
}
}
return float64(ocurrencesEvent) / float64(numTrials)
}
func simulationConcurrent(experimentFuncFactory func() func() bool, numTrials, nGoroutines int) float64 {
ch := make(chan int)
var wg sync.WaitGroup
// Launch work in multiple goroutines
for i := 0; i < nGoroutines; i++ {
wg.Add(1)
go func() {
experiment := experimentFuncFactory()
localOcurrences := 0
for j := 0; j < numTrials/nGoroutines; j++ {
eventHappend := experiment()
if eventHappend {
localOcurrences++
}
}
ch <- localOcurrences
wg.Done()
}()
}
// Close the channel when all the goroutines are done
go func() {
wg.Wait()
close(ch)
}()
// Acummulate the results of each experiment
ocurrencesEvent := 0
for localOcurrences := range ch {
ocurrencesEvent += localOcurrences
}
return float64(ocurrencesEvent) / float64(numTrials)
}
func tossEqualToSix() func() bool {
prng := rand.New(rand.NewSource(time.Now().UnixNano()))
return func() bool {
// Simulate the toss of a six-sided die
roll := prng.Intn(6) + 1
if roll != 6 {
return false
}
return true
}
}
const (
numsSimBenchmark = 5_000_000
numGoroutinesBenchmark = 10
)
func BenchmarkSimulationSequential(b *testing.B) {
for i := 0; i < b.N; i++ {
simulationSequential(tossEqualToSix, numsSimBenchmark)
}
}
func BenchmarkSimulationConcurrent(b *testing.B) {
for i := 0; i < b.N; i++ {
simulationConcurrent(tossEqualToSix, numsSimBenchmark, numGoroutinesBenchmark)
}
}
Benchmark result are as follows:
goos: darwin
goarch: arm64
pkg: scratchpad
BenchmarkSimulationSequential-8 20 55142896 ns/op
BenchmarkSimulationConcurrent-8 82 12944360 ns/op

Why having a default clause in a goroutine's select makes it slower?

Referring to the following benchmarking test codes:
func BenchmarkRuneCountNoDefault(b *testing.B) {
b.StopTimer()
var strings []string
numStrings := 10
for n := 0; n < numStrings; n++{
s := RandStringBytesMaskImprSrc(10)
strings = append(strings, s)
}
jobs := make(chan string)
results := make (chan int)
for i := 0; i < runtime.NumCPU(); i++{
go RuneCountNoDefault(jobs, results)
}
b.StartTimer()
for n := 0; n < b.N; n++ {
go func(){
for n := 0; n < numStrings; n++{
<-results
}
return
}()
for n := 0; n < numStrings; n++{
jobs <- strings[n]
}
}
close(jobs)
}
func RuneCountNoDefault(jobs chan string, results chan int){
for{
select{
case j, ok := <-jobs:
if ok{
results <- utf8.RuneCountInString(j)
} else {
return
}
}
}
}
func BenchmarkRuneCountWithDefault(b *testing.B) {
b.StopTimer()
var strings []string
numStrings := 10
for n := 0; n < numStrings; n++{
s := RandStringBytesMaskImprSrc(10)
strings = append(strings, s)
}
jobs := make(chan string)
results := make (chan int)
for i := 0; i < runtime.NumCPU(); i++{
go RuneCountWithDefault(jobs, results)
}
b.StartTimer()
for n := 0; n < b.N; n++ {
go func(){
for n := 0; n < numStrings; n++{
<-results
}
return
}()
for n := 0; n < numStrings; n++{
jobs <- strings[n]
}
}
close(jobs)
}
func RuneCountWithDefault(jobs chan string, results chan int){
for{
select{
case j, ok := <-jobs:
if ok{
results <- utf8.RuneCountInString(j)
} else {
return
}
default: //DIFFERENCE
}
}
}
//https://stackoverflow.com/questions/22892120/how-to-generate-a-random-string-of-a-fixed-length-in-golang
const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
const (
letterIdxBits = 6 // 6 bits to represent a letter index
letterIdxMask = 1<<letterIdxBits - 1 // All 1-bits, as many as letterIdxBits
letterIdxMax = 63 / letterIdxBits // # of letter indices fitting in 63 bits
)
var src = rand.NewSource(time.Now().UnixNano())
func RandStringBytesMaskImprSrc(n int) string {
b := make([]byte, n)
// A src.Int63() generates 63 random bits, enough for letterIdxMax characters!
for i, cache, remain := n-1, src.Int63(), letterIdxMax; i >= 0; {
if remain == 0 {
cache, remain = src.Int63(), letterIdxMax
}
if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
b[i] = letterBytes[idx]
i--
}
cache >>= letterIdxBits
remain--
}
return string(b)
}
When I benchmarked both the functions where one function, RuneCountNoDefault has no default clause in the select and the other, RuneCountWithDefault has a default clause, I'm getting the following benchmark:
BenchmarkRuneCountNoDefault-4 200000 8910 ns/op
BenchmarkRuneCountWithDefault-4 5 277798660 ns/op
Checking the cpuprofile generated by the tests, I noticed that the function with the default clause spends a lot of time in the following channel operations:
Why having a default clause in the goroutine's select makes it slower?
I'm using Go version 1.10 for windows/amd64
The Go Programming Language
Specification
Select statements
If one or more of the communications can proceed, a single one that
can proceed is chosen via a uniform pseudo-random selection.
Otherwise, if there is a default case, that case is chosen. If there
is no default case, the "select" statement blocks until at least one
of the communications can proceed.
Modifying your benchmark to count the number of proceed and default cases taken:
$ go test default_test.go -bench=.
goos: linux
goarch: amd64
BenchmarkRuneCountNoDefault-4 300000 4108 ns/op
BenchmarkRuneCountWithDefault-4 10 209890782 ns/op
--- BENCH: BenchmarkRuneCountWithDefault-4
default_test.go:90: proceeds 114
default_test.go:91: defaults 128343308
$
While other cases were unable to proceed, the default case was taken 128343308 times in 209422470, (209890782 - 114*4108), nanoseconds or 1.63 nanoseconds per default case. If you do something small a large number of times, it adds up.
default_test.go:
package main
import (
"math/rand"
"runtime"
"sync/atomic"
"testing"
"time"
"unicode/utf8"
)
func BenchmarkRuneCountNoDefault(b *testing.B) {
b.StopTimer()
var strings []string
numStrings := 10
for n := 0; n < numStrings; n++ {
s := RandStringBytesMaskImprSrc(10)
strings = append(strings, s)
}
jobs := make(chan string)
results := make(chan int)
for i := 0; i < runtime.NumCPU(); i++ {
go RuneCountNoDefault(jobs, results)
}
b.StartTimer()
for n := 0; n < b.N; n++ {
go func() {
for n := 0; n < numStrings; n++ {
<-results
}
return
}()
for n := 0; n < numStrings; n++ {
jobs <- strings[n]
}
}
close(jobs)
}
func RuneCountNoDefault(jobs chan string, results chan int) {
for {
select {
case j, ok := <-jobs:
if ok {
results <- utf8.RuneCountInString(j)
} else {
return
}
}
}
}
var proceeds ,defaults uint64
func BenchmarkRuneCountWithDefault(b *testing.B) {
b.StopTimer()
var strings []string
numStrings := 10
for n := 0; n < numStrings; n++ {
s := RandStringBytesMaskImprSrc(10)
strings = append(strings, s)
}
jobs := make(chan string)
results := make(chan int)
for i := 0; i < runtime.NumCPU(); i++ {
go RuneCountWithDefault(jobs, results)
}
b.StartTimer()
for n := 0; n < b.N; n++ {
go func() {
for n := 0; n < numStrings; n++ {
<-results
}
return
}()
for n := 0; n < numStrings; n++ {
jobs <- strings[n]
}
}
close(jobs)
b.Log("proceeds", atomic.LoadUint64(&proceeds))
b.Log("defaults", atomic.LoadUint64(&defaults))
}
func RuneCountWithDefault(jobs chan string, results chan int) {
for {
select {
case j, ok := <-jobs:
atomic.AddUint64(&proceeds, 1)
if ok {
results <- utf8.RuneCountInString(j)
} else {
return
}
default: //DIFFERENCE
atomic.AddUint64(&defaults, 1)
}
}
}
//https://stackoverflow.com/questions/22892120/how-to-generate-a-random-string-of-a-fixed-length-in-golang
const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
const (
letterIdxBits = 6 // 6 bits to represent a letter index
letterIdxMask = 1<<letterIdxBits - 1 // All 1-bits, as many as letterIdxBits
letterIdxMax = 63 / letterIdxBits // # of letter indices fitting in 63 bits
)
var src = rand.NewSource(time.Now().UnixNano())
func RandStringBytesMaskImprSrc(n int) string {
b := make([]byte, n)
// A src.Int63() generates 63 random bits, enough for letterIdxMax characters!
for i, cache, remain := n-1, src.Int63(), letterIdxMax; i >= 0; {
if remain == 0 {
cache, remain = src.Int63(), letterIdxMax
}
if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
b[i] = letterBytes[idx]
i--
}
cache >>= letterIdxBits
remain--
}
return string(b)
}
Playground: https://play.golang.org/p/DLnAY0hovQG

confusing concurrency and performance issue in Go

now I start learning Go language by watching this great course. To be clear for years I write only on PHP and concurrency/parallelism is new for me, so I little confused by this.
In this course, there is a task to create a program which calculates factorial with 100 computations. I went a bit further and to comparing performance I changed it to 10000 and for some reason, the sequential program works same or even faster than concurrency.
Here I'm going to provide 3 solutions: mine, teachers and sequential
My solution:
package main
import (
"fmt"
)
func gen(steps int) <-chan int{
out := make(chan int)
go func() {
for j:= 0; j <steps; j++ {
out <- j
}
close(out)
}()
return out
}
func factorial(in <-chan int) <-chan int {
out := make(chan int)
go func() {
for n := range in {
out <- fact(n)
}
close(out)
}()
return out
}
func fact(n int) int {
total := 1
for i := n;i>0;i-- {
total *=i
}
return total
}
func main() {
steps := 10000
for i := 0; i < steps; i++ {
for n:= range factorial(gen(10)) {
fmt.Println(n)
}
}
}
execution time:
real 0m6,356s
user 0m3,885s
sys 0m0,870s
Teacher solution:
package main
import (
"fmt"
)
func gen(steps int) <-chan int{
out := make(chan int)
go func() {
for i := 0; i < steps; i++ {
for j:= 0; j <10; j++ {
out <- j
}
}
close(out)
}()
return out
}
func factorial(in <-chan int) <-chan int {
out := make(chan int)
go func() {
for n := range in {
out <- fact(n)
}
close(out)
}()
return out
}
func fact(n int) int {
total := 1
for i := n;i>0;i-- {
total *=i
}
return total
}
func main() {
steps := 10000
for n:= range factorial(gen(steps)) {
fmt.Println(n)
}
}
execution time:
real 0m2,836s
user 0m1,388s
sys 0m0,492s
Sequential:
package main
import (
"fmt"
)
func fact(n int) int {
total := 1
for i := n;i>0;i-- {
total *=i
}
return total
}
func main() {
steps := 10000
for i := 0; i < steps; i++ {
for j:= 0; j <10; j++ {
fmt.Println(fact(j))
}
}
}
execution time:
real 0m2,513s
user 0m1,113s
sys 0m0,387s
So, as you can see the sequential solution is fastest, teachers solution is in the second place and my solution is third.
First question: why the sequential solution is fastest?
And second why my solution is so slow? if I understanding correctly in my solution I'm creating 10000 goroutines inside gen and 10000 inside factorial and in teacher solution, he is creating only 1 goroutine in gen and 1 in factorial. My so slow because I'm creating too many unneeded goroutines?
It's the difference between concurrency and parellelism - your's, you teachers and the sequential are progressively less concurrent in design but how parallel they are depends on number of CPU cores and there is a set up and communication cost associated with concurrency. There are no asynchronous calls in the code so only parallelism will improve speed.
This is worth a look: https://blog.golang.org/concurrency-is-not-parallelism
Also, even with parallel cores speedup will be dependent on nature of the workload - google Amdahl's law for explanation.
Let's start with some fundamental benchmarks for factorial computation.
$ go test -run=! -bench=. factorial_test.go
goos: linux
goarch: amd64
BenchmarkFact0-4 1000000000 2.07 ns/op
BenchmarkFact9-4 300000000 4.37 ns/op
BenchmarkFact0To9-4 50000000 36.0 ns/op
BenchmarkFact10K0To9-4 3000 384069 ns/op
$
The CPU time is very small, even for 10,000 iterations of factorials zero through nine.
factorial_test.go:
package main
import "testing"
func fact(n int) int {
total := 1
for i := n; i > 0; i-- {
total *= i
}
return total
}
var sinkFact int
func BenchmarkFact0(b *testing.B) {
for N := 0; N < b.N; N++ {
j := 0
sinkFact = fact(j)
}
}
func BenchmarkFact9(b *testing.B) {
for N := 0; N < b.N; N++ {
j := 9
sinkFact = fact(j)
}
}
func BenchmarkFact0To9(b *testing.B) {
for N := 0; N < b.N; N++ {
for j := 0; j < 10; j++ {
sinkFact = fact(j)
}
}
}
func BenchmarkFact10K0To9(b *testing.B) {
for N := 0; N < b.N; N++ {
steps := 10000
for i := 0; i < steps; i++ {
for j := 0; j < 10; j++ {
sinkFact = fact(j)
}
}
}
}
Let's look at the time for the sequential program.
$ go build -a sequential.go && time ./sequential
real 0m0.247s
user 0m0.054s
sys 0m0.149s
Writing to the terminal is obviously a major bottleneck. Let's write to a sink.
$ go build -a sequential.go && time ./sequential > /dev/null
real 0m0.070s
user 0m0.049s
sys 0m0.020s
It's still a lot more than the 0m0.000000384069s for the factorial computation.
sequential.go:
package main
import (
"fmt"
)
func fact(n int) int {
total := 1
for i := n; i > 0; i-- {
total *= i
}
return total
}
func main() {
steps := 10000
for i := 0; i < steps; i++ {
for j := 0; j < 10; j++ {
fmt.Println(fact(j))
}
}
}
Attempts to use concurrency for such a trivial amount of parallel work are likely to fail. Go goroutines and channels are cheap, but they are not free. Also, a single channel and a single terminal are the bottleneck, the limiting factor, even when writing to a sink. See Amdahl's Law for parallel computing. See Concurrency is not parallelism.
$ go build -a teacher.go && time ./teacher > /dev/null
real 0m0.123s
user 0m0.123s
sys 0m0.022s
$ go build -a student.go && time ./student > /dev/null
real 0m0.135s
user 0m0.113s
sys 0m0.038s
teacher.go:
package main
import (
"fmt"
)
func gen(steps int) <-chan int {
out := make(chan int)
go func() {
for i := 0; i < steps; i++ {
for j := 0; j < 10; j++ {
out <- j
}
}
close(out)
}()
return out
}
func factorial(in <-chan int) <-chan int {
out := make(chan int)
go func() {
for n := range in {
out <- fact(n)
}
close(out)
}()
return out
}
func fact(n int) int {
total := 1
for i := n; i > 0; i-- {
total *= i
}
return total
}
func main() {
steps := 10000
for n := range factorial(gen(steps)) {
fmt.Println(n)
}
}
student.go:
package main
import (
"fmt"
)
func gen(steps int) <-chan int {
out := make(chan int)
go func() {
for j := 0; j < steps; j++ {
out <- j
}
close(out)
}()
return out
}
func factorial(in <-chan int) <-chan int {
out := make(chan int)
go func() {
for n := range in {
out <- fact(n)
}
close(out)
}()
return out
}
func fact(n int) int {
total := 1
for i := n; i > 0; i-- {
total *= i
}
return total
}
func main() {
steps := 10000
for i := 0; i < steps; i++ {
for n := range factorial(gen(10)) {
fmt.Println(n)
}
}
}

Golang foreach difference

func Benchmark_foreach1(b *testing.B) {
var test map[int]int
test = make(map[int]int)
for i := 0; i < 100000; i++ {
test[i] = 1
}
for i := 0; i < b.N; i++ {
for i, _ := range test {
if test[i] != 1 {
panic("ds")
}
}
}
}
func Benchmark_foreach2(b *testing.B) {
var test map[int]int
test = make(map[int]int)
for i := 0; i < 100000; i++ {
test[i] = 1
}
for i := 0; i < b.N; i++ {
for _, v := range test {
if v != 1 {
panic("heh")
}
}
}
}
run with result as below
goos: linux
goarch: amd64
Benchmark_foreach1-2 500 3172323 ns/op
Benchmark_foreach2-2 1000 1707214 ns/op
why is foreach-2 slow?
I think Benchmark_foreach2-2 is about 2 times faster - it requires 1707214 nanoseconds per operation, and first one takes 3172323. So second one is 3172323 / 1707214 = 1.85 times faster.
Reason: second doesn't need to take value from a memory again, it already used value in v variable.
The test[k] statement in BenchmarkForeachK takes time to randomly read the value, so BenchmarkForeachK takes more time than BenchmarkForeachV, 9362945 ns/op versus 4213940 ns/op .
For example,
package main
import "testing"
func testMap() map[int]int {
test := make(map[int]int)
for i := 0; i < 100000; i++ {
test[i] = 1
}
return test
}
func BenchmarkForeachK(b *testing.B) {
test := testMap()
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
for k := range test {
if test[k] != 1 {
panic("eh")
}
}
}
}
func BenchmarkForeachV(b *testing.B) {
test := testMap()
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
for _, v := range test {
if v != 1 {
panic("heh")
}
}
}
}
Output:
$ go test foreach_test.go -bench=.
BenchmarkForeachK-4 200 9362945 ns/op 0 B/op 0 allocs/op
BenchmarkForeachV-4 300 4213940 ns/op 0 B/op 0 allocs/op

Concat byte arrays

Can someone please point at a more efficient version of the following
b:=make([]byte,0,sizeTotal)
b=append(b,size...)
b=append(b,contentType...)
b=append(b,lenCallbackid...)
b=append(b,lenTarget...)
b=append(b,lenAction...)
b=append(b,lenContent...)
b=append(b,callbackid...)
b=append(b,target...)
b=append(b,action...)
b=append(b,content...)
every variable is a byte slice apart from size sizeTotal
Update:
Code:
type Message struct {
size uint32
contentType uint8
callbackId string
target string
action string
content string
}
var res []byte
var b []byte = make([]byte,0,4096)
func (m *Message)ToByte()[]byte{
callbackIdIntLen:=len(m.callbackId)
targetIntLen := len(m.target)
actionIntLen := len(m.action)
contentIntLen := len(m.content)
lenCallbackid:=make([]byte,4)
binary.LittleEndian.PutUint32(lenCallbackid, uint32(callbackIdIntLen))
callbackid := []byte(m.callbackId)
lenTarget := make([]byte,4)
binary.LittleEndian.PutUint32(lenTarget, uint32(targetIntLen))
target:=[]byte(m.target)
lenAction := make([]byte,4)
binary.LittleEndian.PutUint32(lenAction, uint32(actionIntLen))
action := []byte(m.action)
lenContent:= make([]byte,4)
binary.LittleEndian.PutUint32(lenContent, uint32(contentIntLen))
content := []byte(m.content)
sizeTotal:= 21+callbackIdIntLen+targetIntLen+actionIntLen+contentIntLen
size := make([]byte,4)
binary.LittleEndian.PutUint32(size, uint32(sizeTotal))
b=b[:0]
b=append(b,size...)
b=append(b,byte(m.contentType))
b=append(b,lenCallbackid...)
b=append(b,lenTarget...)
b=append(b,lenAction...)
b=append(b,lenContent...)
b=append(b,callbackid...)
b=append(b,target...)
b=append(b,action...)
b=append(b,content...)
res = b
return b
}
func FromByte(bytes []byte)(*Message){
size :=binary.LittleEndian.Uint32(bytes[0:4])
contentType :=bytes[4:5][0]
lenCallbackid:=binary.LittleEndian.Uint32(bytes[5:9])
lenTarget :=binary.LittleEndian.Uint32(bytes[9:13])
lenAction :=binary.LittleEndian.Uint32(bytes[13:17])
lenContent :=binary.LittleEndian.Uint32(bytes[17:21])
callbackid := string(bytes[21:21+lenCallbackid])
target:= string(bytes[21+lenCallbackid:21+lenCallbackid+lenTarget])
action:= string(bytes[21+lenCallbackid+lenTarget:21+lenCallbackid+lenTarget+lenAction])
content:=string(bytes[size-lenContent:size])
return &Message{size,contentType,callbackid,target,action,content}
}
Benchs:
func BenchmarkMessageToByte(b *testing.B) {
m:=NewMessage(uint8(3),"agsdggsdasagdsdgsgddggds","sometarSFAFFget","somFSAFSAFFSeaction","somfasfsasfafsejsonzhit")
for n := 0; n < b.N; n++ {
m.ToByte()
}
}
func BenchmarkMessageFromByte(b *testing.B) {
m:=NewMessage(uint8(1),"sagdsgaasdg","soSASFASFASAFSFASFAGmetarget","adsgdgsagdssgdsgd","agsdsdgsagdsdgasdg").ToByte()
for n := 0; n < b.N; n++ {
FromByte(m)
}
}
func BenchmarkStringToByte(b *testing.B) {
for n := 0; n < b.N; n++ {
_ = []byte("abcdefghijklmnoqrstuvwxyz")
}
}
func BenchmarkStringFromByte(b *testing.B) {
s:=[]byte("abcdefghijklmnoqrstuvwxyz")
for n := 0; n < b.N; n++ {
_ = string(s)
}
}
func BenchmarkUintToByte(b *testing.B) {
for n := 0; n < b.N; n++ {
i:=make([]byte,4)
binary.LittleEndian.PutUint32(i, uint32(99))
}
}
func BenchmarkUintFromByte(b *testing.B) {
i:=make([]byte,4)
binary.LittleEndian.PutUint32(i, uint32(99))
for n := 0; n < b.N; n++ {
binary.LittleEndian.Uint32(i)
}
}
Bench results:
BenchmarkMessageToByte 10000000 280 ns/op
BenchmarkMessageFromByte 10000000 293 ns/op
BenchmarkStringToByte 50000000 55.1 ns/op
BenchmarkStringFromByte 50000000 49.7 ns/op
BenchmarkUintToByte 1000000000 2.14 ns/op
BenchmarkUintFromByte 2000000000 1.71 ns/op
Provided memory is already allocated, a sequence of x=append(x,a...) is rather efficient in Go.
In your example, the initial allocation (make) probably costs more than the sequence of appends. It depends on the size of the fields. Consider the following benchmark:
package main
import (
"testing"
)
const sizeTotal = 25
var res []byte // To enforce heap allocation
func BenchmarkWithAlloc(b *testing.B) {
a := []byte("abcde")
for i := 0; i < b.N; i++ {
x := make([]byte, 0, sizeTotal)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
res = x // Make sure x escapes, and is therefore heap allocated
}
}
func BenchmarkWithoutAlloc(b *testing.B) {
a := []byte("abcde")
x := make([]byte, 0, sizeTotal)
for i := 0; i < b.N; i++ {
x = x[:0]
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
x = append(x, a...)
res = x
}
}
On my box, the result is:
testing: warning: no tests to run
PASS
BenchmarkWithAlloc 10000000 116 ns/op 32 B/op 1 allocs/op
BenchmarkWithoutAlloc 50000000 24.0 ns/op 0 B/op 0 allocs/op
Systematically reallocating the buffer (even a small one) makes this benchmark at least 5 times slower.
So your best hope to optimize this code it to make sure you do not reallocate a buffer for each packet you build. On the contrary, you should keep your buffer, and reuse it for each marshalling operation.
You can reset a slice while keeping its underlying buffer allocated with the following statement:
x = x[:0]
I looked carefully at that and made the following benchmarks.
package append
import "testing"
func BenchmarkAppend(b *testing.B) {
as := 1000
a := make([]byte, as)
s := make([]byte, 0, b.N*as)
for i := 0; i < b.N; i++ {
s = append(s, a...)
}
}
func BenchmarkCopy(b *testing.B) {
as := 1000
a := make([]byte, as)
s := make([]byte, 0, b.N*as)
for i := 0; i < b.N; i++ {
copy(s[i*as:(i+1)*as], a)
}
}
The results are
grzesiek#klapacjusz ~/g/s/t/append> go test -bench . -benchmem
testing: warning: no tests to run
PASS
BenchmarkAppend 10000000 202 ns/op 1000 B/op 0 allocs/op
BenchmarkCopy 10000000 201 ns/op 1000 B/op 0 allocs/op
ok test/append 4.564s
If the totalSize is big enough then your code makes no memory allocations. It copies only the amount of bytes it needs to copy. It is perfectly fine.

Resources