How to use channels efficiently [closed] - performance

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I read on Uber's style guide that one should use at most a channel length of 1.
Although it's clear to me that using a channel size of 100 or 1000 is very bad practice, I was however wondering why a channel size of 10 isn't considered a valid option. I'm missing some part to get to the right conclusion.
Below, you can follow my arguments (and counter arguments) backed by some benchmark test.
I understand that, if your both go-routines, responsible for writing or reading from this channel, would be interrupted in between sequential writings or readings to/from the channel by some other IO action, no gain is to be expected from a higher channel buffer and I agree that 1 is the best option.
But, lets say that there is no significant other go-routine switching needed apart from the implicit locking and unlocking caused by writing/reading to/from the channel. Then I would conclude the following:
Consider the amount of context switches when processing 100 values on a channel with either a channel buffer of size 1 and of 10 (GR = go-routine)
Buffer=1: (GR1 inserts 1 value, GR2 reads 1 value) X 100 ~ 200 go-routine switches
Buffer=10: (GR1 inserts 10 values, GR2 reads 10 values) X 10 ~ 20 go-routine switches
I did some benchmarking to prove that this actually goes faster:
package main
import (
"testing"
)
type a struct {
b [100]int64
}
func BenchmarkBuffer1(b *testing.B) {
count := 0
c := make(chan a, 1)
go func() {
for i := 0; i < b.N; i++ {
c <- a{}
}
close(c)
}()
for v := range c {
for i := range v.b {
count += i
}
}
}
func BenchmarkBuffer10(b *testing.B) {
count := 0
c := make(chan a, 10)
go func() {
for i := 0; i < b.N; i++ {
c <- a{}
}
close(c)
}()
for v := range c {
for i := range v.b {
count += i
}
}
}
Results when comparing simple reading & writing + non-blocking processing:
BenchmarkBuffer1-12 5072902 266 ns/op
BenchmarkBuffer10-12 6029602 179 ns/op
PASS
BenchmarkBuffer1-12 5228782 256 ns/op
BenchmarkBuffer10-12 5392410 216 ns/op
PASS
BenchmarkBuffer1-12 4806208 287 ns/op
BenchmarkBuffer10-12 4637842 233 ns/op
PASS
However, if I add a sleep every 10 reads, it doesn't yield any better results.
import (
"testing"
"time"
)
func BenchmarkBuffer1WithSleep(b *testing.B) {
count := 0
c := make(chan int, 1)
go func() {
for i := 0; i < b.N; i++ {
c <- i
}
close(c)
}()
for a := range c {
count++
if count%10 == 0 {
time.Sleep(time.Duration(a) * time.Nanosecond)
}
}
}
func BenchmarkBuffer10WithSleep(b *testing.B) {
count := 0
c := make(chan int, 10)
go func() {
for i := 0; i < b.N; i++ {
c <- i
}
close(c)
}()
for a := range c {
count++
if count%10 == 0 {
time.Sleep(time.Duration(a) * time.Nanosecond)
}
}
}
Results when adding a sleep every 10 reads:
BenchmarkBuffer1WithSleep-12 856886 53219 ns/op
BenchmarkBuffer10WithSleep-12 929113 56939 ns/op
FYI: I also did the test again with only one CPU and got the following results:
BenchmarkBuffer1 5831193 207 ns/op
BenchmarkBuffer10 6226983 180 ns/op
BenchmarkBuffer1WithSleep 556635 35510 ns/op
BenchmarkBuffer10WithSleep 984472 61434 ns/op

Absolutely nothing is wrong with a channel of cap 500 e.g. if this channel is used as a semaphore.
The style guide you read recommends to not use buffered channels of let's say cap 64 "because this looks like a nice number". But this recommendation is not because of performance! (Btw: You microbenchmarks are useless microbenchmarks, they do not measure anything relevant.)
An unbuffered channel is some kind of synchronisation primitive and us such very much useful.
A buffered channel, well, may buffer between sender and receiver and this buffering can be problematic for observing, tuning and debugging the code (because creation and consumption are further decoupled). Thats why the style guide recommends unbuffered channels (or at most a cap of 1 as this is sometimes needed for correctness!).
It also doesn't prohibit larger buffer caps:
Any other [than 0 or 1] size must be subject to a high level of scrutiny. Consider how the size is determined, what prevents the channel from filling up under load and blocking writers, and what happens when this occurs. [emph. mine]
You may use a cap of 27 if you can explain why 27 (and not 22 or 31) and how this will influence program behaviour (not only performance!) if the buffer is filled.
Most people overrate performance. Correctness, operational stability and maintainability come first. And this is what this style guide is about here.

Related

data-race, two goroutines plus same val [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 1 year ago.
Improve this question
consider below code, in my opinon, val will between 100 and 200, but it's always 200
var val = 0
func main() {
num := runtime.NumCPU()
fmt.Println("使用cpu数量", num)
go add("A")
go add("B")
time.Sleep(1 * time.Second)
fmt.Println("val的最终结果", val)
}
func add(proc string) {
for i := 0; i < 100; i++ {
val++
fmt.Printf("execute process[%s] and val is %d\n", proc, val)
time.Sleep(5 * time.Millisecond)
}
}
why val always is 200 at last?
There are 2 problems with your code:
You have a data race - concurrently writing and reading val without synchronization. The presence of it makes reasoning about program outcome meaningless.
The sleep in main() of 1 second is too short - the goroutines may not be done yet after 1 second. You expect fmt.Printf to take no time at all, but console output does take significant amount of time (on some OS longer than others). So the loop won't take 100 * 5 = 500 milliseconds, but much, much longer.
Here's a fixed version that atomically increments val and properly waits for both goroutines to finish instead of assuming that they will be done within 1 second.
var val = int32(0)
func main() {
num := runtime.NumCPU()
fmt.Println("使用cpu数量", num)
var wg sync.WaitGroup
wg.Add(2)
go add("A", &wg)
go add("B", &wg)
wg.Wait()
fmt.Println("val的最终结果", atomic.LoadInt32(&val))
}
func add(proc string, wg *sync.WaitGroup) {
for i := 0; i < 100; i++ {
tmp := atomic.AddInt32(&val, 1)
fmt.Printf("execute process[%s] and val is %d\n", proc, tmp)
time.Sleep(5 * time.Millisecond)
}
wg.Done()
}
Incrementing the integer will take only on the order of a few nanoseconds, while each goroutine is waiting 5 milliseconds between each increment.
Meaning, each goroutine is only spending about a 1,000,000th of the time actually doing the operation, the rest of the time is sleeping. Therefore, the likelihood of an interference happening is quite low (since both goroutines would need to be doing the operation simultaneously).
Even if both goroutines are on equal timers, the practical precision of the time library is nowhere near the nanosecond scale required to produce collisions consistently. Plus, the goroutines are doing some printing, which will further diverge the timings.
As pointed out in the comments, your code does still have a data race (as is your intention, seemingly), meaning that we cannot say anything for certain about the output of your program, despite any observations. It's possible that it would output any number from 100-200, or different numbers entirely.

Benchmark with Goroutines

Pretty new to the Golang here and bumped into a problem when benchmarking with goroutines.
The code I have is here:
type store struct{}
func (n *store) WriteSpan(span interface{}) error {
return nil
}
func smallTest(times int, b *testing.B) {
writer := store{}
var wg sync.WaitGroup
numGoroutines := times
wg.Add(numGoroutines)
b.ResetTimer()
b.ReportAllocs()
for n := 0; n < numGoroutines; n++ {
go func() {
writer.WriteSpan(nil)
wg.Done()
}()
}
wg.Wait()
}
func BenchmarkTest1(b *testing.B) {
smallTest(1000000, b)
}
func BenchmarkTest2(b *testing.B) {
smallTest(10000000, b)
}
It looks to me the runtime and allocation for both scenario should be similar, but running them gives me the following results which are vastly different. Wonder why this happens? Where do those extra allocations come from?
BenchmarkTest1-12 1000000000 0.26 ns/op 0 B/op 0 allocs/op
BenchmarkTest2-12 1 2868129398 ns/op 31872 B/op 83 allocs/op
PASS
I also notice If I add a inner loop to writeSpan multiple times, the runtime and allocation kind of relates to the numGoroutines * multiple times. If this is not the way how people benchmark with goroutines, are there any other standard ways to test? Thanks in advance.
Meaningless microbenchmarks produce meaningless results.
If this is not the way how people benchmark with goroutines, are there
any other standard ways to test?
It's not the way to benchmark anything. Benchmark real problems.
You run a very large number of goroutines, which do nothing, until you saturate the scheduler, the machine, and other resources. That merely proves that if you run anything enough times you can bring a machine to its knees.

Generating random numbers concurrently in Go

I'm new to Go and to concurrent/parallel programming in general. In order to try out (and hopefully see the performance benefits of) goroutines, I've put together a small test program that simply generates 100 million random ints - first in a single goroutine, and then in as many goroutines as reported by runtime.NumCPU().
However, I consistently get worse performance using more goroutines than using a single one. I assume I'm missing something vital in either my programs design or the way in which I use goroutines/channels/other Go features. Any feedback is much appreciated.
I attach the code below.
package main
import "fmt"
import "time"
import "math/rand"
import "runtime"
func main() {
// Figure out how many CPUs are available and tell Go to use all of them
numThreads := runtime.NumCPU()
runtime.GOMAXPROCS(numThreads)
// Number of random ints to generate
var numIntsToGenerate = 100000000
// Number of ints to be generated by each spawned goroutine thread
var numIntsPerThread = numIntsToGenerate / numThreads
// Channel for communicating from goroutines back to main function
ch := make(chan int, numIntsToGenerate)
// Slices to keep resulting ints
singleThreadIntSlice := make([]int, numIntsToGenerate, numIntsToGenerate)
multiThreadIntSlice := make([]int, numIntsToGenerate, numIntsToGenerate)
fmt.Printf("Initiating single-threaded random number generation.\n")
startSingleRun := time.Now()
// Generate all of the ints from a single goroutine, retrieve the expected
// number of ints from the channel and put in target slice
go makeRandomNumbers(numIntsToGenerate, ch)
for i := 0; i < numIntsToGenerate; i++ {
singleThreadIntSlice = append(singleThreadIntSlice,(<-ch))
}
elapsedSingleRun := time.Since(startSingleRun)
fmt.Printf("Single-threaded run took %s\n", elapsedSingleRun)
fmt.Printf("Initiating multi-threaded random number generation.\n")
startMultiRun := time.Now()
// Run the designated number of goroutines, each of which generates its
// expected share of the total random ints, retrieve the expected number
// of ints from the channel and put in target slice
for i := 0; i < numThreads; i++ {
go makeRandomNumbers(numIntsPerThread, ch)
}
for i := 0; i < numIntsToGenerate; i++ {
multiThreadIntSlice = append(multiThreadIntSlice,(<-ch))
}
elapsedMultiRun := time.Since(startMultiRun)
fmt.Printf("Multi-threaded run took %s\n", elapsedMultiRun)
}
func makeRandomNumbers(numInts int, ch chan int) {
source := rand.NewSource(time.Now().UnixNano())
generator := rand.New(source)
for i := 0; i < numInts; i++ {
ch <- generator.Intn(numInts*100)
}
}
First let's correct and optimize some things in your code:
Since Go 1.5, GOMAXPROCS defaults to the number of CPU cores available, so no need to set that (although it does no harm).
Numbers to generate:
var numIntsToGenerate = 100000000
var numIntsPerThread = numIntsToGenerate / numThreads
If numThreads is like 3, in case of multi goroutines, you'll have less numbers generated (due to integer division), so let's correct it:
numIntsToGenerate = numIntsPerThread * numThreads
No need a buffer for 100 million values, reduce that to a sensible value (e.g. 1000):
ch := make(chan int, 1000)
If you want to use append(), the slices you create should have 0 length (and proper capacity):
singleThreadIntSlice := make([]int, 0, numIntsToGenerate)
multiThreadIntSlice := make([]int, 0, numIntsToGenerate)
But in your case that's unnecessary, as only 1 goroutine is collecting the results, you can simply use indexing, and create slices like this:
singleThreadIntSlice := make([]int, numIntsToGenerate)
multiThreadIntSlice := make([]int, numIntsToGenerate)
And when collecting results:
for i := 0; i < numIntsToGenerate; i++ {
singleThreadIntSlice[i] = <-ch
}
// ...
for i := 0; i < numIntsToGenerate; i++ {
multiThreadIntSlice[i] = <-ch
}
Ok. Code is now better. Attempting to run it, you will still experience that the multi-goroutine version runs slower. Why is that?
It's because controlling, synchronizing and collecting results from multiple goroutines does have overhead. If the task they perform is little, the communication overhead will be greater and overall you lose performance.
Your case is such a case. Generating a single random number once you set up your rand.Rand() is pretty fast.
Let's modify your "task" to be big enough so that we can see the benefit of multiple goroutines:
// 1 million is enough now:
var numIntsToGenerate = 1000 * 1000
func makeRandomNumbers(numInts int, ch chan int) {
source := rand.NewSource(time.Now().UnixNano())
generator := rand.New(source)
for i := 0; i < numInts; i++ {
// Kill time, do some processing:
for j := 0; j < 1000; j++ {
generator.Intn(numInts * 100)
}
// and now return a single random number
ch <- generator.Intn(numInts * 100)
}
}
In this case to get a random number, we generate 1000 random numbers and just throw them away (to make some calculation / kill time) before we generate the one we return. We do this so that the calculation time of the worker goroutines outweights the communication overhead of multiple goroutines.
Running the app now, my results on a 4-core machine:
Initiating single-threaded random number generation.
Single-threaded run took 2.440604504s
Initiating multi-threaded random number generation.
Multi-threaded run took 987.946758ms
The multi-goroutine version runs 2.5 times faster. This means if your goroutines would deliver random numbers in 1000-blocks, you would see 2.5 times faster execution (compared to the single goroutine generation).
One last note:
Your single-goroutine version also uses multiple goroutines: 1 to generate numbers and 1 to collect the results. Most likely the collector does not fully utilize a CPU core and mostly just waits for the results, but still: 2 CPU cores are used. Let's estimate that "1.5" CPU cores are utilized. While the multi-goroutine version utilizes 4 CPU cores. Just as a rough estimation: 4 / 1.5 = 2.66, very close to our performance gain.
If you really want to generate the random numbers in parallel then each task should be about generate the numbers and then return them in one go rather than the task being generate one number at a time and feed them to a channel as that reading and writing to channel will slow things down in multi go routine case. Below is the modified code where then task generate the required numbers in one go and this performs better in multi go routines case, also I have used slice of slices to collect the result from multi go routines.
package main
import "fmt"
import "time"
import "math/rand"
import "runtime"
func main() {
// Figure out how many CPUs are available and tell Go to use all of them
numThreads := runtime.NumCPU()
runtime.GOMAXPROCS(numThreads)
// Number of random ints to generate
var numIntsToGenerate = 100000000
// Number of ints to be generated by each spawned goroutine thread
var numIntsPerThread = numIntsToGenerate / numThreads
// Channel for communicating from goroutines back to main function
ch := make(chan []int)
fmt.Printf("Initiating single-threaded random number generation.\n")
startSingleRun := time.Now()
// Generate all of the ints from a single goroutine, retrieve the expected
// number of ints from the channel and put in target slice
go makeRandomNumbers(numIntsToGenerate, ch)
singleThreadIntSlice := <-ch
elapsedSingleRun := time.Since(startSingleRun)
fmt.Printf("Single-threaded run took %s\n", elapsedSingleRun)
fmt.Printf("Initiating multi-threaded random number generation.\n")
multiThreadIntSlice := make([][]int, numThreads)
startMultiRun := time.Now()
// Run the designated number of goroutines, each of which generates its
// expected share of the total random ints, retrieve the expected number
// of ints from the channel and put in target slice
for i := 0; i < numThreads; i++ {
go makeRandomNumbers(numIntsPerThread, ch)
}
for i := 0; i < numThreads; i++ {
multiThreadIntSlice[i] = <-ch
}
elapsedMultiRun := time.Since(startMultiRun)
fmt.Printf("Multi-threaded run took %s\n", elapsedMultiRun)
//To avoid not used warning
fmt.Print(len(singleThreadIntSlice))
}
func makeRandomNumbers(numInts int, ch chan []int) {
source := rand.NewSource(time.Now().UnixNano())
generator := rand.New(source)
result := make([]int, numInts)
for i := 0; i < numInts; i++ {
result[i] = generator.Intn(numInts * 100)
}
ch <- result
}

In golang my go routines are using all cores, but only between 50 and 75% of each core

I am using version go1.5.3 linux/amd64 of the go language. I have a go routine that performs a mathematical operation that takes some time. Each go routine acts independently and does not have to block.
My systems has 12 cores. If I spawn 12 go routines, it only takes the average use of all cores up to 31%. If I use 24 go routines, it brings the average use of all cores up to 49%. If I use 240, i get 77%. 2400 gives me 76%.
Apparently, the rand.Intn(j) operation is what is slowing it down. Without it, the cores will run at 100%.
func DoSomeMath() int {
k := 0
for i := 0; i < 1000; i++ {
j := i*i + 2
k += i * rand.Intn(j)
}
return k
}
How can I get the program to use all the cores at 100% while using an RNG?
The main reason is, the global rand.* uses a mutex, so at any given point, you can only generate one random number at a time.
The reason #peterSO's answer works is because there's no mutex now and it's 1 generator per routine, however you can end up with duplicate state if 2 or more goroutines start at the exact nano second, although unlikely.
Look here to see how the global rand works under the hood.
To paraphrase, there are lies, damn lies, and benchmarks.
Despite being asked, you still haven't posted the code necessary to reproduce your issue: How to create a Minimal, Complete, and Verifiable example.
Here's a reproducible benchmark, which uses a PRNG, that should drive your CPUs to close to 100%:
package main
import (
"math/rand"
"runtime"
"time"
)
func DoSomeCPU(done <-chan bool) {
r := rand.New(rand.NewSource(time.Now().UnixNano()))
k := 0
for i := 0; i < 1000000; i++ {
j := i*i + 2
k += i * r.Intn(j)
}
_ = k
<-done
}
func main() {
numCPU := runtime.NumCPU()
runtime.GOMAXPROCS(numCPU)
done := make(chan bool, 2*numCPU)
for {
done <- true
go DoSomeCPU(done)
}
}
What results do you get when you run this code?

Reasonable use of goroutines in Go programs

My program has a long running task. I have a list jdIdList that is too big - up to 1000000 items, so the code below doesn't work. Is there a way to improve the code with better use of goroutines?
It seems I have too many goroutines running which makes my code fail to run.
What is a reasonable number of goroutines to have running?
var wg sync.WaitGroup
wg.Add(len(jdIdList))
c := make(chan string)
// just think jdIdList as [0...1000000]
for _, jdId := range jdIdList {
go func(jdId string) {
defer wg.Done()
for _, itemId := range itemIdList {
// following code is doing some computation which consumes much time(you can just replace them with time.Sleep(time.Second * 1)
cvVec, ok := cvVecMap[itemId]
if !ok {
continue
}
jdVec, ok := jdVecMap[jdId]
if !ok {
continue
}
// long time compute
_ = 0.3*computeDist(jdVec.JdPosVec, cvVec.CvPosVec) + 0.7*computeDist(jdVec.JdDescVec, cvVec.CvDescVec)
}
c <- fmt.Sprintf("done %s", jdId)
}(jdId)
}
go func() {
for resp := range c {
fmt.Println(resp)
}
}()
It looks like you're running too many things concurrently, making your computer run out of memory.
Here's a version of your code that uses a limited number of worker goroutines instead of a million goroutines as in your example. Since only a few goroutines run at once, they have much more memory available each before the system starts to swap. Make sure the memory each small computation requires times the number of concurrent goroutines is less than the memory you have in your system, so if the code inside for jdId := range work loop requires less than 1GB memory, and you have 4 cores and at least 4 GB of RAM, setting clvl to 4 should work fine.
I also removed the waitgroups. The code is still correct, but only uses channels for synchronization. A for range loop over a channel reads from that channel until it is closed. This is how we tell the worker threads when we are done.
https://play.golang.org/p/Sy3i77TJjA
runtime.GOMAXPROCS(runtime.NumCPU()) // not needed on go 1.5 or later
c := make(chan string)
work := make(chan int, 1) // increasing 1 to a higher number will probably increase performance
clvl := 4 // runtime.NumCPU() // simulating having 4 cores, use NumCPU otherwise
var wg sync.WaitGroup
wg.Add(clvl)
for i := 0; i < clvl; i++ {
go func(i int) {
for jdId := range work {
time.Sleep(time.Millisecond * 100)
c <- fmt.Sprintf("done %d", jdId)
}
wg.Done()
}(i)
}
// give workers something to do
go func() {
for i := 0; i < 10; i++ {
work <- i
}
close(work)
}()
// close output channel when all workers are done
go func() {
wg.Wait()
close(c)
}()
count := 0
for resp := range c {
fmt.Println(resp, count)
count += 1
}
which generated this output on go playground, while simulating four cpu cores.
done 1 0
done 0 1
done 3 2
done 2 3
done 5 4
done 4 5
done 7 6
done 6 7
done 9 8
done 8 9
Notice how the ordering is not guaranteed. The jdId variable holds the value you want. You should always test your concurrent programs using the go race detector.
Also note that if you are using go 1.4 or earlier and haven't set the GOMAXPROCS environment variable to the number of cores, you should do that, or add runtime.GOMAXPROCS(runtime.NumCPU()) to the beginning of your program.

Resources