Channels and Parallelism confusion - go

I'm learning myself Golang, and I'm a bit confused about parallelism and how it is implemented in Golang.
Given the following example:
package main
import (
"fmt"
"sync"
"math/rand"
"time"
)
const (
workers = 1
rand_count = 5000000
)
func start_rand(ch chan int) {
defer close(ch)
var wg sync.WaitGroup
wg.Add(workers)
rand_routine := func(counter int) {
defer wg.Done()
for i:=0;i<counter;i++ {
seed := time.Now().UnixNano()
rand.Seed(seed)
ch<-rand.Intn(5000)
}
}
for i:=0; i<workers; i++ {
go rand_routine(rand_count/workers)
}
wg.Wait()
}
func main() {
start_time := time.Now()
mychan := make(chan int, workers)
go start_rand(mychan)
var wg sync.WaitGroup
wg.Add(workers)
work_handler := func() {
defer wg.Done()
for {
v, isOpen := <-mychan
if !isOpen { break }
fmt.Println(v)
}
}
for i:=0;i<workers;i++ {
go work_handler()
}
wg.Wait()
elapsed_time := time.Since(start_time)
fmt.Println("Done",elapsed_time)
}
This piece of code takes about one minute to run on my Macbook. I assumed that increasing the "workers" constants, would launch additional go routines, and since my laptop has multiple cores, would shorten the execution time.
This is not the case however. Increasing the workers does not reduce the execution time.
I was thinking that setting workers to 1, would create 1 goroutine to generate the random numbers, and setting it to 4, would create 4 goroutines. Given the multicore nature of my laptop, I was expecting that 4 workers would run on different cores, and therefore, increae the performance.
However, I see increased load on all my cores, even when workers is set to 1. What am I missing here?

Your code has some issues which makes it inherently slow:
You are seeding inside the loop. This needs only to be done once
You are using the same source for random numbers. This source is thread safe, but takes away any performance gains for concurrent workers. You could create a source for each worker with rand.New
You are printing a lot. Printing is thread safe, too. So that takes away any speed gains for concurrent workers.
As Zak already pointed out: The concurrent work inside the go routines is very cheap and the communication is expensive.
You could rewrite your program like that. Then you will see some speed gains when you change the number of workers:
package main
import (
"fmt"
"math/rand"
"time"
)
const (
workers = 1
randCount = 5000000
)
var results = [randCount]int{}
func randRoutine(start, counter int, c chan bool) {
r := rand.New(rand.NewSource(time.Now().UnixNano()))
for i := 0; i < counter; i++ {
results[start+i] = r.Intn(5000)
}
c <- true
}
func main() {
startTime := time.Now()
c := make(chan bool)
start := 0
for w := 0; w < workers; w++ {
go randRoutine(start, randCount/workers, c)
start += randCount / workers
}
for i := 0; i < workers; i++ {
<-c
}
elapsedTime := time.Since(startTime)
for _, i := range results {
fmt.Println(i)
}
fmt.Println("Time calulating", elapsedTime)
elapsedTime = time.Since(startTime)
fmt.Println("Toal time", elapsedTime)
}
This program does a lot of work in a go routine and communicates minimal. Also a different random source is used for each go routine.

Your code does not have just a single routine, even though you set the workers to 1.
There is 1 goroutine from the call go start_rand(...)
That goroutine creates N (worker) routines with go rand_routine(...) and waits for them to finish.
Then you also start N (worker) go routines with go work_handler()
Then you also have 1 goroutine that was started by main() func call.
so: 1 + 2N + 1 routines running for any given N where N == workers.
Plus, on top of that, the work that you are doing in the goroutines is pretty cheap (fast to execute). You are just generating random numbers.
If you look at the blocking and scheduler latency profiles of the program:
You can see from both of the images above that most of the time is spent in the concurrency constructs. This suggests there is a lot of contention in your program. While goroutines are cheap, there is still some blocking and synchronisation that needs to be done when sending a value over a channel. This can take a large proportion of the time of the program when the work being done by the producer is very fast / cheap.
To answer your original question, you see load on many cores because you have more than a single goroutine running.

Related

Job queue where workers can add jobs, is there an elegant solution to stop the program when all workers are idle?

I find myself in a situation where I have a queue of jobs where workers can add new jobs when they are done processing one.
For illustration, in the code below, a job consists in counting up to JOB_COUNTING_TO and, randomly, 1/5 of the time a worker will add a new job to the queue.
Because my workers can add jobs to the queue, it is my understanding that I was not able to use a channel as my job queue. This is because sending to the channel is blocking and, even with a buffered channel, this code, due to its recursive nature (jobs adding jobs) could easily reach a situation where all the workers are sending to the channel and no worker is available to receive.
This is why I decided to use a shared queue protected by a mutex.
Now, I would like the program to halt when all the workers are idle. Unfortunately this cannot be spotted just by looking for when len(jobQueue) == 0 as the queue could be empty but some worker still doing their job and maybe adding a new job after that.
The solution I came up with is, I feel a bit clunky, it makes use of variables var idleWorkerCount int and var isIdle [NB_WORKERS]bool to keep track of idle workers and the code stops when idleWorkerCount == NB_WORKERS.
My question is if there is a concurrency pattern that I could use to make this logic more elegant?
Also, for some reason I don't understand the technique that I currently use (code below) becomes really inefficient when the number of Workers becomes quite big (such as 300000 workers): for the same number of jobs, the code will be > 10x slower for NB_WORKERS = 300000 vs NB_WORKERS = 3000.
Thank you very much in advance!
package main
import (
"math/rand"
"sync"
)
const NB_WORKERS = 3000
const NB_INITIAL_JOBS = 300
const JOB_COUNTING_TO = 10000000
var jobQueue []int
var mu sync.Mutex
var idleWorkerCount int
var isIdle [NB_WORKERS]bool
func doJob(workerId int) {
mu.Lock()
if len(jobQueue) == 0 {
if !isIdle[workerId] {
idleWorkerCount += 1
}
isIdle[workerId] = true
mu.Unlock()
return
}
if isIdle[workerId] {
idleWorkerCount -= 1
}
isIdle[workerId] = false
var job int
job, jobQueue = jobQueue[0], jobQueue[1:]
mu.Unlock()
for i := 0; i < job; i += 1 {
}
if rand.Intn(5) == 0 {
mu.Lock()
jobQueue = append(jobQueue, JOB_COUNTING_TO)
mu.Unlock()
}
}
func main() {
// Filling up the queue with initial jobs
for i := 0; i < NB_INITIAL_JOBS; i += 1 {
jobQueue = append(jobQueue, JOB_COUNTING_TO)
}
var wg sync.WaitGroup
for i := 0; i < NB_WORKERS; i += 1 {
wg.Add(1)
go func(workerId int) {
for idleWorkerCount != NB_WORKERS {
doJob(workerId)
}
wg.Done()
}(i)
}
wg.Wait()
}
Because my workers can add jobs to the queue
A re entrant channel always deadlock. This is easy to demonstrate using this code
package main
import (
"fmt"
)
func main() {
out := make(chan string)
c := make(chan string)
go func() {
for v := range c {
c <- v + " 2"
out <- v
}
}()
go func() {
c <- "hello world!" // pass OK
c <- "hello world!" // no pass, the routine is blocking at pushing to itself
}()
for v := range out {
fmt.Println(v)
}
}
While the program
tries to push at c <- v + " 2"
it can not
read at for v := range c {,
push at c <- "hello world!"
read at for v := range out {
thus, it deadlocks.
If you want to pass that situation you must overflow somewhere.
On the routines, or somewhere else.
package main
import (
"fmt"
"time"
)
func main() {
out := make(chan string)
c := make(chan string)
go func() {
for v := range c {
go func() { // use routines on the stack as a bank for the required overflow.
<-time.After(time.Second) // simulate slowliness.
c <- v + " 2"
}()
out <- v
}
}()
go func() {
for {
c <- "hello world!"
}
}()
exit := time.After(time.Second * 60)
for v := range out {
fmt.Println(v)
select {
case <-exit:
return
default:
}
}
}
But now you have a new problem.
You created a memory bomb by overflowing without limits on the stack. Technically, this is dependent on the time needed to finish a job, the memory available, the speed of your cpus and the shape of the data (they might or might not generate a new job). So there is a upper limit, but it is so hard to make sense of it, that in practice this ends up to be a bomb.
Consider not overflowing without limits on the stack.
If you dont have any arbitrary limit on hand, you can use a semaphore to cap the overflow.
https://play.golang.org/p/5JWPQiqOYKz
my bombs did not explode with a work timeout of 1s and 2s, but they took a large chunk of memory.
In another round with a modified code, it exploded
Of course, because you use if rand.Intn(5) == 0 { in your code, the problem is largely mitigated. Though, when you meet such pattern, think twice to the code.
Also, for some reason I don't understand the technique that I currently use (code below) becomes really inefficient when the number of Workers becomes quite big (such as 300000 workers): for the same number of jobs, the code will be > 10x slower for NB_WORKERS = 300000 vs NB_WORKERS = 3000.
In the big picture, you have a limited amount of cpu cycles. All those allocations and instructions, to spawn and synchronize, has to be executed too. Concurrency is not free.
Now, I would like the program to halt when all the workers are idle.
I came up with that but i find it very difficult to reason about and convince myself it wont end up in a write on closed channel panic.
The idea is to use a sync.WaitGroup to count in flight items and rely on it to properly close the input channel and finish the job.
package main
import (
"log"
"math/rand"
"sync"
"time"
)
func main() {
rand.Seed(time.Now().UnixNano())
var wg sync.WaitGroup
var wgr sync.WaitGroup
out := make(chan string)
c := make(chan string)
go func() {
for v := range c {
if rand.Intn(5) == 0 {
wgr.Add(1)
go func(v string) {
<-time.After(time.Microsecond)
c <- v + " 2"
}(v)
}
wgr.Done()
out <- v
}
close(out)
}()
var sent int
wg.Add(1)
go func() {
for i := 0; i < 300; i++ {
wgr.Add(1)
c <- "hello world!"
sent++
}
wg.Done()
}()
go func() {
wg.Wait()
wgr.Wait()
close(c)
}()
var rcv int
for v := range out {
// fmt.Println(v)
_ = v
rcv++
}
log.Println("sent", sent)
log.Println("rcv", rcv)
}
I ran it with while go run -race .; do :; done it worked fine for a reasonable amount of iterations.

Why is my code running slower after trying Goroutines?

I decided to try figuring out Goroutines and channels. I made a function that takes a list and adds 10 to every element. I then made another function that attempts to incorporate channels and goroutines. When I timed the code it ran much slower. I tried doing some research but was unable to figure anything out.
Here is my code with channels:
package main
import ("fmt"
"time")
func addTen(channel chan int) {
channel <- 10 + <-channel
}
func listPlusTen(list []int) []int {
channel := make(chan int)
for i:= 0; i < len(list); i++ {
go addTen(channel)
channel <- list[i]
list[i] = <-channel
}
return list
}
func main(){
var size int
list := make([]int, 0)
fmt.Print("Enter the list size: ")
fmt.Scanf("%d", &size)
for i:=0; i <= size; i++ {
list = append(list, i)
}
start := time.Now()
list = listPlusTen(list)
end := time.Now()
fmt.Println(end.Sub(start))
}
You are adding a lot of synchronization overhead to the baseline algorithm. You have len(list) goroutines, all waiting to read from a common channel. When you write to the channel, the scheduler picks one of those goroutines, and that goroutines adds 10, and writes to the channel, which enables the main goroutine again. It is hard to speculate without really measuring it, but if you move the goroutine creation outside the for-loop, then you will have only one goroutine, reducing the scheduler overhead. However, in any comparison to the baseline algorithm this will be slower, because each operation involves two synchronizations and two context switches, which take more that the algorithm itself.
I had a similar experience with go routines where I was making 20 http requests using the same function.
calling the function in a simple loop was much faster than using wait groups

Why single goroutine run slower than multiple goroutines when runtime.GOMAXPROCS(1)?

I just want to try how fast goroutine switch context, so I wrote the code below. To my surprise, multiple gorountines run faster than the edition that does not need to switch context (I set the program to run in only one CPU core).
package main
import (
"fmt"
"runtime"
"sync"
"time"
)
func main() {
runtime.GOMAXPROCS(1)
t_start := time.Now()
sum := 0
for j := 0; j < 10; j++ {
sum = 0
for i := 0; i < 100000000; i++ {
sum += i
}
}
fmt.Println("single goroutine takes ", time.Since(t_start))
var wg sync.WaitGroup
t_start = time.Now()
for j := 0; j < 10; j++ {
wg.Add(1)
go func() {
sum := 0
for i := 0; i < 100000000; i++ {
sum += i
}
defer wg.Done()
}()
}
wg.Wait()
fmt.Println("multiple goroutines take ", time.Since(t_start))
}
A single goroutine takes 251.690788ms, multiple goroutines take 254.067156ms
The single goroutine should run faster, because single goroutine does not need to change context. However, the answer is opposite, single mode always slower. What happened in this program?
Your concurrent version several things the non-concurrent version does, which will make it slower:
It's creating a new sum value, which must be allocated. Your non-concurrent version just resets the existing value. This probably has a minimal impact, but is a difference.
You're using a waitgroup. Obviously this adds overhead.
The defer in defer wg.Done() also adds overhead, roughly equivalent to an extra function call.
There may well be other subtle differences, too.
So in short: Your benchmarks are just invalid, because you're comparing apples with oranges.
More important: This isn't a useful benchmark in the first place, because it's a completely artificial workload.

How to prevent deadlocks without using sync.WaitGroup?

concurrent.go:
package main
import (
"fmt"
"sync"
)
// JOBS represents the number of jobs workers do
const JOBS = 2
// WORKERS represents the number of workers
const WORKERS = 5
func work(in <-chan int, out chan<- int, wg *sync.WaitGroup) {
for n := range in {
out <- n * n
}
wg.Done()
}
var wg sync.WaitGroup
func main() {
in := make(chan int, JOBS)
out := make(chan int, JOBS)
for w := 1; w <= WORKERS; w++ {
wg.Add(1)
go work(in, out, &wg)
}
for j := 1; j <= JOBS; j++ {
in <- j
}
close(in)
wg.Wait()
close(out)
for r := range out {
fmt.Println("result:", r)
}
// This is a solution but I want to do it with `range out`
// and also without WaitGroups
// for r := 1; r <= JOBS; r++ {
// fmt.Println("result:", <-out)
// }
}
Example is here on goplay.
Goroutines run concurrently and independently. Spec: Go statements:
A "go" statement starts the execution of a function call as an independent concurrent thread of control, or goroutine, within the same address space.
If you want to use for range to receive values from the out channel, that means the out channel can only be closed once all goroutines are done sending on it.
Since goroutines run concurrently and independently, without synchronization you can't have this.
Using WaitGroup is one mean, one way to do it (to ensure we wait all goroutines to do their job before closing out).
Your commented code is another way of that: the commented code receives exactly as many values from the channel as many the goroutines ought to send on it, which is only possible if all goroutines do send their values. The synchronization are the send statements and receive operations.
Notes:
Usually receiving results from the channel is done asynchronously, in a dedicated goroutine, or using even multiple goroutines. Doing so you are not required to use channels with buffers capable of buffering all the results. You will still need synchronization to wait for all workers to finish their job, you can't avoid this due to the concurrent and independent nature of gorutine scheduling and execution.

Generating random numbers concurrently in Go

I'm new to Go and to concurrent/parallel programming in general. In order to try out (and hopefully see the performance benefits of) goroutines, I've put together a small test program that simply generates 100 million random ints - first in a single goroutine, and then in as many goroutines as reported by runtime.NumCPU().
However, I consistently get worse performance using more goroutines than using a single one. I assume I'm missing something vital in either my programs design or the way in which I use goroutines/channels/other Go features. Any feedback is much appreciated.
I attach the code below.
package main
import "fmt"
import "time"
import "math/rand"
import "runtime"
func main() {
// Figure out how many CPUs are available and tell Go to use all of them
numThreads := runtime.NumCPU()
runtime.GOMAXPROCS(numThreads)
// Number of random ints to generate
var numIntsToGenerate = 100000000
// Number of ints to be generated by each spawned goroutine thread
var numIntsPerThread = numIntsToGenerate / numThreads
// Channel for communicating from goroutines back to main function
ch := make(chan int, numIntsToGenerate)
// Slices to keep resulting ints
singleThreadIntSlice := make([]int, numIntsToGenerate, numIntsToGenerate)
multiThreadIntSlice := make([]int, numIntsToGenerate, numIntsToGenerate)
fmt.Printf("Initiating single-threaded random number generation.\n")
startSingleRun := time.Now()
// Generate all of the ints from a single goroutine, retrieve the expected
// number of ints from the channel and put in target slice
go makeRandomNumbers(numIntsToGenerate, ch)
for i := 0; i < numIntsToGenerate; i++ {
singleThreadIntSlice = append(singleThreadIntSlice,(<-ch))
}
elapsedSingleRun := time.Since(startSingleRun)
fmt.Printf("Single-threaded run took %s\n", elapsedSingleRun)
fmt.Printf("Initiating multi-threaded random number generation.\n")
startMultiRun := time.Now()
// Run the designated number of goroutines, each of which generates its
// expected share of the total random ints, retrieve the expected number
// of ints from the channel and put in target slice
for i := 0; i < numThreads; i++ {
go makeRandomNumbers(numIntsPerThread, ch)
}
for i := 0; i < numIntsToGenerate; i++ {
multiThreadIntSlice = append(multiThreadIntSlice,(<-ch))
}
elapsedMultiRun := time.Since(startMultiRun)
fmt.Printf("Multi-threaded run took %s\n", elapsedMultiRun)
}
func makeRandomNumbers(numInts int, ch chan int) {
source := rand.NewSource(time.Now().UnixNano())
generator := rand.New(source)
for i := 0; i < numInts; i++ {
ch <- generator.Intn(numInts*100)
}
}
First let's correct and optimize some things in your code:
Since Go 1.5, GOMAXPROCS defaults to the number of CPU cores available, so no need to set that (although it does no harm).
Numbers to generate:
var numIntsToGenerate = 100000000
var numIntsPerThread = numIntsToGenerate / numThreads
If numThreads is like 3, in case of multi goroutines, you'll have less numbers generated (due to integer division), so let's correct it:
numIntsToGenerate = numIntsPerThread * numThreads
No need a buffer for 100 million values, reduce that to a sensible value (e.g. 1000):
ch := make(chan int, 1000)
If you want to use append(), the slices you create should have 0 length (and proper capacity):
singleThreadIntSlice := make([]int, 0, numIntsToGenerate)
multiThreadIntSlice := make([]int, 0, numIntsToGenerate)
But in your case that's unnecessary, as only 1 goroutine is collecting the results, you can simply use indexing, and create slices like this:
singleThreadIntSlice := make([]int, numIntsToGenerate)
multiThreadIntSlice := make([]int, numIntsToGenerate)
And when collecting results:
for i := 0; i < numIntsToGenerate; i++ {
singleThreadIntSlice[i] = <-ch
}
// ...
for i := 0; i < numIntsToGenerate; i++ {
multiThreadIntSlice[i] = <-ch
}
Ok. Code is now better. Attempting to run it, you will still experience that the multi-goroutine version runs slower. Why is that?
It's because controlling, synchronizing and collecting results from multiple goroutines does have overhead. If the task they perform is little, the communication overhead will be greater and overall you lose performance.
Your case is such a case. Generating a single random number once you set up your rand.Rand() is pretty fast.
Let's modify your "task" to be big enough so that we can see the benefit of multiple goroutines:
// 1 million is enough now:
var numIntsToGenerate = 1000 * 1000
func makeRandomNumbers(numInts int, ch chan int) {
source := rand.NewSource(time.Now().UnixNano())
generator := rand.New(source)
for i := 0; i < numInts; i++ {
// Kill time, do some processing:
for j := 0; j < 1000; j++ {
generator.Intn(numInts * 100)
}
// and now return a single random number
ch <- generator.Intn(numInts * 100)
}
}
In this case to get a random number, we generate 1000 random numbers and just throw them away (to make some calculation / kill time) before we generate the one we return. We do this so that the calculation time of the worker goroutines outweights the communication overhead of multiple goroutines.
Running the app now, my results on a 4-core machine:
Initiating single-threaded random number generation.
Single-threaded run took 2.440604504s
Initiating multi-threaded random number generation.
Multi-threaded run took 987.946758ms
The multi-goroutine version runs 2.5 times faster. This means if your goroutines would deliver random numbers in 1000-blocks, you would see 2.5 times faster execution (compared to the single goroutine generation).
One last note:
Your single-goroutine version also uses multiple goroutines: 1 to generate numbers and 1 to collect the results. Most likely the collector does not fully utilize a CPU core and mostly just waits for the results, but still: 2 CPU cores are used. Let's estimate that "1.5" CPU cores are utilized. While the multi-goroutine version utilizes 4 CPU cores. Just as a rough estimation: 4 / 1.5 = 2.66, very close to our performance gain.
If you really want to generate the random numbers in parallel then each task should be about generate the numbers and then return them in one go rather than the task being generate one number at a time and feed them to a channel as that reading and writing to channel will slow things down in multi go routine case. Below is the modified code where then task generate the required numbers in one go and this performs better in multi go routines case, also I have used slice of slices to collect the result from multi go routines.
package main
import "fmt"
import "time"
import "math/rand"
import "runtime"
func main() {
// Figure out how many CPUs are available and tell Go to use all of them
numThreads := runtime.NumCPU()
runtime.GOMAXPROCS(numThreads)
// Number of random ints to generate
var numIntsToGenerate = 100000000
// Number of ints to be generated by each spawned goroutine thread
var numIntsPerThread = numIntsToGenerate / numThreads
// Channel for communicating from goroutines back to main function
ch := make(chan []int)
fmt.Printf("Initiating single-threaded random number generation.\n")
startSingleRun := time.Now()
// Generate all of the ints from a single goroutine, retrieve the expected
// number of ints from the channel and put in target slice
go makeRandomNumbers(numIntsToGenerate, ch)
singleThreadIntSlice := <-ch
elapsedSingleRun := time.Since(startSingleRun)
fmt.Printf("Single-threaded run took %s\n", elapsedSingleRun)
fmt.Printf("Initiating multi-threaded random number generation.\n")
multiThreadIntSlice := make([][]int, numThreads)
startMultiRun := time.Now()
// Run the designated number of goroutines, each of which generates its
// expected share of the total random ints, retrieve the expected number
// of ints from the channel and put in target slice
for i := 0; i < numThreads; i++ {
go makeRandomNumbers(numIntsPerThread, ch)
}
for i := 0; i < numThreads; i++ {
multiThreadIntSlice[i] = <-ch
}
elapsedMultiRun := time.Since(startMultiRun)
fmt.Printf("Multi-threaded run took %s\n", elapsedMultiRun)
//To avoid not used warning
fmt.Print(len(singleThreadIntSlice))
}
func makeRandomNumbers(numInts int, ch chan []int) {
source := rand.NewSource(time.Now().UnixNano())
generator := rand.New(source)
result := make([]int, numInts)
for i := 0; i < numInts; i++ {
result[i] = generator.Intn(numInts * 100)
}
ch <- result
}

Resources