My program has a long running task. I have a list jdIdList that is too big - up to 1000000 items, so the code below doesn't work. Is there a way to improve the code with better use of goroutines?
It seems I have too many goroutines running which makes my code fail to run.
What is a reasonable number of goroutines to have running?
var wg sync.WaitGroup
c := make(chan string)
// just think jdIdList as [0...1000000]
for _, jdId := range jdIdList {
go func(jdId string) {
defer wg.Done()
for _, itemId := range itemIdList {
// following code is doing some computation which consumes much time(you can just replace them with time.Sleep(time.Second * 1)
cvVec, ok := cvVecMap[itemId]
if !ok {
jdVec, ok := jdVecMap[jdId]
if !ok {
// long time compute
_ = 0.3*computeDist(jdVec.JdPosVec, cvVec.CvPosVec) + 0.7*computeDist(jdVec.JdDescVec, cvVec.CvDescVec)
c <- fmt.Sprintf("done %s", jdId)
go func() {
for resp := range c {

It looks like you're running too many things concurrently, making your computer run out of memory.
Here's a version of your code that uses a limited number of worker goroutines instead of a million goroutines as in your example. Since only a few goroutines run at once, they have much more memory available each before the system starts to swap. Make sure the memory each small computation requires times the number of concurrent goroutines is less than the memory you have in your system, so if the code inside for jdId := range work loop requires less than 1GB memory, and you have 4 cores and at least 4 GB of RAM, setting clvl to 4 should work fine.
I also removed the waitgroups. The code is still correct, but only uses channels for synchronization. A for range loop over a channel reads from that channel until it is closed. This is how we tell the worker threads when we are done.
runtime.GOMAXPROCS(runtime.NumCPU()) // not needed on go 1.5 or later
c := make(chan string)
work := make(chan int, 1) // increasing 1 to a higher number will probably increase performance
clvl := 4 // runtime.NumCPU() // simulating having 4 cores, use NumCPU otherwise
var wg sync.WaitGroup
for i := 0; i < clvl; i++ {
go func(i int) {
for jdId := range work {
time.Sleep(time.Millisecond * 100)
c <- fmt.Sprintf("done %d", jdId)
// give workers something to do
go func() {
for i := 0; i < 10; i++ {
work <- i
// close output channel when all workers are done
go func() {
count := 0
for resp := range c {
fmt.Println(resp, count)
count += 1
which generated this output on go playground, while simulating four cpu cores.
done 1 0
done 0 1
done 3 2
done 2 3
done 5 4
done 4 5
done 7 6
done 6 7
done 9 8
done 8 9
Notice how the ordering is not guaranteed. The jdId variable holds the value you want. You should always test your concurrent programs using the go race detector.
Also note that if you are using go 1.4 or earlier and haven't set the GOMAXPROCS environment variable to the number of cores, you should do that, or add runtime.GOMAXPROCS(runtime.NumCPU()) to the beginning of your program.


How to design goroutines program to handle api limit error

Just started learning about the power of goroutines.
I have ~100 accounts and ~10 regions, looping through them to create ~ 1000 goroutines with golang to increase the reading speed. It worked too fast that it hit the API return limit of 20/ sec.
How do I ensure that all the goroutines can maintain at the maximum call rate of (20/s)? Im unsure of which golang concurrency methods works best together to handle the error.
func readInstance(acc string, region string, wg *sync.WaitGroup) {
defer wg.Done()
response, err := client.DescribeInstances(acc, region)
if err != nil {
log.Println(err) // API limit exceeding 20
func main() {
accounts := []string{"g", "h", "i", ...}
regions := []string{"g", "h", "i", ...}
for _, region := range regions {
for i := 0; i < len(accounts); i++ {
go readInstance(accounts[i], region, &wg)
If you have a fixed upper limit on how many requests you can do in a particular amount of real time, you can use a time.NewTicker() to space things out.
c := time.NewTicker(50 * time.Millisecond)
defer c.Stop()
Now, when you want to make a server request, just insert
<- c.C
prior to the actual request.
i think you can try this:
According to the documentation, it is concurrency safe.
import (
func main() {
rl := ratelimit.New(100) // per second
prev := time.Now()
for i := 0; i < 10; i++ {
now := rl.Take()
fmt.Println(i, now.Sub(prev))
prev = now
// Output:
// 0 0
// 1 10ms
// 2 10ms
// 3 10ms
// 4 10ms
// 5 10ms
// 6 10ms
// 7 10ms
// 8 10ms
// 9 10ms

what's the difference between two go code below, why the use such much different memory

code A:
MaxCnt := 1000000
wg := sync.WaitGroup{}
for i:=0; i<MaxCnt; i++ {
go func() {
code B:
MaxCnt := 1000000
wg := sync.WaitGroup{}
for i:=0; i<MaxCnt; i++ {
go func() {
code A use about 460MB memory, and code B use a few KB memory, they both go func 10k times. i want to know why?
They don't do it 10K times, they do it 1M times. The first one, while waiting for 1msec, creates thousands of goroutines, with 2K stack each. If that took 460M, then you have about 230K active concurrent goroutines after everything is done. The second one, while it is creating the same number of goroutines, they terminate quickly, keeping the active concurrent goroutine count much lower.

Channels and Parallelism confusion

I'm learning myself Golang, and I'm a bit confused about parallelism and how it is implemented in Golang.
Given the following example:
package main
import (
const (
workers = 1
rand_count = 5000000
func start_rand(ch chan int) {
defer close(ch)
var wg sync.WaitGroup
rand_routine := func(counter int) {
defer wg.Done()
for i:=0;i<counter;i++ {
seed := time.Now().UnixNano()
for i:=0; i<workers; i++ {
go rand_routine(rand_count/workers)
func main() {
start_time := time.Now()
mychan := make(chan int, workers)
go start_rand(mychan)
var wg sync.WaitGroup
work_handler := func() {
defer wg.Done()
for {
v, isOpen := <-mychan
if !isOpen { break }
for i:=0;i<workers;i++ {
go work_handler()
elapsed_time := time.Since(start_time)
This piece of code takes about one minute to run on my Macbook. I assumed that increasing the "workers" constants, would launch additional go routines, and since my laptop has multiple cores, would shorten the execution time.
This is not the case however. Increasing the workers does not reduce the execution time.
I was thinking that setting workers to 1, would create 1 goroutine to generate the random numbers, and setting it to 4, would create 4 goroutines. Given the multicore nature of my laptop, I was expecting that 4 workers would run on different cores, and therefore, increae the performance.
However, I see increased load on all my cores, even when workers is set to 1. What am I missing here?
Your code has some issues which makes it inherently slow:
You are seeding inside the loop. This needs only to be done once
You are using the same source for random numbers. This source is thread safe, but takes away any performance gains for concurrent workers. You could create a source for each worker with rand.New
You are printing a lot. Printing is thread safe, too. So that takes away any speed gains for concurrent workers.
As Zak already pointed out: The concurrent work inside the go routines is very cheap and the communication is expensive.
You could rewrite your program like that. Then you will see some speed gains when you change the number of workers:
package main
import (
const (
workers = 1
randCount = 5000000
var results = [randCount]int{}
func randRoutine(start, counter int, c chan bool) {
r := rand.New(rand.NewSource(time.Now().UnixNano()))
for i := 0; i < counter; i++ {
results[start+i] = r.Intn(5000)
c <- true
func main() {
startTime := time.Now()
c := make(chan bool)
start := 0
for w := 0; w < workers; w++ {
go randRoutine(start, randCount/workers, c)
start += randCount / workers
for i := 0; i < workers; i++ {
elapsedTime := time.Since(startTime)
for _, i := range results {
fmt.Println("Time calulating", elapsedTime)
elapsedTime = time.Since(startTime)
fmt.Println("Toal time", elapsedTime)
This program does a lot of work in a go routine and communicates minimal. Also a different random source is used for each go routine.
Your code does not have just a single routine, even though you set the workers to 1.
There is 1 goroutine from the call go start_rand(...)
That goroutine creates N (worker) routines with go rand_routine(...) and waits for them to finish.
Then you also start N (worker) go routines with go work_handler()
Then you also have 1 goroutine that was started by main() func call.
so: 1 + 2N + 1 routines running for any given N where N == workers.
Plus, on top of that, the work that you are doing in the goroutines is pretty cheap (fast to execute). You are just generating random numbers.
If you look at the blocking and scheduler latency profiles of the program:
You can see from both of the images above that most of the time is spent in the concurrency constructs. This suggests there is a lot of contention in your program. While goroutines are cheap, there is still some blocking and synchronisation that needs to be done when sending a value over a channel. This can take a large proportion of the time of the program when the work being done by the producer is very fast / cheap.
To answer your original question, you see load on many cores because you have more than a single goroutine running.

Recursive concurrency with golang

I'd like to distribute some load across some goroutines. If the number of tasks is known beforehand then it is easy to organize. For example, I could do fan out with a wait group.
nTasks := 100
nGoroutines := 10
// it is important that this channel is not buffered
ch := make(chan *Task)
done := make(chan bool)
var w sync.WaitGroup
// Feed the channel until done
go func () {
for i:= 0; i < nTasks; i++ {
task := getTaskI(i)
ch <- task
// as ch is not buffered once everything is read we know we have delivered all of them
for i:=0; i < nGoroutines; i++ {
done <- false
for i:= 0; i < nGoroutines; i ++ {
go func () {
defer w.Done()
select {
case task := <-ch:
case <- done:
// All tasks done, all goroutines closed
However, in my case each task returns more tasks to be done. Say for example a crawler where we receive all the links from the crawled web. My initial hunch was to have a main loop where I track the number of tasks done and tasks pending. When I'm done I send a finish signal to all goroutines:
nGoroutines := 10
ch := make(chan *Task, nGoroutines)
feedBackChannel := make(chan * Task, nGoroutines)
done := make(chan bool)
for i:= 0; i < nGoroutines; i ++ {
go func () {
select {
case task := <-ch:
task.NextTasks = doSomethingWithTask(task)
feedBackChannel <- task
case <- done:
// seed first task
ch <- firstTask
nTasksRemaining := 1
for nTasksRemaining > 0 {
task := <- feedBackChannel
nTasksRemaining -= 1
for _, t := range(task.NextTasks) {
ch <- t
for i:=0; i < nGoroutines; i++ {
done <- false
However, this produces a deadlock. For example if NextTasks is bigger than the number of goroutines then the main loop will stall when the first tasks finish. But the first tasks can't finish because the feedBack is blocked since the mainLoop is waiting to write.
One "easy" way out of this is to post to the channel asynchronously:
Instead of doing feedBackChannel <- task do go func () {feedBackChannel <- task}(). Now, this feels like an awful hack. Specially since there might be hundred of thousands of tasks.
What would be a nice way to avoid this deadlock? I've searched for concurrency patterns, but mostly are simpler things like fanning out or pipelines where the later stage does not affect the earlier steps.
If I understand your problem correctly, your solution is pretty complex. Here are some points. Hope it helps.
As people mentioned in comments, launching a goroutine is cheap (both memory and switch between them is much cheaper that OS level theread) and you could have hundred thousand of them. Let's assume for some reasons you want to have worker goroutines.
Instead of done channel you could just close ch channel and instead of select you just range over your channel getting tasks.
I don't see the point of separating ch and feedBackChannel just push every task you have into ch and increase its capacity.
As mentioned you may get a deadlock when you trying to enqueue new task. My solution is pretty naive. Just increase its capacity until you are sure that it won't overflow (you could also log warnings if cap(ch) - len(ch) < threshold). If you create a channel (of pointers) with 1 million capacity it will take about 8 * 1e6 ~= 8MB of ram.

Generating random numbers concurrently in Go

I'm new to Go and to concurrent/parallel programming in general. In order to try out (and hopefully see the performance benefits of) goroutines, I've put together a small test program that simply generates 100 million random ints - first in a single goroutine, and then in as many goroutines as reported by runtime.NumCPU().
However, I consistently get worse performance using more goroutines than using a single one. I assume I'm missing something vital in either my programs design or the way in which I use goroutines/channels/other Go features. Any feedback is much appreciated.
I attach the code below.
package main
import "fmt"
import "time"
import "math/rand"
import "runtime"
func main() {
// Figure out how many CPUs are available and tell Go to use all of them
numThreads := runtime.NumCPU()
// Number of random ints to generate
var numIntsToGenerate = 100000000
// Number of ints to be generated by each spawned goroutine thread
var numIntsPerThread = numIntsToGenerate / numThreads
// Channel for communicating from goroutines back to main function
ch := make(chan int, numIntsToGenerate)
// Slices to keep resulting ints
singleThreadIntSlice := make([]int, numIntsToGenerate, numIntsToGenerate)
multiThreadIntSlice := make([]int, numIntsToGenerate, numIntsToGenerate)
fmt.Printf("Initiating single-threaded random number generation.\n")
startSingleRun := time.Now()
// Generate all of the ints from a single goroutine, retrieve the expected
// number of ints from the channel and put in target slice
go makeRandomNumbers(numIntsToGenerate, ch)
for i := 0; i < numIntsToGenerate; i++ {
singleThreadIntSlice = append(singleThreadIntSlice,(<-ch))
elapsedSingleRun := time.Since(startSingleRun)
fmt.Printf("Single-threaded run took %s\n", elapsedSingleRun)
fmt.Printf("Initiating multi-threaded random number generation.\n")
startMultiRun := time.Now()
// Run the designated number of goroutines, each of which generates its
// expected share of the total random ints, retrieve the expected number
// of ints from the channel and put in target slice
for i := 0; i < numThreads; i++ {
go makeRandomNumbers(numIntsPerThread, ch)
for i := 0; i < numIntsToGenerate; i++ {
multiThreadIntSlice = append(multiThreadIntSlice,(<-ch))
elapsedMultiRun := time.Since(startMultiRun)
fmt.Printf("Multi-threaded run took %s\n", elapsedMultiRun)
func makeRandomNumbers(numInts int, ch chan int) {
source := rand.NewSource(time.Now().UnixNano())
generator := rand.New(source)
for i := 0; i < numInts; i++ {
ch <- generator.Intn(numInts*100)
First let's correct and optimize some things in your code:
Since Go 1.5, GOMAXPROCS defaults to the number of CPU cores available, so no need to set that (although it does no harm).
Numbers to generate:
var numIntsToGenerate = 100000000
var numIntsPerThread = numIntsToGenerate / numThreads
If numThreads is like 3, in case of multi goroutines, you'll have less numbers generated (due to integer division), so let's correct it:
numIntsToGenerate = numIntsPerThread * numThreads
No need a buffer for 100 million values, reduce that to a sensible value (e.g. 1000):
ch := make(chan int, 1000)
If you want to use append(), the slices you create should have 0 length (and proper capacity):
singleThreadIntSlice := make([]int, 0, numIntsToGenerate)
multiThreadIntSlice := make([]int, 0, numIntsToGenerate)
But in your case that's unnecessary, as only 1 goroutine is collecting the results, you can simply use indexing, and create slices like this:
singleThreadIntSlice := make([]int, numIntsToGenerate)
multiThreadIntSlice := make([]int, numIntsToGenerate)
And when collecting results:
for i := 0; i < numIntsToGenerate; i++ {
singleThreadIntSlice[i] = <-ch
// ...
for i := 0; i < numIntsToGenerate; i++ {
multiThreadIntSlice[i] = <-ch
Ok. Code is now better. Attempting to run it, you will still experience that the multi-goroutine version runs slower. Why is that?
It's because controlling, synchronizing and collecting results from multiple goroutines does have overhead. If the task they perform is little, the communication overhead will be greater and overall you lose performance.
Your case is such a case. Generating a single random number once you set up your rand.Rand() is pretty fast.
Let's modify your "task" to be big enough so that we can see the benefit of multiple goroutines:
// 1 million is enough now:
var numIntsToGenerate = 1000 * 1000
func makeRandomNumbers(numInts int, ch chan int) {
source := rand.NewSource(time.Now().UnixNano())
generator := rand.New(source)
for i := 0; i < numInts; i++ {
// Kill time, do some processing:
for j := 0; j < 1000; j++ {
generator.Intn(numInts * 100)
// and now return a single random number
ch <- generator.Intn(numInts * 100)
In this case to get a random number, we generate 1000 random numbers and just throw them away (to make some calculation / kill time) before we generate the one we return. We do this so that the calculation time of the worker goroutines outweights the communication overhead of multiple goroutines.
Running the app now, my results on a 4-core machine:
Initiating single-threaded random number generation.
Single-threaded run took 2.440604504s
Initiating multi-threaded random number generation.
Multi-threaded run took 987.946758ms
The multi-goroutine version runs 2.5 times faster. This means if your goroutines would deliver random numbers in 1000-blocks, you would see 2.5 times faster execution (compared to the single goroutine generation).
One last note:
Your single-goroutine version also uses multiple goroutines: 1 to generate numbers and 1 to collect the results. Most likely the collector does not fully utilize a CPU core and mostly just waits for the results, but still: 2 CPU cores are used. Let's estimate that "1.5" CPU cores are utilized. While the multi-goroutine version utilizes 4 CPU cores. Just as a rough estimation: 4 / 1.5 = 2.66, very close to our performance gain.
If you really want to generate the random numbers in parallel then each task should be about generate the numbers and then return them in one go rather than the task being generate one number at a time and feed them to a channel as that reading and writing to channel will slow things down in multi go routine case. Below is the modified code where then task generate the required numbers in one go and this performs better in multi go routines case, also I have used slice of slices to collect the result from multi go routines.
package main
import "fmt"
import "time"
import "math/rand"
import "runtime"
func main() {
// Figure out how many CPUs are available and tell Go to use all of them
numThreads := runtime.NumCPU()
// Number of random ints to generate
var numIntsToGenerate = 100000000
// Number of ints to be generated by each spawned goroutine thread
var numIntsPerThread = numIntsToGenerate / numThreads
// Channel for communicating from goroutines back to main function
ch := make(chan []int)
fmt.Printf("Initiating single-threaded random number generation.\n")
startSingleRun := time.Now()
// Generate all of the ints from a single goroutine, retrieve the expected
// number of ints from the channel and put in target slice
go makeRandomNumbers(numIntsToGenerate, ch)
singleThreadIntSlice := <-ch
elapsedSingleRun := time.Since(startSingleRun)
fmt.Printf("Single-threaded run took %s\n", elapsedSingleRun)
fmt.Printf("Initiating multi-threaded random number generation.\n")
multiThreadIntSlice := make([][]int, numThreads)
startMultiRun := time.Now()
// Run the designated number of goroutines, each of which generates its
// expected share of the total random ints, retrieve the expected number
// of ints from the channel and put in target slice
for i := 0; i < numThreads; i++ {
go makeRandomNumbers(numIntsPerThread, ch)
for i := 0; i < numThreads; i++ {
multiThreadIntSlice[i] = <-ch
elapsedMultiRun := time.Since(startMultiRun)
fmt.Printf("Multi-threaded run took %s\n", elapsedMultiRun)
//To avoid not used warning
func makeRandomNumbers(numInts int, ch chan []int) {
source := rand.NewSource(time.Now().UnixNano())
generator := rand.New(source)
result := make([]int, numInts)
for i := 0; i < numInts; i++ {
result[i] = generator.Intn(numInts * 100)
ch <- result
