Need help about concurrency programming in Go - go

For this task i need to find min sum in list of numbers. Then i must print number that have min sum. This must be done with Mutex and WaitGroups. I can't find where is the mistake or why is output different.
Logic: Scanf n and make vector with len(n). Then create funcion for sum of number and forward that function to second where we in one FOR cycle give goroutines function to.
I run this code a few times, and sometimes give different answer for same input.
Sometimes 12
Sometimes 11
package main
import (
var wg sync.WaitGroup
var mutex sync.Mutex
var vector []int
var i int
var n int
var firstsum int
var p int //Temp sum
var index_result int
func sumanajmanjih(broj int) int {
var br int
var suma int
br = int(math.Abs(float64(broj)))
suma = 0
for {
suma += br % 10
br = br / 10
if br <= 0 {
return suma
func glavna(rg int) {
var index int
firstsum = sumanajmanjih(vector[0])
for {
if i == n {
} else {
index = i
i += 1
fmt.Printf("Procesor %d radni indeks %d\n", rg, index)
p = sumanajmanjih(vector[index])
if p < firstsum {
firstsum = p
index_result = index
func main() {
fmt.Scanf("%d", &n)
vector = make([]int, n)
for i := 0; i < n; i++ {
fmt.Scanf("%d", &vector[i])
brojGR := runtime.NumCPU()
for rg := 0; rg < brojGR; rg++ {
go glavna(rg)

Not a full answer to your question, but a few suggestions to make code more readable and stable:
Use English language for names - glavna, brojGR are hard to understand
Add comments to code explaining intent
Try to avoid shared/global variables, especially for concurrent code. glavna(rg) is executed concurrently, and you assign global i and p inside that function, that is a race condition. Sends all the data in and out into function explicitly as argument or function result.
Mutex easily can lock the code, and it is complicated to debug. Simplify its usage. Often defer mutex.Unlock() in the next line after Lock() is good enough.


Parallel execution of prime finding algorithm slows runtime

So I implemented the following prime finding algorithm in go.
primes = []
Assume all numbers are primes (vacuously true)
check = 2
if check is still assumed to be prime append it to primes
multiply check by each prime less than or equal to its minimum factor and
eliminate results from assumed primes.
increment check by 1 and repeat 4 thru 6 until check > limit.
Here is my serial implementation:
package main
type numWithMinFactor struct {
number int
minfactor int
func pow(base int, power int) int{
result := 1
for i:=0;i<power;i++{
return result
func process(check numWithMinFactor,primes []int,top int,minFactors []numWithMinFactor){
var n int
for i:=0;primes[i]<=check.minfactor;i++{
n = check.number*primes[i]
if n>top{
minFactors[n] = numWithMinFactor{n,primes[i]}
if i+1 == len(primes){
func findPrimes(top int) []int{
primes := []int{}
minFactors := make([]numWithMinFactor,top+2)
check := 2
for power:=1;check <= top;power++{
if minFactors[check].number == 0{
primes = append(primes,check)
minFactors[check] = numWithMinFactor{check,check}
return primes
func main(){
fmt.Println("Welcome to prime finder!")
start := time.Now()
elapsed := time.Since(start)
fmt.Println("Finding primes took %s", elapsed)
This runs great producing all the primes <1,000,000 in about 63ms (mostly printing) and primes <10,000,000 in 600ms on my pc. Now I figure none of the numbers check such that 2^n < check <= 2^(n+1) have factors > 2^n so I can do all the multiplications and elimination for each check in that range in parallel once I have primes up to 2^n. And my parallel implementation is as follows:
package main
type numWithMinFactor struct {
number int
minfactor int
func pow(base int, power int) int{
result := 1
for i:=0;i<power;i++{
return result
func process(check numWithMinFactor,primes []int,top int,minFactors []numWithMinFactor, wg *sync.WaitGroup){
defer wg.Done()
var n int
for i:=0;primes[i]<=check.minfactor;i++{
n = check.number*primes[i]
if n>top{
minFactors[n] = numWithMinFactor{n,primes[i]}
if i+1 == len(primes){
func findPrimes(top int) []int{
primes := []int{}
minFactors := make([]numWithMinFactor,top+2)
check := 2
var wg sync.WaitGroup
for power:=1;check <= top;power++{
for check <= pow(2,power){
if minFactors[check].number == 0{
primes = append(primes,check)
minFactors[check] = numWithMinFactor{check,check}
go process(minFactors[check],primes,top,minFactors,&wg)
if check>top{
return primes
func main(){
fmt.Println("Welcome to prime finder!")
start := time.Now()
elapsed := time.Since(start)
fmt.Println("Finding primes took %s", elapsed)
Unfortunately not only is this implementation slower running up to 1,000,000 in 600ms and up to 10 million in 6 seconds. My intuition tells me that there is potential for parallelism to improve performance however I clearly haven't been able to achieve that and would greatly appreciate any input on how to improve runtime here, or more specifically any insight as to why the parallel solution is slower.
Additionally the parallel solution consumes more memory relative to the serial solution but that is to be expected; the serial solution can grid up to 1,000,000,000 in about 22 seconds where the parallel solution runs out of memory on my system (32GB ram) going for the same target. But I'm asking about runtime here not memory use, I could for example use the zero value state of the minFactors array rather than a separate isPrime []bool true state but I think it is more readable as is.
I've tried passing a pointer for primes []int but that didn't seem to make a difference, using a channel instead of passing the minFactors array to the process function resulted in big time memory use and a much(10x ish) slower performance. I've re-written this algo a couple times to see if I could iron anything out but no luck. Any insights or suggestions would be much appreciated because I think parallelism could make this faster not 10x slower!
Par #Volker's suggestion I limited the number of processes to somthing less than my pc's available logical processes with the following revision however I am still getting runtimes that are 10x slower than the serial implementation.
package main
type numWithMinFactor struct {
number int
minfactor int
func pow(base int, power int) int{
result := 1
for i:=0;i<power;i++{
return result
func process(check numWithMinFactor,primes []int,top int,minFactors []numWithMinFactor, wg *sync.WaitGroup){
defer wg.Done()
var n int
for i:=0;primes[i]<=check.minfactor;i++{
n = check.number*primes[i]
if n>top{
minFactors[n] = numWithMinFactor{n,primes[i]}
if i+1 == len(primes){
func findPrimes(top int) []int{
primes := []int{}
minFactors := make([]numWithMinFactor,top+2)
check := 2
nlogicalProcessors := 20
var wg sync.WaitGroup
var twoPow int
for power:=1;check <= top;power++{
twoPow = pow(2,power)
for check <= twoPow{
for nLogicalProcessorsInUse := 0 ; nLogicalProcessorsInUse < nlogicalProcessors; nLogicalProcessorsInUse++{
if minFactors[check].number == 0{
primes = append(primes,check)
minFactors[check] = numWithMinFactor{check,check}
go process(minFactors[check],primes,top,minFactors,&wg)
if check>top{
if check>twoPow{
if check>top{
return primes
func main(){
fmt.Println("Welcome to prime finder!")
start := time.Now()
elapsed := time.Since(start)
fmt.Println("Finding primes took %s", elapsed)
tldr; Why is my parallel implementation slower than serial implementation how do I make it faster?
Par #mh-cbon's I made larger jobs for parallel processing resulting in the following code.
package main
func pow(base int, power int) int{
result := 1
for i:=0;i<power;i++{
return result
func process(check int,primes []int,top int,minFactors []int){
var n int
for i:=0;primes[i]<=minFactors[check];i++{
n = check*primes[i]
if n>top{
minFactors[n] = primes[i]
if i+1 == len(primes){
func processRange(start int,end int,primes []int,top int,minFactors []int, wg *sync.WaitGroup){
defer wg.Done()
for start <= end{
func findPrimes(top int) []int{
primes := []int{}
minFactors := make([]int,top+2)
check := 2
nlogicalProcessors := 10
var wg sync.WaitGroup
var twoPow int
var start int
var end int
var stepSize int
var stepsTaken int
for power:=1;check <= top;power++{
twoPow = pow(2,power)
stepSize = (twoPow-start)/nlogicalProcessors
stepsTaken = 0
stepSize = (twoPow/2)/nlogicalProcessors
for check <= twoPow{
start = check
end = check+stepSize
if stepSize == 0{
end = twoPow
if stepsTaken == nlogicalProcessors-1{
end = twoPow
if end>top {
end = top
for check<=end {
if minFactors[check] == 0{
primes = append(primes,check)
minFactors[check] = check
go processRange(start,end,primes,top,minFactors,&wg)
if check>top{
if check>twoPow{
if check>top{
return primes
func main(){
fmt.Println("Welcome to prime finder!")
start := time.Now()
elapsed := time.Since(start)
fmt.Println("Finding primes took %s", elapsed)
This runs at a similar speed to the serial implementation.
So I did eventually get a parallel version of the code to run slightly faster than the serial version. following suggestions from #mh-cbon (See above). However this implementation did not result in vast improvements relative to the serial implementation (50ms to 10 million compared to 75ms serially) Considering that allocating and writing an []int 0:10000000 takes 25ms I'm not disappointed by these results. As #Volker stated "such stuff often is not limited by CPU but by memory bandwidth." which I believe is the case here.
I would still love to see any additional improvements however I am somewhat satisfied with what I've gained here.
Serial code running up to 2 billion 19.4 seconds
Parallel code running up to 2 billion 11.1 seconds
Initializing []int{0:2Billion} 4.5 seconds

Mutex - global or local and idiomatic usage?

After reading the mutex examples on and stackoverflow, I'm still not sure about the declaration and idiomatic usage with anonymous functions. Therefore I've summarized a few examples.
Are examples A, B and C nearly equivalent or are there major differences that I don't notice?
I would prefer the global example "B". I guess if I'm careful with it, it's probably the simplest solution.
Or is there maybe a better approach to use mutex?
This example on go playground
package main
import (
type MuContainer struct {
data int
var mucglobal = &MuContainer{}
func main() {
// A: Global declaration - working: adds 45
for i := 0; i < 10; i++ {
go func(j int, mucf *MuContainer) {
mucf.Lock() += j
}(i, mucglobal)
// B: Global only - working: adds 45
for i := 0; i < 10; i++ {
go func(j int) {
mucglobal.Lock() += j
// C: Local declaration - working: adds 45
muclocal := &MuContainer{}
for i := 0; i < 10; i++ {
go func(j int, mucf *MuContainer) {
mucf.Lock() += j
}(i, muclocal)
// // D: Pointer to struct - not working: adds 0
// // I guess because it points directly to the struct.
// for i := 0; i < 10; i++ {
// go func(j int, mucf *MuContainer) {
// mucf.Lock()
// += j
// mucf.Unlock()
// }(i, &MuContainer{})
// }
for {
fmt.Printf("global: %d / local: %d\n",,
if == 90 && == 45 {
D is not working because you are creating a new struct for each iteration. In the end, you'll have 10 independent instances of MuContainer.
The first two options are semantically identical. The bottom line for those two is that each goroutine shares the same instance of the object, which happens to be a global var.
The second one is similar with the only difference being the object locked and updated happens to be a local var. Again, the goroutines are working on the same instance of the object.
So these are not really different from each other, and all three have their uses.

Values sent over channel are not seen as received

The code below starts a few workers. Each worker receives a value via a channel which is added to a map where the key is the worker ID and value is the number received. Finally, when I add all the values received, I should get an expected result (in this case 55 because that is what you get when you add from 1..10). In most cases, I am not seeing the expected output. What am I doing wrong here? I do not want to solve it by adding a sleep. I would like to identify the issue programmatically and fix it.
type counter struct {
value int
count int
var data map[string]counter
var lock sync.Mutex
func adder(wid string, n int) {
defer lock.Unlock()
d := data[wid]
d.value += n
data[wid] = d
func main() {
data = make(map[string]counter)
c := make(chan int)
for w := 1; w <= 3; w++ { //starting 3 workers here
go func(wid string) {
data[wid] = counter{}
for {
v, k := <-c
if !k {
adder(wid, v)
}(strconv.Itoa(w)) // worker is given an ID
time.Sleep(1 * time.Second) // If this is not added, only one goroutine is recorded.
for i := 1; i <= 10; i++ {
c <- i
total := 0
for i, v := range data {
fmt.Println(i, v)
total += v.value
Your code has two significant races:
The initialization of data[wid] = counter{} is not synchronized with other goroutines that may be reading and rewriting data.
The worker goroutines do not signal when they are done modifying data, which means your main goroutine may read data before they finish writing.
You also have a strange construct:
for {
v, k := <-c
if !k {
adder(wid, v)
k will only be false when the channel c is closed, after which the goroutine spins as much as it can. This would be better written as for v := range c.
To fix the reading code in the main goroutine, we'll use the more normal for ... range c idiom and add a sync.WaitGroup, and have each worker invoke Done() on the wait-group. The main goroutine will then wait for them to finish. To fix the initialization, we'll lock the map (there are other ways to do this, e.g., to set up the map before starting any of the goroutines, or to rely on the fact that empty map slots read as zero, but this one is straightforward). I also took out the extra debug. The result is this code, also available on the Go Playground.
package main
import (
// "os"
// "time"
type counter struct {
value int
count int
var data map[string]counter
var lock sync.Mutex
var wg sync.WaitGroup
func adder(wid string, n int) {
defer lock.Unlock()
d := data[wid]
d.value += n
data[wid] = d
func main() {
// fmt.Println(os.Getpid())
data = make(map[string]counter)
c := make(chan int)
for w := 1; w <= 3; w++ { //starting 3 workers here
go func(wid string) {
data[wid] = counter{}
for v := range c {
adder(wid, v)
}(strconv.Itoa(w)) // worker is given an ID
for i := 1; i <= 10; i++ {
c <- i
total := 0
for i, v := range data {
fmt.Println(i, v)
total += v.value
(This can be improved easily, e.g., there's no reason for wg to be global.)
Well, I like #torek's answer but I wanted to post this answer as it contains a bunch of improvements:
Reduce the usage of locks (For such simple tasks, avoid locks. If you benchmark it, you'll notice a good difference because my code uses the lock only numworkers times).
Improve the naming of variables.
Remove usage of global vars (Use of global vars should always be as minimum as possible).
The following code adds a number from minWork to maxWork using numWorker spawned goroutines.
package main
import (
const (
bufferSize = 1 // Buffer for numChan
numworkers = 3 // Number of workers doing addition
minWork = 1 // Sum from [minWork] (inclusive)
maxWork = 10000000 // Sum upto [maxWork] (inclusive)
// worker stats
type worker struct {
workCount int // Number of times, worker worked
workDone int // Amount of work done; numbers added
// workerMap holds a map for worker(s)
type workerMap struct {
mu sync.Mutex // Guards m for safe, concurrent r/w
m map[int]worker // Map to hold worker id to worker mapping
func main() {
var (
totalWorkDone int // Total Work Done
wm workerMap // WorkerMap
wg sync.WaitGroup // WaitGroup
numChan = make(chan int, bufferSize) // Channel for nums
wm.m = make(map[int]worker, numworkers)
for wid := 0; wid < numworkers; wid++ {
go func(id int) {
var wk worker
// Wait for numbers
for n := range numChan {
wk.workDone += n
// Fill worker stats
wm.m[id] = wk
// Send numbers for addition by multiple workers
for i := minWork; i <= maxWork; i++ {
numChan <- i
// Close the channel
// Wait for goroutines to finish
// Print stats
for k, v := range wm.m {
fmt.Printf("WorkerID: %d; Work: %+v\n", k, v)
totalWorkDone += v.workDone
// Print total work done by all workers
fmt.Printf("Work Done: %d\n", totalWorkDone)

"Matrix multiplication" using goroutines and channels

I have a university project for testing time difference for matrix multiplication when I use 1 goroutine, 2 goroutines, 3 and so on. I must use channels. My problem is that doesn't matter how many go routines I add time of compilation is almost always the same. Maybe some one can tell where is the problem. Maybe that sending is very long and it gives all the time. Code is given below
package main
import (
const length = 1000
var start time.Time
var rez [length][length]int
func main() {
const threadlength = 1
toCalcRow := make(chan []int)
toCalcColumn := make(chan []int)
dummy1 := make(chan int)
dummy2 := make(chan int)
var row [length + 1]int
var column [length + 1]int
var a [length][length]int
var b [length][length]int
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
a[i][j] = rand.Intn(10)
b[i][j] = rand.Intn(10)
for i := 0; i < threadlength; i++ {
go Calc(toCalcRow, toCalcColumn, dummy1, dummy2)
start = time.Now()
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
row[0] = i
column[0] = j
for k := 0; k < length; k++ {
row[k+1] = a[i][j]
column[k+1] = b[i][k]
rowSlices := make([]int, len(row))
columnSlices := make([]int, len(column))
copy(rowSlices, row[:])
copy(columnSlices, column[:])
toCalcRow <- rowSlices
toCalcColumn <- columnSlices
dummy1 <- -1
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
fmt.Print(" ")
fmt.Println(" ")
func Calc(chin1 <-chan []int, chin2 <-chan []int, dummy <-chan int, dummy1 chan<- int) {
for {
select {
case row := <-chin1:
column := <-chin2
var sum [3]int
sum[0] = row[0]
sum[1] = column[0]
for i := 1; i < len(row); i++ {
sum[2] += row[i] * column[i]
rez[sum[0]][sum[1]] = sum[2]
case <-dummy:
elapsed := time.Since(start)
fmt.Println("Binomial took ", elapsed)
dummy1 <- 0
break loop
You don't see a difference because preparing the data to pass to the go routines is your bottleneck. It's slower or as fast as performing the calc.
Passing a copy of the rows and columns is not a good strategy. This is killing the performance.
The go routines may read data directly from the input matrix that are read only. There is no possible race condition here.
Same for output. If a go routine computes the multiplication of a row and a column, it will write the result in a distinct cell. There is also no possible race conditions here.
What to do is the following. Define a struct with two fields, one for the row and one for the column to multiply.
Fill a buffered channel with all possible combinations of row and columns to multiply from (0,0) to (n-1,m-1).
The go routines, consume the structs from the channel, perform the computation and write the result directly into the output matrix.
You then also have a done channel to signal to the main go routine that the computation is done. When a go routine has finished processing the struct (n-1,m-1) it closes the done channel.
The main go routine waits on the done channel after it has written all structs. Once the done channel is closed, it prints the elapsed time.
We can use a waiting group to wait that all go routine terminated their computation.
You can then start with one go routine and increase the number of go routines to see the impact of the processing time.
See the code:
package main
import (
type pair struct {
row, col int
const length = 1000
var start time.Time
var rez [length][length]int
func main() {
const threadlength = 1
pairs := make(chan pair, 1000)
var wg sync.WaitGroup
var a [length][length]int
var b [length][length]int
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
a[i][j] = rand.Intn(10)
b[i][j] = rand.Intn(10)
for i := 0; i < threadlength; i++ {
go Calc(pairs, &a, &b, &rez, &wg)
start = time.Now()
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
pairs <- pair{row: i, col: j}
elapsed := time.Since(start)
fmt.Println("Binomial took ", elapsed)
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
fmt.Print(" ")
fmt.Println(" ")
func Calc(pairs chan pair, a, b, rez *[length][length]int, wg *sync.WaitGroup) {
for {
pair, ok := <-pairs
if !ok {
rez[pair.row][pair.col] = 0
for i := 0; i < length; i++ {
rez[pair.row][pair.col] += a[pair.row][i] * b[i][pair.col]
Your code is quite difficult to follow (calling variables dummy1/dummy2 is confusing particularly when they get different names in Calc) and adding some comments would make it more easily understood.
Firstly a bug. After sending data to be calculated you dummy1 <- -1 and I believe you expect this to wait for all calculations to be complete. However that is not necessarily the case when you have multiple goroutines. The channel will be drained by ONE of the goroutines and the timing info printed out; other goroutines will still be running (and may not have finnished their calculations).
In terms of timing I suspect that the way you are sending data to the go routines will slow things down; you send the row and then the column; because the channels are not buffered the goroutine will block while waiting for the column (switching back to the main goroutine to send the column). This back and forth will slow the rate at which your goroutines get data and may well explain why adding extra goroutines has a limited impact (it also becomes dangerous if you use buffered channels).
I have refactored your code (note there may be bugs and its far from perfect!) into something that does show a difference (on my computer 1 goroutine = 10s; 5 = 7s):
package main
import (
const length = 1000
var start time.Time
var rez [length][length]int
// toMultiply will hold details of what the goroutine will be multiplying (one row and one column)
type toMultiply struct {
rowNo int
columnNo int
row []int
column []int
func main() {
const noOfGoRoutines = 5
// Build up a matrix of dimensions (length) x (length)
var a [length][length]int
var b [length][length]int
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
a[i][j] = rand.Intn(10)
b[i][j] = rand.Intn(10)
// Setup completed so start the clock...
start = time.Now()
// Start off threadlength go routines to multiply each row/column
toCalc := make(chan toMultiply)
var wg sync.WaitGroup
for i := 0; i < noOfGoRoutines; i++ {
go func() {
// Begin the multiplication.
start = time.Now()
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
tm := toMultiply{
rowNo: i,
columnNo: j,
row: make([]int, length),
column: make([]int, length),
for k := 0; k < length; k++ {
tm.row[k] = a[i][j]
tm.column[k] = b[i][k]
toCalc <- tm
// All of the data has been sent to the chanel; now we need to wait for all of the
// goroutines to complete
fmt.Println("Binomial took ", time.Since(start))
// The full result should be in tz
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
//fmt.Print(" ")
//fmt.Println(" ")
// Calc - Multiply a row from one matrix with a column from another
func Calc(toCalc <-chan toMultiply) {
for tc := range toCalc {
var result int
for i := 0; i < len(tc.row); i++ {
result += tc.row[i] * tc.column[i]
// warning - the below should work in this case but be careful writing to global variables from goroutines
rez[tc.rowNo][tc.columnNo] = result

Saving results from a parallelized goroutine

I am trying to parallelize an operation in golang and save the results in a manner that I can iterate over to sum up afterwords.
I have managed to set up the parameters so that no deadlock occurs, and I have confirmed that the operations are working and being saved correctly within the function. When I iterate over the Slice of my struct and try and sum up the results of the operation, they all remain 0. I have tried passing by reference, with pointers, and with channels (causes deadlock).
I have only found this example for help: But this seems outdated now, as Vector as been deprecated? I also have not found any references to the way this function (in the example) was constructed (with the func (u Vector) before the name). I tried replacing this with a Slice but got compile time errors.
Any help would be very appreciated. Here is the key parts of my code:
type job struct {
a int
b int
result *big.Int
func choose(jobs []Job, c chan int) {
temp := new(big.Int)
for _,job := range jobs {
job.result = //perform operation on job.a and job.b
c <- 1
func main() {
num := 100 //can be very large (why we need big.Int)
n := num
k := 0
const numCPU = 6 //runtime.NumCPU
count := new(big.Int)
// create a 2d slice of jobs, one for each core
jobs := make([][]Job, numCPU)
for (float64(k) <= math.Ceil(float64(num / 2))) {
// add one job to each core, alternating so that
// job set is similar in difficulty
for i := 0; i < numCPU; i++ {
if !(float64(k) <= math.Ceil(float64(num / 2))) {
jobs[i] = append(jobs[i], Job{n, k, new(big.Int)})
n -= 1
k += 1
c := make(chan int, numCPU)
for i := 0; i < numCPU; i++ {
go choose(jobs[i], c)
// drain the channel
for i := 0; i < numCPU; i++ {
// computations are done
for i := range jobs {
for _,job := range jobs[i] {
count.Add(count, job.result)
Here is the code running on the go playground
As long as the []Job slice is only modified by one goroutine at a time, there's no reason you can't modify the job in place.
for i, job := range jobs {
jobs[i].result = temp.Binomial(int64(job.a), int64(job.b))
You should also use a WaitGroup, rather than rely on counting tokens in a channel yourself.
