Parallel version much slower than the serial one in golang - performance

I'm trying to code a parallel version of a simple algorithm that takes a point and a list of point and find which is the point of the list closer to the first one, to compare execution times with the serial version.
The problem is that running the parallel version needs more than 1 minute, while the serial version need around 1 seconds.
To be sure that the parallelism effect is noticeable I'm testing the code using a list of around 12 millions of points.
My cpu details:
Model name: Intel(R) Core(TM) i5-4210U CPU # 1.70GHz
CPU(s): 4
Here are the two versions:
Common part:
type Point struct {
X float64
Y float64
}
func dist(p, q Point) float64 {
return math.Sqrt(math.Pow(p.X-q.X,2)+math.Pow(p.Y-q.Y,2))
}
Sequential function:
func s_argmin(p Point, points_list []Point, i,j int)(int){
best := 0
d := dist(p, points_list[0])
var new_d float64
for k:=i;k<j+1;k++{
new_d = dist(p, points_list[k])
if new_d < d{
d = new_d
best = k
}
}
return best
}
Parallel function:
func p_argmin(p Point, points_list []Point, i,j int)(int){
if i==j{
return i
}else{
mid := int((i+j)/2)
var argmin1, argmin2 int
c1 := make(chan int)
c2 := make(chan int)
go func(){
c1 <- p_argmin(p, points_list, i, mid)
}()
go func(){
c2 <- p_argmin(p, points_list, mid+1, j)
}()
argmin1 = <- c1
argmin2 = <- c2
close(c1)
close(c2)
if dist(p,points_list[argmin1])<dist(p,points_list[argmin2]){
return argmin1
}else{
return argmin2
}
}
}
I also tried to limit parallelism, with a optimized function that execute the parallel version of the function only when the input size (j-i) is greater than a value, but the serial version is always the faster one.
How can improve the result of the parallel version?

Meaningless microbenchmarks produce meaningless results.
I see no reason to believe that recursive p_argmin might be faster than s_argmin.
$ go test micro_test.go -bench=. -benchmem
goos: linux
goarch: amd64
BenchmarkS-4 946197 1263 ns/op 0 B/op 0 allocs/op
--- BENCH: BenchmarkS-4
micro_test.go:81: 1 946197 946197
BenchmarkP-4 3477 302076 ns/op 80958 B/op 843 allocs/op
--- BENCH: BenchmarkP-4
micro_test.go:98: 839 2917203 3477
$
micro_test.go:
package main
import (
"math"
"sync"
"testing"
)
type Point struct {
X float64
Y float64
}
func dist(p, q Point) float64 {
//return math.Sqrt(math.Pow(p.X-q.X, 2) + math.Pow(p.Y-q.Y, 2))
return math.Sqrt((p.X-q.X)*(p.X-q.X) + (p.Y-q.Y)*(p.Y-q.Y))
}
func s_argmin(p Point, points_list []Point, i, j int) int {
mbm.Lock()
nbm++
mbm.Unlock()
best := 0
d := dist(p, points_list[0])
var new_d float64
for k := i; k < j+1; k++ {
new_d = dist(p, points_list[k])
if new_d < d {
d = new_d
best = k
}
}
return best
}
func p_argmin(p Point, points_list []Point, i, j int) int {
mbm.Lock()
nbm++
mbm.Unlock()
if i == j {
return i
}
mid := int((i + j) / 2)
var argmin1, argmin2 int
c1 := make(chan int)
c2 := make(chan int)
go func() {
c1 <- p_argmin(p, points_list, i, mid)
}()
go func() {
c2 <- p_argmin(p, points_list, mid+1, j)
}()
argmin1 = <-c1
argmin2 = <-c2
if dist(p, points_list[argmin1]) < dist(p, points_list[argmin2]) {
return argmin1
}
return argmin2
}
var (
nbm int
mbm sync.Mutex
)
func BenchmarkS(b *testing.B) {
mbm.Lock()
nbm = 0
mbm.Unlock()
points := make([]Point, 420)
b.ResetTimer()
for N := 0; N < b.N; N++ {
s_argmin(points[0], points, 0, len(points)-1)
}
b.StopTimer()
mbm.Lock()
b.Log(float64(nbm)/float64(b.N), nbm, b.N)
mbm.Unlock()
}
func BenchmarkP(b *testing.B) {
mbm.Lock()
nbm = 0
mbm.Unlock()
points := make([]Point, 420)
b.ResetTimer()
for N := 0; N < b.N; N++ {
p_argmin(points[0], points, 0, len(points)-1)
}
b.StopTimer()
mbm.Lock()
b.Log(float64(nbm)/float64(b.N), nbm, b.N)
mbm.Unlock()
}

The costs matter ( a lot ) can Try-it-Online
A pure-[SERIAL] flow of code-execution shows the negligible cost of a per-Point evaluated distance. It takes but about some 36 [ns] per Point
// ... The [SERIAL] flow of code-execution took 77.095 µs for [10]
// --------^^^^^^^^^^------------------------------------|---------------------
// ... The [PARALLEL] flow of code-execution took 142.563 µs for [10] Points
// ... The [PARALLEL] flow of code-execution took 386.27 µs for [100] Points
// ... The [PARALLEL] flow of code-execution took 4260.941 µs for [1000] Points
// ... The [PARALLEL] flow of code-execution took 31455.29 µs for [10000] Points
// ... The [SERIAL] flow of code-execution took 591.604 µs for [10000] Points
// ... The [PARALLEL] flow of code-execution took 391694.389 µs for [100000] Points
// ... The [SERIAL] flow of code-execution took 6425.999 µs for [100000] Points
// ... The [PARALLEL] flow of code-execution took 2807615.771 µs for [1000000] Points
// ... The [SERIAL] flow of code-execution took 64596.044 µs for [1000000] Points
// | | | ... ns
// | | +____ µs
// | +_______ ms
// +__________ s
Given this, the costs of instantiation of go-parallel flow of execution ( a split-and-conquer ) accumulated so huge add-on costs overheads, that it will hardly get justified for any reasonably used []Point sizes here.
Even for larger []Point sizes, the very overheads here cause ~ 2807 [ns] ~ 78 x slower per-Point processing ( right due to wrong design of the costs_of_computing / costs_of_overheads.
The revised, overhead-strict Amdahl's argument ( not the original one ) is valid here ( the original formulation did not enforce people to take also the hidden add-on overhead costs into consideration and amateurs often tend to skew the Speedup expectations )
func SERIAL( aPointToSEEK Point, aListOfPOINTs []Point ){
defer TimeTRACK( time.Now(), "The [SERIAL] flow of code-execution", len( aListOfPOINTs ) )
//
// 2020/03/09 07:17:54 The [SERIAL] flow of code-execution took 120.529 µs for [1]
// 2020/03/09 07:17:28 The [SERIAL] flow of code-execution took 194.565 µs for [10]
// 2020/03/09 07:11:28 The [SERIAL] flow of code-execution took 77.095 µs for [100]
// 2020/03/09 07:12:16 The [SERIAL] flow of code-execution took 260.771 µs for [1000]
// 2020/03/09 07:13:19 The [SERIAL] flow of code-execution took 591.604 µs for [10000]
// 2020/03/09 07:13:57 The [SERIAL] flow of code-execution took 4585.917 µs for [100000]
// 2020/03/09 07:14:33 The [SERIAL] flow of code-execution took 44317.063 µs for [1000000]
// 2020/03/09 07:10:30 The [SERIAL] flow of code-execution took 36141.75 µs for [1000000]
// 2020/03/09 07:15:10 The [SERIAL] flow of code-execution took 554986.415 µs for [10000000]
// 2020/03/09 07:24:10 The [SERIAL] flow of code-execution took 676098.025 µs for [10000000]
// | | | ... ns
// | | +____ µs
// | +_______ ms
// +__________ s
log.Printf( "%s got nearest aPointID# %d", "The [SERIAL] flow of code-execution", s_argmin( aPointToSEEK, aListOfPOINTs, 0, len( aListOfPOINTs ) - 1 ) )
}

Related

Parallel execution of prime finding algorithm slows runtime

So I implemented the following prime finding algorithm in go.
primes = []
Assume all numbers are primes (vacuously true)
check = 2
if check is still assumed to be prime append it to primes
multiply check by each prime less than or equal to its minimum factor and
eliminate results from assumed primes.
increment check by 1 and repeat 4 thru 6 until check > limit.
Here is my serial implementation:
package main
import(
"fmt"
"time"
)
type numWithMinFactor struct {
number int
minfactor int
}
func pow(base int, power int) int{
result := 1
for i:=0;i<power;i++{
result*=base
}
return result
}
func process(check numWithMinFactor,primes []int,top int,minFactors []numWithMinFactor){
var n int
for i:=0;primes[i]<=check.minfactor;i++{
n = check.number*primes[i]
if n>top{
break;
}
minFactors[n] = numWithMinFactor{n,primes[i]}
if i+1 == len(primes){
break;
}
}
}
func findPrimes(top int) []int{
primes := []int{}
minFactors := make([]numWithMinFactor,top+2)
check := 2
for power:=1;check <= top;power++{
if minFactors[check].number == 0{
primes = append(primes,check)
minFactors[check] = numWithMinFactor{check,check}
}
process(minFactors[check],primes,top,minFactors)
check++
}
return primes
}
func main(){
fmt.Println("Welcome to prime finder!")
start := time.Now()
fmt.Println(findPrimes(1000000))
elapsed := time.Since(start)
fmt.Println("Finding primes took %s", elapsed)
}
This runs great producing all the primes <1,000,000 in about 63ms (mostly printing) and primes <10,000,000 in 600ms on my pc. Now I figure none of the numbers check such that 2^n < check <= 2^(n+1) have factors > 2^n so I can do all the multiplications and elimination for each check in that range in parallel once I have primes up to 2^n. And my parallel implementation is as follows:
package main
import(
"fmt"
"time"
"sync"
)
type numWithMinFactor struct {
number int
minfactor int
}
func pow(base int, power int) int{
result := 1
for i:=0;i<power;i++{
result*=base
}
return result
}
func process(check numWithMinFactor,primes []int,top int,minFactors []numWithMinFactor, wg *sync.WaitGroup){
defer wg.Done()
var n int
for i:=0;primes[i]<=check.minfactor;i++{
n = check.number*primes[i]
if n>top{
break;
}
minFactors[n] = numWithMinFactor{n,primes[i]}
if i+1 == len(primes){
break;
}
}
}
func findPrimes(top int) []int{
primes := []int{}
minFactors := make([]numWithMinFactor,top+2)
check := 2
var wg sync.WaitGroup
for power:=1;check <= top;power++{
for check <= pow(2,power){
if minFactors[check].number == 0{
primes = append(primes,check)
minFactors[check] = numWithMinFactor{check,check}
}
wg.Add(1)
go process(minFactors[check],primes,top,minFactors,&wg)
check++
if check>top{
break;
}
}
wg.Wait()
}
return primes
}
func main(){
fmt.Println("Welcome to prime finder!")
start := time.Now()
fmt.Println(findPrimes(1000000))
elapsed := time.Since(start)
fmt.Println("Finding primes took %s", elapsed)
}
Unfortunately not only is this implementation slower running up to 1,000,000 in 600ms and up to 10 million in 6 seconds. My intuition tells me that there is potential for parallelism to improve performance however I clearly haven't been able to achieve that and would greatly appreciate any input on how to improve runtime here, or more specifically any insight as to why the parallel solution is slower.
Additionally the parallel solution consumes more memory relative to the serial solution but that is to be expected; the serial solution can grid up to 1,000,000,000 in about 22 seconds where the parallel solution runs out of memory on my system (32GB ram) going for the same target. But I'm asking about runtime here not memory use, I could for example use the zero value state of the minFactors array rather than a separate isPrime []bool true state but I think it is more readable as is.
I've tried passing a pointer for primes []int but that didn't seem to make a difference, using a channel instead of passing the minFactors array to the process function resulted in big time memory use and a much(10x ish) slower performance. I've re-written this algo a couple times to see if I could iron anything out but no luck. Any insights or suggestions would be much appreciated because I think parallelism could make this faster not 10x slower!
Par #Volker's suggestion I limited the number of processes to somthing less than my pc's available logical processes with the following revision however I am still getting runtimes that are 10x slower than the serial implementation.
package main
import(
"fmt"
"time"
"sync"
)
type numWithMinFactor struct {
number int
minfactor int
}
func pow(base int, power int) int{
result := 1
for i:=0;i<power;i++{
result*=base
}
return result
}
func process(check numWithMinFactor,primes []int,top int,minFactors []numWithMinFactor, wg *sync.WaitGroup){
defer wg.Done()
var n int
for i:=0;primes[i]<=check.minfactor;i++{
n = check.number*primes[i]
if n>top{
break;
}
minFactors[n] = numWithMinFactor{n,primes[i]}
if i+1 == len(primes){
break;
}
}
}
func findPrimes(top int) []int{
primes := []int{}
minFactors := make([]numWithMinFactor,top+2)
check := 2
nlogicalProcessors := 20
var wg sync.WaitGroup
var twoPow int
for power:=1;check <= top;power++{
twoPow = pow(2,power)
for check <= twoPow{
for nLogicalProcessorsInUse := 0 ; nLogicalProcessorsInUse < nlogicalProcessors; nLogicalProcessorsInUse++{
if minFactors[check].number == 0{
primes = append(primes,check)
minFactors[check] = numWithMinFactor{check,check}
}
wg.Add(1)
go process(minFactors[check],primes,top,minFactors,&wg)
check++
if check>top{
break;
}
if check>twoPow{
break;
}
}
wg.Wait()
if check>top{
break;
}
}
}
return primes
}
func main(){
fmt.Println("Welcome to prime finder!")
start := time.Now()
fmt.Println(findPrimes(10000000))
elapsed := time.Since(start)
fmt.Println("Finding primes took %s", elapsed)
}
tldr; Why is my parallel implementation slower than serial implementation how do I make it faster?
Par #mh-cbon's I made larger jobs for parallel processing resulting in the following code.
package main
import(
"fmt"
"time"
"sync"
)
func pow(base int, power int) int{
result := 1
for i:=0;i<power;i++{
result*=base
}
return result
}
func process(check int,primes []int,top int,minFactors []int){
var n int
for i:=0;primes[i]<=minFactors[check];i++{
n = check*primes[i]
if n>top{
break;
}
minFactors[n] = primes[i]
if i+1 == len(primes){
break;
}
}
}
func processRange(start int,end int,primes []int,top int,minFactors []int, wg *sync.WaitGroup){
defer wg.Done()
for start <= end{
process(start,primes,top,minFactors)
start++
}
}
func findPrimes(top int) []int{
primes := []int{}
minFactors := make([]int,top+2)
check := 2
nlogicalProcessors := 10
var wg sync.WaitGroup
var twoPow int
var start int
var end int
var stepSize int
var stepsTaken int
for power:=1;check <= top;power++{
twoPow = pow(2,power)
stepSize = (twoPow-start)/nlogicalProcessors
stepsTaken = 0
stepSize = (twoPow/2)/nlogicalProcessors
for check <= twoPow{
start = check
end = check+stepSize
if stepSize == 0{
end = twoPow
}
if stepsTaken == nlogicalProcessors-1{
end = twoPow
}
if end>top {
end = top
}
for check<=end {
if minFactors[check] == 0{
primes = append(primes,check)
minFactors[check] = check
}
check++
}
wg.Add(1)
go processRange(start,end,primes,top,minFactors,&wg)
if check>top{
break;
}
if check>twoPow{
break;
}
stepsTaken++
}
wg.Wait()
if check>top{
break;
}
}
return primes
}
func main(){
fmt.Println("Welcome to prime finder!")
start := time.Now()
fmt.Println(findPrimes(1000000))
elapsed := time.Since(start)
fmt.Println("Finding primes took %s", elapsed)
}
This runs at a similar speed to the serial implementation.
So I did eventually get a parallel version of the code to run slightly faster than the serial version. following suggestions from #mh-cbon (See above). However this implementation did not result in vast improvements relative to the serial implementation (50ms to 10 million compared to 75ms serially) Considering that allocating and writing an []int 0:10000000 takes 25ms I'm not disappointed by these results. As #Volker stated "such stuff often is not limited by CPU but by memory bandwidth." which I believe is the case here.
I would still love to see any additional improvements however I am somewhat satisfied with what I've gained here.
Serial code running up to 2 billion 19.4 seconds
Parallel code running up to 2 billion 11.1 seconds
Initializing []int{0:2Billion} 4.5 seconds

Why does Go use more CPUs but does not reduce computing duration?

I wrote a simple Go program to add numbers in many goroutines.
When I increase number of goroutines, the program uses more CPUs and I expect the computing duration is shorter. It is true for 1, 2 or 4 goroutines, but when I try 8 goroutines, the duration is the same as 4 (I run the test on i5-8265U, a 8 CPUs processor).
Can you explain it to me?
The code:
package main
import (
"fmt"
"time"
)
// Sum returns n by calculating 1+1+1+..
func Sum(n int64) int64 {
ret := int64(0)
for i := int64(0); i < n; i++ {
ret += 1
}
return ret
}
func main() {
n := int64(30000000000) // 30e9
sum := int64(0)
beginTime := time.Now()
nWorkers := 4
sumChan := make(chan int64, nWorkers)
for i := 0; i < nWorkers; i++ {
go func() { sumChan <- Sum(n / int64(nWorkers)) }()
}
for i := 0; i < nWorkers; i++ {
sum += <-sumChan
}
fmt.Println("dur:", time.Since(beginTime))
fmt.Println("sum:", sum)
// Results on Intel Core i5-8265U (nWorkers,dur):
// (1, 8s), (2, 4s), (4, 2s), (8, 2s). Why 8 CPUs still need 2s?
}
I run the test on i5-8265U, a 8 CPUs processor
The i5-8265U is not an 8-core CPU, it's a 4-core 8-threads CPU: it has 4 physical cores, and each core can run 2 threads concurrently via hyperthreading.
The "performance advantage" of HT depends on the workloads, and the ability to "mix in" operations from one thread with the computations of another. This means if your CPU is highly loaded, the hyper-threads may not be able to get more than a few % of the runtime, and thus not contribute much to the total performances.
Furthermore, the 8265U has a nominal frequency of 1.6GHz and a maximum turbo of 3.9 (3.7 on 4 cores). It's also possible that fully loading the CPU including hyperthreads would further lower the "turbo ceiling". You'd have to check the cpuinfo state during the run to see.

Am I doing execution timing measurement in Go in a useful way?

My code:
// repeat fib(n) 10000 times
i := 10000
var total_time time.Duration
for i > 0 {
// do fib(n) -> f0
start := time.Now()
for n > 0 {
f0, f1, n = f1, f0.Add(f0, f1), n-1
}
total_time = total_time + time.Since(start)
i--
}
// and divide total execution time by 10000
var normalized_time = total_time / 10000
fmt.Println(normalized_time)
The execution times I'm seeing are so extremely short that I am suspicious that what I've done isn't useful. If it's wrong, what am I doing wrong and how can I make it right?
what am I doing wrong and how can I make it right?
Use the Go testing package for benchmarks. For example:
Write the Fibonacci number computation as a function in your code.
fibonacci.go:
package main
import "fmt"
// fibonacci returns the Fibonacci number for 0 <= n <= 92.
// OEIS: A000045: Fibonacci numbers:
// F(n) = F(n-1) + F(n-2) with F(0) = 0 and F(1) = 1.
func fibonacci(n int) int64 {
if n < 0 {
panic("n < 0")
}
f := int64(0)
a, b := int64(0), int64(1)
for i := 0; i <= n; i++ {
if a < 0 {
panic("overflow")
}
f, a, b = a, b, a+b
}
return f
}
func main() {
for _, n := range []int{0, 1, 2, 3, 90, 91, 92} {
fmt.Printf("%-2d %d\n", n, fibonacci(n))
}
}
Playground: https://play.golang.org/p/FFdG4RlNpUZ
Output:
$ go run fibonacci.go
0 0
1 1
2 1
3 2
90 2880067194370816120
91 4660046610375530309
92 7540113804746346429
$
Write and run some benchmarks using the Go testing package.
fibonacci_test.go:
package main
import "testing"
func BenchmarkFibonacciN0(b *testing.B) {
for i := 0; i < b.N; i++ {
fibonacci(0)
}
}
func BenchmarkFibonacciN92(b *testing.B) {
for i := 0; i < b.N; i++ {
fibonacci(92)
}
}
Output:
$ go test fibonacci.go fibonacci_test.go -bench=. -benchmem
goos: linux
goarch: amd64
BenchmarkFibonacciN0-4 367003574 3.25 ns/op 0 B/op 0 allocs/op
BenchmarkFibonacciN92-4 17369262 63.0 ns/op 0 B/op 0 allocs/op
$

Why this Go program is so slow?

I just read some short tutorials of Go and wrote a simple program sieve. Sieve uses sieve algorithm to print all the prime number that is smaller than 10000, which create a lot of go routines. I got the correct results but the program is very slow (5 seconds on my machine). I also wrote lua script and python script which implemented the same algorithm, and runs a lot faster (both are about 1 second on my machine).
Note that the purpose is to have idea of go routine's performance compared with coroutine in other languages, for example lua. The implementation is very inefficient, some comments pointed out that it's not correct way to implement Sieve of Eratosthenes. Yes, that's intentional. Some other replies pointed out that slowness is caused by print I/O. So I commented out print lines.
My question is why my sieve program implemented in Go is so slow?
Here is the code:
package main
import (
"fmt"
"sync"
)
type Sieve struct {
id int;
msg_queue chan int;
wg *sync.WaitGroup;
}
func NewSieve(id int) *Sieve {
sieve := new(Sieve)
sieve.id = id
sieve.msg_queue = make(chan int)
sieve.wg = new(sync.WaitGroup)
sieve.wg.Add(1)
return sieve
}
func (sieve *Sieve) run() {
defer sieve.wg.Done()
myprime := <-sieve.msg_queue
if myprime == 0 {
return
}
// fmt.Printf("Sieve (%d) is for prime number %d.\n", sieve.id, myprime)
next_sieve := NewSieve(sieve.id + 1)
go next_sieve.run()
for {
number := <-sieve.msg_queue
if number == 0 {
next_sieve.msg_queue <- number;
next_sieve.wg.Wait()
return
} else if number % myprime != 0 {
// fmt.Printf("id: %d, number: %d, myprime: %d, number mod myprime: %d\n", sieve.id, number, myprime, number % myprime)
next_sieve.msg_queue <- number
}
}
}
func driver() {
first := NewSieve(2)
go first.run()
for n := 2; n <= 10000; n++ {
first.msg_queue <- n
}
first.msg_queue <- 0
first.wg.Wait()
}
func main() {
driver()
}
As a comparison, here is the code of sieve.lua
function sieve(id)
local myprime = coroutine.yield()
// print(string.format("Sieve (%d) is for prime number %d", id, myprime))
local next_sieve = coroutine.create(sieve)
coroutine.resume(next_sieve, id + 1)
while true do
local number = coroutine.yield()
if number % myprime ~= 0 then
// print(string.format("id: %d, number: %d, myprime: %d, number mod myprime: %d", id, number, myprime, number % myprime))
coroutine.resume(next_sieve, number)
end
end
end
function driver()
local first = coroutine.create(sieve)
coroutine.resume(first, 2)
local n
for n = 2, 10000 do
coroutine.resume(first, n)
end
end
driver()
Meaningless microbenchmarks produce meaningless results.
You are timing print I/O.
You are incurring go routine and channel overhead for a small amount of work.
Here is a prime number sieve program in Go.
Output:
$ go version
go version devel +46be01f4e0 Sun Oct 13 01:48:30 2019 +0000 linux/amd64
$ go build sumprimes.go && time ./sumprimes
5736396
29.96µs
real 0m0.001s
user 0m0.001s
sys 0m0.000s
sumprimes.go:
package main
import (
"fmt"
"time"
)
const (
prime = 0x00
notprime = 0xFF
)
func oddPrimes(n uint64) (sieve []uint8) {
sieve = make([]uint8, (n+1)/2)
sieve[0] = notprime
p := uint64(3)
for i := p * p; i <= n; i = p * p {
for j := i; j <= n; j += 2 * p {
sieve[j/2] = notprime
}
for p += 2; sieve[p/2] == notprime; p += 2 {
}
}
return sieve
}
func sumPrimes(n uint64) uint64 {
sum := uint64(0)
if n >= 2 {
sum += 2
}
for i, p := range oddPrimes(n) {
if p == prime {
sum += 2*uint64(i) + 1
}
}
return sum
}
func main() {
start := time.Now()
var n uint64 = 10000
sum := sumPrimes(n)
fmt.Println(sum)
fmt.Println(time.Since(start))
}
Most of the time is spent in fmt.Printf.
Taking out the line:
fmt.Printf("id: %d, number: %d, myprime: %d, number mod myprime: %d\n", sieve.id, number, myprime, number%myprime)
reduces runtime from ~5.4 seconds to ~0.64 seconds on one test I ran.
Taking out the unnecessary sync.WaitGroups reduces the time a bit further, to ~0.48 seconds. See the version without sync.WaitGroup here. You're still doing a lot of channel operations, which languages with yield-value-from-coroutine operators do not need (though they have their own issues instead). This is not a good way to implement primality testing.

Goroutine performance

I have started to learn Go, it's fun and easy. But working with goroutines I have seen little benefit in performance.
If I try to sequentially add 1 million numbers two times in 2 functions:
package main
import (
"fmt"
"time"
)
var sumA int
var sumB int
func fSumA() {
for i := 0; i < 1000000; i++ {
sumA += i
}
}
func fSumB() {
for i := 0; i < 1000000; i++ {
sumB += i
}
}
func main() {
start := time.Now()
fSumA()
fSumB()
sum := sumA + sumB
fmt.Println("Elapsed time", time.Since(start))
fmt.Println("Sum", sum)
}
It takes 5 ms.
MacBook-Pro-de-Pedro:hello pedro$ ./bin/hello
Elapsed time 5.724406ms
Suma total 999999000000
MacBook-Pro-de-Pedro:hello pedro$ ./bin/hello
Elapsed time 5.358165ms
Suma total 999999000000
MacBook-Pro-de-Pedro:hello pedro$ ./bin/hello
Elapsed time 5.042528ms
Suma total 999999000000
MacBook-Pro-de-Pedro:hello pedro$ ./bin/hello
Elapsed time 5.469628ms
Suma total 999999000000
When I try to do the same thing with 2 goroutines:
package main
import (
"fmt"
"sync"
"time"
)
var wg sync.WaitGroup
var sumA int
var sumB int
func fSumA() {
for i := 0; i < 1000000; i++ {
sumA += i
}
wg.Done()
}
func fSumB() {
for i := 0; i < 1000000; i++ {
sumB += i
}
wg.Done()
}
func main() {
start := time.Now()
wg.Add(2)
go fSumA()
go fSumB()
wg.Wait()
sum := sumA + sumB
fmt.Println("Elapsed time", time.Since(start))
fmt.Println("Sum", sum)
}
I get more or less the same result, 5 ms. My computer is a MacBook pro (Core 2 Duo). I don't see any performance improvement. Maybe is the processor?
MacBook-Pro-de-Pedro:hello pedro$ ./bin/hello
Elapsed time 5.258415ms
Suma total 999999000000
MacBook-Pro-de-Pedro:hello pedro$ ./bin/hello
Elapsed time 5.528498ms
Suma total 999999000000
MacBook-Pro-de-Pedro:hello pedro$ ./bin/hello
Elapsed time 5.273565ms
Suma total 999999000000
MacBook-Pro-de-Pedro:hello pedro$ ./bin/hello
Elapsed time 5.539224ms
Suma total 999999000000
Here how you can test this with the golangs own benchmark tool:
Create a test go file (e.g. main_test.go).
Note: _test.go has to be the file ending!
Copy the following code there or create your own Benchmarks:
package main
import (
"sync"
"testing"
)
var GlobalInt int
func BenchmarkCount(b *testing.B) {
var a, c int
count(&a, b.N)
count(&c, b.N)
GlobalInt = a + c // make sure the result is actually used
}
func count(a *int, max int) {
for i := 0; i < max; i++ {
*a += i
}
}
var wg sync.WaitGroup
func BenchmarkCountConcurrent(b *testing.B) {
var a, c int
wg.Add(2)
go countCon(&a, b.N)
go countCon(&c, b.N)
wg.Wait()
GlobalInt = a + c // make sure the result is actually used
}
func countCon(a *int, max int) {
for i := 0; i < max; i++ {
*a += i
}
wg.Done()
}
Run with:
go test -bench .
Result on my Mac:
$ go test -bench .
BenchmarkCount-8 500000000 3.50 ns/op
BenchmarkCountConcurrent-8 2000000000 1.98 ns/op
PASS
ok MyPath/MyPackage 6.309s
Most important value is the time/op. The smaller the better. Here 3.5 ns/op for normal counting, 1.98 ns/op for concurrent counting.
EDIT:
Here you can read up on golang Testing and Benchmark.

Resources