Attempt to parallelize not fast enough - go

I read about Go's concurrency model and also saw about the difference between concurrency and parallelism. In order to test parallel execution, I wrote the following program.
package main
import (
"fmt"
"runtime"
"time"
)
const count = 1e8
var buffer [count]int
func main() {
fmt.Println("GOMAXPROCS: ", runtime.GOMAXPROCS(0))
// Initialise with dummy value
for i := 0; i < count; i++ {
buffer[i] = 3
}
// Sequential operation
now := time.Now()
worker(0, count-1)
fmt.Println("sequential operation: ", time.Since(now))
// Attempt to parallelize
ch := make(chan int, 1)
now = time.Now()
go func() {
worker(0, (count/2)-1)
ch <- 1
}()
worker(count/2, count-1)
<-ch
fmt.Println("parallel operation: ", time.Since(now))
}
func worker(start int, end int) {
for i := start; i <= end; i++ {
task(i)
}
}
func task(index int) {
buffer[index] = 2 * buffer[index]
}
But the problem is: the results are not very pleasing.
GOMAXPROCS: 8
sequential operation: 206.85ms
parallel operation: 169.028ms
Using a goroutine does speed things up but not enough. I expected it to be closer to being twice as fast. What is wrong with my code and/or understanding? And how can I get closer to being twice as fast?

Parallelization is powerful, but it's hard to see with such a small computational load. Here is some sample code with a larger difference in the result:
package main
import (
"fmt"
"math"
"runtime"
"time"
)
func calctest(nCPU int) {
fmt.Println("Routines:", nCPU)
ch := make(chan float64, nCPU)
startTime := time.Now()
a := 0.0
b := 1.0
n := 100000.0
deltax := (b - a) / n
stepPerCPU := n / float64(nCPU)
for start := 0.0; start < n; {
stop := start + stepPerCPU
go f(start, stop, a, deltax, ch)
start = stop
}
integral := 0.0
for i := 0; i < nCPU; i++ {
integral += <-ch
}
fmt.Println(time.Now().Sub(startTime))
fmt.Println(deltax * integral)
}
func f(start, stop, a, deltax float64, ch chan float64) {
result := 0.0
for i := start; i < stop; i++ {
result += math.Sqrt(a + deltax*(i+0.5))
}
ch <- result
}
func main() {
nCPU := runtime.NumCPU()
calctest(nCPU)
fmt.Println("")
calctest(1)
}
This is the result I get:
Routines: 8
853.181µs
Routines: 1
2.031358ms

Related

goroutine value return order [duplicate]

This question already has answers here:
Golang channel output order
(4 answers)
Closed 4 years ago.
Why following codes always return 2,1, not 1,2.
func test(x int, c chan int) {
c <- x
}
func main() {
c := make(chan int)
go test(1, c)
go test(2, c)
x, y := <-c, <-c // receive from c
fmt.Println(x, y)
}
If you want to know what the order is, then let your program include ordering information
This example uses a function closure to generate a sequence
The channel returns a struct of two numbers, one of which is a sequence order number
The sequence incrementer should be safe across go routines as there is a mutex lock on the sequence counter
package main
import (
"fmt"
"sync"
)
type value_with_order struct {
v int
order int
}
var (
mu sync.Mutex
)
func orgami(x int, c chan value_with_order, f func() int) {
v := new(value_with_order)
v.v = x
v.order = f()
c <- *v
}
func seq() func() int {
i := 0
return func() int {
mu.Lock()
defer mu.Unlock()
i++
return i
}
}
func main() {
c := make(chan value_with_order)
sequencer := seq()
for n := 0; n < 10; n++ {
go orgami(1, c, sequencer)
go orgami(2, c, sequencer)
go orgami(3, c, sequencer)
}
received := 0
for q := range c {
fmt.Printf("%v\n", q)
received++
if received == 30 {
close(c)
}
}
}
second version where the sequence is called from the main loop to make the sequence numbers come out in the order that the function is called
package main
import (
"fmt"
"sync"
)
type value_with_order struct {
v int
order int
}
var (
mu sync.Mutex
)
func orgami(x int, c chan value_with_order, seqno int) {
v := new(value_with_order)
v.v = x
v.order = seqno
c <- *v
}
func seq() func() int {
i := 0
return func() int {
mu.Lock()
defer mu.Unlock()
i++
return i
}
}
func main() {
c := make(chan value_with_order)
sequencer := seq()
for n := 0; n < 10; n++ {
go orgami(1, c, sequencer())
go orgami(2, c, sequencer())
go orgami(3, c, sequencer())
}
received := 0
for q := range c {
fmt.Printf("%v\n", q)
received++
if received == 30 {
close(c)
}
}
}

Why the result is not as expected with flag "-race"?

Why the result is not as expected with flag "-race" ?
It expected the same result: 1000000 - with flag "-race" and without this
https://gist.github.com/romanitalian/f403ceb6e492eaf6ba953cf67d5a22ff
package main
import (
"fmt"
"runtime"
"sync/atomic"
"time"
)
//$ go run -race main_atomic.go
//954203
//
//$ go run main_atomic.go
//1000000
type atomicCounter struct {
val int64
}
func (c *atomicCounter) Add(x int64) {
atomic.AddInt64(&c.val, x)
runtime.Gosched()
}
func (c *atomicCounter) Value() int64 {
return atomic.LoadInt64(&c.val)
}
func main() {
counter := atomicCounter{}
for i := 0; i < 100; i++ {
go func(no int) {
for i := 0; i < 10000; i++ {
counter.Add(1)
}
}(i)
}
time.Sleep(time.Second)
fmt.Println(counter.Value())
}
The reason why the result is not the same is because time.Sleep(time.Second) does not guarantee that all of your goroutines are going to be executed in the timespan of one second. Even if you execute go run main.go, it's not guaranteed that you will get the same result every time. You can test this out if you put time.Milisecond instead of time.Second, you will see much more inconsistent results.
Whatever value you put in the time.Sleep method, it does not guarantee that all of your goroutines will be executed, it just means that it's less likely that all of your goroutines won't finish in time.
For consistent results, you would want to synchronise your goroutines a bit. You can use WaitGroup or channels.
With WaitGroup:
//rest of the code above is the same
func main() {
counter := atomicCounter{}
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
wg.Add(1)
go func(no int) {
for i := 0; i < 10000; i++ {
counter.Add(1)
}
wg.Done()
}(i)
}
wg.Wait()
fmt.Println(counter.Value())
}
With channels:
func main() {
valStream := make(chan int)
doneStream := make(chan int)
result := 0
for i := 0; i < 100; i++ {
go func() {
for i := 0; i < 10000; i++ {
valStream <- 1
}
doneStream <- 1
}()
}
go func() {
counter := 0
for count := range doneStream {
counter += count
if counter == 100 {
close(doneStream)
}
}
close(valStream)
}()
for val := range valStream {
result += val
}
fmt.Println(result)
}

confusing concurrency and performance issue in Go

now I start learning Go language by watching this great course. To be clear for years I write only on PHP and concurrency/parallelism is new for me, so I little confused by this.
In this course, there is a task to create a program which calculates factorial with 100 computations. I went a bit further and to comparing performance I changed it to 10000 and for some reason, the sequential program works same or even faster than concurrency.
Here I'm going to provide 3 solutions: mine, teachers and sequential
My solution:
package main
import (
"fmt"
)
func gen(steps int) <-chan int{
out := make(chan int)
go func() {
for j:= 0; j <steps; j++ {
out <- j
}
close(out)
}()
return out
}
func factorial(in <-chan int) <-chan int {
out := make(chan int)
go func() {
for n := range in {
out <- fact(n)
}
close(out)
}()
return out
}
func fact(n int) int {
total := 1
for i := n;i>0;i-- {
total *=i
}
return total
}
func main() {
steps := 10000
for i := 0; i < steps; i++ {
for n:= range factorial(gen(10)) {
fmt.Println(n)
}
}
}
execution time:
real 0m6,356s
user 0m3,885s
sys 0m0,870s
Teacher solution:
package main
import (
"fmt"
)
func gen(steps int) <-chan int{
out := make(chan int)
go func() {
for i := 0; i < steps; i++ {
for j:= 0; j <10; j++ {
out <- j
}
}
close(out)
}()
return out
}
func factorial(in <-chan int) <-chan int {
out := make(chan int)
go func() {
for n := range in {
out <- fact(n)
}
close(out)
}()
return out
}
func fact(n int) int {
total := 1
for i := n;i>0;i-- {
total *=i
}
return total
}
func main() {
steps := 10000
for n:= range factorial(gen(steps)) {
fmt.Println(n)
}
}
execution time:
real 0m2,836s
user 0m1,388s
sys 0m0,492s
Sequential:
package main
import (
"fmt"
)
func fact(n int) int {
total := 1
for i := n;i>0;i-- {
total *=i
}
return total
}
func main() {
steps := 10000
for i := 0; i < steps; i++ {
for j:= 0; j <10; j++ {
fmt.Println(fact(j))
}
}
}
execution time:
real 0m2,513s
user 0m1,113s
sys 0m0,387s
So, as you can see the sequential solution is fastest, teachers solution is in the second place and my solution is third.
First question: why the sequential solution is fastest?
And second why my solution is so slow? if I understanding correctly in my solution I'm creating 10000 goroutines inside gen and 10000 inside factorial and in teacher solution, he is creating only 1 goroutine in gen and 1 in factorial. My so slow because I'm creating too many unneeded goroutines?
It's the difference between concurrency and parellelism - your's, you teachers and the sequential are progressively less concurrent in design but how parallel they are depends on number of CPU cores and there is a set up and communication cost associated with concurrency. There are no asynchronous calls in the code so only parallelism will improve speed.
This is worth a look: https://blog.golang.org/concurrency-is-not-parallelism
Also, even with parallel cores speedup will be dependent on nature of the workload - google Amdahl's law for explanation.
Let's start with some fundamental benchmarks for factorial computation.
$ go test -run=! -bench=. factorial_test.go
goos: linux
goarch: amd64
BenchmarkFact0-4 1000000000 2.07 ns/op
BenchmarkFact9-4 300000000 4.37 ns/op
BenchmarkFact0To9-4 50000000 36.0 ns/op
BenchmarkFact10K0To9-4 3000 384069 ns/op
$
The CPU time is very small, even for 10,000 iterations of factorials zero through nine.
factorial_test.go:
package main
import "testing"
func fact(n int) int {
total := 1
for i := n; i > 0; i-- {
total *= i
}
return total
}
var sinkFact int
func BenchmarkFact0(b *testing.B) {
for N := 0; N < b.N; N++ {
j := 0
sinkFact = fact(j)
}
}
func BenchmarkFact9(b *testing.B) {
for N := 0; N < b.N; N++ {
j := 9
sinkFact = fact(j)
}
}
func BenchmarkFact0To9(b *testing.B) {
for N := 0; N < b.N; N++ {
for j := 0; j < 10; j++ {
sinkFact = fact(j)
}
}
}
func BenchmarkFact10K0To9(b *testing.B) {
for N := 0; N < b.N; N++ {
steps := 10000
for i := 0; i < steps; i++ {
for j := 0; j < 10; j++ {
sinkFact = fact(j)
}
}
}
}
Let's look at the time for the sequential program.
$ go build -a sequential.go && time ./sequential
real 0m0.247s
user 0m0.054s
sys 0m0.149s
Writing to the terminal is obviously a major bottleneck. Let's write to a sink.
$ go build -a sequential.go && time ./sequential > /dev/null
real 0m0.070s
user 0m0.049s
sys 0m0.020s
It's still a lot more than the 0m0.000000384069s for the factorial computation.
sequential.go:
package main
import (
"fmt"
)
func fact(n int) int {
total := 1
for i := n; i > 0; i-- {
total *= i
}
return total
}
func main() {
steps := 10000
for i := 0; i < steps; i++ {
for j := 0; j < 10; j++ {
fmt.Println(fact(j))
}
}
}
Attempts to use concurrency for such a trivial amount of parallel work are likely to fail. Go goroutines and channels are cheap, but they are not free. Also, a single channel and a single terminal are the bottleneck, the limiting factor, even when writing to a sink. See Amdahl's Law for parallel computing. See Concurrency is not parallelism.
$ go build -a teacher.go && time ./teacher > /dev/null
real 0m0.123s
user 0m0.123s
sys 0m0.022s
$ go build -a student.go && time ./student > /dev/null
real 0m0.135s
user 0m0.113s
sys 0m0.038s
teacher.go:
package main
import (
"fmt"
)
func gen(steps int) <-chan int {
out := make(chan int)
go func() {
for i := 0; i < steps; i++ {
for j := 0; j < 10; j++ {
out <- j
}
}
close(out)
}()
return out
}
func factorial(in <-chan int) <-chan int {
out := make(chan int)
go func() {
for n := range in {
out <- fact(n)
}
close(out)
}()
return out
}
func fact(n int) int {
total := 1
for i := n; i > 0; i-- {
total *= i
}
return total
}
func main() {
steps := 10000
for n := range factorial(gen(steps)) {
fmt.Println(n)
}
}
student.go:
package main
import (
"fmt"
)
func gen(steps int) <-chan int {
out := make(chan int)
go func() {
for j := 0; j < steps; j++ {
out <- j
}
close(out)
}()
return out
}
func factorial(in <-chan int) <-chan int {
out := make(chan int)
go func() {
for n := range in {
out <- fact(n)
}
close(out)
}()
return out
}
func fact(n int) int {
total := 1
for i := n; i > 0; i-- {
total *= i
}
return total
}
func main() {
steps := 10000
for i := 0; i < steps; i++ {
for n := range factorial(gen(10)) {
fmt.Println(n)
}
}
}

Concurrent integral calculation in Golang

I tried to compute the integral concurrently, but my program ended up being slower than computing the integral with a normal for loop. What am I doing wrong?
package main
import (
"fmt"
"math"
"sync"
"time"
)
type Result struct {
result float64
lock sync.RWMutex
}
var wg sync.WaitGroup
var result Result
func main() {
now := time.Now()
a := 0.0
b := 1.0
n := 100000.0
deltax := (b - a) / n
wg.Add(int(n))
for i := 0.0; i < n; i++ {
go f(a, deltax, i)
}
wg.Wait()
fmt.Println(deltax * result.result)
fmt.Println(time.Now().Sub(now))
}
func f(a float64, deltax float64, i float64) {
fx := math.Sqrt(a + deltax * (i + 0.5))
result.lock.Lock()
result.result += fx
result.lock.Unlock()
wg.Done()
}
3- For performance gain, you may divide tasks per CPU cores without using lock sync.RWMutex:
+30x Optimizations using channels and runtime.NumCPU(), this takes 2ms on 2 Cores and 993µs on 8 Cores, while Your sample code takes 61ms on 2 Cores and 40ms on 8 Cores:
See this working sample code and outputs:
package main
import (
"fmt"
"math"
"runtime"
"time"
)
func main() {
nCPU := runtime.NumCPU()
fmt.Println("nCPU =", nCPU)
ch := make(chan float64, nCPU)
startTime := time.Now()
a := 0.0
b := 1.0
n := 100000.0
deltax := (b - a) / n
stepPerCPU := n / float64(nCPU)
for start := 0.0; start < n; {
stop := start + stepPerCPU
go f(start, stop, a, deltax, ch)
start = stop
}
integral := 0.0
for i := 0; i < nCPU; i++ {
integral += <-ch
}
fmt.Println(time.Now().Sub(startTime))
fmt.Println(deltax * integral)
}
func f(start, stop, a, deltax float64, ch chan float64) {
result := 0.0
for i := start; i < stop; i++ {
result += math.Sqrt(a + deltax*(i+0.5))
}
ch <- result
}
Output on 2 Cores:
nCPU = 2
2.0001ms
0.6666666685900485
Output on 8 Cores:
nCPU = 8
993µs
0.6666666685900456
Your sample code, Output on 2 Cores:
0.6666666685900424
61.0035ms
Your sample code, Output on 8 Cores:
0.6666666685900415
40.9964ms
2- For good benchmark statistics, use large number of samples (big n):
As you See here using 2 Cores this takes 110ms on 2 Cores, but on this same CPU
using 1 Core this takes 215ms with n := 10000000.0:
With n := 10000000.0 and single goroutine, see this working sample code:
package main
import (
"fmt"
"math"
"time"
)
func main() {
now := time.Now()
a := 0.0
b := 1.0
n := 10000000.0
deltax := (b - a) / n
result := 0.0
for i := 0.0; i < n; i++ {
result += math.Sqrt(a + deltax*(i+0.5))
}
fmt.Println(time.Now().Sub(now))
fmt.Println(deltax * result)
}
Output:
215.0123ms
0.6666666666685884
With n := 10000000.0 and 2 goroutines, see this working sample code:
package main
import (
"fmt"
"math"
"runtime"
"time"
)
func main() {
nCPU := runtime.NumCPU()
fmt.Println("nCPU =", nCPU)
ch := make(chan float64, nCPU)
startTime := time.Now()
a := 0.0
b := 1.0
n := 10000000.0
deltax := (b - a) / n
stepPerCPU := n / float64(nCPU)
for start := 0.0; start < n; {
stop := start + stepPerCPU
go f(start, stop, a, deltax, ch)
start = stop
}
integral := 0.0
for i := 0; i < nCPU; i++ {
integral += <-ch
}
fmt.Println(time.Now().Sub(startTime))
fmt.Println(deltax * integral)
}
func f(start, stop, a, deltax float64, ch chan float64) {
result := 0.0
for i := start; i < stop; i++ {
result += math.Sqrt(a + deltax*(i+0.5))
}
ch <- result
}
Output:
nCPU = 2
110.0063ms
0.6666666666686073
1- There is an optimum point for number of Goroutines, And from this point forward increasing number of Goroutines doesn't reduce program execution time:
On 2 Cores CPU, with the following code, The result is:
nCPU: 1, 2, 4, 8, 16
Time: 2.1601236s, 1.1220642s, 1.1060633s, 1.1140637s, 1.1380651s
As you see from nCPU=1 to nCPU=2 time decrease is big enough but after this point it is not much, so nCPU=2 on 2 Cores CPU is Optimum point for this Sample code, so using nCPU := runtime.NumCPU() is enough here.
package main
import (
"fmt"
"math"
"time"
)
func main() {
nCPU := 2 //2.1601236s#1 1.1220642s#2 1.1060633s#4 1.1140637s#8 1.1380651s#16
fmt.Println("nCPU =", nCPU)
ch := make(chan float64, nCPU)
startTime := time.Now()
a := 0.0
b := 1.0
n := 100000000.0
deltax := (b - a) / n
stepPerCPU := n / float64(nCPU)
for start := 0.0; start < n; {
stop := start + stepPerCPU
go f(start, stop, a, deltax, ch)
start = stop
}
integral := 0.0
for i := 0; i < nCPU; i++ {
integral += <-ch
}
fmt.Println(time.Now().Sub(startTime))
fmt.Println(deltax * integral)
}
func f(start, stop, a, deltax float64, ch chan float64) {
result := 0.0
for i := start; i < stop; i++ {
result += math.Sqrt(a + deltax*(i+0.5))
}
ch <- result
}
Unless the time taken by the activity in the goroutine takes a lot more time than needed to switch contexts, carry out the task and use a mutex to update a value, it would be faster to do it serially.
Take a look at a slightly modified version. All I've done is add a delay of 1 microsecond in the f() function.
package main
import (
"fmt"
"math"
"sync"
"time"
)
type Result struct {
result float64
lock sync.RWMutex
}
var wg sync.WaitGroup
var result Result
func main() {
fmt.Println("concurrent")
concurrent()
result.result = 0
fmt.Println("serial")
serial()
}
func concurrent() {
now := time.Now()
a := 0.0
b := 1.0
n := 100000.0
deltax := (b - a) / n
wg.Add(int(n))
for i := 0.0; i < n; i++ {
go f(a, deltax, i, true)
}
wg.Wait()
fmt.Println(deltax * result.result)
fmt.Println(time.Now().Sub(now))
}
func serial() {
now := time.Now()
a := 0.0
b := 1.0
n := 100000.0
deltax := (b - a) / n
for i := 0.0; i < n; i++ {
f(a, deltax, i, false)
}
fmt.Println(deltax * result.result)
fmt.Println(time.Now().Sub(now))
}
func f(a, deltax, i float64, concurrent bool) {
time.Sleep(1 * time.Microsecond)
fx := math.Sqrt(a + deltax*(i+0.5))
if concurrent {
result.lock.Lock()
result.result += fx
result.lock.Unlock()
wg.Done()
} else {
result.result += fx
}
}
With the delay, the result was as follows (the concurrent version is much faster):
concurrent
0.6666666685900424
624.914165ms
serial
0.6666666685900422
5.609195767s
Without the delay:
concurrent
0.6666666685900428
50.771275ms
serial
0.6666666685900422
749.166µs
As you can see, the longer it takes to complete a task, the more sense it makes to do it concurrently, if possible.

Avoid panic, when trying to insert a value to a closed channel

package main
import (
"fmt"
"time"
)
func fib() chan int {
c := make(chan int)
go func() {
c <- 0
c <- 1
n, m := 0, 1
for {
temp := n + m
n = m
m = temp
c <- m // This results in panic, when the channel is closed
}
}()
return c
}
func main() {
start := time.Now()
var lastFib int
c := fib()
for i := 0; i != 1000000; i++ {
lastFib = <-c
}
close(c)
fmt.Println(lastFib)
fmt.Println(time.Now().Sub(start))
}
In the most idiomatic way, how would one avoid the panic in the goroutine, when the channel is closed? Or should i avoid using close at all?
I'm not looking into alternative methods (such as closures) to achieve the same thing, just trying to get a better understanding of channels.
Close is a good way for the goroutine sending into a channel to signal the receiving side that you are done with this channel. The other way around (your problem) is IMHO undoable, at least direct. You could add an other channel done which signal end of duty to your fibonacci generating goroutine.
Here is a modified version of your example that uses channels in an allowed (though not necessarily sensible) way:
package main
import (
"fmt"
"time"
)
func fib(c chan int) {
c <- 0
c <- 1
n, m := 0, 1
for {
temp := n + m
n = m
m = temp
c <- m
if m > 100000000 {
close(c)
break
}
}
}
func main() {
start := time.Now()
lastFib, newFib := 0, 0
ok := true
c := make(chan int)
go fib(c)
for {
newFib, ok = <-c
if !ok {
fmt.Println(lastFib)
break
}
lastFib = newFib
}
fmt.Println(time.Now().Sub(start))
}

Resources