Concurrent code slower than sequential code on parallel problem? [duplicate] - go

This question already has answers here:
Why does adding concurrency slow down this golang code?
(4 answers)
Closed 3 months ago.
I wrote some code to execute Monte Carlo simulations. The first thing I wrote was this sequential version:
func simulationSequential(experiment func() bool, numTrials int) float64 {
ocurrencesEvent := 0
for trial := 0; trial < numTrials; trial++ {
eventHappend := experiment()
if eventHappend {
ocurrencesEvent++
}
}
return float64(ocurrencesEvent) / float64(numTrials)
}
Then, I figured I could run some of the experiments concurrently and get a result faster using my laptop's multiple cores. So, I wrote the following version:
func simulationConcurrent(experiment func() bool, numTrials, nGoroutines int) float64 {
ch := make(chan int)
var wg sync.WaitGroup
// Launch work in multiple goroutines
for i := 0; i < nGoroutines; i++ {
wg.Add(1)
go func() {
localOcurrences := 0
for j := 0; j < numTrials/nGoroutines; j++ {
eventHappend := experiment()
if eventHappend {
localOcurrences++
}
}
ch <- localOcurrences
wg.Done()
}()
}
// Close the channel when all the goroutines are done
go func() {
wg.Wait()
close(ch)
}()
// Acummulate the results of each experiment
ocurrencesEvent := 0
for localOcurrences := range ch {
ocurrencesEvent += localOcurrences
}
return float64(ocurrencesEvent) / float64(numTrials)
}
To my surprise, when I run benchmarks on the two versions, I get that the sequential is faster than the concurrent one, with the concurrent version getting better as I decrease the number of goroutines. Why does this happen? I thought the concurrent version will be faster since this is a highly parallelizable problem.
Here is my benchmark's code:
func tossEqualToSix() bool {
// Simulate the toss of a six-sided die
roll := rand.Intn(6) + 1
if roll != 6 {
return false
}
return true
}
const (
numsSimBenchmark = 1_000_000
numGoroutinesBenckmark = 10
)
func BenchmarkSimulationSequential(b *testing.B) {
for i := 0; i < b.N; i++ {
simulationSequential(tossEqualToSix, numsSimBenchmark)
}
}
func BenchmarkSimulationConcurrent(b *testing.B) {
for i := 0; i < b.N; i++ {
simulationConcurrent(tossEqualToSix, numsSimBenchmark, numGoroutinesBenckmark)
}
}
And the results:
goos: linux
goarch: amd64
pkg: github.com/jpuriol/montecarlo
cpu: Intel(R) Core(TM) i7-10510U CPU # 1.80GHz
BenchmarkSimulationSequential-8 36 30453588 ns/op
BenchmarkSimulationConcurrent-8 9 117462720 ns/op
PASS
ok github.com/jpuriol/montecarlo 2.478s
You can download my code from Github.

I thought I would elaborate on my comment and post it with code and benchmark result.
Examine function uses package level rand functions from rand package. These functions under the hood uses globalRand instance of rand.Rand. For example func Intn(n int) int { return globalRand.Intn(n) }. As the random number generator is not thread safe, the globalRand is instantiated in following way:
/*
* Top-level convenience functions
*/
var globalRand = New(&lockedSource{src: NewSource(1).(*rngSource)})
type lockedSource struct {
lk sync.Mutex
src *rngSource
}
func (r *lockedSource) Int63() (n int64) {
r.lk.Lock()
n = r.src.Int63()
r.lk.Unlock()
return
}
...
This means that all invocations of rand.Intn are guarded by the global lock. The consequence is that examine function "works sequentially", because of the lock. More specifically, each call to rand.Intn will not start generating a random number before the previous call completes.
Here is redesigned code. Each examine function has its own random generator. The assumption is that single examine function is used by one goroutine, so it does not require lock protection.
package main
import (
"math/rand"
"sync"
"testing"
"time"
)
func simulationSequential(experimentFuncFactory func() func() bool, numTrials int) float64 {
experiment := experimentFuncFactory()
ocurrencesEvent := 0
for trial := 0; trial < numTrials; trial++ {
eventHappend := experiment()
if eventHappend {
ocurrencesEvent++
}
}
return float64(ocurrencesEvent) / float64(numTrials)
}
func simulationConcurrent(experimentFuncFactory func() func() bool, numTrials, nGoroutines int) float64 {
ch := make(chan int)
var wg sync.WaitGroup
// Launch work in multiple goroutines
for i := 0; i < nGoroutines; i++ {
wg.Add(1)
go func() {
experiment := experimentFuncFactory()
localOcurrences := 0
for j := 0; j < numTrials/nGoroutines; j++ {
eventHappend := experiment()
if eventHappend {
localOcurrences++
}
}
ch <- localOcurrences
wg.Done()
}()
}
// Close the channel when all the goroutines are done
go func() {
wg.Wait()
close(ch)
}()
// Acummulate the results of each experiment
ocurrencesEvent := 0
for localOcurrences := range ch {
ocurrencesEvent += localOcurrences
}
return float64(ocurrencesEvent) / float64(numTrials)
}
func tossEqualToSix() func() bool {
prng := rand.New(rand.NewSource(time.Now().UnixNano()))
return func() bool {
// Simulate the toss of a six-sided die
roll := prng.Intn(6) + 1
if roll != 6 {
return false
}
return true
}
}
const (
numsSimBenchmark = 5_000_000
numGoroutinesBenchmark = 10
)
func BenchmarkSimulationSequential(b *testing.B) {
for i := 0; i < b.N; i++ {
simulationSequential(tossEqualToSix, numsSimBenchmark)
}
}
func BenchmarkSimulationConcurrent(b *testing.B) {
for i := 0; i < b.N; i++ {
simulationConcurrent(tossEqualToSix, numsSimBenchmark, numGoroutinesBenchmark)
}
}
Benchmark result are as follows:
goos: darwin
goarch: arm64
pkg: scratchpad
BenchmarkSimulationSequential-8 20 55142896 ns/op
BenchmarkSimulationConcurrent-8 82 12944360 ns/op

Related

what is going on under the hood so this concurrent usage of a map is racy

in below example, the race detector will trigger an error. I am fine with it, though, as it does not change keys, the map header (if i might say), i struggle figuring out what is the reason of the race. I simply dont understand what is going on under hood so that a race detection is emitted.
package main
import (
"fmt"
"sync"
)
// scores holds values incremented by multiple goroutines.
var scores = make(map[string]int)
func main() {
var wg sync.WaitGroup
wg.Add(2)
scores["A"] = 0
scores["B"] = 0
go func() {
for i := 0; i < 1000; i++ {
// if _, ok := scores["A"]; !ok {
// scores["A"] = 1
// } else {
scores["A"]++
// }
}
wg.Done()
}()
go func() {
for i := 0; i < 1000; i++ {
scores["B"]++ // Line 28
}
wg.Done()
}()
wg.Wait()
fmt.Println("Final scores:", scores)
}
Map values are not addressable, so incrementing the integer values requires writing them back to the map itself.
The line
scores["A"]++
is equivalent to
tmp := scores["A"]
scores["A"] = tmp + 1
If you use a pointer to make the integer values addressable, and assign all the keys before the goroutines are dispatched, you can see there is no longer a race on the map itself:
var scores = make(map[string]*int)
func main() {
var wg sync.WaitGroup
wg.Add(2)
scores["A"] = new(int)
scores["B"] = new(int)
go func() {
for i := 0; i < 1000; i++ {
(*scores["A"])++
}
wg.Done()
}()
go func() {
for i := 0; i < 1000; i++ {
(*scores["B"])++
}
wg.Done()
}()
wg.Wait()
fmt.Println("Final scores:", scores)
}

Why the result is not as expected with flag "-race"?

Why the result is not as expected with flag "-race" ?
It expected the same result: 1000000 - with flag "-race" and without this
https://gist.github.com/romanitalian/f403ceb6e492eaf6ba953cf67d5a22ff
package main
import (
"fmt"
"runtime"
"sync/atomic"
"time"
)
//$ go run -race main_atomic.go
//954203
//
//$ go run main_atomic.go
//1000000
type atomicCounter struct {
val int64
}
func (c *atomicCounter) Add(x int64) {
atomic.AddInt64(&c.val, x)
runtime.Gosched()
}
func (c *atomicCounter) Value() int64 {
return atomic.LoadInt64(&c.val)
}
func main() {
counter := atomicCounter{}
for i := 0; i < 100; i++ {
go func(no int) {
for i := 0; i < 10000; i++ {
counter.Add(1)
}
}(i)
}
time.Sleep(time.Second)
fmt.Println(counter.Value())
}
The reason why the result is not the same is because time.Sleep(time.Second) does not guarantee that all of your goroutines are going to be executed in the timespan of one second. Even if you execute go run main.go, it's not guaranteed that you will get the same result every time. You can test this out if you put time.Milisecond instead of time.Second, you will see much more inconsistent results.
Whatever value you put in the time.Sleep method, it does not guarantee that all of your goroutines will be executed, it just means that it's less likely that all of your goroutines won't finish in time.
For consistent results, you would want to synchronise your goroutines a bit. You can use WaitGroup or channels.
With WaitGroup:
//rest of the code above is the same
func main() {
counter := atomicCounter{}
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
wg.Add(1)
go func(no int) {
for i := 0; i < 10000; i++ {
counter.Add(1)
}
wg.Done()
}(i)
}
wg.Wait()
fmt.Println(counter.Value())
}
With channels:
func main() {
valStream := make(chan int)
doneStream := make(chan int)
result := 0
for i := 0; i < 100; i++ {
go func() {
for i := 0; i < 10000; i++ {
valStream <- 1
}
doneStream <- 1
}()
}
go func() {
counter := 0
for count := range doneStream {
counter += count
if counter == 100 {
close(doneStream)
}
}
close(valStream)
}()
for val := range valStream {
result += val
}
fmt.Println(result)
}

confusing concurrency and performance issue in Go

now I start learning Go language by watching this great course. To be clear for years I write only on PHP and concurrency/parallelism is new for me, so I little confused by this.
In this course, there is a task to create a program which calculates factorial with 100 computations. I went a bit further and to comparing performance I changed it to 10000 and for some reason, the sequential program works same or even faster than concurrency.
Here I'm going to provide 3 solutions: mine, teachers and sequential
My solution:
package main
import (
"fmt"
)
func gen(steps int) <-chan int{
out := make(chan int)
go func() {
for j:= 0; j <steps; j++ {
out <- j
}
close(out)
}()
return out
}
func factorial(in <-chan int) <-chan int {
out := make(chan int)
go func() {
for n := range in {
out <- fact(n)
}
close(out)
}()
return out
}
func fact(n int) int {
total := 1
for i := n;i>0;i-- {
total *=i
}
return total
}
func main() {
steps := 10000
for i := 0; i < steps; i++ {
for n:= range factorial(gen(10)) {
fmt.Println(n)
}
}
}
execution time:
real 0m6,356s
user 0m3,885s
sys 0m0,870s
Teacher solution:
package main
import (
"fmt"
)
func gen(steps int) <-chan int{
out := make(chan int)
go func() {
for i := 0; i < steps; i++ {
for j:= 0; j <10; j++ {
out <- j
}
}
close(out)
}()
return out
}
func factorial(in <-chan int) <-chan int {
out := make(chan int)
go func() {
for n := range in {
out <- fact(n)
}
close(out)
}()
return out
}
func fact(n int) int {
total := 1
for i := n;i>0;i-- {
total *=i
}
return total
}
func main() {
steps := 10000
for n:= range factorial(gen(steps)) {
fmt.Println(n)
}
}
execution time:
real 0m2,836s
user 0m1,388s
sys 0m0,492s
Sequential:
package main
import (
"fmt"
)
func fact(n int) int {
total := 1
for i := n;i>0;i-- {
total *=i
}
return total
}
func main() {
steps := 10000
for i := 0; i < steps; i++ {
for j:= 0; j <10; j++ {
fmt.Println(fact(j))
}
}
}
execution time:
real 0m2,513s
user 0m1,113s
sys 0m0,387s
So, as you can see the sequential solution is fastest, teachers solution is in the second place and my solution is third.
First question: why the sequential solution is fastest?
And second why my solution is so slow? if I understanding correctly in my solution I'm creating 10000 goroutines inside gen and 10000 inside factorial and in teacher solution, he is creating only 1 goroutine in gen and 1 in factorial. My so slow because I'm creating too many unneeded goroutines?
It's the difference between concurrency and parellelism - your's, you teachers and the sequential are progressively less concurrent in design but how parallel they are depends on number of CPU cores and there is a set up and communication cost associated with concurrency. There are no asynchronous calls in the code so only parallelism will improve speed.
This is worth a look: https://blog.golang.org/concurrency-is-not-parallelism
Also, even with parallel cores speedup will be dependent on nature of the workload - google Amdahl's law for explanation.
Let's start with some fundamental benchmarks for factorial computation.
$ go test -run=! -bench=. factorial_test.go
goos: linux
goarch: amd64
BenchmarkFact0-4 1000000000 2.07 ns/op
BenchmarkFact9-4 300000000 4.37 ns/op
BenchmarkFact0To9-4 50000000 36.0 ns/op
BenchmarkFact10K0To9-4 3000 384069 ns/op
$
The CPU time is very small, even for 10,000 iterations of factorials zero through nine.
factorial_test.go:
package main
import "testing"
func fact(n int) int {
total := 1
for i := n; i > 0; i-- {
total *= i
}
return total
}
var sinkFact int
func BenchmarkFact0(b *testing.B) {
for N := 0; N < b.N; N++ {
j := 0
sinkFact = fact(j)
}
}
func BenchmarkFact9(b *testing.B) {
for N := 0; N < b.N; N++ {
j := 9
sinkFact = fact(j)
}
}
func BenchmarkFact0To9(b *testing.B) {
for N := 0; N < b.N; N++ {
for j := 0; j < 10; j++ {
sinkFact = fact(j)
}
}
}
func BenchmarkFact10K0To9(b *testing.B) {
for N := 0; N < b.N; N++ {
steps := 10000
for i := 0; i < steps; i++ {
for j := 0; j < 10; j++ {
sinkFact = fact(j)
}
}
}
}
Let's look at the time for the sequential program.
$ go build -a sequential.go && time ./sequential
real 0m0.247s
user 0m0.054s
sys 0m0.149s
Writing to the terminal is obviously a major bottleneck. Let's write to a sink.
$ go build -a sequential.go && time ./sequential > /dev/null
real 0m0.070s
user 0m0.049s
sys 0m0.020s
It's still a lot more than the 0m0.000000384069s for the factorial computation.
sequential.go:
package main
import (
"fmt"
)
func fact(n int) int {
total := 1
for i := n; i > 0; i-- {
total *= i
}
return total
}
func main() {
steps := 10000
for i := 0; i < steps; i++ {
for j := 0; j < 10; j++ {
fmt.Println(fact(j))
}
}
}
Attempts to use concurrency for such a trivial amount of parallel work are likely to fail. Go goroutines and channels are cheap, but they are not free. Also, a single channel and a single terminal are the bottleneck, the limiting factor, even when writing to a sink. See Amdahl's Law for parallel computing. See Concurrency is not parallelism.
$ go build -a teacher.go && time ./teacher > /dev/null
real 0m0.123s
user 0m0.123s
sys 0m0.022s
$ go build -a student.go && time ./student > /dev/null
real 0m0.135s
user 0m0.113s
sys 0m0.038s
teacher.go:
package main
import (
"fmt"
)
func gen(steps int) <-chan int {
out := make(chan int)
go func() {
for i := 0; i < steps; i++ {
for j := 0; j < 10; j++ {
out <- j
}
}
close(out)
}()
return out
}
func factorial(in <-chan int) <-chan int {
out := make(chan int)
go func() {
for n := range in {
out <- fact(n)
}
close(out)
}()
return out
}
func fact(n int) int {
total := 1
for i := n; i > 0; i-- {
total *= i
}
return total
}
func main() {
steps := 10000
for n := range factorial(gen(steps)) {
fmt.Println(n)
}
}
student.go:
package main
import (
"fmt"
)
func gen(steps int) <-chan int {
out := make(chan int)
go func() {
for j := 0; j < steps; j++ {
out <- j
}
close(out)
}()
return out
}
func factorial(in <-chan int) <-chan int {
out := make(chan int)
go func() {
for n := range in {
out <- fact(n)
}
close(out)
}()
return out
}
func fact(n int) int {
total := 1
for i := n; i > 0; i-- {
total *= i
}
return total
}
func main() {
steps := 10000
for i := 0; i < steps; i++ {
for n := range factorial(gen(10)) {
fmt.Println(n)
}
}
}

Go: channel many slow API queries into single SQL transaction

I wonder what would be idiomatic way to do as following.
I have N slow API queries, and one database connection, I want to have a buffered channel, where responses will come, and one database transaction which I will use to write data.
I could only come up with semaphore thing as following makeup example:
func myFunc(){
//10 concurrent API calls
sem := make(chan bool, 10)
//A concurrent safe map as buffer
var myMap MyConcurrentMap
for i:=0;i<N;i++{
sem<-true
go func(i int){
defer func(){<-sem}()
resp:=slowAPICall(fmt.Sprintf("http://slow-api.me?%d",i))
myMap.Put(resp)
}(i)
}
for j=0;j<cap(sem);j++{
sem<-true
}
tx,_ := db.Begin()
for data:=range myMap{
tx.Exec("Insert data into database")
}
tx.Commit()
}
I am nearly sure there is simpler, cleaner and more proper solution, but it is seems complicated to grasp for me.
EDIT:
Well, I come with following solution, this way I do not need the buffer map, so once data comes to resp channel the data is printed or can be used to insert into a database, it works, I am still not sure if everything OK, at last there are no race.
package main
import (
"fmt"
"math/rand"
"sync"
"time"
)
//Gloab waitGroup
var wg sync.WaitGroup
func init() {
//just for fun sake, make rand seeded
rand.Seed(time.Now().UnixNano())
}
//Emulate a slow API call
func verySlowAPI(id int) int {
n := rand.Intn(5)
time.Sleep(time.Duration(n) * time.Second)
return n
}
func main() {
//Amount of tasks
N := 100
//Concurrency level
concur := 10
//Channel for tasks
tasks := make(chan int, N)
//Channel for responses
resp := make(chan int, 10)
//10 concurrent groutinezs
wg.Add(concur)
for i := 1; i <= concur; i++ {
go worker(tasks, resp)
}
//Add tasks
for i := 0; i < N; i++ {
tasks <- i
}
//Collect data from goroutiens
for i := 0; i < N; i++ {
fmt.Printf("%d\n", <-resp)
}
//close the tasks channel
close(tasks)
//wait till finish
wg.Wait()
}
func worker(task chan int, resp chan<- int) {
defer wg.Done()
for {
task, ok := <-task
if !ok {
return
}
n := verySlowAPI(task)
resp <- n
}
}
There's no need to use channels for a semaphore, sync.WaitGroup was made for waiting for a set of routines to complete.
If you're using the channel to limit throughput, you're better off with a worker pool, and using the channel to pass jobs to the workers:
type job struct {
i int
}
func myFunc(N int) {
// Adjust as needed for total number of tasks
work := make(chan job, 10)
// res being whatever type slowAPICall returns
results := make(chan res, 10)
resBuff := make([]res, 0, N)
wg := new(sync.WaitGroup)
// 10 concurrent API calls
for i = 0; i < 10; i++ {
wg.Add(1)
go func() {
for j := range work {
resp := slowAPICall(fmt.Sprintf("http://slow-api.me?%d", j.i))
results <- resp
}
wg.Done()
}()
}
go func() {
for r := range results {
resBuff = append(resBuff, r)
}
}
for i = 0; i < N; i++ {
work <- job{i}
}
close(work)
wg.Wait()
close(results)
}
Maybe this will work for you. Now you can get rid of your concurrent map. Here is a code snippet:
func myFunc() {
//10 concurrent API calls
sem := make(chan bool, 10)
respCh := make(chan YOUR_RESP_TYPE, 10)
var responses []YOUR_RESP_TYPE
for i := 0; i < N; i++ {
sem <- true
go func(i int) {
defer func() {
<-sem
}()
resp := slowAPICall(fmt.Sprintf("http://slow-api.me?%d",i))
respCh <- resp
}(i)
}
respCollected := make(chan struct{})
go func() {
for i := 0; i < N; i++ {
responses = append(responses, <-respCh)
}
close(respCollected)
}()
<-respCollected
tx,_ := db.Begin()
for _, data := range responses {
tx.Exec("Insert data into database")
}
tx.Commit()
}
Than we need to use one more goroutine that will collect all responses in some slice or map from a response channel.

Going mutex-less

Alright, Go "experts". How would you write this code in idiomatic Go, aka without a mutex in next?
package main
import (
"fmt"
)
func main() {
done := make(chan int)
x := 0
for i := 0; i < 10; i++ {
go func() {
y := next(&x)
fmt.Println(y)
done <- 0
}()
}
for i := 0; i < 10; i++ {
<-done
}
fmt.Println(x)
}
var mutex = make(chan int, 1)
func next(p *int) int {
mutex <- 0
// critical section BEGIN
x := *p
*p++
// critical section END
<-mutex
return x
}
Assume you can't have two goroutines in the critical section at the same time, or else bad things will happen.
My first guess is to have a separate goroutine to handle the state, but I can't figure out a way to match up inputs / outputs.
You would use an actual sync.Mutex:
var mutex sync.Mutex
func next(p *int) int {
mutex.Lock()
defer mutex.Unlock()
x := *p
*p++
return x
}
Though you would probably also group the next functionality, state and sync.Mutex into a single struct.
Though there's no reason to do so in this case, since a Mutex is better suited for mutual exclusion around a single resource, you can use goroutines and channels to achieve the same effect
http://play.golang.org/p/RR4TQXf2ct
x := 0
var wg sync.WaitGroup
send := make(chan *int)
recv := make(chan int)
go func() {
for i := range send {
x := *i
*i++
recv <- x
}
}()
for i := 0; i < 10; i++ {
wg.Add(1)
go func() {
defer wg.Done()
send <- &x
fmt.Println(<-recv)
}()
}
wg.Wait()
fmt.Println(x)
As #favoretti mentioned, sync/atomic is a way to do it.
But, you have to use int32 or int64 rather than int (since int can be different sizes on different platforms).
Here's an example on Playground
package main
import (
"fmt"
"sync/atomic"
)
func main() {
done := make(chan int)
x := int64(0)
for i := 0; i < 10; i++ {
go func() {
y := next(&x)
fmt.Println(y)
done <- 0
}()
}
for i := 0; i < 10; i++ {
<-done
}
fmt.Println(x)
}
func next(p *int64) int64 {
return atomic.AddInt64(p, 1) - 1
}

Resources