goroutine blocking and non-blocking usage - go

I am trying to understand how go-routines work. Here is some code:
//parallelSum.go
func sum(a []int, c chan<- int, func_id string) {
sum := 0
for _, n := range a {
sum += n
}
log.Printf("func_id %v is DONE!", func_id)
c <- sum
}
func main() {
ELEM_COUNT := 10000000
test_arr := make([]int, ELEM_COUNT)
for i := 0; i < ELEM_COUNT; i++ {
test_arr[i] = i * 2
}
c1 := make(chan int)
c2 := make(chan int)
go sum(test_arr[:len(test_arr)/2], c1, "1")
go sum(test_arr[len(test_arr)/2:], c2, "2")
x := <-c1
y := <-c2
//x, y := <-c, <-c
log.Printf("x= %v, y = %v, sum = %v", x, y, x+y)
}
The above program runs fine and returns the output. I have an iterative version of the same program:
//iterSum.go
func sumIter(a []int, c *int, func_id string) {
sum := 0
log.Printf("entered the func %s", func_id)
for _, n := range a {
sum += n
}
log.Printf("func_id %v is DONE!", func_id)
*c = sum
}
func main() {
*/
ELEM_COUNT := 10000000
test_arr := make([]int, ELEM_COUNT)
for i := 0; i < ELEM_COUNT; i++ {
test_arr[i] = i * 2
}
var (
i1 int
i2 int
)
sumIter(test_arr[:len(test_arr)/2], &i1, "1")
sumIter(test_arr[len(test_arr)/2:], &i2, "2")
x := i1
y := i2
log.Printf("x= %v, y = %v, sum = %v", x, y, x+y)
}
I ran the program 20 times and averaged the run time for each program. I see the average almost equal? Shouldn't parallelizing make things faster? What am I doing wrong?
Here is the python program to run it 20 times:
iterCmd = 'go run iterSum.go'
parallelCmd = 'go run parallelSum.go'
runCount = 20
def analyzeCmd(cmd, runCount):
runData = []
print("running cmd (%s) for (%s) times" % (cmd, runCount))
for i in range(runCount):
┆ start_time = time.time()
┆ cmd_out = subprocess.check_call(shlex.split(cmd))
run_time = time.time() - start_time
┆ curr_data = {'iteration': i, 'run_time' : run_time}
┆ runData.append(curr_data)
return runData
iterOut = analyzeCmd(iterCmd, runCount)
parallelOut = analyzeCmd(parallelCmd, runCount)
print("iter cmd data -->")
print(iterOut)
with open('iterResults.json', 'w') as f:
json.dump(iterOut, f)
print("parallel cmd data -->")
print(parallelOut)
with open('parallelResults.json', 'w') as f:
json.dump(parallelOut, f)
avg = lambda results: sum(i['run_time'] for i in results) / len(results)
print("average time for iterSum = %3.2f" % (avg(iterOut)))
print("average time for parallelSum = %3.2f" % (avg(parallelOut)))
Here is output of 1 run:
average time for iterSum = 0.27
average time for parallelSum = 0.29

So, several problems here. Firstly, your channels aren't buffered in the concurrent example, which means the receives still may have to wait a bit on each other. Second, concurrent doesn't mean parallel. Are you sure these are actually running in parallel and not simply being scheduled on the same OS thread?
That said, your main problem here is that your Python code is using go run for each iteration, which means the vast majority of your recorded "run time" is actually the compilation of your code (go run compiles and then runs the specified file, and it specifically by design does not cache any of that). If you want to test run time, use Go's benchmark system, not your own cobbled-together version. You'll get far more accurate results. For example, beyond the compilation bottleneck, there's also no way to identify how much of a bottleneck the Python code itself is introducing.
Oh, and you should get out of the habit of using reference arguments to functions as a way to "return" values. Go supports multiple returns, so the C style of modifying arguments in-place is generally considered an anti-pattern unless there's a really compelling reason to do it.

Related

I got a memory surge using goroutine for parral, eventually oom

I just started to learn go. I am trying to do a sum calculation in go. Code below
func test() {
start := time.Now()
ret := make(chan int)
go foo(1, 100000, ret)
ssum := <- ret
elap := time.Since(start)
fmt.Println(ssum)
fmt.Printf("used time in milli is %d", elap)
}
func foo(start, end int, ret chan int) {
if start > end {
ret <- 0
return
}
if end - start <= 10000 {
sum := 0
for i := start; i <=end; i++ {
sum += i
}
ret <- sum
return
}
mid := (end - start) / 2
leftRet := make(chan int)
go foo(start, mid, leftRet)
leftNum := <- leftRet
rightRet := make(chan int)
go foo(mid+1, end, rightRet)
rightNum := <- rightRet
ret <- leftNum + rightNum
}
Does the code above do parrallel computing? Since goroutine are not mulitiple process, not even multiple threads. I am not sure if goroutine can be used to do parrell computing.
Why did I get memory surge and oom?
This is not doing anything in parallel. You create a goroutine, and immediately wait for its return value, where each goroutine performs a computation, writes the result to a channel and returns. So there is no parallelism there. You might get some concurrent execution if you move the channel reads to the line after all goroutine creation, so two goroutines can run.
Your program is not correct, and that's why it is not terminating. mid is not (end-start)/2, it is (end+start)/2. It is likely that it is falling into a case where end-start is larger than 10000 and mid is such that you end up running the same start and end values over and over again. Put a println statement after that mid computation to see what the start and end values are.

Why this Go program is so slow?

I just read some short tutorials of Go and wrote a simple program sieve. Sieve uses sieve algorithm to print all the prime number that is smaller than 10000, which create a lot of go routines. I got the correct results but the program is very slow (5 seconds on my machine). I also wrote lua script and python script which implemented the same algorithm, and runs a lot faster (both are about 1 second on my machine).
Note that the purpose is to have idea of go routine's performance compared with coroutine in other languages, for example lua. The implementation is very inefficient, some comments pointed out that it's not correct way to implement Sieve of Eratosthenes. Yes, that's intentional. Some other replies pointed out that slowness is caused by print I/O. So I commented out print lines.
My question is why my sieve program implemented in Go is so slow?
Here is the code:
package main
import (
"fmt"
"sync"
)
type Sieve struct {
id int;
msg_queue chan int;
wg *sync.WaitGroup;
}
func NewSieve(id int) *Sieve {
sieve := new(Sieve)
sieve.id = id
sieve.msg_queue = make(chan int)
sieve.wg = new(sync.WaitGroup)
sieve.wg.Add(1)
return sieve
}
func (sieve *Sieve) run() {
defer sieve.wg.Done()
myprime := <-sieve.msg_queue
if myprime == 0 {
return
}
// fmt.Printf("Sieve (%d) is for prime number %d.\n", sieve.id, myprime)
next_sieve := NewSieve(sieve.id + 1)
go next_sieve.run()
for {
number := <-sieve.msg_queue
if number == 0 {
next_sieve.msg_queue <- number;
next_sieve.wg.Wait()
return
} else if number % myprime != 0 {
// fmt.Printf("id: %d, number: %d, myprime: %d, number mod myprime: %d\n", sieve.id, number, myprime, number % myprime)
next_sieve.msg_queue <- number
}
}
}
func driver() {
first := NewSieve(2)
go first.run()
for n := 2; n <= 10000; n++ {
first.msg_queue <- n
}
first.msg_queue <- 0
first.wg.Wait()
}
func main() {
driver()
}
As a comparison, here is the code of sieve.lua
function sieve(id)
local myprime = coroutine.yield()
// print(string.format("Sieve (%d) is for prime number %d", id, myprime))
local next_sieve = coroutine.create(sieve)
coroutine.resume(next_sieve, id + 1)
while true do
local number = coroutine.yield()
if number % myprime ~= 0 then
// print(string.format("id: %d, number: %d, myprime: %d, number mod myprime: %d", id, number, myprime, number % myprime))
coroutine.resume(next_sieve, number)
end
end
end
function driver()
local first = coroutine.create(sieve)
coroutine.resume(first, 2)
local n
for n = 2, 10000 do
coroutine.resume(first, n)
end
end
driver()
Meaningless microbenchmarks produce meaningless results.
You are timing print I/O.
You are incurring go routine and channel overhead for a small amount of work.
Here is a prime number sieve program in Go.
Output:
$ go version
go version devel +46be01f4e0 Sun Oct 13 01:48:30 2019 +0000 linux/amd64
$ go build sumprimes.go && time ./sumprimes
5736396
29.96µs
real 0m0.001s
user 0m0.001s
sys 0m0.000s
sumprimes.go:
package main
import (
"fmt"
"time"
)
const (
prime = 0x00
notprime = 0xFF
)
func oddPrimes(n uint64) (sieve []uint8) {
sieve = make([]uint8, (n+1)/2)
sieve[0] = notprime
p := uint64(3)
for i := p * p; i <= n; i = p * p {
for j := i; j <= n; j += 2 * p {
sieve[j/2] = notprime
}
for p += 2; sieve[p/2] == notprime; p += 2 {
}
}
return sieve
}
func sumPrimes(n uint64) uint64 {
sum := uint64(0)
if n >= 2 {
sum += 2
}
for i, p := range oddPrimes(n) {
if p == prime {
sum += 2*uint64(i) + 1
}
}
return sum
}
func main() {
start := time.Now()
var n uint64 = 10000
sum := sumPrimes(n)
fmt.Println(sum)
fmt.Println(time.Since(start))
}
Most of the time is spent in fmt.Printf.
Taking out the line:
fmt.Printf("id: %d, number: %d, myprime: %d, number mod myprime: %d\n", sieve.id, number, myprime, number%myprime)
reduces runtime from ~5.4 seconds to ~0.64 seconds on one test I ran.
Taking out the unnecessary sync.WaitGroups reduces the time a bit further, to ~0.48 seconds. See the version without sync.WaitGroup here. You're still doing a lot of channel operations, which languages with yield-value-from-coroutine operators do not need (though they have their own issues instead). This is not a good way to implement primality testing.

How to send of GO routines in a worker pool

im writing an algorithm to break down an image into segments and manipulate it, however the way im currently using Go routines isn't quite optimal.
I'd like to split it into a worker pool, firing off routines and having each worker take a new job until the image is completed.
I have it split into 8 as such:
var bounds = img.Bounds()
var halfHeight = bounds.Max.Y / 2
var eighthOne = halfHeight / 4
var eighthTwo = eighthOne + eighthOne
var eighthThree = eighthOne + eighthTwo
var eighthFive = halfHeight + eighthOne
var eighthSix = halfHeight + eighthTwo
var eighthSeven = halfHeight + eighthThree
elapsed := time.Now()
go Threshold(pic, c2, 0, eighthOne)
go Threshold(pic, c5, eighthOne, eighthTwo)
go Threshold(pic, c6, eighthTwo, eighthThree)
go Threshold(pic, c7, eighthThree, halfHeight)
go Threshold(pic, c8, halfHeight, eighthFive)
go Threshold(pic, c9, eighthFive, eighthSix)
go Threshold(pic, c10, eighthSix, eighthSeven)
go Threshold(pic, c11, eighthSeven, bounds.Max.Y)
From which i then fire off Go routines one after another, how can i optimise this into a worker system?
Thanks
Here you have a generic pattern for implementing concurrent image processors giving control to the caller over the image partitioning to split the work in n parts and over the concurrency level of the execution (i.e. the number of worker goroutines used for executing the (possibly different) number of processing jobs).
See the pprocess func which implements the whole pattern taking a Partitioner and a Processor, the former being a func that takes the job of returning n image partitions to operate on, and the latter being a func which will be used for processing each partition.
I implemented the vertical splitting you expressed in your code example in the func splitVert which returns a function which can split an image in n vertical sections.
For doing some actual work I implemented the gray func which is a Processor that transform pixel colors to gray levels (luminance).
Here's the working code:
type MutableImage interface {
image.Image
Set(x, y int, c color.Color)
}
type Processor func(MutableImage, image.Rectangle)
type Partitioner func(image.Image) []image.Rectangle
func pprocess(i image.Image, concurrency int, part Partitioner, proc Processor) image.Image {
m := image.NewRGBA(i.Bounds())
draw.Draw(m, i.Bounds(), i, i.Bounds().Min, draw.Src)
var wg sync.WaitGroup
c := make(chan image.Rectangle, concurrency*2)
for n := 0; n < concurrency; n++ {
wg.Add(1)
go func() {
for r := range c {
proc(m, r)
}
wg.Done()
}()
}
for _, p := range part(i) {
c <- p
}
close(c)
wg.Wait()
return m
}
func gray(i MutableImage, r image.Rectangle) {
for x := r.Min.X; x <= r.Max.X; x++ {
for y := r.Min.Y; y <= r.Max.Y; y++ {
c := i.At(x, y)
r, g, b, _ := c.RGBA()
l := 0.299*float64(r) + 0.587*float64(g) + 0.114*float64(b)
i.Set(x, y, color.Gray{uint8(l / 256)})
}
}
}
func splitVert(c int) Partitioner {
return func(i image.Image) []image.Rectangle {
b := i.Bounds()
s := float64(b.Dy()) / float64(c)
rs := make([]image.Rectangle, c)
for n := 0; n < c; n++ {
m := float64(n)
x0 := b.Min.X
y0 := b.Min.Y + int(0.5+m*s)
x1 := b.Max.X
y1 := b.Min.Y + int(0.5+(m+1)*s)
if n < c-1 {
y1--
}
rs[n] = image.Rect(x0, y0, x1, y1)
}
return rs
}
}
func main() {
i, err := jpeg.Decode(os.Stdin)
if err != nil {
log.Fatalf("decoding image: %v", err)
}
o := pprocess(i, runtime.NumCPU(), splitVert(8), gray)
err = jpeg.Encode(os.Stdout, o, nil)
if err != nil {
log.Fatalf("encoding image: %v", err)
}
}

Saving results from a parallelized goroutine

I am trying to parallelize an operation in golang and save the results in a manner that I can iterate over to sum up afterwords.
I have managed to set up the parameters so that no deadlock occurs, and I have confirmed that the operations are working and being saved correctly within the function. When I iterate over the Slice of my struct and try and sum up the results of the operation, they all remain 0. I have tried passing by reference, with pointers, and with channels (causes deadlock).
I have only found this example for help: https://golang.org/doc/effective_go.html#parallel. But this seems outdated now, as Vector as been deprecated? I also have not found any references to the way this function (in the example) was constructed (with the func (u Vector) before the name). I tried replacing this with a Slice but got compile time errors.
Any help would be very appreciated. Here is the key parts of my code:
type job struct {
a int
b int
result *big.Int
}
func choose(jobs []Job, c chan int) {
temp := new(big.Int)
for _,job := range jobs {
job.result = //perform operation on job.a and job.b
//fmt.Println(job.result)
}
c <- 1
}
func main() {
num := 100 //can be very large (why we need big.Int)
n := num
k := 0
const numCPU = 6 //runtime.NumCPU
count := new(big.Int)
// create a 2d slice of jobs, one for each core
jobs := make([][]Job, numCPU)
for (float64(k) <= math.Ceil(float64(num / 2))) {
// add one job to each core, alternating so that
// job set is similar in difficulty
for i := 0; i < numCPU; i++ {
if !(float64(k) <= math.Ceil(float64(num / 2))) {
break
}
jobs[i] = append(jobs[i], Job{n, k, new(big.Int)})
n -= 1
k += 1
}
}
c := make(chan int, numCPU)
for i := 0; i < numCPU; i++ {
go choose(jobs[i], c)
}
// drain the channel
for i := 0; i < numCPU; i++ {
<-c
}
// computations are done
for i := range jobs {
for _,job := range jobs[i] {
//fmt.Println(job.result)
count.Add(count, job.result)
}
}
fmt.Println(count)
}
Here is the code running on the go playground https://play.golang.org/p/X5IYaG36U-
As long as the []Job slice is only modified by one goroutine at a time, there's no reason you can't modify the job in place.
for i, job := range jobs {
jobs[i].result = temp.Binomial(int64(job.a), int64(job.b))
}
https://play.golang.org/p/CcEGsa1fLh
You should also use a WaitGroup, rather than rely on counting tokens in a channel yourself.

GoLang - Sequential vs Concurrent

I have two versions of factorial. Concurrent vs Sequencial.
Both the program will calculate factorial of 10 "1000000" times.
Factorial Concurrent Processing
package main
import (
"fmt"
//"math/rand"
"sync"
"time"
//"runtime"
)
func main() {
start := time.Now()
printFact(fact(gen(1000000)))
fmt.Println("Current Time:", time.Now(), "Start Time:", start, "Elapsed Time:", time.Since(start))
panic("Error Stack!")
}
func gen(n int) <-chan int {
c := make(chan int)
go func() {
for i := 0; i < n; i++ {
//c <- rand.Intn(10) + 1
c <- 10
}
close(c)
}()
return c
}
func fact(in <-chan int) <-chan int {
out := make(chan int)
var wg sync.WaitGroup
for n := range in {
wg.Add(1)
go func(n int) {
//temp := 1
//for i := n; i > 0; i-- {
// temp *= i
//}
temp := calcFact(n)
out <- temp
wg.Done()
}(n)
}
go func() {
wg.Wait()
close(out)
}()
return out
}
func printFact(in <-chan int) {
//for n := range in {
// fmt.Println("The random Factorial is:", n)
//}
var i int
for range in {
i ++
}
fmt.Println("Count:" , i)
}
func calcFact(c int) int {
if c == 0 {
return 1
} else {
return calcFact(c-1) * c
}
}
//###End of Factorial Concurrent
Factorial Sequencial Processing
package main
import (
"fmt"
//"math/rand"
"time"
"runtime"
)
func main() {
start := time.Now()
//for _, n := range factorial(gen(10000)...) {
// fmt.Println("The random Factorial is:", n)
//}
var i int
for range factorial(gen(1000000)...) {
i++
}
fmt.Println("Count:" , i)
fmt.Println("Current Time:", time.Now(), "Start Time:", start, "Elapsed Time:", time.Since(start))
}
func gen(n int) []int {
var out []int
for i := 0; i < n; i++ {
//out = append(out, rand.Intn(10)+1)
out = append(out, 10)
}
println(len(out))
return out
}
func factorial(val ...int) []int {
var out []int
for _, n := range val {
fa := calcFact(n)
out = append(out, fa)
}
return out
}
func calcFact(c int) int {
if c == 0 {
return 1
} else {
return calcFact(c-1) * c
}
}
//###End of Factorial sequential processing
My assumption was concurrent processing will be faster than sequential but sequential is executing faster than concurrent in my windows machine.
I am using 8 core/ i7 / 32 GB RAM.
I am not sure if there is something wrong in the programs or my basic understanding is correct.
p.s. - I am new to GoLang.
Concurrent version of your program will always be slow compared to the sequential version. The reason however, is related to the nature and behavior of problem you are trying to solve.
Your program is concurrent but it is not parallel. Each callFact is running in it's own goroutine but there is no division of the amount of work required to be done. Each goroutine must perform the same computation and output the same value.
It is like having a task that requires some text to be copied a hundred times. You have just one CPU (ignore the cores for now).
When you start a sequential process, you point the CPU to the original text once, and ask it to write it down a 100 times. The CPU has to manage a single task.
With goroutines, the CPU is told that there are a hundred tasks that must be done concurrently. It just so happens that they are all the same tasks. But CPU is not smart enough to know that.
So it does the same thing as above. Even though each task now is a 100 times smaller, there is still just one CPU. So the amount of work CPU has to do is still the same, except with all the added overhead of managing 100 different things at once. Hence, it looses a part of its efficiency.
To see an improvement in performance you'll need proper parallelism. A simple example would be to split the factorial input number roughly in the middle and compute 2 smaller factorials. Then combine them together:
// not an ideal solution
func main() {
ch := make(chan int)
r := 10
result := 1
go fact(r, ch)
for i := range ch {
result *= i
}
fmt.Println(result)
}
func fact(n int, ch chan int) {
p := n/2
q := p + 1
var wg sync.WaitGroup
wg.Add(2)
go func() {
ch <- factPQ(1, p)
wg.Done()
}()
go func() {
ch <- factPQ(q, n)
wg.Done()
}()
go func() {
wg.Wait()
close(ch)
}()
}
func factPQ(p, q int) int {
r := 1
for i := p; i <= q; i++ {
r *= i
}
return r
}
Working code: https://play.golang.org/p/xLHAaoly8H
Now you have two goroutines working towards the same goal and not just repeating the same calculations.
Note about CPU cores:
In your original code, the sequential version's operations are most definitely being distributed amongst various CPU cores by the runtime environment and the OS. So it still has parallelism to a degree, you just don't controll it.
The same is happening in the concurrent version but again as mentioned above, the overhead of goroutine context switching makes the performance come down.
abhink has given a good answer. I would also like to draw attention to Amdahl's Law, which should always be borne in mind when trying to use parallel processing to increase the overall speed of computation. That's not to say "don't make things parallel", but rather: be realistic about expectations and understand the parallel architecture fully.
Go allows us to write concurrent programs. This is related to trying to write faster parallel programs, but the two issues are separate. See Rob Pike's Concurrency is not Parallelism for more info.

Resources