Why causes the difference in measured time for goroutine creation? - go

Consider the following application, designed to measure goroutine creation latency. Assume that we are running with GOMAXPROCS=2.
package main
import "fmt"
import "time"
const numRuns = 10000
type timeRecord struct{
Ts time.Time
Msg string
}
var timeStamps []timeRecord
func threadMain(done chan bool) {
timeStamps = append(timeStamps, timeRecord{time.Now(), "Inside thread"})
done <- true
}
func main() {
timeStamps = make([]timeRecord, 0, numRuns*2)
done := make(chan bool)
dummy := 0
for i := 0; i < numRuns; i++ {
timeStamps = append(timeStamps, timeRecord{time.Now(), "Before creation"})
go threadMain(done)
<-done
}
// Regularize
regularizedTime := make([]time.Duration, numRuns*2)
for i := 0; i < len(timeStamps) ; i++ {
regularizedTime[i] = timeStamps[i].Ts.Sub(timeStamps[0].Ts)
}
// Fake timetraced
fmt.Printf("%6d ns (+%6d ns): %s\n", 0, 0, timeStamps[0].Msg)
for i := 1; i < len(timeStamps) ; i++ {
fmt.Printf("%8d ns (+%6d ns): %s\n", regularizedTime[i], (regularizedTime[i] - regularizedTime[i-1]).Nanoseconds(), timeStamps[i].Msg)
}
}
On my server, this consistently outputs roughly a median 260 ns delta from Before creation to Inside thread. Now consider the following variation of the main method.
timeStamps = make([]timeRecord, 0, numRuns*2)
done := make(chan bool)
dummy := 0
for i := 0; i < numRuns; i++ {
timeStamps = append(timeStamps, timeRecord{time.Now(), "Before creation"})
go threadMain(done)
for j := 0; j < 1000; j++ {
dummy += j
}
<-done
}
Under this variation, the same time delta takes roughly 890 ns.
Obviously, the exact numbers are machine-specific, but the difference between the numbers is curious. Logically, if I am measuring between "Before creation" and "Inside thread", adding extra logic after the go statement seems like it should not increase that time, but it does.
Does anyone have a good idea for why the time increase is not occurring in the expected location?

Go scheduler is cooperative. It can switch the current goroutine to another one only at certain points of the program such as function calls, reading/writing to a channel, etc. I expect that the difference you observe is due to the fact that the goroutines are started later (after the added for loop).

Related

"Matrix multiplication" using goroutines and channels

I have a university project for testing time difference for matrix multiplication when I use 1 goroutine, 2 goroutines, 3 and so on. I must use channels. My problem is that doesn't matter how many go routines I add time of compilation is almost always the same. Maybe some one can tell where is the problem. Maybe that sending is very long and it gives all the time. Code is given below
package main
import (
"fmt"
"math/rand"
"time"
)
const length = 1000
var start time.Time
var rez [length][length]int
func main() {
const threadlength = 1
toCalcRow := make(chan []int)
toCalcColumn := make(chan []int)
dummy1 := make(chan int)
dummy2 := make(chan int)
var row [length + 1]int
var column [length + 1]int
var a [length][length]int
var b [length][length]int
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
a[i][j] = rand.Intn(10)
b[i][j] = rand.Intn(10)
}
}
for i := 0; i < threadlength; i++ {
go Calc(toCalcRow, toCalcColumn, dummy1, dummy2)
}
start = time.Now()
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
row[0] = i
column[0] = j
for k := 0; k < length; k++ {
row[k+1] = a[i][j]
column[k+1] = b[i][k]
}
rowSlices := make([]int, len(row))
columnSlices := make([]int, len(column))
copy(rowSlices, row[:])
copy(columnSlices, column[:])
toCalcRow <- rowSlices
toCalcColumn <- columnSlices
}
}
dummy1 <- -1
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
fmt.Print(rez[i][j])
fmt.Print(" ")
}
fmt.Println(" ")
}
<-dummy2
close(toCalcRow)
close(toCalcColumn)
close(dummy1)
}
func Calc(chin1 <-chan []int, chin2 <-chan []int, dummy <-chan int, dummy1 chan<- int) {
loop:
for {
select {
case row := <-chin1:
column := <-chin2
var sum [3]int
sum[0] = row[0]
sum[1] = column[0]
for i := 1; i < len(row); i++ {
sum[2] += row[i] * column[i]
}
rez[sum[0]][sum[1]] = sum[2]
case <-dummy:
elapsed := time.Since(start)
fmt.Println("Binomial took ", elapsed)
dummy1 <- 0
break loop
}
}
close(dummy1)
}
You don't see a difference because preparing the data to pass to the go routines is your bottleneck. It's slower or as fast as performing the calc.
Passing a copy of the rows and columns is not a good strategy. This is killing the performance.
The go routines may read data directly from the input matrix that are read only. There is no possible race condition here.
Same for output. If a go routine computes the multiplication of a row and a column, it will write the result in a distinct cell. There is also no possible race conditions here.
What to do is the following. Define a struct with two fields, one for the row and one for the column to multiply.
Fill a buffered channel with all possible combinations of row and columns to multiply from (0,0) to (n-1,m-1).
The go routines, consume the structs from the channel, perform the computation and write the result directly into the output matrix.
You then also have a done channel to signal to the main go routine that the computation is done. When a go routine has finished processing the struct (n-1,m-1) it closes the done channel.
The main go routine waits on the done channel after it has written all structs. Once the done channel is closed, it prints the elapsed time.
We can use a waiting group to wait that all go routine terminated their computation.
You can then start with one go routine and increase the number of go routines to see the impact of the processing time.
See the code:
package main
import (
"fmt"
"math/rand"
"sync"
"time"
)
type pair struct {
row, col int
}
const length = 1000
var start time.Time
var rez [length][length]int
func main() {
const threadlength = 1
pairs := make(chan pair, 1000)
var wg sync.WaitGroup
var a [length][length]int
var b [length][length]int
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
a[i][j] = rand.Intn(10)
b[i][j] = rand.Intn(10)
}
}
wg.Add(threadlength)
for i := 0; i < threadlength; i++ {
go Calc(pairs, &a, &b, &rez, &wg)
}
start = time.Now()
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
pairs <- pair{row: i, col: j}
}
}
close(pairs)
wg.Wait()
elapsed := time.Since(start)
fmt.Println("Binomial took ", elapsed)
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
fmt.Print(rez[i][j])
fmt.Print(" ")
}
fmt.Println(" ")
}
}
func Calc(pairs chan pair, a, b, rez *[length][length]int, wg *sync.WaitGroup) {
for {
pair, ok := <-pairs
if !ok {
break
}
rez[pair.row][pair.col] = 0
for i := 0; i < length; i++ {
rez[pair.row][pair.col] += a[pair.row][i] * b[i][pair.col]
}
}
wg.Done()
}
Your code is quite difficult to follow (calling variables dummy1/dummy2 is confusing particularly when they get different names in Calc) and adding some comments would make it more easily understood.
Firstly a bug. After sending data to be calculated you dummy1 <- -1 and I believe you expect this to wait for all calculations to be complete. However that is not necessarily the case when you have multiple goroutines. The channel will be drained by ONE of the goroutines and the timing info printed out; other goroutines will still be running (and may not have finnished their calculations).
In terms of timing I suspect that the way you are sending data to the go routines will slow things down; you send the row and then the column; because the channels are not buffered the goroutine will block while waiting for the column (switching back to the main goroutine to send the column). This back and forth will slow the rate at which your goroutines get data and may well explain why adding extra goroutines has a limited impact (it also becomes dangerous if you use buffered channels).
I have refactored your code (note there may be bugs and its far from perfect!) into something that does show a difference (on my computer 1 goroutine = 10s; 5 = 7s):
package main
import (
"fmt"
"math/rand"
"sync"
"time"
)
const length = 1000
var start time.Time
var rez [length][length]int
// toMultiply will hold details of what the goroutine will be multiplying (one row and one column)
type toMultiply struct {
rowNo int
columnNo int
row []int
column []int
}
func main() {
const noOfGoRoutines = 5
// Build up a matrix of dimensions (length) x (length)
var a [length][length]int
var b [length][length]int
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
a[i][j] = rand.Intn(10)
b[i][j] = rand.Intn(10)
}
}
// Setup completed so start the clock...
start = time.Now()
// Start off threadlength go routines to multiply each row/column
toCalc := make(chan toMultiply)
var wg sync.WaitGroup
wg.Add(noOfGoRoutines)
for i := 0; i < noOfGoRoutines; i++ {
go func() {
Calc(toCalc)
wg.Done()
}()
}
// Begin the multiplication.
start = time.Now()
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
tm := toMultiply{
rowNo: i,
columnNo: j,
row: make([]int, length),
column: make([]int, length),
}
for k := 0; k < length; k++ {
tm.row[k] = a[i][j]
tm.column[k] = b[i][k]
}
toCalc <- tm
}
}
// All of the data has been sent to the chanel; now we need to wait for all of the
// goroutines to complete
close(toCalc)
wg.Wait()
fmt.Println("Binomial took ", time.Since(start))
// The full result should be in tz
for i := 0; i < length; i++ {
for j := 0; j < length; j++ {
//fmt.Print(rez[i][j])
//fmt.Print(" ")
}
//fmt.Println(" ")
}
}
// Calc - Multiply a row from one matrix with a column from another
func Calc(toCalc <-chan toMultiply) {
for tc := range toCalc {
var result int
for i := 0; i < len(tc.row); i++ {
result += tc.row[i] * tc.column[i]
}
// warning - the below should work in this case but be careful writing to global variables from goroutines
rez[tc.rowNo][tc.columnNo] = result
}
}

Why single goroutine run slower than multiple goroutines when runtime.GOMAXPROCS(1)?

I just want to try how fast goroutine switch context, so I wrote the code below. To my surprise, multiple gorountines run faster than the edition that does not need to switch context (I set the program to run in only one CPU core).
package main
import (
"fmt"
"runtime"
"sync"
"time"
)
func main() {
runtime.GOMAXPROCS(1)
t_start := time.Now()
sum := 0
for j := 0; j < 10; j++ {
sum = 0
for i := 0; i < 100000000; i++ {
sum += i
}
}
fmt.Println("single goroutine takes ", time.Since(t_start))
var wg sync.WaitGroup
t_start = time.Now()
for j := 0; j < 10; j++ {
wg.Add(1)
go func() {
sum := 0
for i := 0; i < 100000000; i++ {
sum += i
}
defer wg.Done()
}()
}
wg.Wait()
fmt.Println("multiple goroutines take ", time.Since(t_start))
}
A single goroutine takes 251.690788ms, multiple goroutines take 254.067156ms
The single goroutine should run faster, because single goroutine does not need to change context. However, the answer is opposite, single mode always slower. What happened in this program?
Your concurrent version several things the non-concurrent version does, which will make it slower:
It's creating a new sum value, which must be allocated. Your non-concurrent version just resets the existing value. This probably has a minimal impact, but is a difference.
You're using a waitgroup. Obviously this adds overhead.
The defer in defer wg.Done() also adds overhead, roughly equivalent to an extra function call.
There may well be other subtle differences, too.
So in short: Your benchmarks are just invalid, because you're comparing apples with oranges.
More important: This isn't a useful benchmark in the first place, because it's a completely artificial workload.

Im trying to understand why this case is always going off

I'm trying to understand why in this select statement the first case always goes off and does not wait for the channel to be filled. For this program I'm trying to get the program to wait till all the channels have been filled and whenever a channel is filled by the method it is put in the first available space in the array of channels
I tried putting the line <-res[i] in the case statement but for some reason this case always goes off regardless of whether or not the channels have a value.
package main
import (
"fmt"
"math"
"math/rand"
"time"
)
func numbers(sz int) (res chan float64) {
res = make(chan float64)
go func() {
defer close(res)
num := 0.0
time.Sleep(time.Duration(rand.Intn(1000)) *time.Microsecond)
for i := 0; i < sz; i++ {
num += math.Sqrt(math.Abs(rand.Float64()))
}
num /= float64(sz)
res <- num
return
}()
return
}
func main() {
var nGo int
rand.Seed(42)
fmt.Print("Number of Go routines: ")
fmt.Scanf("%d \n", &nGo)
res := make([]chan float64, nGo)
j:=0
for i := 0; i < nGo; i++ {
res[i] =numbers(1000)
}
for true{
for i := 0; i < nGo; {
select {
case <-res[i]:{
res[j]=res[i]//this line
j++
}
default:
i++
}
}
if j==nGo{
break
}
}
fmt.Println(<-res[nGo-1])
}
The print line should print some float.
<-res[i] in the case statement but for some reason this case always goes off regarless of wether or not the channels has a value
It will only not choose this case if the channel's buffer is full (i.e. a value cannot be sent without blocking). Your channel has a buffer length equal to the number of values you're sending on it, so it will never block, giving it no reason to ever take the default case.

How to iterate int range concurrently

For purely educational purposes I created a base58 package. It will encode/decode a uint64 using the bitcoin base58 symbol chart, for example:
b58 := Encode(100) // return 2j
num := Decode("2j") // return 100
While creating the first tests I came with this:
func TestEncode(t *testing.T) {
var i uint64
for i = 0; i <= (1<<64 - 1); i++ {
b58 := Encode(i)
num := Decode(b58)
if num != i {
t.Fatalf("Expecting %d for %s", i, b58)
}
}
}
This "naive" implementation, tries to convert all the range from uint64 (From 0 to 18,446,744,073,709,551,615) to base58 and later back to uint64 but takes too much time.
To better understand how go handles concurrency I would like to know how to use channels or goroutines and perform the iteration across the full uint64 range in the most efficient way?
Could data be processed by chunks and in parallel, if yes how to accomplish this?
Thanks in advance.
UPDATE:
Like mention in the answer by #Adrien, one-way is to use t.Parallel() but that applies only when testing the package, In any case, by implementing it I found that is noticeably slower, it runs in parallel but there is no speed gain.
I understand that doing the full uint64 may take years but what I want to find/now is how could a channel or goroutine, may help to speed up the process (testing with small range 1<<16) probably by using something like this https://play.golang.org/p/9U22NfrXeq just as an example.
The question is not about how to test the package is about what algorithm, technic could be used to iterate faster by using concurrency.
This functionality is built into the Go testing package, in the form of T.Parallel:
func TestEncode(t *testing.T) {
var i uint64
for i = 0; i <= (1<<64 - 1); i++ {
t.Run(fmt.Sprintf("%d",i), func(t *testing.T) {
j := i // Copy to local var - important
t.Parallel() // Mark test as parallelizable
b58 := Encode(j)
num := Decode(b58)
if num != j {
t.Fatalf("Expecting %d for %s", j, b58)
}
})
}
}
I came up with this solutions:
package main
import (
"fmt"
"time"
"github.com/nbari/base58"
)
func encode(i uint64) {
x := base58.Encode(i)
fmt.Printf("%d = %s\n", i, x)
time.Sleep(time.Second)
}
func main() {
concurrency := 4
sem := make(chan struct{}, concurrency)
for i, val := uint64(0), uint64(1<<16); i <= val; i++ {
sem <- struct{}{}
go func(i uint64) {
defer func() { <-sem }()
encode(i)
}(i)
}
for i := 0; i < cap(sem); i++ {
sem <- struct{}{}
}
}
Basically, start 4 workers and calls the encode function, to notice/understand more this behavior a sleep is added so that the data can be printed in chunks of 4.
Also, these answers helped me to better understand concurrency understanding: https://stackoverflow.com/a/18405460/1135424
If there is a better way please let me know.

Saving results from a parallelized goroutine

I am trying to parallelize an operation in golang and save the results in a manner that I can iterate over to sum up afterwords.
I have managed to set up the parameters so that no deadlock occurs, and I have confirmed that the operations are working and being saved correctly within the function. When I iterate over the Slice of my struct and try and sum up the results of the operation, they all remain 0. I have tried passing by reference, with pointers, and with channels (causes deadlock).
I have only found this example for help: https://golang.org/doc/effective_go.html#parallel. But this seems outdated now, as Vector as been deprecated? I also have not found any references to the way this function (in the example) was constructed (with the func (u Vector) before the name). I tried replacing this with a Slice but got compile time errors.
Any help would be very appreciated. Here is the key parts of my code:
type job struct {
a int
b int
result *big.Int
}
func choose(jobs []Job, c chan int) {
temp := new(big.Int)
for _,job := range jobs {
job.result = //perform operation on job.a and job.b
//fmt.Println(job.result)
}
c <- 1
}
func main() {
num := 100 //can be very large (why we need big.Int)
n := num
k := 0
const numCPU = 6 //runtime.NumCPU
count := new(big.Int)
// create a 2d slice of jobs, one for each core
jobs := make([][]Job, numCPU)
for (float64(k) <= math.Ceil(float64(num / 2))) {
// add one job to each core, alternating so that
// job set is similar in difficulty
for i := 0; i < numCPU; i++ {
if !(float64(k) <= math.Ceil(float64(num / 2))) {
break
}
jobs[i] = append(jobs[i], Job{n, k, new(big.Int)})
n -= 1
k += 1
}
}
c := make(chan int, numCPU)
for i := 0; i < numCPU; i++ {
go choose(jobs[i], c)
}
// drain the channel
for i := 0; i < numCPU; i++ {
<-c
}
// computations are done
for i := range jobs {
for _,job := range jobs[i] {
//fmt.Println(job.result)
count.Add(count, job.result)
}
}
fmt.Println(count)
}
Here is the code running on the go playground https://play.golang.org/p/X5IYaG36U-
As long as the []Job slice is only modified by one goroutine at a time, there's no reason you can't modify the job in place.
for i, job := range jobs {
jobs[i].result = temp.Binomial(int64(job.a), int64(job.b))
}
https://play.golang.org/p/CcEGsa1fLh
You should also use a WaitGroup, rather than rely on counting tokens in a channel yourself.

Resources