How one can deterime the order that the order of receiving from channels? - go

Consider the following example from the tour of go.
How one can determine the order of reception from the channels?
why x always get the first output from the gorouting?
It sounds reasonable but i didn't find any documentation about it.
I tried to add some sleep and still x get the input from the first executed gorouting.
c := make(chan int)
go sumSleep(s[:len(s)/2], c)
go sum(s[len(s)/2:], c)
x, y := <-c, <-c // receive from c
fmt.Println(x, y, x+y)
The sleep is before the sending to the channel.

Messages are always received in the order they are sent. That is deterministic.
However, the order of execution of any given operation across concurrent Goroutines is not deterministic. So if you have two goroutines concurrently sending on a channel, you can't know which will send first and which will send second. Same if you have two goroutines receiving on the same channel.

I tried to add some sleep and still x get the input from the first executed goroutine
In addition to what #Adrian wrote, in your code x will always get result from first recieve on c, because of language rules for tuple assignment
The assignment proceeds in two phases. First, the operands of index expressions and pointer indirections (including implicit pointer indirections in selectors) on the left and the expressions on the right are all evaluated in the usual order. Second, the assignments are carried out in left-to-right order.

To add a bit to Adrian's answer: we don't know in what order the two goroutines might run. If your sleep comes before your channel-send, and sleeps "long enough",1 that will guarantee that the other goroutine can run to the point of doing its send. If both goroutines run "at the same time" and neither waits (as in the original Tour example), we cannot be sure which one will actually reach its c <- sum line first.
Running the Tour's example on the Go Playground (either directly, or through the Tour website), I actually get:
-5 17 12
in the output window, which (because we know that the -9 is in the 2nd half of the slice) tells us that the second goroutine "got there" (to the channel-send) first. In some sense, that's just luckā€”but when using the Go Playground, all jobs are run in a fairly deterministic environment, with a single CPU and with cooperative scheduling, so that the results are more predictable. In other words, if the second goroutine got there first on one run, it probably will on the next. If the playground used multiple CPUs and/or a less-deterministic environment, the results might change from one run to the next, but there is no guarantee of that.
In any case, assuming your code does what you say (and I believe it does), this:
go sumSleep(s[:len(s)/2], c)
go sum(s[len(s)/2:], c)
has the first sender wait and the second sender run first. But that's what we already observed actually happened when we let the two routines race. To see a change, we'd need to make the second sender delay.
I made a modified version of the example in the Go Playground here that prints more annotations. With the delay inserted in the 2nd half sum, we see the first-half sum as x:
2nd half: sleeping for 1s
1st half: sleeping for 0s
1st half: sending 17
2nd half: sending -5
17 -5 12
as we can expect since one second is "long enough".
1How long is "long enough"? Well, that depends: how fast are our computers? How much other stuff are they doing? If the computer is fast enough, a delay of a few milliseconds, or even a few nanoseconds, may be enough. If our computer is a really old one or very busy with other higher priority tasks, a few milliseconds might not be enough time. If the problem is sufficiently big, one second might not be enough time. It's often unwise to choose some particular amount of time, if you can control this better by some sort of synchronization operation, and usually you can. For instance, using a sync.WaitGroup variable allows you to wait for n goroutines (for some runtime value of n) to call a Done function before your own goroutine proceeds.
Playground code, copied to StackOverflow for convenience
package main
import (
"fmt"
"time"
)
func sum(s []int, c chan int, printme string, delay time.Duration) {
sum := 0
for _, v := range s {
sum += v
}
fmt.Printf("%s: sleeping for %v\n", printme, delay)
time.Sleep(delay)
fmt.Printf("%s: sending %d\n", printme, sum)
c <- sum
}
func main() {
s := []int{7, 2, 8, -9, 4, 0}
c := make(chan int)
go sum(s[:len(s)/2], c, "1st half", 0*time.Second)
go sum(s[len(s)/2:], c, "2nd half", 1*time.Second)
x, y := <-c, <-c
fmt.Println(x, y, x+y)
}

Related

Is it safe to write to on-stack variables from different go routine blocking current one with WaitGroup?

There are various task executors, with different properties, and some of them only support non-blocking calls. So, I was thinking, whether there's a need to use mutex/channel to safely deliver task results to calling go-routine, or whether is it enough simple WaitGroup?
For sake of simplicity, and specificity of the question, an example using very naive task executor launching function directly as go routine:
func TestRace(t *testing.T) {
var wg sync.WaitGroup
a, b := 1, 2
wg.Add(1)
// this func would be passed to real executor
go func() {
a, b = a+1, b+1
wg.Done()
}()
wg.Wait()
assert.Equal(t, a, 2)
assert.Equal(t, b, 3)
}
Execution of the test above with -race option didn't fail, on my machine. However, is that enough guarantee? What if go-routine is executed on different CPU core, or on CPU core block (AMD CCX), or on different CPU in multi-socket setups?
So, the question is, can I use WaitGroup to provide synchronization (block and return values) for non-blocking executors?
JimB should perhaps provide this as the answer, but I'll copy it from his comments, starting with this one:
The WaitGroup here is to ensure that a, b = a+1, b+1 has executed, so there's no reason to assume it hasn't.
[and]
[T]he guarantees you have are laid out by the go memory model, which is well documented [here]. [Specifically, the combination of wg.Done() and wg.Wait() in the example suffices to guarantee non-racy access to the two variables a and b.]
As long as this question exists, it's probably a good idea to copy Adrian's comment too:
As #JimB noted, if a value is shared between goroutines, it cannot be stack-allocated, so the question is moot (see How are Go closures layed out in memory?). WaitGroup works correctly.
The fact that closure variables are heap-allocated is an implementation detail: it might not be true in the future. But the sync.WaitGroup guarantee will still be true in the future, even if some clever future Go compiler is able to keep those variables on some stack.
("Which stack?" is another question entirely, but one for the hypothetical future clever Go compiler to answer. The WaitGroup and memory model provide the rules.)

What is the order that things are happening?

I started playing around with Go by going through the Tour of Go on the official site.
I only have basic experience with programming but on coming up to the channels page I started to play around to try get my head around it and I've ended up quite confused.
This is what I have the code as:
package main
import "fmt"
func sum(s []int, c chan int) {
sum := 0
s[0] = 8
s = append(s, 20)
fmt.Println(s)
for _, v := range s {
sum += v
}
c <- sum // send sum to c
}
func main() {
s := []int{7, 2, 8, -9, 4, 0}
c := make(chan int)
go sum(s[:len(s)/2], c)
fmt.Println(s[0])
go sum(s[len(s)/2:], c)
fmt.Println(s)
x, y := <-c, <-c // receive from c
fmt.Println(x, y, x+y)
fmt.Println(s)
}
And this is the result I get:
7
[8 2 8 20 4 0]
[8 2 8 20]
[8 4 0 20]
26 32 58
[8 2 8 8 4 0]
I get that in creating a slice you get an underlying array with needed number underneath and passes a slice to a function and modifying an element modifying the underlying array but I'm not sure in what sort of order the goroutines are going.
It prints out the first S[0] as 7 even though the function call before it should have modified it to 8, so I'm assuming that the goroutines hasn't run yet?
Then printing the entire array after the second goroutine function call prints it with all the modifications made during the first function call (modifying the first item to 8 and appending 20).
Yet the next 2 line print outs are the print outs of the slice segments from within the functions, which logically by the fact that I just said I saw the modifications done by the function call means they should have printed out BEFORE that line.
Then I'm not sure how it got the calculations or the final printout of the array.
Can someone familiar with how Go operates explain what the logical progression of this code was?
It prints out the first S[0] as 7 even though the function call before it should have modified it to 8, so I'm assuming that the goroutines hasn't run yet?
Yes, this interpretation is correct. (But should not be made, see next section).
Then printing the entire array after the second goroutine function call prints it with all the modifications made during the first function call (modifying the first item to 8 and appending 20).
NO. Never ever think along these lines.
The state of your application at this point is not well defined. 1) You started goroutines which modify the state and you have absolutely no synchronisation between the three goroutines (the main one and the two you started manually). 2) Your code is racy which means it is a) wrong, racy programs are never correct and b) results in undefined behaviour.
Yet the next 2 line print outs are the print outs of the slice segments from within the functions, which logically by the fact that I just said I saw the modifications done by the function call means they should have printed out BEFORE that line.
More or less, if your program would not be racy (see above).
Where is your code racy: the main.s is a slice with a backing array and you subslice it twice when starting the goroutines with go sum(...). Both subslices share the same backing array. Now inside the goroutines you write ( s[0]=8) and append (append(s, 20)) to this subslice. From both goroutines without synchronisation. The first goroutine will fulfill this append using the large enough backing array which is concurrently used by the second goroutine.
This results in concurrent write without synchronisation to the fourth element of main.s which is racy and thus undefined.
With reading twice from the channel (<-c, <-c) you introduce synchronisation between the two manually started goroutines and the main goroutine and your program returns to serial execution (but its state is still not welldefined as it is racy).
The Println statements you use to track what is going on are problematic too: These statements print stuff which is in a not well defined state because it is accessed concurrently from different goroutines; both may print arbitrary data.
Summing up: You program is racy, its output is undefined, there is no "reason" or "explanation" why it prints what it prints, its pure chance.
Always try your code under the race detector (-race) and do not try to make up pseudo-explanations for racy code.
How to fix:
The divide and conquer approach is nice and okay. What is dangerous and thus easy to get wrong (as it happened to you) is modifying the subslice inside the goroutine. While s[0]=9] is harmless in your case the append interferes badly with the shared backing array.
Do not modify s inside the goroutine. If you must: Make sure you have a new backing array.
(Do not try to mix slices and concurrency. Both have some subtleties, slices the shared backing array and concurrency the problem of undefined racy behaviour. Especially if you are new to all this. Learn one, then the other, then combine both.)

Does the goroutines created in the same goroutines execute always in order?

package main
func main() {
c:=make(chan int)
for i:=0; i<=100;i++ {
i:=i
go func() {
c<-i
}()
}
for {
b:=<-c
println(b)
if b==100 {
break
}
}
}
The above code created 100 goroutines to insert num to channel c, so I just wonder that, will these goroutines execute in random orders? During my test, the output will always be 1 to 100
No, they are not guaranteed to run in order. With GOMAXPROCS=1 (the default) they appear to, but this is not guaranteed by the language spec.
And when I run your program with GOMAXPROCS=6, the output is non-deterministic:
$ GOMAXPROCS=6 ./test
2
0
1
4
3
5
6
7
8
9
...
On another run, the output was slightly different.
If you want a set of sends on a channel to happen in order, the best solution would be to perform them from the same goroutine.
What you observe as "random" behaviour is, more strictly, non-deterministic behaviour.
To understand what is happening here, think about the behaviour of the channel. In this case, it has many goroutines trying to write into the channel, and just one goroutine reading out of the channel.
The reading process is simply sequential and we can disregard it.
There are many concurrent writing processes and they are competing to access a shared resource (the channel). The channel has to make choices about which message it will accept.
When a Communicating Sequential Process (CSP) network makes a choice, it introduces non-determinism. In Go, there are two ways that this kind of choice happens:
concurrent access to one of the ends of a channel, and
select statements.
Your case is the first of these.
CSP is an algebra that allows concurrent behaviours to be analysed and understood. A seminal publication on this is Roscoe and Hoare "The Laws of Occam Programming" https://www.cs.ox.ac.uk/files/3376/PRG53.pdf (similar ideas apply to Go also, although there are small differences).
Surprisingly, the concurrent execution of goroutines is fully deterministic. It's only when choices are made that non-determinism comes in.

How does goroutines behave on a multi-core processor

I am a newbie in Go language, so please excuse me if my question is very basic. I have written a very simple code:
func main(){
var count int // Default 0
cptr := &count
go incr(cptr)
time.Sleep(100)
fmt.Println(*cptr)
}
// Increments the value of count through pointer var
func incr(cptr *int) {
for i := 0; i < 1000; i++ {
go func() {
fmt.Println(*cptr)
*cptr = *cptr + 1
}()
}
}
The value of count should increment by one the number of times the loop runs. Consider the cases:
Loop runs for 100 times--> value of count is 100 (Which is correct as the loop runs 100 times).
Loop runs for >510 times --> Value of count is either 508 OR 510. This happens even if it is 100000.
I am running this on an 8 core processor machine.
First of all: prior to Go 1.5 it runs on a single processor, only using multiple threads for blocking system calls. Unless you tell the runtime to use more processors by using GOMAXPROCS.
As of Go 1.5 GOMAXPROCS is set to the number of CPUS. See 6, 7 .
Also, the operation *cptr = *cptr + 1 is not guaranteed to be atomic. If you look carefully, it can be split up into 3 operations: fetch old value by dereferencing pointer, increment value, save value into pointer address.
The fact that you're getting 508/510 is due to some magic in the runtime and not defined to stay that way. More information on the behaviour of operations with concurrency can be found in the Go memory model.
You're probably getting the correct values for <510 started goroutines because any number below these are not (yet) getting interrupted.
Generally, what you're trying to do is neither recommendable in any language, nor the "Go" way to do concurrency. A very good example of using channels to synchronize is this code walk: Share Memory By Communicating (rather than communicating by sharing memory)
Here is a little example to show you what I mean: use a channel with a buffer of 1 to store the current number, fetch it from the channel when you need it, change it at will, then put it back for others to use.
You code is racy: You write to the same memory location from different, unsynchronized goroutines without any locking. The result is basically undefined. You must either a) make sure that all the goroutine writes after each other in a nice, ordered way, or b) protect each write by e.g. e mutex or c) use atomic operations.
If you write such code: Always try it under the race detector like $ go run -race main.go and fix all races.
A nice alternative to using channels in this case might be the sync/atomic package, which contains specifically functions for atomically incrementing/decrementing numbers.
You are spawning 500 or 1000 routines without synchronization among routines. This is creating a race condition, which makes the result un-predictable.
Imagine You are working in an office to account for expense balance for your boss.
Your boss was in a meeting with 1000 of his subordinates and in this meeting he said, "I would pay you all a bonus, so we will have to increment the record of our expense". He issued following command:
i) Go to Nerve and ask him/ her what is the current expense balance
ii) Call my secretary to ask how much bonus you would receive
iii) Add your bonus as an expense to the additional expense balance. Do the math yourself.
iv) Ask Nerve to write the new balance of expense on my authority.
All of the 1000 eager participants rushed to record their bonus and created a race condition.
Say 50 of the eager gophers hit Nerve at the same time (almost), they ask,
i) "What is the current expense balance?
-- Nerve says $1000 to all of those 50 gophers, as they asked at the same question at the same time(almost) when the balance was $1000.
ii) The gophers then called secretary, how much bonus should be paid to me?
Secretary answers, "just $1 to be fair"
iii) Boos said do the math, they all calculates $1000+ $1 = $1001 should be the new cost balance for the company
iv) They all will ask Nerve to put back $1001 to the balance.
You see the problem in the method of Eager gophers?
There are $50 units of computation done, every time the added $1 to the existing balance, but the cost didn't increase by $50; only increased by $1.
Hope that clarifies the problem. Now for the solutions, other contributors gave very good solutions that would be sufficient I believe.
All of that approaches failed to me, noobie here. But i have found a better way http://play.golang.org/p/OcMsuUpv2g
I'm using sync package to solve that problem and wait for all goroutines to finish, without Sleep or Channel.
And don't forget to take a look at that awesome post http://devs.cloudimmunity.com/gotchas-and-common-mistakes-in-go-golang/

Speedup problems with go

I wrote a very simple program in go to test performances of a parallel program. I wrote a very simple program that factorizes a big semiprime number by division trials. Since no communications are involved, I expected an almost perfect speedup. However, the program seems to scale very badly.
I timed the program with 1, 2, 4, and 8 processes, running on a 8 (real, not HT) cores computer, using the system timecommand. The number I factorized is "28808539627864609". Here are my results:
cores time (sec) speedup
1 60.0153 1
2 47.358 1.27
4 34.459 1.75
8 28.686 2.10
How to explain such bad speedups? Is it a bug in my program, or is it a problem with go runtime? How could I get better performances? I'm not talking about the algorithm by itself (I know there are better algorithms to factorize semiprime numbers), but about the way I parallelized it.
Here is the source code of my program:
package main
import (
"big"
"flag"
"fmt"
"runtime"
)
func factorize(n *big.Int, start int, step int, c chan *big.Int) {
var m big.Int
i := big.NewInt(int64(start))
s := big.NewInt(int64(step))
z := big.NewInt(0)
for {
m.Mod(n, i)
if m.Cmp(z) == 0{
c <- i
}
i.Add(i, s)
}
}
func main() {
var np *int = flag.Int("n", 1, "Number of processes")
flag.Parse()
runtime.GOMAXPROCS(*np)
var n big.Int
n.SetString(flag.Arg(0), 10) // Uses number given on command line
c := make(chan *big.Int)
for i:=0; i<*np; i++ {
go factorize(&n, 2+i, *np, c)
}
fmt.Println(<-c)
}
EDIT
Problem really seems to be related to Mod function. Replacing it by Rem gives better but still imperfect performances and speedups. Replacing it by QuoRem gives 3 times faster performances, and perfect speedup. Conclusion: it seems memory allocation kills parallel performances in Go. Why? Do you have any references about this?
Big.Int methods generally have to allocate memory, usually to hold the result of the computation. The problem is that there is just one heap and all memory operations are serialized. In this program, the numbers are fairly small and the (parallelizable) computation time needed for things like Mod and Add is small compared to the non-parallelizable operations of repeatedly allocating all the tiny little bits of memory.
As far as speeding it up, there is the obvious answer of don't use big.Ints if you don't have to. Your example number happens to fit in 64 bits. If you plan on working with really big big numbers though, the problem will kind of go away on its own. You will spend much more time doing computations, and the time spent in the heap will be relatively much less.
There is a bug in your program, by the way, although it's not related to performance. When you find a factor and return the result on the channel, you send a pointer to the local variable i. This is fine, except that you don't break out of the loop then. The loop in the goroutine continues incrementing i and by the time the main goroutine gets around to fishing the pointer out of the channel and following it, the value is almost certain to be wrong.
After sending i through the channel, i should be replaced with a newly allocated big.Int:
if m.Cmp(z) == 0 {
c <- i
i = new(big.Int).Set(i)
}
This is necessary because there is no guarantee when fmt.Println will process the integer received on line fmt.Println(<-c). It isn't usual for fmt.Println to cause goroutine switching, so if i isn't replaced with a newly allocated big.Int and the run-time switches back to executing the for-loop in function factorize then the for-loop will overwrite i before it is printed - in which case the program won't print out the 1st integer sent through the channel.
The fact that fmt.Println can cause goroutine switching means that the for-loop in function factorize may potentially consume a lot of CPU time between the moment when the main goroutine receives from channel c and the moment when the main goroutine terminates. Something like this:
run factorize()
<-c in main()
call fmt.Println()
continue running factorize() // Unnecessary CPU time consumed
return from fmt.Println()
return from main() and terminate program
Another reason for the small multi-core speedup is memory allocation. The function (*Int).Mod is internally using (*Int).QuoRem and will create a new big.Int each time it is called. To avoid the memory allocation, use QuoRem directly:
func factorize(n *big.Int, start int, step int, c chan *big.Int) {
var q, r big.Int
i := big.NewInt(int64(start))
s := big.NewInt(int64(step))
z := big.NewInt(0)
for {
q.QuoRem(n, i, &r)
if r.Cmp(z) == 0 {
c <- i
i = new(big.Int).Set(i)
}
i.Add(i, s)
}
}
Unfortunately, the goroutine scheduler in Go release r60.3 contains a bug which prevents this code to use all CPU cores. When the program is started with -n=2 (GOMAXPROCS=2), the run-time will utilize only 1 thread.
Go weekly release has a better run-time and can utilize 2 threads if n=2 is passed to the program. This gives a speedup of approximately 1.9 on my machine.
Another potential contributing factor to multi-core slowdown has been mentioned in the answer by user "High Performance Mark". If the program is splitting the work into multiple sub-tasks and the result comes only from 1 sub-task, it means that the other sub-tasks may do some "extra work". Running the program with n>=2 may in total consume more CPU time than running the program with n=1.
The learn how much extra work is being done, you may want to (somehow) print out values of all i's in all goroutines at the moment the program is exiting the function main().
I don't read go so this is probably the answer to a question which is not what you asked. If so, downvote or delete as you wish.
If you were to make a plot of 'time to factorise integer n' against 'n' you would get a plot that goes up and down somewhat randomly. For any n you choose there will be an integer in the range 1..n that takes longest to factorise on one processor. If your parallelisation strategy is to distribute the n integers across p processors one of those processors will take at least the time to factorise the hardest integer, then the time to factorise the rest of its load.
Perhaps you've done something similar ?

Resources