How does goroutines behave on a multi-core processor - parallel-processing

I am a newbie in Go language, so please excuse me if my question is very basic. I have written a very simple code:
func main(){
var count int // Default 0
cptr := &count
go incr(cptr)
time.Sleep(100)
fmt.Println(*cptr)
}
// Increments the value of count through pointer var
func incr(cptr *int) {
for i := 0; i < 1000; i++ {
go func() {
fmt.Println(*cptr)
*cptr = *cptr + 1
}()
}
}
The value of count should increment by one the number of times the loop runs. Consider the cases:
Loop runs for 100 times--> value of count is 100 (Which is correct as the loop runs 100 times).
Loop runs for >510 times --> Value of count is either 508 OR 510. This happens even if it is 100000.
I am running this on an 8 core processor machine.

First of all: prior to Go 1.5 it runs on a single processor, only using multiple threads for blocking system calls. Unless you tell the runtime to use more processors by using GOMAXPROCS.
As of Go 1.5 GOMAXPROCS is set to the number of CPUS. See 6, 7 .
Also, the operation *cptr = *cptr + 1 is not guaranteed to be atomic. If you look carefully, it can be split up into 3 operations: fetch old value by dereferencing pointer, increment value, save value into pointer address.
The fact that you're getting 508/510 is due to some magic in the runtime and not defined to stay that way. More information on the behaviour of operations with concurrency can be found in the Go memory model.
You're probably getting the correct values for <510 started goroutines because any number below these are not (yet) getting interrupted.
Generally, what you're trying to do is neither recommendable in any language, nor the "Go" way to do concurrency. A very good example of using channels to synchronize is this code walk: Share Memory By Communicating (rather than communicating by sharing memory)
Here is a little example to show you what I mean: use a channel with a buffer of 1 to store the current number, fetch it from the channel when you need it, change it at will, then put it back for others to use.

You code is racy: You write to the same memory location from different, unsynchronized goroutines without any locking. The result is basically undefined. You must either a) make sure that all the goroutine writes after each other in a nice, ordered way, or b) protect each write by e.g. e mutex or c) use atomic operations.
If you write such code: Always try it under the race detector like $ go run -race main.go and fix all races.

A nice alternative to using channels in this case might be the sync/atomic package, which contains specifically functions for atomically incrementing/decrementing numbers.

You are spawning 500 or 1000 routines without synchronization among routines. This is creating a race condition, which makes the result un-predictable.
Imagine You are working in an office to account for expense balance for your boss.
Your boss was in a meeting with 1000 of his subordinates and in this meeting he said, "I would pay you all a bonus, so we will have to increment the record of our expense". He issued following command:
i) Go to Nerve and ask him/ her what is the current expense balance
ii) Call my secretary to ask how much bonus you would receive
iii) Add your bonus as an expense to the additional expense balance. Do the math yourself.
iv) Ask Nerve to write the new balance of expense on my authority.
All of the 1000 eager participants rushed to record their bonus and created a race condition.
Say 50 of the eager gophers hit Nerve at the same time (almost), they ask,
i) "What is the current expense balance?
-- Nerve says $1000 to all of those 50 gophers, as they asked at the same question at the same time(almost) when the balance was $1000.
ii) The gophers then called secretary, how much bonus should be paid to me?
Secretary answers, "just $1 to be fair"
iii) Boos said do the math, they all calculates $1000+ $1 = $1001 should be the new cost balance for the company
iv) They all will ask Nerve to put back $1001 to the balance.
You see the problem in the method of Eager gophers?
There are $50 units of computation done, every time the added $1 to the existing balance, but the cost didn't increase by $50; only increased by $1.
Hope that clarifies the problem. Now for the solutions, other contributors gave very good solutions that would be sufficient I believe.

All of that approaches failed to me, noobie here. But i have found a better way http://play.golang.org/p/OcMsuUpv2g
I'm using sync package to solve that problem and wait for all goroutines to finish, without Sleep or Channel.
And don't forget to take a look at that awesome post http://devs.cloudimmunity.com/gotchas-and-common-mistakes-in-go-golang/

Related

Is it safe to read a function pointer concurrently without a lock?

Suppose I have this:
go func() {
for range time.Tick(1 * time.Millisecond) {
a, b = b, a
}
}()
And elsewhere:
i := a // <-- Is this safe?
For this question, it's unimportant what the value of i is with respect to the original a or b. The only question is whether reading a is safe. That is, is it possible for a to be nil, partially assigned, invalid, undefined, ... anything other than a valid value?
I've tried to make it fail but so far it always succeeds (on my Mac).
I haven't been able to find anything specific beyond this quote in the The Go Memory Model doc:
Reads and writes of values larger than a single machine word behave as
multiple machine-word-sized operations in an unspecified order.
Is this implying that a single machine word write is effectively atomic? And, if so, are function pointer writes in Go a single machine word operation?
Update: Here's a properly synchronized solution
Unsynchronized, concurrent access to any variable from multiple goroutines where at least one of them is a write is undefined behavior by The Go Memory Model.
Undefined means what it says: undefined. It may be that your program will work correctly, it may be it will work incorrectly. It may result in losing memory and type safety provided by the Go runtime (see example below). It may even crash your program. Or it may even cause the Earth to explode (probability of that is extremely small, maybe even less than 1e-40, but still...).
This undefined in your case means that yes, i may be nil, partially assigned, invalid, undefined, ... anything other than either a or b. This list is just a tiny subset of all the possible outcomes.
Stop thinking that some data races are (or may be) benign or unharmful. They can be the source of the worst things if left unattended.
Since your code writes to the variable a in one goroutine and reads it in another goroutine (which tries to assign its value to another variable i), it's a data race and as such it's not safe. It doesn't matter if in your tests it works "correctly". One could take your code as a starting point, extend / build on it and result in a catastrophe due to your initially "unharmful" data race.
As related questions, read How safe are Golang maps for concurrent Read/Write operations? and Incorrect synchronization in go lang.
Strongly recommended to read the blog post by Dmitry Vyukov: Benign data races: what could possibly go wrong?
Also a very interesting blog post which shows an example which breaks Go's memory safety with intentional data race: Golang data races to break memory safety
In terms of Race condition, it's not safe. In short my understanding of race condition is when there're more than one asynchronous routine (coroutines, threads, process, goroutines etc.) trying to access the same resource and at least one is a writing operation, so in your example we have 2 goroutines reading and writing variables of type function, I think what's matter from a concurrent point of view is those variables have a memory space somewhere and we're trying to read or write in that portion of memory.
Short answer: just run your example using the -race flag with go run -race
or go build -race and you'll see a detected data race.
The answer to your question, as of today, is that if a and b are not larger than a machine word, i must be equal to a or b. Otherwise, it may contains an unspecified value, that is most likely to be an interleave of different parts from a and b.
The Go memory model, as of the version on June 6, 2022, guarantees that if a program executes a race condition, a memory access of a location not larger than a machine word must be atomic.
Otherwise, a read r of a memory location x that is not larger than a machine word must observe some write w such that r does not happen before w and there is no write w' such that w happens before w' and w' happens before r. That is, each read must observe a value written by a preceding or concurrent write.
The happen-before relationship here is defined in the memory model in the previous section.
The result of a racy read from a larger memory location is unspecified, but it is definitely not undefined as in the realm of C++.
Reads of memory locations larger than a single machine word are encouraged but not required to meet the same semantics as word-sized memory locations, observing a single allowed write w. For performance reasons, implementations may instead treat larger operations as a set of individual machine-word-sized operations in an unspecified order. This means that races on multiword data structures can lead to inconsistent values not corresponding to a single write. When the values depend on the consistency of internal (pointer, length) or (pointer, type) pairs, as can be the case for interface values, maps, slices, and strings in most Go implementations, such races can in turn lead to arbitrary memory corruption.

Does the goroutines created in the same goroutines execute always in order?

package main
func main() {
c:=make(chan int)
for i:=0; i<=100;i++ {
i:=i
go func() {
c<-i
}()
}
for {
b:=<-c
println(b)
if b==100 {
break
}
}
}
The above code created 100 goroutines to insert num to channel c, so I just wonder that, will these goroutines execute in random orders? During my test, the output will always be 1 to 100
No, they are not guaranteed to run in order. With GOMAXPROCS=1 (the default) they appear to, but this is not guaranteed by the language spec.
And when I run your program with GOMAXPROCS=6, the output is non-deterministic:
$ GOMAXPROCS=6 ./test
2
0
1
4
3
5
6
7
8
9
...
On another run, the output was slightly different.
If you want a set of sends on a channel to happen in order, the best solution would be to perform them from the same goroutine.
What you observe as "random" behaviour is, more strictly, non-deterministic behaviour.
To understand what is happening here, think about the behaviour of the channel. In this case, it has many goroutines trying to write into the channel, and just one goroutine reading out of the channel.
The reading process is simply sequential and we can disregard it.
There are many concurrent writing processes and they are competing to access a shared resource (the channel). The channel has to make choices about which message it will accept.
When a Communicating Sequential Process (CSP) network makes a choice, it introduces non-determinism. In Go, there are two ways that this kind of choice happens:
concurrent access to one of the ends of a channel, and
select statements.
Your case is the first of these.
CSP is an algebra that allows concurrent behaviours to be analysed and understood. A seminal publication on this is Roscoe and Hoare "The Laws of Occam Programming" https://www.cs.ox.ac.uk/files/3376/PRG53.pdf (similar ideas apply to Go also, although there are small differences).
Surprisingly, the concurrent execution of goroutines is fully deterministic. It's only when choices are made that non-determinism comes in.

lock free programming - c++ atomic

I am trying to develop the following lock free code (c++11):
int val_max;
std::array<std::atomic<int>, 255> vector;
if (vector[i] > val_max) {
val_max =vector[i];
}
The problem is that when there are many threads (128 threads), the result is not correct, because if for example val_max=1, and three threads ( with vector[i] = 5, 15 and 20) will execute the code in the next line, and it will be a data race..
So, I don't know the best way to solve the problem with the available functions in c++11 (I cannot use mutex, locks), protecting the whole code.
Any suggestions? Thanks in advance!
You need to describe the bigger problem you need to solve, and why you want multiple threads for this.
If you have a lot of data and want to find the maximum, and you want to split that problem, you're doing it wrong. If all threads try to access a shared maximum, not only is it hard to get right, but by the time you have it right, you have fully serialized accesses, thus making the entire thing an exercise in adding complexity and thread overhead to a serial program.
The correct way to make it parallel is to give every thread a chunk of the array to work on (and the array members are not atomics), for which the thread calculates a local maximum, then when all threads are done have one thread find the maximum of the individual results.
Do an atomic fetch of val_max.
If the value fetched is greater than or equal to vector[i], stop, you are done.
Do an atomic compare exchange -- compare val_max to the value you read in step 1 and exchange it for the value of vector[i] if it compares.
If the compare succeeded, stop, you are done.
Go to step 1, you raced with another thread that made forward progress.

Or-equals on constant as reduction operation (ex. value |= 1 ) thread-safe?

Let's say that I have a variable x.
x = 0
I then spawn some number of threads, and each of them may or may not run the following expression WITHOUT the use of atomics.
x |= 1
After all threads have joined with my main thread, the main thread branches on the value.
if(x) { ... } else { ... }
Is it possible for there to be a race condition in this situation? My thoughts say no, because it doesn't seem to matter whether or not a thread is interrupted by another thread between reading and writing 'x' (in both cases, either 'x == 1', or 'x == 1'). That said, I want to make sure I'm not missing something stupid obvious or ridiculously subtle.
Also, if you happen to provide an answer to the contrary, please provide an instruction-by-instruction example!
Context:
I'm trying to, in OpenCL, have my threads indicate the presence or absence of a feature among any of their work-items. If any of the threads indicate the presence of the feature, my host ought to be able to branch on the result. I'm thinking of using the above method. If you guys have a better suggestion, that works too!
Detail:
I'm trying to add early-exit to my OpenCL radix-sort implementation, to skip radix passes if the data is banded (i.e. 'x' above would be x[RADIX] and I'd have all work groups, right after partial reduction of the data, indicate presence or absence of elements in the RADIX bins via 'x').
It may work within a work-group. You will need to insert a barrier before testing x. I'm not sure it will be faster than using atomic increments.
It will not work across several work-groups. Imagine you have 1000 work-groups to run on 20 cores. Typically, only a small number of work-groups can be resident on a single core, for example 4, meaning only 80 work-groups can be in flight inside the GPU at a given time. Once a work-group is done executing, it is retired, and another one is started. Halting a kernel in the middle of execution to wait for all 1000 work-groups to reach the same point is impossible.

Performance difference between iterating once and iterating twice?

Consider something like...
for (int i = 0; i < test.size(); ++i) {
test[i].foo();
test[i].bar();
}
Now consider..
for (int i = 0; i < test.size(); ++i) {
test[i].foo();
}
for (int i = 0; i < test.size(); ++i) {
test[i].bar();
}
Is there a large difference in time spent between these two? I.e. what is the cost of the actual iteration? It seems like the only real operations you are repeating are an increment and a comparison (though I suppose this would become significant for a very large n). Am I missing something?
First, as noted above, if your compiler can't optimize the size() method out so it's just called once, or is nothing more than a single read (no function call overhead), then it will hurt.
There is a second effect you may want to be concerned with, though. If your container size is large enough, then the first case will perform faster. This is because, when it gets to test[i].bar(), test[i] will be cached. The second case, with split loops, will thrash the cache, since test[i] will always need to be reloaded from main memory for each function.
Worse, if your container (std::vector, I'm guessing) has so many items that it won't all fit in memory, and some of it has to live in swap on your disk, then the difference will be huge as you have to load things in from disk twice.
However, there is one final thing that you have to consider: all this only makes a difference if there is no order dependency between the function calls (really, between different objects in the container). Because, if you work it out, the first case does:
test[0].foo();
test[0].bar();
test[1].foo();
test[1].bar();
test[2].foo();
test[2].bar();
// ...
test[test.size()-1].foo();
test[test.size()-1].bar();
while the second does:
test[0].foo();
test[1].foo();
test[2].foo();
// ...
test[test.size()-1].foo();
test[0].bar();
test[1].bar();
test[2].bar();
// ...
test[test.size()-1].bar();
So if your bar() assumes that all foo()'s have run, you will break it if you change the second case to the first. Likewise, if bar() assumes that foo() has not been run on later objects, then moving from the second case to the first will break your code.
So be careful and document what you do.
There are many aspects in such comparison.
First, complexity for both options is O(n), so difference isn't very big anyway. I mean, you must not care about it if you write quite big and complex program with a large n and "heavy" operations .foo() and bar(). So, you must care about it only in case of very small simple programs (this is kind of programs for embedded devices, for example).
Second, it will depend on programming language and compiler. I'm assured that, for instance, most of C++ compilers will optimize your second option to produce same code as for the first one.
Third, if compiler haven't optimized your code, performance difference will heavily depend on the target processor. Consider loop in a term of assembly commands - it will look something like this (pseudo assembly language):
LABEL L1:
do this ;; some commands
call that
IF condition
goto L1
;; some more instructions, ELSE part
I.e. every loop passage is just IF statement. But modern processors don't like IF. This is because processors may rearrange instructions to execute them beforehand or just to avoid idles. With the IF (in fact, conditional goto or jump) instructions, processors do not know if they may rearrange operation or not.
There's also a mechanism called branch predictor. From material of Wikipedia:
branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure.
This "soften" effect of IF's, through if the predictor's guess is wrong, no optimization will be performed.
So, you can see that there's a big amount of conditions for both your options: target language and compiler, target machine, it's processor and branch predictor. This all makes very complex system, and you cannot foresee what exact result you will get. I believe, that if you don't deal with embedded systems or something like that, the best solution is just to use the form which your are more comfortable with.
For your examples you have the additional concern of how expensive .size() is, since it's compared for each time i increments in most languages.
How expensive is it? Well that depends, it's certainly all relative. If .foo() and .bar() are expensive, the cost of the actual iteration is probably minuscule in comparison. If they're pretty lightweight, then it'll be a larger percentage of your execution time. If you want to know about a particular case test it, this is the only way to be sure about your specific scenario.
Personally, I'd go with the single iteration to be on the cheap side for sure (unless you need the .foo() calls to happen before the .bar() calls).
I assume .size() will be constant. Otherwise, the first code example might not give the same as the second one.
Most compilers would probably store .size() in a variable before the loop starts, so the .size() time will be cut down.
Therefore the time of the stuff inside the two for loops will be the same, but the other part will be twice as much.
Performance tag, right.
As long as you are concentrating on the "cost" of this or that minor code segment, you are oblivious to the bigger picture (isolation); and your intention is to justify something that, at a higher level (outside your isolated context), is simply bad practice, and breaks guidelines. The question is too low level and therefore too isolated. A system or program which is set of integrated components will perform much better that a collection of isolated components.
The fact that this or that isolated component (work inside the loop) is fast or faster is irrelevant when the loop itself is repeated unnecessarily, and which would therefore take twice the time.
Given that you have one family car (CPU), why on Earth would you:
sit at home and send your wife out to do her shopping
wait until she returns
take the car, go out and do your shopping
leaving her to wait until you return
If it needs to be stated, you would spend (a) almost half of your hard-earned resources executing one trip and shopping at the same time and (b) have those resources available to have fun together when you get home.
It has nothing to do with the price of petrol at 9:00 on a Saturday, or the time it takes to grind coffee at the café, or cost of each iteration.
Yes, there is a large diff in the time and the resources used. But the cost is not merely in the overhead per iteration; it is in the overall cost of the one organised trip vs the two serial trips.
Performance is about architecture; never doing anything twice (that you can do once), which are the higher levels of organisation; integrated of the parts that make up the whole. It is not about counting pennies at the bowser or cycles per iteration; those are lower orders of organisation; which ajust a collection of fragmented parts (not a systemic whole).
Masseratis cannot get through traffic jams any faster than station wagons.

Resources