Is there a difference in Go between a counter using atomic operations and one using a mutex? - go

I have seen some discussion lately about whether there is a difference between a counter implemented using atomic increment/load, and one using a mutex to synchronise increment/load.
Are the following counter implementations functionally equivalent?
type Counter interface {
Inc()
Load() int64
}
// Atomic Implementation
type AtomicCounter struct {
counter int64
}
func (c *AtomicCounter) Inc() {
atomic.AddInt64(&c.counter, 1)
}
func (c *AtomicCounter) Load() int64 {
return atomic.LoadInt64(&c.counter)
}
// Mutex Implementation
type MutexCounter struct {
counter int64
lock sync.Mutex
}
func (c *MutexCounter) Inc() {
c.lock.Lock()
defer c.lock.Unlock()
c.counter++
}
func (c *MutexCounter) Load() int64 {
c.lock.Lock()
defer c.lock.Unlock()
return c.counter
}
I have run a bunch of test cases (Playground Link) and haven't been able to see any different behaviour. Running the tests on my machine the numbers get printed out of order for all the PrintAll test functions.
Can someone confirm whether they are equivalent or if there are any edge cases where these are different? Is there a preference to use one technique over the other? The atomic documentation does say it should only be used in special cases.
Update:
The original question that caused me to ask this was this one, however it is now on hold, and i feel this aspect deserves its own discussion. In the answers it seemed that using a mutex would guarantee correct results, whereas atomics might not, specifically if the program is running in multiple threads. My questions are:
Is it correct that they can produce different results? (See update below. The answer is yes.).
What causes this behaviour?
What are the tradeoffs between the two approaches?
Another Update:
I've found some code where the two counters behave differently. When run on my machine this function will finish with MutexCounter, but not with AtomicCounter. Don't ask me why you would ever run this code:
func TestCounter(counter Counter) {
end := make(chan interface{})
for i := 0; i < 1000; i++ {
go func() {
r := rand.New(rand.NewSource(time.Now().UnixNano()))
for j := 0; j < 10000; j++ {
k := int64(r.Uint32())
if k >= 0 {
counter.Inc()
}
}
}()
}
go func() {
prevValue := int64(0)
for counter.Load() != 10000000 { // Sometimes this condition is never met with AtomicCounter.
val := counter.Load()
if val%1000000 == 0 && val != prevValue {
prevValue = val
}
}
end <- true
fmt.Println("Count:", counter.Load())
}()
<-end
}

There is no difference in behavior. There is a difference in performance.
Mutexes are slow, due to the setup and teardown, and due to the fact that they block other goroutines for the duration of the lock.
Atomic operations are fast because they use an atomic CPU instruction (when possible), rather than relying on external locks to.
Therefore, whenever it is feasible, atomic operations should be preferred.

Alright, I'm going to attempt to self-answer for some closure. Edits are welcome.
There is some discussion about the atomic package here. But to quote the most telling comments:
The very short summary is that if you have to ask, you should probably
avoid the package. Or, read the atomic operations chapter of the
C++11 standard; if you understand how to use those operations safely
in C++, then you are more than capable of using Go's sync/atomic
package.
That said, sticking to atomic.AddInt32 and atomic.LoadInt32 is safe as
long as you are just reporting statistical information, and not
actually relying on the values carrying any meaning about the state of
the different goroutines.
And:
What atomicity does not guarantee, is any ordering of
observability of values. I mean, atomic.AddInt32() does only guarantee
that what this operation stores at &cnt will be exactly *cnt + 1 (with
the value of *cnt being what the CPU executing the active goroutine
fetched from memory when the operation started); it does not provide any
guarantee that if another goroutine will attempt to read this value at
the same time it will fetch that same value *cnt + 1.
On the other hand, mutexes and channels guarantee strict ordering
of accesses to values being shared/passed around (subject to the rules
of Go memory model).
In regards to why the code sample in the question never finishes, this is due to fact that the func that is reading the counter is in a very tight loop. When using the atomic counter, there are no syncronisation events (e.g. mutex lock/unlock, syscalls) which means that the goroutine never yields control. The result of this is that this goroutine starves the thread it is running on, and prevents the scheduler from allocating time to any other goroutines allocated to that thread, this includes ones that increment the counter meaning the counter never reaches 10000000.

Atomics are faster in the common case: the compiler translates each call to a function from the sync/atomic package to a special set of machine instructions which basically operate on the CPU level — for instance, on x86
architectures, an atomic.AddInt64 would be translated to some plain
ADD-class instruction prefixed with the LOCK instruction (see this for an example) —
with the latter ensuring coherent view of the updated memory location
across all the CPUs in the system.
A mutex is a much complicated thing as it, in the end, wraps some bit
of the native OS-specific thread synchronization API
(for instance, on Linux, that's futex).
On the other hand, the Go runtime is pretty much optimized when
it comes to synchronization stuff (which is kinda expected —
given one of the main selling points of Go), and the mutex implementation
tries to avoid hitting the kernel to perform synchronization
between goroutines, if possible, and carry it out completely in
the Go runtime itself.
This might explain no noticeable difference in the timings
in your benchmarks, provided the contention over the mutexes
was reasonably low.
Still, I feel oblidged to note — just in case — that atomics and
higher-level synchronization facilities are designed to solve different
tasks. Say, you can't use atomics to protect some memory state during
the execution of a whole function — and even a single statement,
in the general case.

Related

Is it safe to write to on-stack variables from different go routine blocking current one with WaitGroup?

There are various task executors, with different properties, and some of them only support non-blocking calls. So, I was thinking, whether there's a need to use mutex/channel to safely deliver task results to calling go-routine, or whether is it enough simple WaitGroup?
For sake of simplicity, and specificity of the question, an example using very naive task executor launching function directly as go routine:
func TestRace(t *testing.T) {
var wg sync.WaitGroup
a, b := 1, 2
wg.Add(1)
// this func would be passed to real executor
go func() {
a, b = a+1, b+1
wg.Done()
}()
wg.Wait()
assert.Equal(t, a, 2)
assert.Equal(t, b, 3)
}
Execution of the test above with -race option didn't fail, on my machine. However, is that enough guarantee? What if go-routine is executed on different CPU core, or on CPU core block (AMD CCX), or on different CPU in multi-socket setups?
So, the question is, can I use WaitGroup to provide synchronization (block and return values) for non-blocking executors?
JimB should perhaps provide this as the answer, but I'll copy it from his comments, starting with this one:
The WaitGroup here is to ensure that a, b = a+1, b+1 has executed, so there's no reason to assume it hasn't.
[and]
[T]he guarantees you have are laid out by the go memory model, which is well documented [here]. [Specifically, the combination of wg.Done() and wg.Wait() in the example suffices to guarantee non-racy access to the two variables a and b.]
As long as this question exists, it's probably a good idea to copy Adrian's comment too:
As #JimB noted, if a value is shared between goroutines, it cannot be stack-allocated, so the question is moot (see How are Go closures layed out in memory?). WaitGroup works correctly.
The fact that closure variables are heap-allocated is an implementation detail: it might not be true in the future. But the sync.WaitGroup guarantee will still be true in the future, even if some clever future Go compiler is able to keep those variables on some stack.
("Which stack?" is another question entirely, but one for the hypothetical future clever Go compiler to answer. The WaitGroup and memory model provide the rules.)

Does the goroutines created in the same goroutines execute always in order?

package main
func main() {
c:=make(chan int)
for i:=0; i<=100;i++ {
i:=i
go func() {
c<-i
}()
}
for {
b:=<-c
println(b)
if b==100 {
break
}
}
}
The above code created 100 goroutines to insert num to channel c, so I just wonder that, will these goroutines execute in random orders? During my test, the output will always be 1 to 100
No, they are not guaranteed to run in order. With GOMAXPROCS=1 (the default) they appear to, but this is not guaranteed by the language spec.
And when I run your program with GOMAXPROCS=6, the output is non-deterministic:
$ GOMAXPROCS=6 ./test
2
0
1
4
3
5
6
7
8
9
...
On another run, the output was slightly different.
If you want a set of sends on a channel to happen in order, the best solution would be to perform them from the same goroutine.
What you observe as "random" behaviour is, more strictly, non-deterministic behaviour.
To understand what is happening here, think about the behaviour of the channel. In this case, it has many goroutines trying to write into the channel, and just one goroutine reading out of the channel.
The reading process is simply sequential and we can disregard it.
There are many concurrent writing processes and they are competing to access a shared resource (the channel). The channel has to make choices about which message it will accept.
When a Communicating Sequential Process (CSP) network makes a choice, it introduces non-determinism. In Go, there are two ways that this kind of choice happens:
concurrent access to one of the ends of a channel, and
select statements.
Your case is the first of these.
CSP is an algebra that allows concurrent behaviours to be analysed and understood. A seminal publication on this is Roscoe and Hoare "The Laws of Occam Programming" https://www.cs.ox.ac.uk/files/3376/PRG53.pdf (similar ideas apply to Go also, although there are small differences).
Surprisingly, the concurrent execution of goroutines is fully deterministic. It's only when choices are made that non-determinism comes in.

go atomic Load and Store

func resetElectionTimeoutMS(newMin, newMax int) (int, int) {
oldMin := atomic.LoadInt32(&MinimumElectionTimeoutMS)
oldMax := atomic.LoadInt32(&maximumElectionTimeoutMS)
atomic.StoreInt32(&MinimumElectionTimeoutMS, int32(newMin))
atomic.StoreInt32(&maximumElectionTimeoutMS, int32(newMax))
return int(oldMin), int(oldMax)
}
I got a go code function like this.
The thing I am confused is: why do we need atomic here? What is this preventing from?
Thanks.
Atomic functions complete a task in an isolated way where all parts of the task appear to happen instantaneously or don't happen at all.
In this case, LoadInt32 and StoreInt32 ensure that an integer is stored and retrieved in a way where someone loading won't get a partial store. However, you need both sides to use atomic functions for this to function correctly. The raft example appears incorrect for at least two reasons.
Two atomic functions do not act as one atomic function, so reading the old and setting the new in two lines is a race condition. You may read, then someone else sets, then you set and you are returning false information for the previous value before you set it.
Not everyone accessing MinimumElectionTimeoutMS is using atomic operations. This means that the use of atomics in this function is effectively useless.
How would this be fixed?
func resetElectionTimeoutMS(newMin, newMax int) (int, int) {
oldMin := atomic.SwapInt32(&MinimumElectionTimeoutMS, int32(newMin))
oldMax := atomic.SwapInt32(&maximumElectionTimeoutMS, int32(newMax))
return int(oldMin), int(oldMax)
}
This would ensure that oldMin is the minimum that existed before the swap. However, the entire function is still not atomic as the final outcome could be an oldMin and oldMax pair that was never called with resetElectionTimeoutMS. For that... just use locks.
Each function would also need to be changed to do an atomic load:
func minimumElectionTimeout() time.Duration {
min := atomic.LoadInt32(&MinimumElectionTimeoutMS)
return time.Duration(min) * time.Millisecond
}
I recommend you carefully consider the quote VonC mentioned from the golang atomic documentation:
These functions require great care to be used correctly. Except for special, low-level applications, synchronization is better done with channels or the facilities of the sync package.
If you want to understand atomic operations, I recommend you start with http://preshing.com/20130618/atomic-vs-non-atomic-operations/. That goes over the load and store operations used in your example. However, there are other uses for atomics. The go atomic package overview goes over some cool stuff like atomic swapping (the example I gave), compare and swap (known as CAS), and Adding.
A funny quote from the link I gave you:
it’s well-known that on x86, a 32-bit mov instruction is atomic if the memory operand is naturally aligned, but non-atomic otherwise. In other words, atomicity is only guaranteed when the 32-bit integer is located at an address which is an exact multiple of 4.
In other words, on common systems today, the atomic functions used in your example are effectively no-ops. They are already atomic! (They are not guaranteed though, if you need it to be atomic, it is better to specify it explicitly)
Considering that the package atomic provides low-level atomic memory primitives useful for implementing synchronization algorithms, I suppose it was intended to be used as:
MinimumElectionTimeoutMS isn't modified while being stored in oldMin
MinimumElectionTimeoutMS isn't modified while being set to a new value newMin.
But, the package does come with the warning:
These functions require great care to be used correctly.
Except for special, low-level applications, synchronization is better done with channels or the facilities of the sync package.
Share memory by communicating; don't communicate by sharing memory.
In this case (server.go from the Raft distributed consensus protocol), synchronizing directly on the variable might be deemed faster than putting a Mutex on the all function.
Except, as Stephen Weinberg's answer illustrate (upvoted), this isn't how you use atomic. It only makes sure that oldMin is accurate while doing the swap.
See another example at "Is the two atomic style code in sync/atomic.once.go necessary?", in relation with the "memory model".
OneOfOne mentions in the comments using atomic CAS as a spinlock (very fast locking):
BenchmarkSpinL-8 2000 708494 ns/op 32315 B/op 2001 allocs/op
BenchmarkMutex-8 1000 1225260 ns/op 78027 B/op 2259 allocs/op
See:
sync/spinlock.go
sync/spinlock_test.go

How does goroutines behave on a multi-core processor

I am a newbie in Go language, so please excuse me if my question is very basic. I have written a very simple code:
func main(){
var count int // Default 0
cptr := &count
go incr(cptr)
time.Sleep(100)
fmt.Println(*cptr)
}
// Increments the value of count through pointer var
func incr(cptr *int) {
for i := 0; i < 1000; i++ {
go func() {
fmt.Println(*cptr)
*cptr = *cptr + 1
}()
}
}
The value of count should increment by one the number of times the loop runs. Consider the cases:
Loop runs for 100 times--> value of count is 100 (Which is correct as the loop runs 100 times).
Loop runs for >510 times --> Value of count is either 508 OR 510. This happens even if it is 100000.
I am running this on an 8 core processor machine.
First of all: prior to Go 1.5 it runs on a single processor, only using multiple threads for blocking system calls. Unless you tell the runtime to use more processors by using GOMAXPROCS.
As of Go 1.5 GOMAXPROCS is set to the number of CPUS. See 6, 7 .
Also, the operation *cptr = *cptr + 1 is not guaranteed to be atomic. If you look carefully, it can be split up into 3 operations: fetch old value by dereferencing pointer, increment value, save value into pointer address.
The fact that you're getting 508/510 is due to some magic in the runtime and not defined to stay that way. More information on the behaviour of operations with concurrency can be found in the Go memory model.
You're probably getting the correct values for <510 started goroutines because any number below these are not (yet) getting interrupted.
Generally, what you're trying to do is neither recommendable in any language, nor the "Go" way to do concurrency. A very good example of using channels to synchronize is this code walk: Share Memory By Communicating (rather than communicating by sharing memory)
Here is a little example to show you what I mean: use a channel with a buffer of 1 to store the current number, fetch it from the channel when you need it, change it at will, then put it back for others to use.
You code is racy: You write to the same memory location from different, unsynchronized goroutines without any locking. The result is basically undefined. You must either a) make sure that all the goroutine writes after each other in a nice, ordered way, or b) protect each write by e.g. e mutex or c) use atomic operations.
If you write such code: Always try it under the race detector like $ go run -race main.go and fix all races.
A nice alternative to using channels in this case might be the sync/atomic package, which contains specifically functions for atomically incrementing/decrementing numbers.
You are spawning 500 or 1000 routines without synchronization among routines. This is creating a race condition, which makes the result un-predictable.
Imagine You are working in an office to account for expense balance for your boss.
Your boss was in a meeting with 1000 of his subordinates and in this meeting he said, "I would pay you all a bonus, so we will have to increment the record of our expense". He issued following command:
i) Go to Nerve and ask him/ her what is the current expense balance
ii) Call my secretary to ask how much bonus you would receive
iii) Add your bonus as an expense to the additional expense balance. Do the math yourself.
iv) Ask Nerve to write the new balance of expense on my authority.
All of the 1000 eager participants rushed to record their bonus and created a race condition.
Say 50 of the eager gophers hit Nerve at the same time (almost), they ask,
i) "What is the current expense balance?
-- Nerve says $1000 to all of those 50 gophers, as they asked at the same question at the same time(almost) when the balance was $1000.
ii) The gophers then called secretary, how much bonus should be paid to me?
Secretary answers, "just $1 to be fair"
iii) Boos said do the math, they all calculates $1000+ $1 = $1001 should be the new cost balance for the company
iv) They all will ask Nerve to put back $1001 to the balance.
You see the problem in the method of Eager gophers?
There are $50 units of computation done, every time the added $1 to the existing balance, but the cost didn't increase by $50; only increased by $1.
Hope that clarifies the problem. Now for the solutions, other contributors gave very good solutions that would be sufficient I believe.
All of that approaches failed to me, noobie here. But i have found a better way http://play.golang.org/p/OcMsuUpv2g
I'm using sync package to solve that problem and wait for all goroutines to finish, without Sleep or Channel.
And don't forget to take a look at that awesome post http://devs.cloudimmunity.com/gotchas-and-common-mistakes-in-go-golang/

Speedup problems with go

I wrote a very simple program in go to test performances of a parallel program. I wrote a very simple program that factorizes a big semiprime number by division trials. Since no communications are involved, I expected an almost perfect speedup. However, the program seems to scale very badly.
I timed the program with 1, 2, 4, and 8 processes, running on a 8 (real, not HT) cores computer, using the system timecommand. The number I factorized is "28808539627864609". Here are my results:
cores time (sec) speedup
1 60.0153 1
2 47.358 1.27
4 34.459 1.75
8 28.686 2.10
How to explain such bad speedups? Is it a bug in my program, or is it a problem with go runtime? How could I get better performances? I'm not talking about the algorithm by itself (I know there are better algorithms to factorize semiprime numbers), but about the way I parallelized it.
Here is the source code of my program:
package main
import (
"big"
"flag"
"fmt"
"runtime"
)
func factorize(n *big.Int, start int, step int, c chan *big.Int) {
var m big.Int
i := big.NewInt(int64(start))
s := big.NewInt(int64(step))
z := big.NewInt(0)
for {
m.Mod(n, i)
if m.Cmp(z) == 0{
c <- i
}
i.Add(i, s)
}
}
func main() {
var np *int = flag.Int("n", 1, "Number of processes")
flag.Parse()
runtime.GOMAXPROCS(*np)
var n big.Int
n.SetString(flag.Arg(0), 10) // Uses number given on command line
c := make(chan *big.Int)
for i:=0; i<*np; i++ {
go factorize(&n, 2+i, *np, c)
}
fmt.Println(<-c)
}
EDIT
Problem really seems to be related to Mod function. Replacing it by Rem gives better but still imperfect performances and speedups. Replacing it by QuoRem gives 3 times faster performances, and perfect speedup. Conclusion: it seems memory allocation kills parallel performances in Go. Why? Do you have any references about this?
Big.Int methods generally have to allocate memory, usually to hold the result of the computation. The problem is that there is just one heap and all memory operations are serialized. In this program, the numbers are fairly small and the (parallelizable) computation time needed for things like Mod and Add is small compared to the non-parallelizable operations of repeatedly allocating all the tiny little bits of memory.
As far as speeding it up, there is the obvious answer of don't use big.Ints if you don't have to. Your example number happens to fit in 64 bits. If you plan on working with really big big numbers though, the problem will kind of go away on its own. You will spend much more time doing computations, and the time spent in the heap will be relatively much less.
There is a bug in your program, by the way, although it's not related to performance. When you find a factor and return the result on the channel, you send a pointer to the local variable i. This is fine, except that you don't break out of the loop then. The loop in the goroutine continues incrementing i and by the time the main goroutine gets around to fishing the pointer out of the channel and following it, the value is almost certain to be wrong.
After sending i through the channel, i should be replaced with a newly allocated big.Int:
if m.Cmp(z) == 0 {
c <- i
i = new(big.Int).Set(i)
}
This is necessary because there is no guarantee when fmt.Println will process the integer received on line fmt.Println(<-c). It isn't usual for fmt.Println to cause goroutine switching, so if i isn't replaced with a newly allocated big.Int and the run-time switches back to executing the for-loop in function factorize then the for-loop will overwrite i before it is printed - in which case the program won't print out the 1st integer sent through the channel.
The fact that fmt.Println can cause goroutine switching means that the for-loop in function factorize may potentially consume a lot of CPU time between the moment when the main goroutine receives from channel c and the moment when the main goroutine terminates. Something like this:
run factorize()
<-c in main()
call fmt.Println()
continue running factorize() // Unnecessary CPU time consumed
return from fmt.Println()
return from main() and terminate program
Another reason for the small multi-core speedup is memory allocation. The function (*Int).Mod is internally using (*Int).QuoRem and will create a new big.Int each time it is called. To avoid the memory allocation, use QuoRem directly:
func factorize(n *big.Int, start int, step int, c chan *big.Int) {
var q, r big.Int
i := big.NewInt(int64(start))
s := big.NewInt(int64(step))
z := big.NewInt(0)
for {
q.QuoRem(n, i, &r)
if r.Cmp(z) == 0 {
c <- i
i = new(big.Int).Set(i)
}
i.Add(i, s)
}
}
Unfortunately, the goroutine scheduler in Go release r60.3 contains a bug which prevents this code to use all CPU cores. When the program is started with -n=2 (GOMAXPROCS=2), the run-time will utilize only 1 thread.
Go weekly release has a better run-time and can utilize 2 threads if n=2 is passed to the program. This gives a speedup of approximately 1.9 on my machine.
Another potential contributing factor to multi-core slowdown has been mentioned in the answer by user "High Performance Mark". If the program is splitting the work into multiple sub-tasks and the result comes only from 1 sub-task, it means that the other sub-tasks may do some "extra work". Running the program with n>=2 may in total consume more CPU time than running the program with n=1.
The learn how much extra work is being done, you may want to (somehow) print out values of all i's in all goroutines at the moment the program is exiting the function main().
I don't read go so this is probably the answer to a question which is not what you asked. If so, downvote or delete as you wish.
If you were to make a plot of 'time to factorise integer n' against 'n' you would get a plot that goes up and down somewhat randomly. For any n you choose there will be an integer in the range 1..n that takes longest to factorise on one processor. If your parallelisation strategy is to distribute the n integers across p processors one of those processors will take at least the time to factorise the hardest integer, then the time to factorise the rest of its load.
Perhaps you've done something similar ?

Resources