Suppose I have this:
go func() {
for range time.Tick(1 * time.Millisecond) {
a, b = b, a
}
}()
And elsewhere:
i := a // <-- Is this safe?
For this question, it's unimportant what the value of i is with respect to the original a or b. The only question is whether reading a is safe. That is, is it possible for a to be nil, partially assigned, invalid, undefined, ... anything other than a valid value?
I've tried to make it fail but so far it always succeeds (on my Mac).
I haven't been able to find anything specific beyond this quote in the The Go Memory Model doc:
Reads and writes of values larger than a single machine word behave as
multiple machine-word-sized operations in an unspecified order.
Is this implying that a single machine word write is effectively atomic? And, if so, are function pointer writes in Go a single machine word operation?
Update: Here's a properly synchronized solution
Unsynchronized, concurrent access to any variable from multiple goroutines where at least one of them is a write is undefined behavior by The Go Memory Model.
Undefined means what it says: undefined. It may be that your program will work correctly, it may be it will work incorrectly. It may result in losing memory and type safety provided by the Go runtime (see example below). It may even crash your program. Or it may even cause the Earth to explode (probability of that is extremely small, maybe even less than 1e-40, but still...).
This undefined in your case means that yes, i may be nil, partially assigned, invalid, undefined, ... anything other than either a or b. This list is just a tiny subset of all the possible outcomes.
Stop thinking that some data races are (or may be) benign or unharmful. They can be the source of the worst things if left unattended.
Since your code writes to the variable a in one goroutine and reads it in another goroutine (which tries to assign its value to another variable i), it's a data race and as such it's not safe. It doesn't matter if in your tests it works "correctly". One could take your code as a starting point, extend / build on it and result in a catastrophe due to your initially "unharmful" data race.
As related questions, read How safe are Golang maps for concurrent Read/Write operations? and Incorrect synchronization in go lang.
Strongly recommended to read the blog post by Dmitry Vyukov: Benign data races: what could possibly go wrong?
Also a very interesting blog post which shows an example which breaks Go's memory safety with intentional data race: Golang data races to break memory safety
In terms of Race condition, it's not safe. In short my understanding of race condition is when there're more than one asynchronous routine (coroutines, threads, process, goroutines etc.) trying to access the same resource and at least one is a writing operation, so in your example we have 2 goroutines reading and writing variables of type function, I think what's matter from a concurrent point of view is those variables have a memory space somewhere and we're trying to read or write in that portion of memory.
Short answer: just run your example using the -race flag with go run -race
or go build -race and you'll see a detected data race.
The answer to your question, as of today, is that if a and b are not larger than a machine word, i must be equal to a or b. Otherwise, it may contains an unspecified value, that is most likely to be an interleave of different parts from a and b.
The Go memory model, as of the version on June 6, 2022, guarantees that if a program executes a race condition, a memory access of a location not larger than a machine word must be atomic.
Otherwise, a read r of a memory location x that is not larger than a machine word must observe some write w such that r does not happen before w and there is no write w' such that w happens before w' and w' happens before r. That is, each read must observe a value written by a preceding or concurrent write.
The happen-before relationship here is defined in the memory model in the previous section.
The result of a racy read from a larger memory location is unspecified, but it is definitely not undefined as in the realm of C++.
Reads of memory locations larger than a single machine word are encouraged but not required to meet the same semantics as word-sized memory locations, observing a single allowed write w. For performance reasons, implementations may instead treat larger operations as a set of individual machine-word-sized operations in an unspecified order. This means that races on multiword data structures can lead to inconsistent values not corresponding to a single write. When the values depend on the consistency of internal (pointer, length) or (pointer, type) pairs, as can be the case for interface values, maps, slices, and strings in most Go implementations, such races can in turn lead to arbitrary memory corruption.
Related
Please don't mark this as duplicate question, as this is more specific to golang and requesting advise on some best practice when declaring variables to store large byte arrays when reading from a channel.
Forgive me for this dumb question, but the reason for this question is just my curiosity to determine what could be the best practice on writing a high-performance stream consumer reading large size byte array from multiple channels. (although premature optimization is the root of all evil, this is more of a curiosity). I have read answers about similar senario specific to C here, but I am requesting answer specific to go, as it is a garbage collected language, and their documentation here says "From a correctness standpoint, you don't need to know where the variable is allocated".
If I have the following code to read from a channel,
for {
select {
case msg := <-stream.Messages():
...snip...
Variable msg is within the scope of the case statement.
What happens once it goes out of scope of case statement? Since this is declared in the same native function, and the size of stream could be a large byte slice, is the variable going to be stored in heap or stack, and if heap, will it be garbage collected, or does stack pointer comes into picture?
Since this is inside an infinite for loop, and the size of stream is a large byte slice, is creating the variable and allocating memory every time an overhead,or should I declare the variable ahead, and keeps on over-writing it in every iteration, so that if there is a garbage collection involved, which I am not sure, I could possibly reduce the garbage?
Shouldn't I be bothered about it at all?
Thank you.
Shouldn't I be bothered about it at all?
No.
(And once it bothers you: profile.)
If the channel value type is a slice, the value of the variable msg is just the slice descriptor, which is small (see https://blog.golang.org/go-slices-usage-and-internals). The array that contains the data the slice refers to will have been allocated elsewhere before the slice was placed on the channel. Assuming the value must survive after the function that allocated it returns, it will be on the heap. Note that the contents of the slice are not actually being moved or copied by the channel receive operation.
Once the value of msg becomes unreachable (by the variable going out of scope or being assigned a different value), assuming there are no other references to the array underlying the slice, it will be subject to garbage collection.
It's hard to say whether some amount of optimization would be helpful without knowing more about how the program works.
func resetElectionTimeoutMS(newMin, newMax int) (int, int) {
oldMin := atomic.LoadInt32(&MinimumElectionTimeoutMS)
oldMax := atomic.LoadInt32(&maximumElectionTimeoutMS)
atomic.StoreInt32(&MinimumElectionTimeoutMS, int32(newMin))
atomic.StoreInt32(&maximumElectionTimeoutMS, int32(newMax))
return int(oldMin), int(oldMax)
}
I got a go code function like this.
The thing I am confused is: why do we need atomic here? What is this preventing from?
Thanks.
Atomic functions complete a task in an isolated way where all parts of the task appear to happen instantaneously or don't happen at all.
In this case, LoadInt32 and StoreInt32 ensure that an integer is stored and retrieved in a way where someone loading won't get a partial store. However, you need both sides to use atomic functions for this to function correctly. The raft example appears incorrect for at least two reasons.
Two atomic functions do not act as one atomic function, so reading the old and setting the new in two lines is a race condition. You may read, then someone else sets, then you set and you are returning false information for the previous value before you set it.
Not everyone accessing MinimumElectionTimeoutMS is using atomic operations. This means that the use of atomics in this function is effectively useless.
How would this be fixed?
func resetElectionTimeoutMS(newMin, newMax int) (int, int) {
oldMin := atomic.SwapInt32(&MinimumElectionTimeoutMS, int32(newMin))
oldMax := atomic.SwapInt32(&maximumElectionTimeoutMS, int32(newMax))
return int(oldMin), int(oldMax)
}
This would ensure that oldMin is the minimum that existed before the swap. However, the entire function is still not atomic as the final outcome could be an oldMin and oldMax pair that was never called with resetElectionTimeoutMS. For that... just use locks.
Each function would also need to be changed to do an atomic load:
func minimumElectionTimeout() time.Duration {
min := atomic.LoadInt32(&MinimumElectionTimeoutMS)
return time.Duration(min) * time.Millisecond
}
I recommend you carefully consider the quote VonC mentioned from the golang atomic documentation:
These functions require great care to be used correctly. Except for special, low-level applications, synchronization is better done with channels or the facilities of the sync package.
If you want to understand atomic operations, I recommend you start with http://preshing.com/20130618/atomic-vs-non-atomic-operations/. That goes over the load and store operations used in your example. However, there are other uses for atomics. The go atomic package overview goes over some cool stuff like atomic swapping (the example I gave), compare and swap (known as CAS), and Adding.
A funny quote from the link I gave you:
it’s well-known that on x86, a 32-bit mov instruction is atomic if the memory operand is naturally aligned, but non-atomic otherwise. In other words, atomicity is only guaranteed when the 32-bit integer is located at an address which is an exact multiple of 4.
In other words, on common systems today, the atomic functions used in your example are effectively no-ops. They are already atomic! (They are not guaranteed though, if you need it to be atomic, it is better to specify it explicitly)
Considering that the package atomic provides low-level atomic memory primitives useful for implementing synchronization algorithms, I suppose it was intended to be used as:
MinimumElectionTimeoutMS isn't modified while being stored in oldMin
MinimumElectionTimeoutMS isn't modified while being set to a new value newMin.
But, the package does come with the warning:
These functions require great care to be used correctly.
Except for special, low-level applications, synchronization is better done with channels or the facilities of the sync package.
Share memory by communicating; don't communicate by sharing memory.
In this case (server.go from the Raft distributed consensus protocol), synchronizing directly on the variable might be deemed faster than putting a Mutex on the all function.
Except, as Stephen Weinberg's answer illustrate (upvoted), this isn't how you use atomic. It only makes sure that oldMin is accurate while doing the swap.
See another example at "Is the two atomic style code in sync/atomic.once.go necessary?", in relation with the "memory model".
OneOfOne mentions in the comments using atomic CAS as a spinlock (very fast locking):
BenchmarkSpinL-8 2000 708494 ns/op 32315 B/op 2001 allocs/op
BenchmarkMutex-8 1000 1225260 ns/op 78027 B/op 2259 allocs/op
See:
sync/spinlock.go
sync/spinlock_test.go
I read that nonatomic and atomic both are thread unsafe. but nonatomic is faster because it allows faster access means asynchronously and atomic is slower it allows slower access synchronously.
An atomic property in Objective C guarantees that you will never see partial writes.
That is, if two threads concurrently write values A and B to the same variable X, then a concurrent read on that same variable will either yield the initial value of X, or A or B. With nonatomic that guarantee is no longer given. You may get any value, including values that you never explicitly wrote to that variable.
The reason for this is that with nonatomic, the reading thread may read the variable while another thread is in the middle of writing it. So part of what you read comes from the old value while another part comes from the new value.
The comment about them both being thread-unsafe refers to the fact that no additional guarantees are given beyond that. Apple's docs give the following example here:
Consider an XYZPerson object in which both a person’s first and last
names are changed using atomic accessors from one thread. If another
thread accesses both names at the same time, the atomic getter methods
will return complete strings (without crashing), but there’s no
guarantee that those values will be the right names relative to each
other. If the first name is accessed before the change, but the last
name is accessed after the change, you’ll end up with an inconsistent,
mismatched pair of names.
A purist might argue that this definition of thread-safety is overly strict. Technically speaking, atomic already takes care of data races and ordering, which is all you need from a language designer's point of view.
From an application-logic point of view on the other hand the aforementioned first-name-last-name example clearly constitutes a bug. Additional synchronization is required to get rid of the undesired behavior. In this application-specific view the class XYZPerson is not thread-safe. But here we are talking about a different level of thread-safety than the one that the language designer has.
I have a function that I use to look up a value based on an index. The value takes some time to calculate, so I want to do it with ParallelMap, and references another similar such function that returns a list of expressions, also based on an index.
However, when I set it all up in a seemingly reasonable fashion, I see some very bizarre behaviour. First, I see that the function appears to work, albeit very slowly. For large indexes, however, the processor activity in Taskmangler stays entirely at zero for an extended period of time (i.e. 2-4 minutes) where all instances of Mathematica are seemingly inert. Then, without the slightest blip of CPU use, a result appears. Is this another case of Mathematica spukhafte Fernwirkung?
That is, I want to create a variable/function that stores an expression, here a list of integers (ListOfInts), and then on the parallel workers I want to perform some function on that expression (here I apply a set of replacement rules and take the Min). I want the result of that function to also be indexed by the same index under another variable/function (IndexedFunk), whose result is then available back on the main instance of Mathematica:
(*some arbitrary rules that will convert some of the integers to negative values:*)
rulez=Dispatch[Thread[Rule[Range[222],-Range[222]]]];
maxIndex = 333;
Clear[ListOfInts]
Scan[(ListOfInts[#]=RandomInteger[{1,999},55])&,Range[maxIndex ]]
(*just for safety's sake:*)
DistributeDefinitions[rulez, ListOfInts]
Clear[IndexedFunk]
(*I believe I have to have at least one value of IndexedFunk defined before I Share the definition to the workers:*)
IndexedFunk[1]=Min[ListOfInts[1]]/.rulez
(*... and this should let me retrieve the values back on the primary instance of MMA:*)
SetSharedFunction[IndexedFunk]
(*Now, here is the mysterious part: this just sits there on my multiprocessor machine for many minutes until suddenly a result appears. If I up maxIndex to say 99999 (and of course re-execute the above code again) then the effect can more clearly be seen.*)
AbsoluteTiming[Short[ParallelMap[(IndexedFunk[#]=Min[ListOfInts[#]/.rulez])&, Range[maxIndex]]]]
I believe this is some bug, but then I am still trying to figure out Mathematica Parallel, so I can't be too confident in this conclusion. Despite its being depressingly slow, it is nonetheless impressive in its ability to perform calculations without actually requiring a CPU to do so.
I thought perhaps it was due to whatever communications protocol is being used between the master and slave processes, perhaps it is so slow that it just appears that the processors are doing nothing when if fact they are just waiting to send the next bit of some definition or other. In which case I thought ParallelMap[..., Method->"CoarsestGrained"] would be of some use. But no, that doesn't work neither.
A question: "Am I doing something obviously wrong, or is this a bug?"
I am afraid you are. The problem is with the shared definition of a variable. Mathematica maintains a single coherent value in all copies of the variable across kernels, and therefore that variable becomes a single point of huge contention. CPU is idle because kernels line up to the queue waiting for the variable IndexedFunk, and most time is spent in interprocess or inter-machine communication. Go figure.
By the way, there is no function SetSharedDefinition in any Mathematica version I know of. You probably intended to write SetSharedVariable. But remove that evil call anyway! To avoid contention, return results from the parallelized computation as a list of pairs, and then assemble them into downvalues of your variable at the main kernel:
Clear[IndexedFunk]
Scan[(IndexedFunk[#[[1]]] = #[[2]]) &,
ParallelMap[{#, Min[ListOfInts[#] /. rulez]} &, Range[maxIndex]]
]
ParallelMap takes care of distributing definition automagically, so the call to DistributeDefinitions is superfluous. (As a minor note, it is not correct as written, omitting the maxIndex variable, but the omission is automatically taken care of by ParallelMap in this particular case.)
EDIT, NB!: The automatic distribution applies only to the version 8 of Mathematica. Thanks #MikeHoneychurch for the correction.
Is the following construct thread-safe, assuming that the elements of foo are aligned and sized properly so that there is no word tearing? If not, why not?
Note: The code below is a toy example of what I want to do, not my actual real world scenario. Obviously, there are better ways of coding the observable behavior in my example.
uint[] foo;
// Fill foo with data.
// In thread one:
for(uint i = 0; i < foo.length; i++) {
if(foo[i] < SOME_NUMBER) {
foo[i] = MAGIC_VAL;
}
}
// In thread two:
for(uint i = 0; i < foo.length; i++) {
if(foo[i] < SOME_OTHER_NUMBER) {
foo[i] = MAGIC_VAL;
}
}
This obviously looks unsafe at first glance, so I'll highlight why I think it could be safe:
The only two options are for an element of foo to be unchanged or to be set to MAGIC_VAL.
If thread two sees foo[i] in an intermediate state while it's being updated, only two things can happen: The intermediate state is < SOME_OTHER_NUMBER or it's not. If it is < SOME_OTHER_NUMBER, thread two will also try to set it to MAGIC_VAL. If not, thread two will do nothing.
Edit: Also, what if foo is a long or a double or something, so that updating it can't be done atomically? You may still assume that alignment, etc. is such that updating one element of foo will not affect any other element. Also, the whole point of multithreading in this case is performance, so any type of locking would defeat this.
On a modern multicore processor your code is NOT threadsafe (at least in most languages) without a memory barrier. Simply put, without explicit barriers each thread can see a different entirely copy of foo from caches.
Say that your two threads ran at some point in time, then at some later point in time a third thread read foo, it could see a foo that was completely uninitialized, or the foo of either of the other two threads, or some mix of both, depending on what's happened with CPU memory caching.
My advice - don't try to be "smart" about concurrency, always try to be "safe". Smart will bite you every time. The broken double-checked locking article has some eye-opening insights into what can happen with memory access and instruction reordering in the absence of memory barriers (though specifically about Java and it's (changing) memory model, it's insightful for any language).
You have to be really on top of your language's specified memory model to shortcut barriers. For example, Java allows a variable to be tagged volatile, which combined with a type which is documented as having atomic assignment, can allow unsynchronized assignment and fetch by forcing them through to main memory (so the thread is not observing/updating cached copies).
You can do this safely and locklessly with a compare-and-swap operation. What you've got looks thread safe but the compiler might create a writeback of the unchanged value under some circumstances, which will cause one thread to step on the other.
Also you're probably not getting as much performance as you think out of doing this, because having both threads writing to the same contiguous memory like this will cause a storm of MESI transitions inside the CPU's cache, each of which is quite slow. For more details on multithread memory coherence you can look at section 3.3.4 of Ulrich Drepper's "What Every Programmer Should Know About Memory".
If reads and writes to each array element are atomic (i.e. they're aligned properly with no word tearing as you mentioned), then there shouldn't be any problems in this code. If foo[i] is less than either of SOME_NUMBER or SOME_OTHER_NUMBER, then at least one thread (possibly both) will set it to MAGIC_VAL at some point; otherwise, it will be untouched. With atomic reads and writes, there are no other possibilities.
However, since your situation is more complicated, be very very careful -- make sure that foo[i] is truly only read once per loop and stored in a local variable. If you read it more than once during the same iteration, you could get inconsistent results. Even the slightest change you make to your code could immediately make it unsafe with race conditions, so comment heavily about the code with big red warning signs.
It's bad practice, you should never be in the state where two threads are accessesing the same variable at the same time, regardless of the consequences. The example you give is over simplified, any majority complex samples will almost always have problems associated with it.. ...
Remember: Semaphores are your friend!
That particular example is thread-safe.
There are no intermediate states really involved here.
That particular program would not get confused.
I would suggest a Mutex on the array, though.