Thread safety of simultaneous updates of a variable to the same value - performance

Is the following construct thread-safe, assuming that the elements of foo are aligned and sized properly so that there is no word tearing? If not, why not?
Note: The code below is a toy example of what I want to do, not my actual real world scenario. Obviously, there are better ways of coding the observable behavior in my example.
uint[] foo;
// Fill foo with data.
// In thread one:
for(uint i = 0; i < foo.length; i++) {
if(foo[i] < SOME_NUMBER) {
foo[i] = MAGIC_VAL;
}
}
// In thread two:
for(uint i = 0; i < foo.length; i++) {
if(foo[i] < SOME_OTHER_NUMBER) {
foo[i] = MAGIC_VAL;
}
}
This obviously looks unsafe at first glance, so I'll highlight why I think it could be safe:
The only two options are for an element of foo to be unchanged or to be set to MAGIC_VAL.
If thread two sees foo[i] in an intermediate state while it's being updated, only two things can happen: The intermediate state is < SOME_OTHER_NUMBER or it's not. If it is < SOME_OTHER_NUMBER, thread two will also try to set it to MAGIC_VAL. If not, thread two will do nothing.
Edit: Also, what if foo is a long or a double or something, so that updating it can't be done atomically? You may still assume that alignment, etc. is such that updating one element of foo will not affect any other element. Also, the whole point of multithreading in this case is performance, so any type of locking would defeat this.

On a modern multicore processor your code is NOT threadsafe (at least in most languages) without a memory barrier. Simply put, without explicit barriers each thread can see a different entirely copy of foo from caches.
Say that your two threads ran at some point in time, then at some later point in time a third thread read foo, it could see a foo that was completely uninitialized, or the foo of either of the other two threads, or some mix of both, depending on what's happened with CPU memory caching.
My advice - don't try to be "smart" about concurrency, always try to be "safe". Smart will bite you every time. The broken double-checked locking article has some eye-opening insights into what can happen with memory access and instruction reordering in the absence of memory barriers (though specifically about Java and it's (changing) memory model, it's insightful for any language).
You have to be really on top of your language's specified memory model to shortcut barriers. For example, Java allows a variable to be tagged volatile, which combined with a type which is documented as having atomic assignment, can allow unsynchronized assignment and fetch by forcing them through to main memory (so the thread is not observing/updating cached copies).

You can do this safely and locklessly with a compare-and-swap operation. What you've got looks thread safe but the compiler might create a writeback of the unchanged value under some circumstances, which will cause one thread to step on the other.
Also you're probably not getting as much performance as you think out of doing this, because having both threads writing to the same contiguous memory like this will cause a storm of MESI transitions inside the CPU's cache, each of which is quite slow. For more details on multithread memory coherence you can look at section 3.3.4 of Ulrich Drepper's "What Every Programmer Should Know About Memory".

If reads and writes to each array element are atomic (i.e. they're aligned properly with no word tearing as you mentioned), then there shouldn't be any problems in this code. If foo[i] is less than either of SOME_NUMBER or SOME_OTHER_NUMBER, then at least one thread (possibly both) will set it to MAGIC_VAL at some point; otherwise, it will be untouched. With atomic reads and writes, there are no other possibilities.
However, since your situation is more complicated, be very very careful -- make sure that foo[i] is truly only read once per loop and stored in a local variable. If you read it more than once during the same iteration, you could get inconsistent results. Even the slightest change you make to your code could immediately make it unsafe with race conditions, so comment heavily about the code with big red warning signs.

It's bad practice, you should never be in the state where two threads are accessesing the same variable at the same time, regardless of the consequences. The example you give is over simplified, any majority complex samples will almost always have problems associated with it.. ...
Remember: Semaphores are your friend!

That particular example is thread-safe.
There are no intermediate states really involved here.
That particular program would not get confused.
I would suggest a Mutex on the array, though.

Related

Is it safe to read a function pointer concurrently without a lock?

Suppose I have this:
go func() {
for range time.Tick(1 * time.Millisecond) {
a, b = b, a
}
}()
And elsewhere:
i := a // <-- Is this safe?
For this question, it's unimportant what the value of i is with respect to the original a or b. The only question is whether reading a is safe. That is, is it possible for a to be nil, partially assigned, invalid, undefined, ... anything other than a valid value?
I've tried to make it fail but so far it always succeeds (on my Mac).
I haven't been able to find anything specific beyond this quote in the The Go Memory Model doc:
Reads and writes of values larger than a single machine word behave as
multiple machine-word-sized operations in an unspecified order.
Is this implying that a single machine word write is effectively atomic? And, if so, are function pointer writes in Go a single machine word operation?
Update: Here's a properly synchronized solution
Unsynchronized, concurrent access to any variable from multiple goroutines where at least one of them is a write is undefined behavior by The Go Memory Model.
Undefined means what it says: undefined. It may be that your program will work correctly, it may be it will work incorrectly. It may result in losing memory and type safety provided by the Go runtime (see example below). It may even crash your program. Or it may even cause the Earth to explode (probability of that is extremely small, maybe even less than 1e-40, but still...).
This undefined in your case means that yes, i may be nil, partially assigned, invalid, undefined, ... anything other than either a or b. This list is just a tiny subset of all the possible outcomes.
Stop thinking that some data races are (or may be) benign or unharmful. They can be the source of the worst things if left unattended.
Since your code writes to the variable a in one goroutine and reads it in another goroutine (which tries to assign its value to another variable i), it's a data race and as such it's not safe. It doesn't matter if in your tests it works "correctly". One could take your code as a starting point, extend / build on it and result in a catastrophe due to your initially "unharmful" data race.
As related questions, read How safe are Golang maps for concurrent Read/Write operations? and Incorrect synchronization in go lang.
Strongly recommended to read the blog post by Dmitry Vyukov: Benign data races: what could possibly go wrong?
Also a very interesting blog post which shows an example which breaks Go's memory safety with intentional data race: Golang data races to break memory safety
In terms of Race condition, it's not safe. In short my understanding of race condition is when there're more than one asynchronous routine (coroutines, threads, process, goroutines etc.) trying to access the same resource and at least one is a writing operation, so in your example we have 2 goroutines reading and writing variables of type function, I think what's matter from a concurrent point of view is those variables have a memory space somewhere and we're trying to read or write in that portion of memory.
Short answer: just run your example using the -race flag with go run -race
or go build -race and you'll see a detected data race.
The answer to your question, as of today, is that if a and b are not larger than a machine word, i must be equal to a or b. Otherwise, it may contains an unspecified value, that is most likely to be an interleave of different parts from a and b.
The Go memory model, as of the version on June 6, 2022, guarantees that if a program executes a race condition, a memory access of a location not larger than a machine word must be atomic.
Otherwise, a read r of a memory location x that is not larger than a machine word must observe some write w such that r does not happen before w and there is no write w' such that w happens before w' and w' happens before r. That is, each read must observe a value written by a preceding or concurrent write.
The happen-before relationship here is defined in the memory model in the previous section.
The result of a racy read from a larger memory location is unspecified, but it is definitely not undefined as in the realm of C++.
Reads of memory locations larger than a single machine word are encouraged but not required to meet the same semantics as word-sized memory locations, observing a single allowed write w. For performance reasons, implementations may instead treat larger operations as a set of individual machine-word-sized operations in an unspecified order. This means that races on multiword data structures can lead to inconsistent values not corresponding to a single write. When the values depend on the consistency of internal (pointer, length) or (pointer, type) pairs, as can be the case for interface values, maps, slices, and strings in most Go implementations, such races can in turn lead to arbitrary memory corruption.

are nonatomic and atomic thread unsafe in objective c?

I read that nonatomic and atomic both are thread unsafe. but nonatomic is faster because it allows faster access means asynchronously and atomic is slower it allows slower access synchronously.
An atomic property in Objective C guarantees that you will never see partial writes.
That is, if two threads concurrently write values A and B to the same variable X, then a concurrent read on that same variable will either yield the initial value of X, or A or B. With nonatomic that guarantee is no longer given. You may get any value, including values that you never explicitly wrote to that variable.
The reason for this is that with nonatomic, the reading thread may read the variable while another thread is in the middle of writing it. So part of what you read comes from the old value while another part comes from the new value.
The comment about them both being thread-unsafe refers to the fact that no additional guarantees are given beyond that. Apple's docs give the following example here:
Consider an XYZPerson object in which both a person’s first and last
names are changed using atomic accessors from one thread. If another
thread accesses both names at the same time, the atomic getter methods
will return complete strings (without crashing), but there’s no
guarantee that those values will be the right names relative to each
other. If the first name is accessed before the change, but the last
name is accessed after the change, you’ll end up with an inconsistent,
mismatched pair of names.
A purist might argue that this definition of thread-safety is overly strict. Technically speaking, atomic already takes care of data races and ordering, which is all you need from a language designer's point of view.
From an application-logic point of view on the other hand the aforementioned first-name-last-name example clearly constitutes a bug. Additional synchronization is required to get rid of the undesired behavior. In this application-specific view the class XYZPerson is not thread-safe. But here we are talking about a different level of thread-safety than the one that the language designer has.

Or-equals on constant as reduction operation (ex. value |= 1 ) thread-safe?

Let's say that I have a variable x.
x = 0
I then spawn some number of threads, and each of them may or may not run the following expression WITHOUT the use of atomics.
x |= 1
After all threads have joined with my main thread, the main thread branches on the value.
if(x) { ... } else { ... }
Is it possible for there to be a race condition in this situation? My thoughts say no, because it doesn't seem to matter whether or not a thread is interrupted by another thread between reading and writing 'x' (in both cases, either 'x == 1', or 'x == 1'). That said, I want to make sure I'm not missing something stupid obvious or ridiculously subtle.
Also, if you happen to provide an answer to the contrary, please provide an instruction-by-instruction example!
Context:
I'm trying to, in OpenCL, have my threads indicate the presence or absence of a feature among any of their work-items. If any of the threads indicate the presence of the feature, my host ought to be able to branch on the result. I'm thinking of using the above method. If you guys have a better suggestion, that works too!
Detail:
I'm trying to add early-exit to my OpenCL radix-sort implementation, to skip radix passes if the data is banded (i.e. 'x' above would be x[RADIX] and I'd have all work groups, right after partial reduction of the data, indicate presence or absence of elements in the RADIX bins via 'x').
It may work within a work-group. You will need to insert a barrier before testing x. I'm not sure it will be faster than using atomic increments.
It will not work across several work-groups. Imagine you have 1000 work-groups to run on 20 cores. Typically, only a small number of work-groups can be resident on a single core, for example 4, meaning only 80 work-groups can be in flight inside the GPU at a given time. Once a work-group is done executing, it is retired, and another one is started. Halting a kernel in the middle of execution to wait for all 1000 work-groups to reach the same point is impossible.

What is atomic in boost / C++0x / C++1x / computer sciences?

What is atomic in C/C++ programming ?
I just visited the dearly cppreference.com (well I don't take the title for granted but wait for my story to finish), and the home changed to describe some of the C++0x/C++1x (let's call it C+++, okay ?) new features.
There was a mysterious and never seen by my zombie programmer's eye, the new <atomic>.
I guess its purpose is not to program atomic bombs or black holes (but I highly doubt this could have ANY connection with black holes, I don't know how those 2 words slipped here), but I'd like to know something:
What is the purpose of this feature ? Is it a type ? A function ? Is it a data container ? Is it related to threads ? May it have some relation with python's "import antigravity" ? I mean, we are programming here, we're not bloody physicist or semanticists !
Atomic refers to something which is not divisible.
An atomic expression is one that is actually executed by a single operation.
For example a++ is not atomic, since to exec it you need first to get the value of a, then to sum 1 to it, then to store the result into a.
Reading the value of an int should instead be atomic.
Atomic-ness is important in shared-memory parallel computations (eg: when using threads): because it tells you that an expression will give you the result you're expecting no matter what the other threads are doing.
AFAIK you could use atomic functions to create your own semaphores etc. The name atomic came from atom, you cant break it smaller, so those function calls can't be "broken apart" and paused by the operating system. This is for thread programming.
Is intended for multithreading. It avoids you to have concurrent threads mix operations. An atomic operation is an indivisible operation. You can’t observe such an operation half-done from any thread in the system; it’s either done or not done. With an atomic operation you cannot get a data race between threads. In a real world analogy you will use atomic not for physics but for semaphores and other traffic signals on roads. Cars will be threads, roads will be rules, locations will be data. Semaphores will be atomic. You don't need semaphores when there is only one car on all roads, right?

Performance difference between iterating once and iterating twice?

Consider something like...
for (int i = 0; i < test.size(); ++i) {
test[i].foo();
test[i].bar();
}
Now consider..
for (int i = 0; i < test.size(); ++i) {
test[i].foo();
}
for (int i = 0; i < test.size(); ++i) {
test[i].bar();
}
Is there a large difference in time spent between these two? I.e. what is the cost of the actual iteration? It seems like the only real operations you are repeating are an increment and a comparison (though I suppose this would become significant for a very large n). Am I missing something?
First, as noted above, if your compiler can't optimize the size() method out so it's just called once, or is nothing more than a single read (no function call overhead), then it will hurt.
There is a second effect you may want to be concerned with, though. If your container size is large enough, then the first case will perform faster. This is because, when it gets to test[i].bar(), test[i] will be cached. The second case, with split loops, will thrash the cache, since test[i] will always need to be reloaded from main memory for each function.
Worse, if your container (std::vector, I'm guessing) has so many items that it won't all fit in memory, and some of it has to live in swap on your disk, then the difference will be huge as you have to load things in from disk twice.
However, there is one final thing that you have to consider: all this only makes a difference if there is no order dependency between the function calls (really, between different objects in the container). Because, if you work it out, the first case does:
test[0].foo();
test[0].bar();
test[1].foo();
test[1].bar();
test[2].foo();
test[2].bar();
// ...
test[test.size()-1].foo();
test[test.size()-1].bar();
while the second does:
test[0].foo();
test[1].foo();
test[2].foo();
// ...
test[test.size()-1].foo();
test[0].bar();
test[1].bar();
test[2].bar();
// ...
test[test.size()-1].bar();
So if your bar() assumes that all foo()'s have run, you will break it if you change the second case to the first. Likewise, if bar() assumes that foo() has not been run on later objects, then moving from the second case to the first will break your code.
So be careful and document what you do.
There are many aspects in such comparison.
First, complexity for both options is O(n), so difference isn't very big anyway. I mean, you must not care about it if you write quite big and complex program with a large n and "heavy" operations .foo() and bar(). So, you must care about it only in case of very small simple programs (this is kind of programs for embedded devices, for example).
Second, it will depend on programming language and compiler. I'm assured that, for instance, most of C++ compilers will optimize your second option to produce same code as for the first one.
Third, if compiler haven't optimized your code, performance difference will heavily depend on the target processor. Consider loop in a term of assembly commands - it will look something like this (pseudo assembly language):
LABEL L1:
do this ;; some commands
call that
IF condition
goto L1
;; some more instructions, ELSE part
I.e. every loop passage is just IF statement. But modern processors don't like IF. This is because processors may rearrange instructions to execute them beforehand or just to avoid idles. With the IF (in fact, conditional goto or jump) instructions, processors do not know if they may rearrange operation or not.
There's also a mechanism called branch predictor. From material of Wikipedia:
branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure.
This "soften" effect of IF's, through if the predictor's guess is wrong, no optimization will be performed.
So, you can see that there's a big amount of conditions for both your options: target language and compiler, target machine, it's processor and branch predictor. This all makes very complex system, and you cannot foresee what exact result you will get. I believe, that if you don't deal with embedded systems or something like that, the best solution is just to use the form which your are more comfortable with.
For your examples you have the additional concern of how expensive .size() is, since it's compared for each time i increments in most languages.
How expensive is it? Well that depends, it's certainly all relative. If .foo() and .bar() are expensive, the cost of the actual iteration is probably minuscule in comparison. If they're pretty lightweight, then it'll be a larger percentage of your execution time. If you want to know about a particular case test it, this is the only way to be sure about your specific scenario.
Personally, I'd go with the single iteration to be on the cheap side for sure (unless you need the .foo() calls to happen before the .bar() calls).
I assume .size() will be constant. Otherwise, the first code example might not give the same as the second one.
Most compilers would probably store .size() in a variable before the loop starts, so the .size() time will be cut down.
Therefore the time of the stuff inside the two for loops will be the same, but the other part will be twice as much.
Performance tag, right.
As long as you are concentrating on the "cost" of this or that minor code segment, you are oblivious to the bigger picture (isolation); and your intention is to justify something that, at a higher level (outside your isolated context), is simply bad practice, and breaks guidelines. The question is too low level and therefore too isolated. A system or program which is set of integrated components will perform much better that a collection of isolated components.
The fact that this or that isolated component (work inside the loop) is fast or faster is irrelevant when the loop itself is repeated unnecessarily, and which would therefore take twice the time.
Given that you have one family car (CPU), why on Earth would you:
sit at home and send your wife out to do her shopping
wait until she returns
take the car, go out and do your shopping
leaving her to wait until you return
If it needs to be stated, you would spend (a) almost half of your hard-earned resources executing one trip and shopping at the same time and (b) have those resources available to have fun together when you get home.
It has nothing to do with the price of petrol at 9:00 on a Saturday, or the time it takes to grind coffee at the café, or cost of each iteration.
Yes, there is a large diff in the time and the resources used. But the cost is not merely in the overhead per iteration; it is in the overall cost of the one organised trip vs the two serial trips.
Performance is about architecture; never doing anything twice (that you can do once), which are the higher levels of organisation; integrated of the parts that make up the whole. It is not about counting pennies at the bowser or cycles per iteration; those are lower orders of organisation; which ajust a collection of fragmented parts (not a systemic whole).
Masseratis cannot get through traffic jams any faster than station wagons.

Resources