How to understand acquire and release semantics? - windows

I found out three function from MSDN , below:
1.InterlockedDecrement().
2.InterlockedDecrementAcquire().
3.InterlockedDecrementRelease().
I knew those fucntion use to decrement a value as an atomic operation, but i don't know distinction between the three function

(um... but don't ask me what does it mean exactly)
I'll take a stab at that.
Something to remember is that the compiler, or the CPU itself, may reorder memory reads and writes if they appear to not deal with each other.
This is useful, for instance, if you have some code that, maybe is updating a structure:
if ( playerMoved ) {
playerPos.X += dx;
playerPos.Y += dy;
// Keep the player above the world's surface.
if ( playerPos.Z + dz > 0 ) {
playerPos.Z += dz;
}
else {
playerPos.Z = 0;
}
}
Most of above statements may be reordered because there's no data dependency between them, in fact, a superscalar CPU may execute most of those statements simultaneously, or maybe would start working on the Z section sooner, since it doesn't affect X or Y, but might take longer.
Here's the problem with that - lets say that you're attempting lock-free programming. You want to perform a whole bunch of memory writes to, maybe, fill in a shared queue. You signal that you're done by finally writing to a flag.
Well, since that flag appears to have nothing to do with the rest of the work being done, the compiler and the CPU may reorder those instructions, and now you may set your 'done' flag before you've actually committed the rest of the structure to memory, and now your "lock-free" queue doesn't work.
This is where Acquire and Release ordering semantics come into play. I set that I'm doing work by setting a flag or so with an Acquire semantic, and the CPU guarantees that any memory games I play after that instruction stay actually below that instruction. I set that I'm done by setting a flag or so with a Release semantic, and the CPU guarantees that any memory games I had done just before the release actually stay before the release.
Normally, one would do this using explicit locks - mutexes, semaphores, etc, in which the CPU already knows it has to pay attention to memory ordering. The point of attempting to create 'lock free' data structures is to provide data structures that are thread safe (for some meaning of thread safe), that don't use explicit locks (because they are very slow).
Creating lock-free data structures is possible on a CPU or compiler that doesn't support acquire/release ordering semantics, but it usually means that some slower memory ordering semantic is used. For instance, you could issue a full memory barrier - everything that came before this instruction has to actually be committed before this instruction, and everything that came after this instruction has to be committed actually after this instruction. But that might mean that I wait for a bunch of actually irrelevant memory writes from earlier in the instruction stream (perhaps function call prologue) that has nothing to do with the memory safety I'm trying to implement.
Acquire says "only worry about stuff after me". Release says "only worry about stuff before me". Combining those both is a full memory barrier.

http://preshing.com/20120913/acquire-and-release-semantics/
Acquire semantics is a property which can only apply to operations
which read from shared memory, whether they are read-modify-write
operations or plain loads. The operation is then considered a
read-acquire. Acquire semantics prevent memory reordering of the
read-acquire with any read or write operation which follows it in
program order.
Release semantics is a property which can only apply to operations
which write to shared memory, whether they are read-modify-write
operations or plain stores. The operation is then considered a
write-release. Release semantics prevent memory reordering of the
write-release with any read or write operation which precedes it in
program order.
(um... but don't ask me what does it mean exactly)

Related

Why we do not use barriers in User space

I am reading about memory barriers and what I can summarize is that they prevent instruction re-ordering done by compilers.
So in User space memory lets say I have
b = 0;
main(){
a = 10;
b = 20;
c = add(a,b);
}
Can the compiler reorder this code so that b = 20 assignment happens after c = add() is called.
Why we do not use barriers in this case ? Am I missing some fundamental here.
Does Virtual memory is exempted from any re ordering ?
Extending the Question further:
In Network driver:
1742 /*
1743 * Writing to TxStatus triggers a DMA transfer of the data
1744 * copied to tp->tx_buf[entry] above. Use a memory barrier
1745 * to make sure that the device sees the updated data.
1746 */
1747 wmb();
1748 RTL_W32_F (TxStatus0 + (entry * sizeof (u32)),
1749 tp->tx_flag | max(len, (unsigned int)ETH_ZLEN));
1750
When he says devices see the updated data... How to relate this with the multi threaded theory for usage of barriers.
Short answer
Memory barriers are used less frequently in user mode code than kernel mode code because user mode code tends to use higher level abstractions (for example pthread synchronization operations).
Additional details
There are two things to consider when analyzing the possible ordering of operations:
What order the thread that is executing the code will see the operations in
What order other threads will see the operations in
In your example the compiler cannot reorder b=20 to occur after c=add(a,b) because the c=add(a,b) operation uses the results of b=20. However, it may be possible for the compiler to reorder these operations so that other threads see the memory location associated with c change before the memory location associated with b changes.
Whether this would actually happen or not depends on the memory consistency model that is implemented by the hardware.
As for when the compiler might do reordering you could imagine adding another variable as follows:
b = 0;
main(){
a = 10;
b = 20;
d = 30;
c = add(a,b);
}
In this case the compiler would be free to move the d=30 assignment to occur after c=add(a,b).
However, this entire example is too simplistic. The program doesn't do anything and the compiler can eliminate all the operations and does not need to write anything to memory.
Addendum: Memory reordering example
In a multiprocessor environment multiple threads can see memory operations occur in different orders. The Intel Software Developer's Manual has some examples in Volume 3 section 8.2.3. I've copied a screenshot below that shows an example where loads and stores can be reordered.
There is also a good blog post that provides some more detail about this example.
The thread running the code will always act as if the effects of the source lines of its own code happened in program order. This is as if rule is what enables most compiler optimizations.
Within a single thread, out-of-order CPUs track dependencies to give a thread the illusion that all its instructions executed in program order. The globally-visible (to threads on other cores) effects may be seen out-of-order by other cores, though.
Memory barriers (as part of locking, or on their own) are only needed in code that interacts with other threads through shared memory.
Compilers can similarly do any reordering / hoisting they want, as long as the results are the same. The C++ memory model is very weak, so compile-time reordering is possible even when targeting an x86 CPU. (But of course not reordering that produces different results within the local thread.) C11 <stdatomic.h> and the equivalent C++11 std::atomic are the best way to tell the compiler about any ordering requirements you have for the global visibility of operations. On x86, this usually just results in putting store instructions in source order, but the default memory_order_seq_cst needs an MFENCE on each store to prevent StoreLoad reordering for full sequential consistency.
In kernel code, memory barriers are also common to make sure that stores to memory-mapped I/O registers happen in a required order. The reasoning is the same: to order the globally-visible effects on memory of a sequence of stores and loads. The difference is that the observer is an I/O device, not a thread on another CPU. The fact that cores interact with each other through a cache coherency protocol is irrelevant.
The compiler cannot reorder (nor can the runtime or the cpu) so that b=20 is after the c=add()since that would change the semantics of the method and that is not permissible.
I would say that for the compiler (or runtime or cpu) to act as you describe would make the behaviour random, which would be a bad thing.
This restriction on reordering applies only within the thread executing the code. As #GabrielSouthern points out, the ordering of the stores becoming globally visible is not guaranteed, if a, b, and c are all global variables.

Why aren't out of order CPUs troublesome?

i've recently learned about out-of-order execution CPUs in this link https://en.wikipedia.org/wiki/Out-of-order_execution
There is something that i can't quite understand. Why aren't these kind of CPUs troublesome? I mean, if i have instructions executing out of order, even if they apply to different data, wont i be able to reach a situation where data is not updated according to the program order?
I mean, if i have something like:
x = 1;
y = 2;
x = x+y;
print x;
print y;
what prevents the "print y" instruction from being executed before the "print x"?
Maybe i'm getting something wrong about this kind of CPU, can you explain it to me?
Thanks in advance
In an out-of-order processor, the instructions are executed out-of-order but committed in order. This means that externally visible state becomes visible in order. Writes to memory (including I/O operations) are buffered locally so that they can be released to the external memory system in order, similarly register results are stored locally using register renaming. This allows the local processor to use the early speculative values without violating externally visible ordering. On incorrect speculation, a rollback mechanism is used to restore state back to a previous known-valid state.
Note that technically commitment of results to processor-core-external state does not have to be in order as long as the out-of-order result is non-speculative and does not violate ordering guarantees. With weak memory consistency models, this could (in theory) allow values to become externally visible out-of-order. (I/O is required to be in-order, so the print example would still be required to commit in order.) Furthermore, if other cores know that values are speculative in nature or order, the values could be made externally visible out-of-order (again, in theory) and consumers of the values would rollback on incorrect speculation. (This is extending "externally visible" from external to the single processor core to external to some larger system component which is aware of speculation and supports rollback.)
(In very impractical theory, extending speculative realization to the human computer interface would be possible if the definition of the interface allowed such glitches (i.e., the human corrects for wrongly speculated values and ordering). However, since humans are even slower than most I/O devices (speculations would be resolved on a shorter time scale than is significant) and such an extension of speculation would be extremely complex and generally undesirable, it is unlikely ever to be broadly used.)
#paul-a-clayton already gave a great answer. I would like to mention one thing more.
Earlier out-of-order architectures were notoriously troublesome. The fundamental problem was that they couldn't guarantee precise interrupts. This changed in the late 80s/early 90s due to the solutions proposed in Smith & Pleszkun's seminal paper. They introduced the idea of a reorder buffer which allows instructions to issue out of order but commit in order.

Which std::sync::atomic::Ordering to use?

All the methods of std::sync::atomic::AtomicBool take a memory ordering (Relaxed, Release, Acquire, AcqRel, and SeqCst), which I have not used before. Under what circumstances should these values be used? The documentation uses confusing “load” and “store” terms which I don’t really understand. For example:
A producer thread mutates some state held by a Mutex, then calls AtomicBool::compare_and_swap(false, true, ordering) (to coalesce invalidations), and if it swapped, posts an “invalidate” message to a concurrent queue (e.g. mpsc or a winapi PostMessage). A consumer thread resets the AtomicBool, reads from the queue, and reads the state held by the Mutex. Can the producer use Relaxed ordering because it is preceded by a mutex, or must it use Release? Can the consumer use store(false, Relaxed), or must it use compare_and_swap(true, false, Acquire) to receive the changes from the mutex?
What if the producer and consumer share a RefCell instead of a Mutex?
I'm not an expert on this, and it's really complicated, so please feel free to critique my post. As pointed out by mdh.heydari, cppreference.com has much better documentation of orderings than Rust (C++ has an almost identical API).
For your question
You'd need to use "release" ordering in your producer and "acquire" ordering in your consumer. This ensures that the data mutation occurs before the AtomicBool is set to true.
If your queue is asynchronous, then the consumer will need to keep trying to read from it in a loop, since the producer could get interrupted between setting the AtomicBool and putting something in the queue.
If the producer code might run multiple times before client runs, then you can't use RefCell because they could mutate the data while the client is reading it. Otherwise it's fine.
There are other better and simpler ways to implement this pattern, but I assume you were just giving it as an example.
What are orderings?
The different orderings have to do with what another thread sees happen when an atomic operation occurs. Compilers and CPUs are normally both allowed to reorder instructions in order to optimize code, and the orderings effect how much they're allowed to reorder instructions.
You could just always use SeqCst, which basically guarantees everyone will see that instruction as having occurred wherever you put it relative to other instructions, but in some cases if you specify a less restrictive ordering then LLVM and the CPU can better optimize your code.
You should think of these orderings as applying to a memory location (instead of applying to an instruction).
Ordering Types
Relaxed Ordering
There are no constraints besides any modification to the memory location being atomic (so it either happens completely or not at all). This is fine for something like a counter if the values retrieved by/set by individual threads don't matter as long as they're atomic.
Acquire Ordering
This constraint says that any variable reads that occur in your code after "acquire" is applied can't be reordered to occur before it. So, say in your code you read some shared memory location and get value X, which was stored in that memory location at time T, and then you apply the "acquire" constraint. Any memory locations that you read from after applying the constraint will have the value they had at time T or later.
This is probably what most people would expect to happen intuitively, but because a CPU and optimizer are allowed to reorder instructions as long as they don't change the result, it isn't guaranteed.
In order for "acquire" to be useful, it has to be paired with "release", because otherwise there's no guarantee that the other thread didn't reorder its write instructions that were supposed to occur at time T to an earlier time.
Acquire-reading the flag value you're looking for means you won't see a stale value somewhere else that was actually changed by a write before the release-store to the flag.
Release Ordering
This constraint says that any variable writes that occur in your code before "release" is applied can't be reordered to occur after it. So, say in your code you write to a few shared memory locations and then set some memory location t at time T, and then you apply the "release" constraint. Any writes that appear in your code before "release" is applied are guaranteed to have occurred before it.
Again, this is what most people would expect to happen intuitively, but it isn't guaranteed without constraints.
If the other thread trying to read value X doesn't use "acquire", then it isn't guaranteed to see the new value with respect to changes in other variable values. So it could get the new value, but it might not see new values for any other shared variables. Also keep in mind that testing is hard. Some hardware won't in practice show re-ordering with some unsafe code, so problems can go undetected.
Jeff Preshing wrote a nice explanation of acquire and release semantics, so read that if this isn't clear.
AcqRel Ordering
This does both Acquire and Release ordering (ie. both restrictions apply). I'm not sure when this is necessary - it might be helpful in situations with 3 or more threads if some Release, some Acquire, and some do both, but I'm not really sure.
SeqCst Ordering
This is most restrictive and, therefore, slowest option. It forces memory accesses to appear to occur in one, identical order to every thread. This requires an MFENCE instruction on x86 on all writes to atomic variables (full memory barrier, including StoreLoad), while the weaker orderings don't. (SeqCst loads don't require a barrier on x86, as you can see in this C++ compiler output.)
Read-Modify-Write accesses, like atomic increment, or compare-and-swap, are done on x86 with locked instructions, which are already full memory barriers. If you care at all about compiling to efficient code on non-x86 targets, it makes sense to avoid SeqCst when you can, even for atomic read-modify-write ops. There are cases where it's needed, though.
For more examples of how atomic semantics turn into ASM, see this larger set of simple functions on C++ atomic variables. I know this is a Rust question, but it's supposed to have basically the same API as C++. godbolt can target x86, ARM, ARM64, and PowerPC. Interestingly, ARM64 has load-acquire (ldar) and store-release (stlr) instructions, so it doesn't always have to use separate barrier instructions.
By the way, x86 CPUs are always "strongly ordered" by default, which means they always act as if at least AcqRel mode was set. So for x86 "ordering" only affects how LLVM's optimizer behaves. ARM, on the other hand, is weakly ordered. Relaxed is set by default, to allow the compiler full freedom to reorder things, and to not require extra barrier instructions on weakly-ordered CPUs.

Go atomic and memory order

I am porting a lock free queue from c++11 to go and i came across things such as
auto currentRead = writeIndex.load(std::memory_order_relaxed);
and in some cases std::memory_order_release and std::memory_order_aqcuire
also the equivelent for the above in c11 is something like
unsigned long currentRead = atomic_load_explicit(&q->writeIndex,memory_order_relaxed);
the meaning of those is described here
is there an equivalent to such thing in go or do i just use something like
var currentRead uint64 = atomic.LoadUint64(&q.writeIndex)
after porting i benchmarked and just using LoadUint64 it seems to work as expected but orders of magnitude slower and i wonder how much effect dose those specialized ops have on performance.
further info from the link i attached
memory_order_relaxed:Relaxed operation: there are no synchronization
or ordering constraints, only atomicity is required of this operation.
memory_order_consume:A load operation with this memory order performs
a consume operation on the affected memory location: no reads in the
current thread dependent on the value currently loaded can be
reordered before this load. This ensures that writes to data-dependent
variables in other threads that release the same atomic variable are
visible in the current thread. On most platforms, this affects
compiler optimizations only.
memory_order_acquire:A load operation with this memory order performs the acquire operation on the affected memory location: no
memory accesses in the current thread can be reordered before this
load. This ensures that all writes in other threads that release the
same atomic variable are visible in the current thread.
memory_order_release:A store operation with this memory order performs the release operation: no memory accesses in the current
thread can be reordered after this store. This ensures that all writes
in the current thread are visible in other threads that acquire or the
same atomic variable and writes that carry a dependency into the
atomic variable become visible in other threads that consume the same
atomic.
You need to read The Go Memory Model
You'll discover that Go has nothing like the control that you have in C++ - there isn't a direct translation of the C++ features in your post. This is a deliberate design decision by the Go authors - the Go motto is Do not communicate by sharing memory; instead, share memory by communicating.
Assuming that the standard go channel isn't good enough for what you want to do, you'll have 2 choices for each memory access, using the facilities in sync/atomic or not, and whether you need to use them or not will depend on a careful reading of the Go Memory Model and analysis of your code which only you can do.

Atomicity, Volatility and Thread Safety in Windows

It's my understanding of atomicity that it's used to make sure a value will be read/written in whole rather than in parts. For example, a 64-bit value that is really two 32-bit DWORDs (assume x86 here) must be atomic when shared between threads so that both DWORDs are read/written at the same time. That way one thread can't read half variable that's not updated. How do you guarantee atomicity?
Furthermore it's my understanding that volatility does not guarantee thread safety at all. Is that true?
I've seen it implied many places that simply being atomic/volatile is thread-safe. I don't see how that is. Won't I need a memory barrier as well to ensure that any values, atomic or otherwise, are read/written before they can actually be guaranteed to be read/written in the other thread?
So for example let's say I create a thread suspended, do some calculations to change some values to a struct available to the thread and then resume, for example:
HANDLE hThread = CreateThread(NULL, 0, thread_entry, (void *)&data, CREATE_SUSPENDED, NULL);
data->val64 = SomeCalculation();
ResumeThread(hThread);
I suppose this would depend on any memory barriers in ResumeThread? Should I do an interlocked exchange for val64? What if the thread were running, how does that change things?
I'm sure I'm asking a lot here but basically what I'm trying to figure out is what I asked in the title: a good explanation for atomicity, volatility and thread safety in Windows. Thanks
it's used to make sure a value will be read/written in whole
That's just a small part of atomicity. At its core it means "uninterruptible", an instruction on a processor whose side-effects cannot be interleaved with another instruction. By design, a memory update is atomic when it can be executed with a single memory-bus cycle. Which requires the address of the memory location to be aligned so that a single cycle can update it. An unaligned access requires extra work, part of the bytes written by one cycle and part by another. Now it is not uninterruptible anymore.
Getting aligned updates is pretty easy, it is a guarantee provided by the compiler. Or, more broadly, by the memory model implemented by the compiler. Which simply chooses memory addresses that are aligned, sometimes intentionally leaving unused gaps of a few bytes to get the next variable aligned. An update to a variable that's larger than the native word size of the processor can never be atomic.
But much more important are the kind of processor instructions you need to make threading work. Every processor implements a variant of the CAS instruction, compare-and-swap. It is the core atomic instruction you need to implement synchronization. Higher level synchronization primitives, like monitors (aka condition variables), mutexes, signals, critical sections and semaphores are all built on top of that core instruction.
That's the minimum, a processor usually provide extra ones to make simple operations atomic. Like incrementing a variable, at its core an interruptible operation since it requires a read-modify-write operation. Having a need for it be atomic is very common, most any C++ program relies on it for example to implement reference counting.
volatility does not guarantee thread safety at all
It doesn't. It is an attribute that dates from much easier times, back when machines only had a single processor core. It only affects code generation, in particular the way a code optimizer tries to eliminate memory accesses and use a copy of the value in a processor register instead. Makes a big, big difference to code execution speed, reading a value from a register is easily 3 times faster than having to read it from memory.
Applying volatile ensures that the code optimizer does not consider the value in the register to be accurate and forces it to read memory again. It truly only matters on the kind of memory values that are not stable by themselves, devices that expose their registers through memory-mapped I/O. It has been abused heavily since that core meaning to try to put semantics on top of processors with a weak memory model, Itanium being the most egregious example. What you get with volatile today is strongly dependent on the specific compiler and runtime you use. Never use it for thread-safety, always use a synchronization primitive instead.
simply being atomic/volatile is thread-safe
Programming would be much simpler if that was true. Atomic operations only cover the very simple operations, a real program often needs to keep an entire object thread-safe. Having all its members updated atomically and never expose a view of the object that is partially updated. Something as simple as iterating a list is a core example, you can't have another thread modifying the list while you are looking at its elements. That's when you need to reach for the higher-level synchronization primitives, the kind that can block code until it is safe to proceed.
Real programs often suffer from this synchronization need and exhibit Amdahls' law behavior. In other words, adding an extra thread does not actually make the program faster. Sometimes actually making it slower. Whomever finds a better mouse-trap for this is guaranteed a Nobel, we're still waiting.
In general, C and C++ don't give any guarantees about how reading or writing a 'volatile' object behaves in multithreaded programs. (The 'new' C++11 probably does since it now includes threads as part of the standard, but tradiationally threads have not been part of standard C or C++.) Using volatile and making assumptions about atomicity and cache-coherence in code that's meant to be portable is a problem. It's a crap-shoot as to whether a particular compiler and platform will treat accesses to 'volatile' objects in a thread-safe way.
The general rule is: 'volatile' is not enough to ensure thread safe access. You should use some platform-provided mechanism (usually some functions or synchronisation objects) to access thread-shared values safely.
Now, specifically on Windows, specifically with the VC++ 2005+ compiler, and specifically on x86 and x64 systems, accessing a primitive object (like an int) can be made thread-safe if:
On 64- and 32-bit Windows, the object has to be a 32-bit type, and it has to be 32-bit aligned.
On 64-bit Windows, the object may also be a 64-bit type, and it has to be 64-bit aligned.
It must be declared volatile.
If those are true, then accesses to the object will be volatile, atomic and be surrounded by instructions that ensure cache-coherency. The size and alignment conditions must be met so that the compiler makes code that performs atomic operations when accessing the object. Declaring the object volatile ensures that the compiler doesn't make code optimisations related to caching previous values it may have read into a register and ensures that code generated includes appropriate memory barrier instructions when it's accessed.
Even so, you're probably still better off using something like the Interlocked* functions for accessing small things, and bog standard synchronisation objects like Mutexes or CriticalSections for larger objects and data structures. Ideally, get libraries for and use data structures that already include appropriate locks. Let your libraries & OS do the hard work as much as possible!
In your example, I expect you do need to use a thread-safe access to update val64 whether the thread is started yet or not.
If the thread was already running, then you would definitely need some kind of thread-safe write to val64, either using InterchangeExchange64 or similar, or by acquiring and releasing some kind of synchronisation object which will perform appropriate memory barrier instructions. Similarly, the thread would need to use a thread-safe accessor to read it as well.
In the case where the thread hasn't been resumed yet, it's a bit less clear. It's possible that ResumeThread might use or act like a synchronisation function and do the memory barrier operations, but the documentation doesn't specify that it does, so it is better to assume that it doesn't.
References:
On atomicity of 32- and 64- bit aligned types... https://msdn.microsoft.com/en-us/library/windows/desktop/ms684122%28v=vs.85%29.aspx
On 'volatile' including memory fences... https://msdn.microsoft.com/en-us/library/windows/desktop/ms686355%28v=vs.85%29.aspx

Resources