Which std::sync::atomic::Ordering to use? - std

All the methods of std::sync::atomic::AtomicBool take a memory ordering (Relaxed, Release, Acquire, AcqRel, and SeqCst), which I have not used before. Under what circumstances should these values be used? The documentation uses confusing “load” and “store” terms which I don’t really understand. For example:
A producer thread mutates some state held by a Mutex, then calls AtomicBool::compare_and_swap(false, true, ordering) (to coalesce invalidations), and if it swapped, posts an “invalidate” message to a concurrent queue (e.g. mpsc or a winapi PostMessage). A consumer thread resets the AtomicBool, reads from the queue, and reads the state held by the Mutex. Can the producer use Relaxed ordering because it is preceded by a mutex, or must it use Release? Can the consumer use store(false, Relaxed), or must it use compare_and_swap(true, false, Acquire) to receive the changes from the mutex?
What if the producer and consumer share a RefCell instead of a Mutex?

I'm not an expert on this, and it's really complicated, so please feel free to critique my post. As pointed out by mdh.heydari, cppreference.com has much better documentation of orderings than Rust (C++ has an almost identical API).
For your question
You'd need to use "release" ordering in your producer and "acquire" ordering in your consumer. This ensures that the data mutation occurs before the AtomicBool is set to true.
If your queue is asynchronous, then the consumer will need to keep trying to read from it in a loop, since the producer could get interrupted between setting the AtomicBool and putting something in the queue.
If the producer code might run multiple times before client runs, then you can't use RefCell because they could mutate the data while the client is reading it. Otherwise it's fine.
There are other better and simpler ways to implement this pattern, but I assume you were just giving it as an example.
What are orderings?
The different orderings have to do with what another thread sees happen when an atomic operation occurs. Compilers and CPUs are normally both allowed to reorder instructions in order to optimize code, and the orderings effect how much they're allowed to reorder instructions.
You could just always use SeqCst, which basically guarantees everyone will see that instruction as having occurred wherever you put it relative to other instructions, but in some cases if you specify a less restrictive ordering then LLVM and the CPU can better optimize your code.
You should think of these orderings as applying to a memory location (instead of applying to an instruction).
Ordering Types
Relaxed Ordering
There are no constraints besides any modification to the memory location being atomic (so it either happens completely or not at all). This is fine for something like a counter if the values retrieved by/set by individual threads don't matter as long as they're atomic.
Acquire Ordering
This constraint says that any variable reads that occur in your code after "acquire" is applied can't be reordered to occur before it. So, say in your code you read some shared memory location and get value X, which was stored in that memory location at time T, and then you apply the "acquire" constraint. Any memory locations that you read from after applying the constraint will have the value they had at time T or later.
This is probably what most people would expect to happen intuitively, but because a CPU and optimizer are allowed to reorder instructions as long as they don't change the result, it isn't guaranteed.
In order for "acquire" to be useful, it has to be paired with "release", because otherwise there's no guarantee that the other thread didn't reorder its write instructions that were supposed to occur at time T to an earlier time.
Acquire-reading the flag value you're looking for means you won't see a stale value somewhere else that was actually changed by a write before the release-store to the flag.
Release Ordering
This constraint says that any variable writes that occur in your code before "release" is applied can't be reordered to occur after it. So, say in your code you write to a few shared memory locations and then set some memory location t at time T, and then you apply the "release" constraint. Any writes that appear in your code before "release" is applied are guaranteed to have occurred before it.
Again, this is what most people would expect to happen intuitively, but it isn't guaranteed without constraints.
If the other thread trying to read value X doesn't use "acquire", then it isn't guaranteed to see the new value with respect to changes in other variable values. So it could get the new value, but it might not see new values for any other shared variables. Also keep in mind that testing is hard. Some hardware won't in practice show re-ordering with some unsafe code, so problems can go undetected.
Jeff Preshing wrote a nice explanation of acquire and release semantics, so read that if this isn't clear.
AcqRel Ordering
This does both Acquire and Release ordering (ie. both restrictions apply). I'm not sure when this is necessary - it might be helpful in situations with 3 or more threads if some Release, some Acquire, and some do both, but I'm not really sure.
SeqCst Ordering
This is most restrictive and, therefore, slowest option. It forces memory accesses to appear to occur in one, identical order to every thread. This requires an MFENCE instruction on x86 on all writes to atomic variables (full memory barrier, including StoreLoad), while the weaker orderings don't. (SeqCst loads don't require a barrier on x86, as you can see in this C++ compiler output.)
Read-Modify-Write accesses, like atomic increment, or compare-and-swap, are done on x86 with locked instructions, which are already full memory barriers. If you care at all about compiling to efficient code on non-x86 targets, it makes sense to avoid SeqCst when you can, even for atomic read-modify-write ops. There are cases where it's needed, though.
For more examples of how atomic semantics turn into ASM, see this larger set of simple functions on C++ atomic variables. I know this is a Rust question, but it's supposed to have basically the same API as C++. godbolt can target x86, ARM, ARM64, and PowerPC. Interestingly, ARM64 has load-acquire (ldar) and store-release (stlr) instructions, so it doesn't always have to use separate barrier instructions.
By the way, x86 CPUs are always "strongly ordered" by default, which means they always act as if at least AcqRel mode was set. So for x86 "ordering" only affects how LLVM's optimizer behaves. ARM, on the other hand, is weakly ordered. Relaxed is set by default, to allow the compiler full freedom to reorder things, and to not require extra barrier instructions on weakly-ordered CPUs.

Related

CUDA critical sections, thread/warp execution model and NVCC compiler decisions

Recently I posted this question, about a critical section. Here is a similar question. In those questions the given answer says, that is up to the compiler if the code "works" or not, because the order of the various paths of execution is up to the compiler.
To elaborate the rest of the question I need the following excerpts from The CUDA programming guide:
... Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently....
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path....
The execution context (program counters, registers, etc.) for each warp processed by a multiprocessor is maintained on-chip during the entire lifetime of the warp. Therefore, switching from one execution context to another has no cost, and at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction (the active threads of the warp) and issues the instruction to those threads.
What I understand from this three excerpts is that, threads can diverge freely from the rest, all the branch possibilities will be serialized if there is divergence between threads, and if a branch is taken it will execute till completion. And that is why the questions mentioned above ends on deadlock, because the ordering of the execution paths imposed by the compiler, results in the taking of a branch that doesn't get the lock.
Now the question is: the compiler shouldn't always put the branches in the order written by the user?, is there a high level way to enforce the order? I know, the compiler can optimize, do a reordering of the instructions, etc, but it should not fundamentally change the logic of the code (yes there are exceptions like some memory access without the volatile keyword, but that is why the keyword exists, to give control to the user).
Edit
The main point of this question is not about critical sections, is about the compiler, for example in the first link, a compilation flag change drastically the logic of the code. One "working", and the other doesn't. What bothers me, is that in all the reference, it only says be careful, nothing about undefined behaviour from the nvcc compiler.
I believe the order of execution is not set, nor guaranteed, by the CUDA compiler. It's the hardware that sets it - as far as I can recall.
Thus,
the compiler shouldn't always put the branches in the order written by the user?
It doesn't control execution order anyway
is there a high level way to enforce the order?
Just the synchronization instructions like __syncthreads().
The compiler... should not fundamentally change the logic of the code
The semantics of CUDA code is not the same as for C++ code... sequential execution of if branches is not part of the semantics.
I realize this answer may not be satisfying to you, but that's how things stand, for better or for worse.

What is the point of MESI on Intel 64 and IA-32

The point of MESI is to retain a notion of a shared memory system.
However, with store buffers, things are complicated:
Memory is coherent downstream of once the data hits the MESI-implementing caches.
However, upstream of that, each core may disagree on what is in memory location X, dependent on what is in each core's local store buffer.
As such, it seems like, from the viewpoint of each core, that the state of memory is different - it is not coherent.
So, why do we bother "partially" enforcing coherency with MESI?
Edit: A substantial edit was made, after some further narrowing of what was really confusing me. I have tried to keep the general notion of the question the same, to preserve the relevance of the great answers received.
The point of MESI on x86 is the same as on pretty much any multiple core/CPU system: to enforce cache consistency. There is no "partial coherency" used for the cache coherency part of the equation on x86: the caches are fully coherent. The possible re-orderings, then, are a result of both the coherent caching system and the interaction with core-local components such as the load/store subsystem (especially store buffers) and other out-of-order machinery.
The result of that interaction is the architected strong memory model that x86 provides, with only limited re-ordering. Without coherent caches, you couldn't reasonably implement this model at all, or almost any model that was anything other than completely weak1.
Your question seems to embed the assumption that there are only possible states "coherent" and "everything every else". Also, there is some mixing of the ideas of cache coherency (which mostly deals with the caches specifically, and is mostly a hidden detail), and the memory consistency model which is architecturally defined and will be implemented by each architecture2. Wikipedia explains that one difference between cache coherency and memory consistency is that the rules for the former applies only to one location at a time, whereas consistency rules apply across locations. In practice, the more important distinction is that the memory consistency model is the only architecturally documented one.
Briefly, Intel (and AMD likewise) define a specific memory consistency model, x86-TSO3 - which is relatively strong as far as memory models go, but is still weaker than sequential consistency. The primary behaviors weakened compared to sequential consistency are:
That later loads can pass earlier stores.
That stores can be seen in a different order from the total store order, but only by cores that performed one of the stores.
To order to implement this memory model, various parts must play by the rules to achieve it. On all recent x86, this means ordered load and store buffers, which avoid the disallowed re-orderings. The use of a store buffer results in the two re-orderings mentioned above: without allowing those, the implementation would be very restricted and probably much slower. In practice it also means fully coherent data caches, since many of the guarantees (e.g., no load-load reordering) would be very difficult to implement without that.
To wrap it all up:
Memory consistency is different than cache coherency: the former is what is documented and forms part of the programming model.
In practice, x86 implementations have fully coherent caches, which helps them implement their x86-TSO memory model, which is fairly strong but weaker than sequential consistency.
Finally, perhaps the answer you were looking for, in different words: a memory model weaker than sequential consistency is still very useful since you can program against it, and in the case you need sequential consistency for some particular operations(s) you insert the right memory barriers4.
If you program against a language supplied memory model, such as Java's or C++11's you don't need to worry about the hardware specifics, but rather than language memory model, and the compiler inserts the barriers required to match the language memory model semantics to the hardware one. The stronger the hardware model, the fewer the barriers required.
1 If your memory model was completely weak, i.e., not really placing any restrictions on cross-core reordering, I suppose you could implement it directly on a non-cache coherent system in a cheap way for normal operations, but then memory barriers potentially become very expensive since they would need to flush a potentially large part of the local private cache.
2 Various chips may implement in differently internally, and in particular some chips may implement stronger semantics than the model (i.e., some allowed re-orderings can never be observed), but absent bugs none will implement a weaker one.
3 This is the name given to it in that paper, which I used because Intel themselves doesn't give it a name, and the paper is a more formal definition than the one Intel gives a less formal model as a series of litmus tests.
4 It practice on x86 you usually use locked instructions (using the lock prefix) rather than separate barriers, although standalone barriers exist also. Here's I'll just use the term barries to refer to both standalone barriers and the barrier semantics embedded into locked instructions.
Re: your edit, which seems to be a new question: right, store-forwarding "violates" coherency. A core can see its own stores earlier than any other core can see them. The store buffer is not coherent.
The x86 memory ordering rules require that loads become globally visible in program order, but allows a core to load data from its own stores before they become globally visible. It doesn't have to pretend it waited and check for memory-order mis-speculation, like it does in other cases of doing loads earlier than the memory model says it should.
Also related; Can x86 reorder a narrow store with a wider load that fully contains it? is a specific example of the store buffer + store-forwarding violating the usual memory-ordering rules. See this collection of mailing list posts by Linus Torvalds explaining the effect of store-forwarding on memory ordering (and how it means that the proposed locking scheme doesn't work).
Without coherency at all, how would you atomically increment a shared counter, or implement other atomic read-modify-write operations that are essential for implementing locks, or for use directly in lockless code. (See Can num++ be atomic for 'int num'?).
lock add [shared_counter], 1 in multiple threads at the same time never loses any counts on actual x86, because the lock prefix makes a core keep exclusive ownership of the cache line from the load until the store commits to L1d (and thus becomes globally visible).
A system without coherent caches would let each thread increment its own copy of a shared counter, and then the final value in memory would come from whichever thread last flushed that line.
Allowing different caches to hold conflicting data for the same line long-term even while other loads/stores happened, and across memory barriers, would allow all sorts of weirdness.
It would also violate the assumption that a pure store becomes visible to other cores promptly. If you didn't have coherency at all, then cores could keep using their cached copy of a shared variable. So if you wanted readers to notice updates, you'd have to clflush before every read of the shared variable, making the common case expensive (when nobody has modified the data since you last checked).
MESI is like a push notification system, instead of forcing every reader to re-validate their cache on every read.
MESI (or coherency in general) allows things like RCU (Read-Copy-Update) to have zero overhead for readers (compared to single threaded) in the case where the shared data structure hasn't been modified. See https://lwn.net/Articles/262464/, and https://en.wikipedia.org/wiki/Read-copy-update. The basic idea is that instead of locking a data structure, a writer copies the whole thing, modifies the copy, and then updates a shared pointer to point to the new version. So readers are always entirely wait-free; they just dereference an (atomic) pointer, and data stays hot in their L1d caches.
Hardware-supported coherency is extremely valuable, and almost every shared-memory SMP architecture uses it. Even ISAs with much weaker memory-ordering rules than x86, like PowerPC, use MESI.

Why aren't out of order CPUs troublesome?

i've recently learned about out-of-order execution CPUs in this link https://en.wikipedia.org/wiki/Out-of-order_execution
There is something that i can't quite understand. Why aren't these kind of CPUs troublesome? I mean, if i have instructions executing out of order, even if they apply to different data, wont i be able to reach a situation where data is not updated according to the program order?
I mean, if i have something like:
x = 1;
y = 2;
x = x+y;
print x;
print y;
what prevents the "print y" instruction from being executed before the "print x"?
Maybe i'm getting something wrong about this kind of CPU, can you explain it to me?
Thanks in advance
In an out-of-order processor, the instructions are executed out-of-order but committed in order. This means that externally visible state becomes visible in order. Writes to memory (including I/O operations) are buffered locally so that they can be released to the external memory system in order, similarly register results are stored locally using register renaming. This allows the local processor to use the early speculative values without violating externally visible ordering. On incorrect speculation, a rollback mechanism is used to restore state back to a previous known-valid state.
Note that technically commitment of results to processor-core-external state does not have to be in order as long as the out-of-order result is non-speculative and does not violate ordering guarantees. With weak memory consistency models, this could (in theory) allow values to become externally visible out-of-order. (I/O is required to be in-order, so the print example would still be required to commit in order.) Furthermore, if other cores know that values are speculative in nature or order, the values could be made externally visible out-of-order (again, in theory) and consumers of the values would rollback on incorrect speculation. (This is extending "externally visible" from external to the single processor core to external to some larger system component which is aware of speculation and supports rollback.)
(In very impractical theory, extending speculative realization to the human computer interface would be possible if the definition of the interface allowed such glitches (i.e., the human corrects for wrongly speculated values and ordering). However, since humans are even slower than most I/O devices (speculations would be resolved on a shorter time scale than is significant) and such an extension of speculation would be extremely complex and generally undesirable, it is unlikely ever to be broadly used.)
#paul-a-clayton already gave a great answer. I would like to mention one thing more.
Earlier out-of-order architectures were notoriously troublesome. The fundamental problem was that they couldn't guarantee precise interrupts. This changed in the late 80s/early 90s due to the solutions proposed in Smith & Pleszkun's seminal paper. They introduced the idea of a reorder buffer which allows instructions to issue out of order but commit in order.

Atomicity, Volatility and Thread Safety in Windows

It's my understanding of atomicity that it's used to make sure a value will be read/written in whole rather than in parts. For example, a 64-bit value that is really two 32-bit DWORDs (assume x86 here) must be atomic when shared between threads so that both DWORDs are read/written at the same time. That way one thread can't read half variable that's not updated. How do you guarantee atomicity?
Furthermore it's my understanding that volatility does not guarantee thread safety at all. Is that true?
I've seen it implied many places that simply being atomic/volatile is thread-safe. I don't see how that is. Won't I need a memory barrier as well to ensure that any values, atomic or otherwise, are read/written before they can actually be guaranteed to be read/written in the other thread?
So for example let's say I create a thread suspended, do some calculations to change some values to a struct available to the thread and then resume, for example:
HANDLE hThread = CreateThread(NULL, 0, thread_entry, (void *)&data, CREATE_SUSPENDED, NULL);
data->val64 = SomeCalculation();
ResumeThread(hThread);
I suppose this would depend on any memory barriers in ResumeThread? Should I do an interlocked exchange for val64? What if the thread were running, how does that change things?
I'm sure I'm asking a lot here but basically what I'm trying to figure out is what I asked in the title: a good explanation for atomicity, volatility and thread safety in Windows. Thanks
it's used to make sure a value will be read/written in whole
That's just a small part of atomicity. At its core it means "uninterruptible", an instruction on a processor whose side-effects cannot be interleaved with another instruction. By design, a memory update is atomic when it can be executed with a single memory-bus cycle. Which requires the address of the memory location to be aligned so that a single cycle can update it. An unaligned access requires extra work, part of the bytes written by one cycle and part by another. Now it is not uninterruptible anymore.
Getting aligned updates is pretty easy, it is a guarantee provided by the compiler. Or, more broadly, by the memory model implemented by the compiler. Which simply chooses memory addresses that are aligned, sometimes intentionally leaving unused gaps of a few bytes to get the next variable aligned. An update to a variable that's larger than the native word size of the processor can never be atomic.
But much more important are the kind of processor instructions you need to make threading work. Every processor implements a variant of the CAS instruction, compare-and-swap. It is the core atomic instruction you need to implement synchronization. Higher level synchronization primitives, like monitors (aka condition variables), mutexes, signals, critical sections and semaphores are all built on top of that core instruction.
That's the minimum, a processor usually provide extra ones to make simple operations atomic. Like incrementing a variable, at its core an interruptible operation since it requires a read-modify-write operation. Having a need for it be atomic is very common, most any C++ program relies on it for example to implement reference counting.
volatility does not guarantee thread safety at all
It doesn't. It is an attribute that dates from much easier times, back when machines only had a single processor core. It only affects code generation, in particular the way a code optimizer tries to eliminate memory accesses and use a copy of the value in a processor register instead. Makes a big, big difference to code execution speed, reading a value from a register is easily 3 times faster than having to read it from memory.
Applying volatile ensures that the code optimizer does not consider the value in the register to be accurate and forces it to read memory again. It truly only matters on the kind of memory values that are not stable by themselves, devices that expose their registers through memory-mapped I/O. It has been abused heavily since that core meaning to try to put semantics on top of processors with a weak memory model, Itanium being the most egregious example. What you get with volatile today is strongly dependent on the specific compiler and runtime you use. Never use it for thread-safety, always use a synchronization primitive instead.
simply being atomic/volatile is thread-safe
Programming would be much simpler if that was true. Atomic operations only cover the very simple operations, a real program often needs to keep an entire object thread-safe. Having all its members updated atomically and never expose a view of the object that is partially updated. Something as simple as iterating a list is a core example, you can't have another thread modifying the list while you are looking at its elements. That's when you need to reach for the higher-level synchronization primitives, the kind that can block code until it is safe to proceed.
Real programs often suffer from this synchronization need and exhibit Amdahls' law behavior. In other words, adding an extra thread does not actually make the program faster. Sometimes actually making it slower. Whomever finds a better mouse-trap for this is guaranteed a Nobel, we're still waiting.
In general, C and C++ don't give any guarantees about how reading or writing a 'volatile' object behaves in multithreaded programs. (The 'new' C++11 probably does since it now includes threads as part of the standard, but tradiationally threads have not been part of standard C or C++.) Using volatile and making assumptions about atomicity and cache-coherence in code that's meant to be portable is a problem. It's a crap-shoot as to whether a particular compiler and platform will treat accesses to 'volatile' objects in a thread-safe way.
The general rule is: 'volatile' is not enough to ensure thread safe access. You should use some platform-provided mechanism (usually some functions or synchronisation objects) to access thread-shared values safely.
Now, specifically on Windows, specifically with the VC++ 2005+ compiler, and specifically on x86 and x64 systems, accessing a primitive object (like an int) can be made thread-safe if:
On 64- and 32-bit Windows, the object has to be a 32-bit type, and it has to be 32-bit aligned.
On 64-bit Windows, the object may also be a 64-bit type, and it has to be 64-bit aligned.
It must be declared volatile.
If those are true, then accesses to the object will be volatile, atomic and be surrounded by instructions that ensure cache-coherency. The size and alignment conditions must be met so that the compiler makes code that performs atomic operations when accessing the object. Declaring the object volatile ensures that the compiler doesn't make code optimisations related to caching previous values it may have read into a register and ensures that code generated includes appropriate memory barrier instructions when it's accessed.
Even so, you're probably still better off using something like the Interlocked* functions for accessing small things, and bog standard synchronisation objects like Mutexes or CriticalSections for larger objects and data structures. Ideally, get libraries for and use data structures that already include appropriate locks. Let your libraries & OS do the hard work as much as possible!
In your example, I expect you do need to use a thread-safe access to update val64 whether the thread is started yet or not.
If the thread was already running, then you would definitely need some kind of thread-safe write to val64, either using InterchangeExchange64 or similar, or by acquiring and releasing some kind of synchronisation object which will perform appropriate memory barrier instructions. Similarly, the thread would need to use a thread-safe accessor to read it as well.
In the case where the thread hasn't been resumed yet, it's a bit less clear. It's possible that ResumeThread might use or act like a synchronisation function and do the memory barrier operations, but the documentation doesn't specify that it does, so it is better to assume that it doesn't.
References:
On atomicity of 32- and 64- bit aligned types... https://msdn.microsoft.com/en-us/library/windows/desktop/ms684122%28v=vs.85%29.aspx
On 'volatile' including memory fences... https://msdn.microsoft.com/en-us/library/windows/desktop/ms686355%28v=vs.85%29.aspx

How to understand acquire and release semantics?

I found out three function from MSDN , below:
1.InterlockedDecrement().
2.InterlockedDecrementAcquire().
3.InterlockedDecrementRelease().
I knew those fucntion use to decrement a value as an atomic operation, but i don't know distinction between the three function
(um... but don't ask me what does it mean exactly)
I'll take a stab at that.
Something to remember is that the compiler, or the CPU itself, may reorder memory reads and writes if they appear to not deal with each other.
This is useful, for instance, if you have some code that, maybe is updating a structure:
if ( playerMoved ) {
playerPos.X += dx;
playerPos.Y += dy;
// Keep the player above the world's surface.
if ( playerPos.Z + dz > 0 ) {
playerPos.Z += dz;
}
else {
playerPos.Z = 0;
}
}
Most of above statements may be reordered because there's no data dependency between them, in fact, a superscalar CPU may execute most of those statements simultaneously, or maybe would start working on the Z section sooner, since it doesn't affect X or Y, but might take longer.
Here's the problem with that - lets say that you're attempting lock-free programming. You want to perform a whole bunch of memory writes to, maybe, fill in a shared queue. You signal that you're done by finally writing to a flag.
Well, since that flag appears to have nothing to do with the rest of the work being done, the compiler and the CPU may reorder those instructions, and now you may set your 'done' flag before you've actually committed the rest of the structure to memory, and now your "lock-free" queue doesn't work.
This is where Acquire and Release ordering semantics come into play. I set that I'm doing work by setting a flag or so with an Acquire semantic, and the CPU guarantees that any memory games I play after that instruction stay actually below that instruction. I set that I'm done by setting a flag or so with a Release semantic, and the CPU guarantees that any memory games I had done just before the release actually stay before the release.
Normally, one would do this using explicit locks - mutexes, semaphores, etc, in which the CPU already knows it has to pay attention to memory ordering. The point of attempting to create 'lock free' data structures is to provide data structures that are thread safe (for some meaning of thread safe), that don't use explicit locks (because they are very slow).
Creating lock-free data structures is possible on a CPU or compiler that doesn't support acquire/release ordering semantics, but it usually means that some slower memory ordering semantic is used. For instance, you could issue a full memory barrier - everything that came before this instruction has to actually be committed before this instruction, and everything that came after this instruction has to be committed actually after this instruction. But that might mean that I wait for a bunch of actually irrelevant memory writes from earlier in the instruction stream (perhaps function call prologue) that has nothing to do with the memory safety I'm trying to implement.
Acquire says "only worry about stuff after me". Release says "only worry about stuff before me". Combining those both is a full memory barrier.
http://preshing.com/20120913/acquire-and-release-semantics/
Acquire semantics is a property which can only apply to operations
which read from shared memory, whether they are read-modify-write
operations or plain loads. The operation is then considered a
read-acquire. Acquire semantics prevent memory reordering of the
read-acquire with any read or write operation which follows it in
program order.
Release semantics is a property which can only apply to operations
which write to shared memory, whether they are read-modify-write
operations or plain stores. The operation is then considered a
write-release. Release semantics prevent memory reordering of the
write-release with any read or write operation which precedes it in
program order.
(um... but don't ask me what does it mean exactly)

Resources