Go atomic and memory order

Go atomic and memory order - c++11

I am porting a lock free queue from c++11 to go and i came across things such as
auto currentRead = writeIndex.load(std::memory_order_relaxed);
and in some cases std::memory_order_release and std::memory_order_aqcuire
also the equivelent for the above in c11 is something like
unsigned long currentRead = atomic_load_explicit(&q->writeIndex,memory_order_relaxed);
the meaning of those is described here
is there an equivalent to such thing in go or do i just use something like
var currentRead uint64 = atomic.LoadUint64(&q.writeIndex)
after porting i benchmarked and just using LoadUint64 it seems to work as expected but orders of magnitude slower and i wonder how much effect dose those specialized ops have on performance.
further info from the link i attached
memory_order_relaxed:Relaxed operation: there are no synchronization
or ordering constraints, only atomicity is required of this operation.
memory_order_consume:A load operation with this memory order performs
a consume operation on the affected memory location: no reads in the
current thread dependent on the value currently loaded can be
reordered before this load. This ensures that writes to data-dependent
variables in other threads that release the same atomic variable are
visible in the current thread. On most platforms, this affects
compiler optimizations only.
memory_order_acquire:A load operation with this memory order performs the acquire operation on the affected memory location: no
memory accesses in the current thread can be reordered before this
load. This ensures that all writes in other threads that release the
same atomic variable are visible in the current thread.
memory_order_release:A store operation with this memory order performs the release operation: no memory accesses in the current
thread can be reordered after this store. This ensures that all writes
in the current thread are visible in other threads that acquire or the
same atomic variable and writes that carry a dependency into the
atomic variable become visible in other threads that consume the same
atomic.

You need to read The Go Memory Model
You'll discover that Go has nothing like the control that you have in C++ - there isn't a direct translation of the C++ features in your post. This is a deliberate design decision by the Go authors - the Go motto is Do not communicate by sharing memory; instead, share memory by communicating.
Assuming that the standard go channel isn't good enough for what you want to do, you'll have 2 choices for each memory access, using the facilities in sync/atomic or not, and whether you need to use them or not will depend on a careful reading of the Go Memory Model and analysis of your code which only you can do.

Related

Is cache coherency only an issue when storing and not when loading?

I came across this code emission for x64 were "Atomic Load" is using a simple movq whereas "Atomic Store" is using xchgq.
This link explains that Atomic Load/Stores on aligned addresses are atomic by default. I'm assuming that's why Atomic Load in the above link is using a simple movq.
I have the following questions;
Is Atomic Store using a xchgq (which enables LOCK by default) to fix any issues with cache lines? essentially it's making sure all cache lines are updated properly? If cache line wasn't an issue they could have just used movq?
Does it also mean cache coherency is only an issue when Storing? As Load above is not using a locked instruction?

No, seq_cst stores use xchg (or mov + mfence but that's slower on recent CPUs) for ordering wrt. other operations. release or relaxed atomic stores can just use mov and will still be promptly visible to other cores. (Not before later loads in this thread might have executed, though.)
Cache coherence isn't the cause of memory-reordering, that's local to each core. (For x86, the memory model is program order + a store buffer with store-forwarding. It's the store buffer that causes stores to not become visible until after the store instruction has retired from out-of-order exec.)
The answer you linked which says "if I set this to true (or false), no other thread will read a different value after I've set it" (that's not quite such a certainty - you need a "lock" prefix to guarantee that). is somewhat misleading. They mean that (implicit-lock) xchg includes a full memory barrier, so no code in the storing thread can access memory until after the store is actually committed to cache, globally visible.
A clearer way to state that is that it makes this thread wait without doing anything until the store is visible. i.e. stall this thread until the store buffer has finished committing all previous stores. That would eventually happen on its own. So it's really about ordering of this thread relative to store visibility, not other threads. Other threads (cores) can locally do their own early loading / late storing, although on x86 all loads happen in program order. That's why I commented on that answer you linked to disagree with the way it was presenting things.
Can a speculatively executed CPU branch contain opcodes that access RAM? (What a store buffer does)
C++ How is release-and-acquire achieved on x86 only using MOV? discusses cache-coherency and limits on local reordering being enough to give release/acquire synchronization.
Why does a std::atomic store with sequential consistency use XCHG?
Acquire-release on x86
Which is a better write barrier on x86: lock+addl or xchgl? - shows in more detail why we need xchg or a separate memory barrier for a seq_cst store.
https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/ (It's talking about Java volatile, which is like C++ std::atomic with memory_order_seq_cst.
How does memory reordering help processors and compilers?
Does atomic read guarantees reading of the latest value? - people often get hung up on "latest value" guarantees. Don't. Acquire/release just works, and stronger orders or memory barriers don't make stores visible to other cores sooner in any significant way.

Why we do not use barriers in User space

I am reading about memory barriers and what I can summarize is that they prevent instruction re-ordering done by compilers.
So in User space memory lets say I have
b = 0;
main(){
a = 10;
b = 20;
c = add(a,b);
}
Can the compiler reorder this code so that b = 20 assignment happens after c = add() is called.
Why we do not use barriers in this case ? Am I missing some fundamental here.
Does Virtual memory is exempted from any re ordering ?
Extending the Question further:
In Network driver:
1742 /*
1743 * Writing to TxStatus triggers a DMA transfer of the data
1744 * copied to tp->tx_buf[entry] above. Use a memory barrier
1745 * to make sure that the device sees the updated data.
1746 */
1747 wmb();
1748 RTL_W32_F (TxStatus0 + (entry * sizeof (u32)),
1749 tp->tx_flag | max(len, (unsigned int)ETH_ZLEN));
1750
When he says devices see the updated data... How to relate this with the multi threaded theory for usage of barriers.

Short answer
Memory barriers are used less frequently in user mode code than kernel mode code because user mode code tends to use higher level abstractions (for example pthread synchronization operations).
Additional details
There are two things to consider when analyzing the possible ordering of operations:
What order the thread that is executing the code will see the operations in
What order other threads will see the operations in
In your example the compiler cannot reorder b=20 to occur after c=add(a,b) because the c=add(a,b) operation uses the results of b=20. However, it may be possible for the compiler to reorder these operations so that other threads see the memory location associated with c change before the memory location associated with b changes.
Whether this would actually happen or not depends on the memory consistency model that is implemented by the hardware.
As for when the compiler might do reordering you could imagine adding another variable as follows:
b = 0;
main(){
a = 10;
b = 20;
d = 30;
c = add(a,b);
}
In this case the compiler would be free to move the d=30 assignment to occur after c=add(a,b).
However, this entire example is too simplistic. The program doesn't do anything and the compiler can eliminate all the operations and does not need to write anything to memory.
Addendum: Memory reordering example
In a multiprocessor environment multiple threads can see memory operations occur in different orders. The Intel Software Developer's Manual has some examples in Volume 3 section 8.2.3. I've copied a screenshot below that shows an example where loads and stores can be reordered.
There is also a good blog post that provides some more detail about this example.

The thread running the code will always act as if the effects of the source lines of its own code happened in program order. This is as if rule is what enables most compiler optimizations.
Within a single thread, out-of-order CPUs track dependencies to give a thread the illusion that all its instructions executed in program order. The globally-visible (to threads on other cores) effects may be seen out-of-order by other cores, though.
Memory barriers (as part of locking, or on their own) are only needed in code that interacts with other threads through shared memory.
Compilers can similarly do any reordering / hoisting they want, as long as the results are the same. The C++ memory model is very weak, so compile-time reordering is possible even when targeting an x86 CPU. (But of course not reordering that produces different results within the local thread.) C11 <stdatomic.h> and the equivalent C++11 std::atomic are the best way to tell the compiler about any ordering requirements you have for the global visibility of operations. On x86, this usually just results in putting store instructions in source order, but the default memory_order_seq_cst needs an MFENCE on each store to prevent StoreLoad reordering for full sequential consistency.
In kernel code, memory barriers are also common to make sure that stores to memory-mapped I/O registers happen in a required order. The reasoning is the same: to order the globally-visible effects on memory of a sequence of stores and loads. The difference is that the observer is an I/O device, not a thread on another CPU. The fact that cores interact with each other through a cache coherency protocol is irrelevant.

The compiler cannot reorder (nor can the runtime or the cpu) so that b=20 is after the c=add()since that would change the semantics of the method and that is not permissible.
I would say that for the compiler (or runtime or cpu) to act as you describe would make the behaviour random, which would be a bad thing.
This restriction on reordering applies only within the thread executing the code. As #GabrielSouthern points out, the ordering of the stores becoming globally visible is not guaranteed, if a, b, and c are all global variables.

Atomicity, Volatility and Thread Safety in Windows

It's my understanding of atomicity that it's used to make sure a value will be read/written in whole rather than in parts. For example, a 64-bit value that is really two 32-bit DWORDs (assume x86 here) must be atomic when shared between threads so that both DWORDs are read/written at the same time. That way one thread can't read half variable that's not updated. How do you guarantee atomicity?
Furthermore it's my understanding that volatility does not guarantee thread safety at all. Is that true?
I've seen it implied many places that simply being atomic/volatile is thread-safe. I don't see how that is. Won't I need a memory barrier as well to ensure that any values, atomic or otherwise, are read/written before they can actually be guaranteed to be read/written in the other thread?
So for example let's say I create a thread suspended, do some calculations to change some values to a struct available to the thread and then resume, for example:
HANDLE hThread = CreateThread(NULL, 0, thread_entry, (void *)&data, CREATE_SUSPENDED, NULL);
data->val64 = SomeCalculation();
ResumeThread(hThread);
I suppose this would depend on any memory barriers in ResumeThread? Should I do an interlocked exchange for val64? What if the thread were running, how does that change things?
I'm sure I'm asking a lot here but basically what I'm trying to figure out is what I asked in the title: a good explanation for atomicity, volatility and thread safety in Windows. Thanks

it's used to make sure a value will be read/written in whole
That's just a small part of atomicity. At its core it means "uninterruptible", an instruction on a processor whose side-effects cannot be interleaved with another instruction. By design, a memory update is atomic when it can be executed with a single memory-bus cycle. Which requires the address of the memory location to be aligned so that a single cycle can update it. An unaligned access requires extra work, part of the bytes written by one cycle and part by another. Now it is not uninterruptible anymore.
Getting aligned updates is pretty easy, it is a guarantee provided by the compiler. Or, more broadly, by the memory model implemented by the compiler. Which simply chooses memory addresses that are aligned, sometimes intentionally leaving unused gaps of a few bytes to get the next variable aligned. An update to a variable that's larger than the native word size of the processor can never be atomic.
But much more important are the kind of processor instructions you need to make threading work. Every processor implements a variant of the CAS instruction, compare-and-swap. It is the core atomic instruction you need to implement synchronization. Higher level synchronization primitives, like monitors (aka condition variables), mutexes, signals, critical sections and semaphores are all built on top of that core instruction.
That's the minimum, a processor usually provide extra ones to make simple operations atomic. Like incrementing a variable, at its core an interruptible operation since it requires a read-modify-write operation. Having a need for it be atomic is very common, most any C++ program relies on it for example to implement reference counting.
volatility does not guarantee thread safety at all
It doesn't. It is an attribute that dates from much easier times, back when machines only had a single processor core. It only affects code generation, in particular the way a code optimizer tries to eliminate memory accesses and use a copy of the value in a processor register instead. Makes a big, big difference to code execution speed, reading a value from a register is easily 3 times faster than having to read it from memory.
Applying volatile ensures that the code optimizer does not consider the value in the register to be accurate and forces it to read memory again. It truly only matters on the kind of memory values that are not stable by themselves, devices that expose their registers through memory-mapped I/O. It has been abused heavily since that core meaning to try to put semantics on top of processors with a weak memory model, Itanium being the most egregious example. What you get with volatile today is strongly dependent on the specific compiler and runtime you use. Never use it for thread-safety, always use a synchronization primitive instead.
simply being atomic/volatile is thread-safe
Programming would be much simpler if that was true. Atomic operations only cover the very simple operations, a real program often needs to keep an entire object thread-safe. Having all its members updated atomically and never expose a view of the object that is partially updated. Something as simple as iterating a list is a core example, you can't have another thread modifying the list while you are looking at its elements. That's when you need to reach for the higher-level synchronization primitives, the kind that can block code until it is safe to proceed.
Real programs often suffer from this synchronization need and exhibit Amdahls' law behavior. In other words, adding an extra thread does not actually make the program faster. Sometimes actually making it slower. Whomever finds a better mouse-trap for this is guaranteed a Nobel, we're still waiting.

In general, C and C++ don't give any guarantees about how reading or writing a 'volatile' object behaves in multithreaded programs. (The 'new' C++11 probably does since it now includes threads as part of the standard, but tradiationally threads have not been part of standard C or C++.) Using volatile and making assumptions about atomicity and cache-coherence in code that's meant to be portable is a problem. It's a crap-shoot as to whether a particular compiler and platform will treat accesses to 'volatile' objects in a thread-safe way.
The general rule is: 'volatile' is not enough to ensure thread safe access. You should use some platform-provided mechanism (usually some functions or synchronisation objects) to access thread-shared values safely.
Now, specifically on Windows, specifically with the VC++ 2005+ compiler, and specifically on x86 and x64 systems, accessing a primitive object (like an int) can be made thread-safe if:
On 64- and 32-bit Windows, the object has to be a 32-bit type, and it has to be 32-bit aligned.
On 64-bit Windows, the object may also be a 64-bit type, and it has to be 64-bit aligned.
It must be declared volatile.
If those are true, then accesses to the object will be volatile, atomic and be surrounded by instructions that ensure cache-coherency. The size and alignment conditions must be met so that the compiler makes code that performs atomic operations when accessing the object. Declaring the object volatile ensures that the compiler doesn't make code optimisations related to caching previous values it may have read into a register and ensures that code generated includes appropriate memory barrier instructions when it's accessed.
Even so, you're probably still better off using something like the Interlocked* functions for accessing small things, and bog standard synchronisation objects like Mutexes or CriticalSections for larger objects and data structures. Ideally, get libraries for and use data structures that already include appropriate locks. Let your libraries & OS do the hard work as much as possible!
In your example, I expect you do need to use a thread-safe access to update val64 whether the thread is started yet or not.
If the thread was already running, then you would definitely need some kind of thread-safe write to val64, either using InterchangeExchange64 or similar, or by acquiring and releasing some kind of synchronisation object which will perform appropriate memory barrier instructions. Similarly, the thread would need to use a thread-safe accessor to read it as well.
In the case where the thread hasn't been resumed yet, it's a bit less clear. It's possible that ResumeThread might use or act like a synchronisation function and do the memory barrier operations, but the documentation doesn't specify that it does, so it is better to assume that it doesn't.
References:
On atomicity of 32- and 64- bit aligned types... https://msdn.microsoft.com/en-us/library/windows/desktop/ms684122%28v=vs.85%29.aspx
On 'volatile' including memory fences... https://msdn.microsoft.com/en-us/library/windows/desktop/ms686355%28v=vs.85%29.aspx

How to understand acquire and release semantics?

I found out three function from MSDN , below:
1.InterlockedDecrement().
2.InterlockedDecrementAcquire().
3.InterlockedDecrementRelease().
I knew those fucntion use to decrement a value as an atomic operation, but i don't know distinction between the three function

(um... but don't ask me what does it mean exactly)
I'll take a stab at that.
Something to remember is that the compiler, or the CPU itself, may reorder memory reads and writes if they appear to not deal with each other.
This is useful, for instance, if you have some code that, maybe is updating a structure:
if ( playerMoved ) {
playerPos.X += dx;
playerPos.Y += dy;
// Keep the player above the world's surface.
if ( playerPos.Z + dz > 0 ) {
playerPos.Z += dz;
}
else {
playerPos.Z = 0;
}
}
Most of above statements may be reordered because there's no data dependency between them, in fact, a superscalar CPU may execute most of those statements simultaneously, or maybe would start working on the Z section sooner, since it doesn't affect X or Y, but might take longer.
Here's the problem with that - lets say that you're attempting lock-free programming. You want to perform a whole bunch of memory writes to, maybe, fill in a shared queue. You signal that you're done by finally writing to a flag.
Well, since that flag appears to have nothing to do with the rest of the work being done, the compiler and the CPU may reorder those instructions, and now you may set your 'done' flag before you've actually committed the rest of the structure to memory, and now your "lock-free" queue doesn't work.
This is where Acquire and Release ordering semantics come into play. I set that I'm doing work by setting a flag or so with an Acquire semantic, and the CPU guarantees that any memory games I play after that instruction stay actually below that instruction. I set that I'm done by setting a flag or so with a Release semantic, and the CPU guarantees that any memory games I had done just before the release actually stay before the release.
Normally, one would do this using explicit locks - mutexes, semaphores, etc, in which the CPU already knows it has to pay attention to memory ordering. The point of attempting to create 'lock free' data structures is to provide data structures that are thread safe (for some meaning of thread safe), that don't use explicit locks (because they are very slow).
Creating lock-free data structures is possible on a CPU or compiler that doesn't support acquire/release ordering semantics, but it usually means that some slower memory ordering semantic is used. For instance, you could issue a full memory barrier - everything that came before this instruction has to actually be committed before this instruction, and everything that came after this instruction has to be committed actually after this instruction. But that might mean that I wait for a bunch of actually irrelevant memory writes from earlier in the instruction stream (perhaps function call prologue) that has nothing to do with the memory safety I'm trying to implement.
Acquire says "only worry about stuff after me". Release says "only worry about stuff before me". Combining those both is a full memory barrier.

http://preshing.com/20120913/acquire-and-release-semantics/
Acquire semantics is a property which can only apply to operations
which read from shared memory, whether they are read-modify-write
operations or plain loads. The operation is then considered a
read-acquire. Acquire semantics prevent memory reordering of the
read-acquire with any read or write operation which follows it in
program order.
Release semantics is a property which can only apply to operations
which write to shared memory, whether they are read-modify-write
operations or plain stores. The operation is then considered a
write-release. Release semantics prevent memory reordering of the
write-release with any read or write operation which precedes it in
program order.
(um... but don't ask me what does it mean exactly)

How does CUDA handle multiple updates to memory address?

I have written a CUDA kernel in which each thread makes an update to a particular memory address (with int size). Some threads might want to update this address simultaneously.
How does CUDA handle this? Does the operation become atomic? Does this increase the latency of my application in any way? If so, how?

The operation does not become atomic, and it is essentially undefined behavior. When two or more threads write to the same location, one of the values will end up in the location, but there is no way to predict which one.
It can be especially problematic if you are reading and writing, such as to increment a variable.
CUDA provides a set of atomic operations to help.
You may also use other coding techniques such as parallel reductions, to help when there are multiple updates to the same location, such as finding a max or min value.
If you don't care about the order of updates, it should not be a performance issue for newer GPUs which automatically condense writes or reads to a single location in global memory or shared memory, but this is also not specified behavior.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio