He is example of atomic_read implementation:
#define atomic_read(v) (*(volatile int *)&(v)->counter)
Also, should we explicitly use memory barriers for atomic operations on arm?
He is example of atomic_read implementation:
A problematic one actually, which assumes that a cast is not a nop, which isn't guaranteed.
Also, should we explicitly use memory barriers for atomic operations
on arm?
Probably. It depends on what you are doing and what you are expecting.
Yes, the casting to volatile is to prevent the compiler from assuming the value of v cannot change. As for using memory barriers, the GCC builtins already allow you to specify the memory ordering you desire, no need to do it manually: https://gcc.gnu.org/onlinedocs/gcc-9.2.0/gcc/_005f_005fatomic-Builtins.html#g_t_005f_005fatomic-Builtins
The default behavior on GCC is to use __ATOMIC_SEQ_CST which will emit the barriers necessary on Arm to make sure your atomics execute in the order you place them in the code. To optimize performance on Arm, you will want to consider using weaker semantics to allow the compiler to elide barriers and let the hardware execute faster. For more information on the types of memory barriers the Arm architecture has, see https://developer.arm.com/docs/den0024/latest/memory-ordering/barriers.
Related
I've been reading on compiler optimizations vs CPU optimizations, and volatile vs memory barriers.
One thing which isn't clear to me is that my current understanding is that CPU optimizations and compiler optimizations are orthogonal. I.e. can occur independently of each other.
However, the article volatile considered harmful makes the point that volatile should not be used. Linus's post makes similar claims. The main reasoning, IIUC, is that marking a variable as volatile disables all compiler optimizations when accessing that variable (i.e. even if they are not harmful), while still not providing protection against memory reorderings. Essentially, the main point is that it's not the data that should be handled with care, but rather a particular access pattern needs to be handled with care.
Now, the volatile considered harmful article gives the following example of a busy loop waiting for a flag:
while (my_variable != what_i_want) {}
and makes the point that the compiler can optimize the access to my_variable so that it only occurs once and not in a loop. The solution, so the article claims, is the following:
while (my_variable != what_i_want)
cpu_relax();
It is said that cpu_relax acts as a compiler barrier (earlier versions of the article said that it's a memory barrier).
I have several gaps here:
1) Is the implication that gcc has special knowledge of the cpu_relax call, and that it translates to a hint to both the compiler and the CPU?
2) Is the same true for other instructions such as smb_mb() and the likes?
3) How does that work, given that cpu_relax is essentially defined as a C macro? If I manually expand cpu_relax will gcc still respect it as a compiler barrier? How can I know which calls are respected by gcc?
4) What is the scope of cpu_relax as far as gcc is concerned? In other words, what's the scope of reads that cannot be optimized by gcc when it sees the cpu_relax instruction? From the CPU's perspective, the scope is wide (memory barriers place a mark in the read or write buffer). I would guess gcc uses a smaller scope - perhaps the C scope?
Yes, gcc has special knowledge of the semantics of cpu_relax or whatever it expands to, and must translate it to something for which the hardware will respect the semantics too.
Yes, any kind of memory fencing primitive needs special respect by the compiler and hardware.
Look at what the macro expands to, e.g. compile with "gcc -E" and examine the output. You'll have to read the compiler documentation to find out the semantics of the primitives.
The scope of a memory fence is as wide as the scope the compiler might move a load or store across. A non-optimizing compiler that never moves loads or stores across a subroutine call might not need to pay much attention to a memory fence that is represented as a subroutine call. An optimizing compiler that does interprocedural optimization across translation units would need to track a memory fence across a much bigger scope.
There are a number subtle questions related to cpu and smp concurrency in your questions which will require you to look at the kernel code. Here are some quick ideas to get you started on the research specifically for the x86 architecture.
The idea is that you are trying to perform a concurrency operation where your kernel task (see kernel source sched.h for struct task_struct) is in a tight loop comparing my_variable with a local variable until it is changed by another kernel task (or change asynchronously by a hardware device!) This is a common pattern in the kernel.
The kernel has been ported to a number of architectures and each has a specific set of machine instructions to handle concurrency. For x86, cpu_relax maps to the PAUSE machine instruction. It allows an x86 CPU to more efficiently run a spinlock so that the lock variable update is more readily visible by the spinning CPU. GCC will execute the function/macro just like any other function. If cpu_relax is removed from the loop then gcc CAN consider the loop as non-functional and remove it. Look at the Intel X86 Software Manuals for the PAUSE instruction.
smp_mb is an x86 memory fence instruction that flushes the memory cache. One CPU can change my_variable in its cache but it will not be visible to other CPUs. smp_mb provides on-demand cache coherency. Look at the Intel X86 Software Manuals for MFENCE/LFENCE instructions.
Note that smp_mb() flushes the CPU cache so it CAN be an expensive operation. Current Intel CPUs have huge caches (~6MB).
If you expand cpu_relax on an x86, it will show asm volatile("rep; nop" ::: "memory"). This is NOT a compiler barrier but code that GCC will not optimize out. See the barrier macro, which is asm volatile("": : : "memory") for the GCC hint.
I'm not clear what you mean by "scope of cpu_relax". Some possible ideas: It's the PAUSE machine instruction, similar to ADD or MOV. PAUSE will affect only the current CPU. PAUSE allows for more efficient cache coherency between CPUs.
I just looked at the PAUSE instruction a little more - an additional property is it prevents the CPU from doing out-of-order memory speculation when leaving a tight loop/spinlock. I'm not clear what THAT means but I suppose it could briefly indicate a false value in a variable? Still a lot of questions....
The Eigen web site says:
Explicit vectorization is performed for SSE 2/3/4, AVX, FMA, AVX512, ARM NEON (32-bit and 64-bit), PowerPC AltiVec/VSX (32-bit and 64-bit) instruction sets, and now S390x SIMD (ZVector) with graceful fallback to non-vectorized code.
Does that mean if you compile, e.g. with FMA, and the CPU you are running on does not support it it will fall back to completely unvectorised code? Or will it fall back to the best vectorisation available?
If not, is there any way to have Eigen compile for all or several SIMD ISAs and automatically pick the best at runtime?
Edit: To be clear I'm talking about run-time fallbacks.
There is absolutely no run-time dispatching in Eigen. Everything that happens happens at compile-time. This is where you need to make all of the choices, either via preprocessor macros that control the library's behavior, or using optimization settings in your C++ compiler.
In order to implement run-time dispatching, you would either need to check and see what CPU features were supported on each and every call into the library and branch into the applicable implementation, or you would need to do this check once at launch and set up a table of function pointers (or some other method to facilitate dynamic dispatching). Eigen can't do the latter because it is a header-only library (just a collection of types and functions), with no "main" function that gets called upon initialization where all of this setup code could be localized. So the only option would be the former, which would result in a significant performance penalty. The whole point of this library is speed; introducing this type of performance penalty to each of the library's functions would be a disaster.
The documentation also contains a breakdown of how Eigen works, using a simple example. This page says:
The goal of this page is to understand how Eigen compiles it, assuming that SSE2 vectorization is enabled (GCC option -msse2).
which lends further credence to the claim that static, compile-time options determine how the library will work.
Whichever instruction set you choose to target, the generated code will have those instructions in it. If you try to execute that code on a processor that does not support these instructions (for example, you compile Eigen with AVX optimizations enabled, but you run it on an Intel Nehalem processor that doesn't support the AVX instruction set), then you will get an invalid instruction exception (presented to your program in whatever form the operating system passes CPU exceptions through). This will happen as soon as your CPU encounters an unrecognized/unsupported instruction, which will probably be deep in the bowels of one of the Eigen functions (i.e., not immediately upon startup).
However, as I said in the comments, there is a fallback mechanism of sorts, but it is a static one, all done at compile time. As the documentation indicates, Eigen supports multiple vector instruction sets. If you choose a lowest-common-denominator instruction set, like SSE2, you will still get some degree of vectorization. For example, although SSE3 may provide a specialized instruction for some particular task, if you can't target SSE3, all hope is not lost. There is, in many places, code that uses a series of SSE2 instructions to accomplish the same task. These are probably not as efficient as if SSE3 were available, but they'll still be faster than having no vector code whatsoever. You can see examples of this by digging into the source code, specifically the arch folder that contains the tidbits that have been specifically optimized for the different instruction sets (often by the use of intrinsics). In other words, you get the best code it can give you for your target architecture/instruction set.
I have 4096 bytes of shared memory allocated. How to treat it as an array of std::atomic<uint64_t> objects?
The final goal is to place array of 64-bit variables to shared memory and perform __sync_fetch_and_add (GCC built-in) on these variables. But I would prefer using native C++11 code instead of using GCC built-ins. So how to use the allocated memory as an std::atomic objects? Should I invoke placement new on 512 counters? What If std::atomic's constructor require additional memory allocations in some imeplementations? Should I consider the aligning of std::atomic object in shared memory?
With C++20, you can use std::atomic_ref if you can make sure that your objects are suitably aligned. For C++17 and anything older, boost::atomic_ref might work as well, though I haven't tested it.
If you don't want to use Boost, then compiler builtins are the only solution left. In that case, you should prefer the __atomic builtins over the old __sync functions, as stated on the GCC documentation page for atomics.
It's my understanding of atomicity that it's used to make sure a value will be read/written in whole rather than in parts. For example, a 64-bit value that is really two 32-bit DWORDs (assume x86 here) must be atomic when shared between threads so that both DWORDs are read/written at the same time. That way one thread can't read half variable that's not updated. How do you guarantee atomicity?
Furthermore it's my understanding that volatility does not guarantee thread safety at all. Is that true?
I've seen it implied many places that simply being atomic/volatile is thread-safe. I don't see how that is. Won't I need a memory barrier as well to ensure that any values, atomic or otherwise, are read/written before they can actually be guaranteed to be read/written in the other thread?
So for example let's say I create a thread suspended, do some calculations to change some values to a struct available to the thread and then resume, for example:
HANDLE hThread = CreateThread(NULL, 0, thread_entry, (void *)&data, CREATE_SUSPENDED, NULL);
data->val64 = SomeCalculation();
ResumeThread(hThread);
I suppose this would depend on any memory barriers in ResumeThread? Should I do an interlocked exchange for val64? What if the thread were running, how does that change things?
I'm sure I'm asking a lot here but basically what I'm trying to figure out is what I asked in the title: a good explanation for atomicity, volatility and thread safety in Windows. Thanks
it's used to make sure a value will be read/written in whole
That's just a small part of atomicity. At its core it means "uninterruptible", an instruction on a processor whose side-effects cannot be interleaved with another instruction. By design, a memory update is atomic when it can be executed with a single memory-bus cycle. Which requires the address of the memory location to be aligned so that a single cycle can update it. An unaligned access requires extra work, part of the bytes written by one cycle and part by another. Now it is not uninterruptible anymore.
Getting aligned updates is pretty easy, it is a guarantee provided by the compiler. Or, more broadly, by the memory model implemented by the compiler. Which simply chooses memory addresses that are aligned, sometimes intentionally leaving unused gaps of a few bytes to get the next variable aligned. An update to a variable that's larger than the native word size of the processor can never be atomic.
But much more important are the kind of processor instructions you need to make threading work. Every processor implements a variant of the CAS instruction, compare-and-swap. It is the core atomic instruction you need to implement synchronization. Higher level synchronization primitives, like monitors (aka condition variables), mutexes, signals, critical sections and semaphores are all built on top of that core instruction.
That's the minimum, a processor usually provide extra ones to make simple operations atomic. Like incrementing a variable, at its core an interruptible operation since it requires a read-modify-write operation. Having a need for it be atomic is very common, most any C++ program relies on it for example to implement reference counting.
volatility does not guarantee thread safety at all
It doesn't. It is an attribute that dates from much easier times, back when machines only had a single processor core. It only affects code generation, in particular the way a code optimizer tries to eliminate memory accesses and use a copy of the value in a processor register instead. Makes a big, big difference to code execution speed, reading a value from a register is easily 3 times faster than having to read it from memory.
Applying volatile ensures that the code optimizer does not consider the value in the register to be accurate and forces it to read memory again. It truly only matters on the kind of memory values that are not stable by themselves, devices that expose their registers through memory-mapped I/O. It has been abused heavily since that core meaning to try to put semantics on top of processors with a weak memory model, Itanium being the most egregious example. What you get with volatile today is strongly dependent on the specific compiler and runtime you use. Never use it for thread-safety, always use a synchronization primitive instead.
simply being atomic/volatile is thread-safe
Programming would be much simpler if that was true. Atomic operations only cover the very simple operations, a real program often needs to keep an entire object thread-safe. Having all its members updated atomically and never expose a view of the object that is partially updated. Something as simple as iterating a list is a core example, you can't have another thread modifying the list while you are looking at its elements. That's when you need to reach for the higher-level synchronization primitives, the kind that can block code until it is safe to proceed.
Real programs often suffer from this synchronization need and exhibit Amdahls' law behavior. In other words, adding an extra thread does not actually make the program faster. Sometimes actually making it slower. Whomever finds a better mouse-trap for this is guaranteed a Nobel, we're still waiting.
In general, C and C++ don't give any guarantees about how reading or writing a 'volatile' object behaves in multithreaded programs. (The 'new' C++11 probably does since it now includes threads as part of the standard, but tradiationally threads have not been part of standard C or C++.) Using volatile and making assumptions about atomicity and cache-coherence in code that's meant to be portable is a problem. It's a crap-shoot as to whether a particular compiler and platform will treat accesses to 'volatile' objects in a thread-safe way.
The general rule is: 'volatile' is not enough to ensure thread safe access. You should use some platform-provided mechanism (usually some functions or synchronisation objects) to access thread-shared values safely.
Now, specifically on Windows, specifically with the VC++ 2005+ compiler, and specifically on x86 and x64 systems, accessing a primitive object (like an int) can be made thread-safe if:
On 64- and 32-bit Windows, the object has to be a 32-bit type, and it has to be 32-bit aligned.
On 64-bit Windows, the object may also be a 64-bit type, and it has to be 64-bit aligned.
It must be declared volatile.
If those are true, then accesses to the object will be volatile, atomic and be surrounded by instructions that ensure cache-coherency. The size and alignment conditions must be met so that the compiler makes code that performs atomic operations when accessing the object. Declaring the object volatile ensures that the compiler doesn't make code optimisations related to caching previous values it may have read into a register and ensures that code generated includes appropriate memory barrier instructions when it's accessed.
Even so, you're probably still better off using something like the Interlocked* functions for accessing small things, and bog standard synchronisation objects like Mutexes or CriticalSections for larger objects and data structures. Ideally, get libraries for and use data structures that already include appropriate locks. Let your libraries & OS do the hard work as much as possible!
In your example, I expect you do need to use a thread-safe access to update val64 whether the thread is started yet or not.
If the thread was already running, then you would definitely need some kind of thread-safe write to val64, either using InterchangeExchange64 or similar, or by acquiring and releasing some kind of synchronisation object which will perform appropriate memory barrier instructions. Similarly, the thread would need to use a thread-safe accessor to read it as well.
In the case where the thread hasn't been resumed yet, it's a bit less clear. It's possible that ResumeThread might use or act like a synchronisation function and do the memory barrier operations, but the documentation doesn't specify that it does, so it is better to assume that it doesn't.
References:
On atomicity of 32- and 64- bit aligned types... https://msdn.microsoft.com/en-us/library/windows/desktop/ms684122%28v=vs.85%29.aspx
On 'volatile' including memory fences... https://msdn.microsoft.com/en-us/library/windows/desktop/ms686355%28v=vs.85%29.aspx
As far as I have understood, mfence is a hardware memory barrier while asm volatile ("" : : : "memory") is a compiler barrier. But,can asm volatile ("" : : : "memory") be used in place of mfence.
The reason I have got confused is this link
Well, a memory barrier is only needed on architectures that have weak memory ordering. x86 and x64 don't have weak memory ordering. on x86/x64 all stores have a release fence and all loads have an acquire fence. so, you should only really need asm volatile ("" : : : "memory")
For a good overview of both Intel and AMD as well as references to the relavent manufacturer specs, see http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
Generally things like "volatile" are used on a per-field basis where loads and stores to that field are natively atomic. Where loads and stores to a field are already atomic (i.e. the "operation" in question is a load or a store to a single field and thus the entire operation is atomic) the volatile field modifier or memory barriers are not needed on x86/x64. Portable code notwithstanding.
When it comes to "operations" that are not atomic--e.g. loads or stores to a field that is larger than a native word or loads or stores to multiple fields within an "operation"--a means by which the operation can be viewed as atomic are required regardless of CPU architecture. generally this is done by means of a synchronization primitive like a mutex. Mutexes (the ones I've used) include memory barriers to avoid issues like processor reordering so you don't have to add extra memory barrier instructions. I generally consider not using synchronization primitives a premature optimization; but, the nature of premature optimization is, of course, 97% of the time :)
Where you don't use a synchronization primitive and you're dealing with a multi-field invariant, memory barriers that ensure the processor does not reorder stores and loads to different memory locations is important.
Now, in terms of not issuing an "mfence" instruction in asm volatile but using "memory" in the clobber list. From what I've been able to read
If your assembler instructions access memory in an unpredictable fashion, add `memory' to the list of clobbered registers. This will cause GCC to not keep memory values cached in registers across the assembler instruction and not optimize stores or loads to that memory.
When they say "GCC" and don't mention anything about the CPU, this means it applies to only the compiler. The lack of "mfence" means there is no CPU memory barrier. You can verify this by disassembling the resulting binary. If no "mfence" instruction is issued (depending on the target platform) then it's clear the CPU is not being told to issue a memory fence.
Depending on the platform you're on and what you're trying to do, there maybe something "better" or more clear... portability not withstanding.
asm volatile ("" ::: "memory") is just a compiler barrier.
asm volatile ("mfence" ::: "memory") is both a compiler barrier and MFENCE
__sync_synchronize() is also a compiler barrier and a full memory barrier.
so asm volatile ("" ::: "memory") will not prevent CPU reordering independent data instructions per se. As pointed out x86-64 has a strong memory model, but StoreLoad reordering is still possible. If a full memory barrier is needed for your algorithm to work then you neeed __sync_synchronize
There are two reorderings, one is compiler reordering, the other one is CPU reordering.
x86/x64 has a relatively strong memory model, but on x86/x64 StoreLoad reordering (later loads passing earlier stores) CAN happen.
see http://en.wikipedia.org/wiki/Memory_ordering
asm volatile ("" ::: "memory") is just a compiler barrier.
asm volatile ("mfence" ::: "memory") is both a compiler barrier and CPU barrier.
that means, only use a compiler barrier, you can only prevent compiler reordering, but you can not prevent CPU reordering. that means there is no reordering when compiling source code, but reordering can happen in runtime.
So, it depends your needs, which one to use.