1) Is the following declaration of a naturally aligned pointer:
alignas(sizeof(void *)) volatile void * p;
equivalent to
std::atomic<void *>
in C++11?
2) Saying more exactly, is it correct to assume that this type of pointer will work in the same way as std::atomic in C++11?
No, volatile does not guarantee that the location will be written or read atomically, just the the compiler can't optimise out multiple reads and writes.
On certain architectures, the processor will atomically read or write if aligned correctly, but that is not universal or even guaranteed through a family of processors. Where it can, the internal implementation of atomic will take advantage of architectural features and atomic instruction modifiers, so why not use atomic, if you mean atomic?
Related
I am using the C volatile keyword in combination with x86 memory ordering guarantees (writes are ordered with writes, and reads are ordered with reads) to implement a barrier-free message queue. Does gcc provide a builtin function that efficiently copies data from one volatile array to another?
I.e. is there a builtin/efficient function that we could call as memcpy_volatile is used in the following example?
uint8_t volatile * dest = ...;
uint8_t volatile const* src = ...;
int len;
memcpy_volatile(dest, src, len);
rather than writing a naive loop?
This question is NOT about the popularity of barrier-free C programs. I am perfectly aware of barrier-based alternatives. This question is, therefore, also NOT a duplicate of any question where the answer is "use barrier primitives".
This question is also NOT a duplicate of similar questions that are not specific to x86/gcc, where of course the answer is "there's no general mechanism that works on all platforms".
Additional Detail
memcpy_volatile is not expected to be atomic. The ordering of operations within memcpy_volatile does not matter. What matters is that if memcpy_volatile(dest, ...) is done before advertising the dest pointer to another thread (via another volatile variable) then the sequence (data write, pointer write) must appear in the same order to the other thread. So if the other thread sees the new pointer (dest) then it must also see the data that was copied to *dest. This is the essential requirement for barrier-free queue implementation.
memcpy_volatile is not expected to be atomic. ... What matters is that if memcpy_volatile(dest, ...) is done before advertising the dest pointer to another thread (via another volatile variable) then the sequence (data write, pointer write) must appear in the same order to the other thread. ...
Ok, that makes the problem solvable, you're just "publishing" the memcpy stores via release/acquire synchronization.
The buffers don't need to be volatile, then, except as one way to ensure compile-time ordering before some other volatile store. Because volatile operations are only guaranteed ordered (at compile time) wrt. other volatile operations. Since it's not being concurrently accessed while you're storing, the possible gotchas in Who's afraid of a big bad optimizing compiler? aren't a factor.
To hack this into your hand-rolled atomics with volatile, use GNU C asm("" ::: "memory") as a compiler memory barrier to block compile-time reordering between the release-store and the memcpy.
volatile uint8_t *shared_var;
memcpy((char*)dest, (const char*)src, len);
asm("" ::: "memory");
shared_var = dest; // release-store
But really you're just making it inconvenient for yourself by avoiding C11 stdatomic.h for atomic_store_explicit(&shared_var, dest, memory_order_release) or GNU C __atomic_store_n(&shared_var, dest, __ATOMIC_RELEASE), which are ordered wrt. non-atomic accesses like a memcpy. Using a memory_order other than the default seq_cst will let it compile with no overhead for x86, to the same asm you get from volatile.
The compiler knows x86's memory ordering rules, and will take advantage of them by not using any extra barriers except for seq_cst stores. (Atomic RMWs on x86 are always full barriers, but you can't do those using volatile.)
Avoid RMW operations like x++ if you don't actually need atomicity for the whole operation; volatile x++ is more like atomic_store_explicit(&x, 1+atomic_load_explicit(&x, memory_order_acquire), memory_order_release); which is a big pain to type, but often you'd want to load into a tmp variable anyway.
If you're willing to use GNU C features like asm("" ::: "memory"), you can use its __atomic built-ins instead, without even having to change your variable declarations like you would for stdatomic.h.
volatile uint8_t *shared_var;
memcpy((char*)dest, (const char*)src, len);
// a release-store is ordered after all previous stuff in this thread
__atomic_store_explicit(&shared_var, dest, __ATOMIC_RELEASE);
As a bonus, doing it this way makes your code portable to non-x86 ISAs, e.g. AArch64 where it could compile the release-store to stlr. (And no separate barrier could be that efficient.)
The key point is that there's no down-side to the generated asm for x86.
As in When to use volatile with multi threading? - never. Use atomic with memory_order_relaxed, or with acquire / release to get C-level guarantees equivalent to x86 hardware memory-ordering.
The compilation flag -fmerge-all-constants merges identical constants into a single variable. I keep reading that this results in non-conforming code, and Linus Torvalds wrote that it's inexcusable, but why?
What can possibly happen when you merge two or more identical constant variables?
There are times when programs declare a constant object because they need something with a unique address, and there are times when any address that points to storage holding the proper sequence of byte values would be equally usable. In C, if one writes:
char const * const HelloShared1 = "Hello";
char const * const HelloShared2 = "Hello";
char const HelloUnique1[] = "Hello";
char const HelloUnique2[] = "Hello";
a compiler would have to reserve space for at least three copies of the word Hello, followed by a zero byte. The names HelloUnique1 and HelloUnique2 would refer to two of those copies, and the names HelloShared1 and HelloShared2 would need to identify storage that was distinct from that used by HelloUnique1 and HelloUnique2, but HelloShared1 and HelloShared2 could at the compiler's convenience identify the same storage.
Unfortunately, while the C and C++ Standards usefully provides two ways of specifying objects that hold string literal data, so as to allow programmers to indicate when multiple copies of the same information may be placed in the same storage, it fails to specify any means of specifying the same semantics for any other kind of constant data. For most kinds of applications, situations where a program would care about whether two objects share the same address would be far less common than those where using the same storage for constant objects holding the same data would be advantageous.
Being able to invite an implementation to make optimizations which would not be allowable by the Standard is useful, if one recognizes that programs should not be expected to be compatible with all optimizations, nor vice versa, and if compiler writers do a good job of documenting what kinds of programs different optimizations are compatible with and letting compiler writers enable only optimizations that are known to be compatible with their code.
Fundamentally, optimizations that assume programs won't do X will be useful for applications that don't involve doing X, but at best counter-productive for those that do. The described optimizations would fall into this category. I wouldn't see any basis for complaining about a compiler that makes such optimizations available but doesn't enable them by default. On the other hand, some people believe any program that isn't compatible with any imaginable optimization as "broken".
The malloc() and calloc() functions return a pointer to the allocated memory, which is suitably aligned for any built-in type.
Now, GCC provides built-in support for vector types through the syntax[1]
template<class T>
using vec4_t __attribute__ ((vector_size (4*sizeof(T))))=T;
And it is possible to do arithmetic operations on two vec4_t<float>, or with AVX vec4_t<double> directly, yet malloc cannot be used on 32-bit platforms when T=float, or on 64-bit platforms when T=double.
Now there is P0035R1, which sort of solves the problem of dynamic allocation for over-aligned types. But why is it not better to blame malloc instead of adding more variants to operator new?[2] Indeed, the compiler has built-in support for these types, so one would expect malloc to work properly.
[1] In C, I would have to specify the type for each typedef
[2] Page-aligned memory is another issue
Which is more efficient of passing a SSE vector by value or reference?
typedef float v4sf __attribute__ ((vector_size(16)));
//Pass by reference
void doStuff(v4sf& foo);
//Pass by value
v4sf doStuff(v4sf foo);
On one hand, v4sf is large 16 byte.
But, we can deal with these things as if they were single element data, and the reference may introduce one level of indirection
Typically SIMD functions which take vector parameters are relatively small and performance-critical, which usually means they should be inlined. Once inlined it doesn't really matter whether you pass by value, pointer or reference, as the compiler will optimise away unnecessary copies or dereferences.
One further point: if you think you might ever need to port your code to Windows then you will almost certainly want to use references, as there are some inane ABI restrictions which limit how many vector parameters you can pass (by value), even when the function is inlined.
I am working on ARM optimizations using the NEON intrinsics, from C++ code. I understand and master most of the typing issues, but I am stuck on this one:
The instruction vzip_u8 returns a uint8x8x2_t value (in fact an array of two uint8x8_t). I want to assign the returned value to a plain uint16x8_t. I see no appropriate vreinterpretq intrinsic to achieve that, and simple casts are rejected.
Some definitions to answer clearly...
NEON has 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide).
The NEON unit can view the same register bank as:
sixteen 128-bit quadword registers, Q0-Q15
thirty-two 64-bit doubleword registers, D0-D31.
uint16x8_t is a type which requires 128-bit storage thus it needs to be in an quadword register.
ARM NEON Intrinsics has a definition called vector array data type in ARMĀ® C Language Extensions:
... for use in load and store operations, in
table-lookup operations, and as the result type of operations that return a pair of vectors.
vzip instruction
... interleaves the elements of two vectors.
vzip Dd, Dm
and has an intrinsic like
uint8x8x2_t vzip_u8 (uint8x8_t, uint8x8_t)
from these we can conclude that uint8x8x2_t is actually a list of two random numbered doubleword registers, because vzip instructions doesn't have any requirement on order of input registers.
Now the answer is...
uint8x8x2_t can contain non-consecutive two dualword registers while uint16x8_t is a data structure consisting of two consecutive dualword registers which first one has an even index (D0-D31 -> Q0-Q15).
Because of this you can't cast vector array data type with two double word registers to a quadword register... easily.
Compiler may be smart enough to assist you, or you can just force conversion however I would check the resulting assembly for correctness as well as performance.
You can construct a 128 bit vector from two 64 bit vectors using the vcombine_* intrinsics. Thus, you can achieve what you want like this.
#include <arm_neon.h>
uint8x16_t f(uint8x8_t a, uint8x8_t b)
{
uint8x8x2_t tmp = vzip_u8(a,b);
uint8x16_t result;
result = vcombine_u8(tmp.val[0], tmp.val[1]);
return result;
}
I have found a workaround: given that the val member of the uint8x8x2_t type is an array, it is therefore seen as a pointer. Casting and deferencing the pointer works ! [Whereas taking the address of the data raises an "address of temporary" warning.]
uint16x8_t Value= *(uint16x8_t*)vzip_u8(arg0, arg1).val;
It turns out that this compiles and executes as should (at least in the case I have tried). I haven't looked at the assembly code so I cannot grant it is implemented properly (I mean just keeping the value in a register instead of writing/read to/from memory.)
I was facing the same kind of problem, so I introduced a flexible data type.
I can now therefore define the following:
typedef NeonVectorType<uint8x16_t> uint_128bit_t; //suitable for uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
typedef NeonVectorType<uint8x8_t> uint_64bit_t; //suitable for uint8x8_t, uint32x2_t, etc.
Its a bug in GCC(now fixed) on 4.5 and 4.6 series.
Bugzilla link http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48252
Please take the fix from this bug and apply to gcc source and rebuild it.