What is a `built-in type` - gcc

The malloc() and calloc() functions return a pointer to the allocated memory, which is suitably aligned for any built-in type.
Now, GCC provides built-in support for vector types through the syntax[1]
template<class T>
using vec4_t __attribute__ ((vector_size (4*sizeof(T))))=T;
And it is possible to do arithmetic operations on two vec4_t<float>, or with AVX vec4_t<double> directly, yet malloc cannot be used on 32-bit platforms when T=float, or on 64-bit platforms when T=double.
Now there is P0035R1, which sort of solves the problem of dynamic allocation for over-aligned types. But why is it not better to blame malloc instead of adding more variants to operator new?[2] Indeed, the compiler has built-in support for these types, so one would expect malloc to work properly.
[1] In C, I would have to specify the type for each typedef
[2] Page-aligned memory is another issue

Related

memcpy for volatile arrays in gcc C on x86?

I am using the C volatile keyword in combination with x86 memory ordering guarantees (writes are ordered with writes, and reads are ordered with reads) to implement a barrier-free message queue. Does gcc provide a builtin function that efficiently copies data from one volatile array to another?
I.e. is there a builtin/efficient function that we could call as memcpy_volatile is used in the following example?
uint8_t volatile * dest = ...;
uint8_t volatile const* src = ...;
int len;
memcpy_volatile(dest, src, len);
rather than writing a naive loop?
This question is NOT about the popularity of barrier-free C programs. I am perfectly aware of barrier-based alternatives. This question is, therefore, also NOT a duplicate of any question where the answer is "use barrier primitives".
This question is also NOT a duplicate of similar questions that are not specific to x86/gcc, where of course the answer is "there's no general mechanism that works on all platforms".
Additional Detail
memcpy_volatile is not expected to be atomic. The ordering of operations within memcpy_volatile does not matter. What matters is that if memcpy_volatile(dest, ...) is done before advertising the dest pointer to another thread (via another volatile variable) then the sequence (data write, pointer write) must appear in the same order to the other thread. So if the other thread sees the new pointer (dest) then it must also see the data that was copied to *dest. This is the essential requirement for barrier-free queue implementation.
memcpy_volatile is not expected to be atomic. ... What matters is that if memcpy_volatile(dest, ...) is done before advertising the dest pointer to another thread (via another volatile variable) then the sequence (data write, pointer write) must appear in the same order to the other thread. ...
Ok, that makes the problem solvable, you're just "publishing" the memcpy stores via release/acquire synchronization.
The buffers don't need to be volatile, then, except as one way to ensure compile-time ordering before some other volatile store. Because volatile operations are only guaranteed ordered (at compile time) wrt. other volatile operations. Since it's not being concurrently accessed while you're storing, the possible gotchas in Who's afraid of a big bad optimizing compiler? aren't a factor.
To hack this into your hand-rolled atomics with volatile, use GNU C asm("" ::: "memory") as a compiler memory barrier to block compile-time reordering between the release-store and the memcpy.
volatile uint8_t *shared_var;
memcpy((char*)dest, (const char*)src, len);
asm("" ::: "memory");
shared_var = dest; // release-store
But really you're just making it inconvenient for yourself by avoiding C11 stdatomic.h for atomic_store_explicit(&shared_var, dest, memory_order_release) or GNU C __atomic_store_n(&shared_var, dest, __ATOMIC_RELEASE), which are ordered wrt. non-atomic accesses like a memcpy. Using a memory_order other than the default seq_cst will let it compile with no overhead for x86, to the same asm you get from volatile.
The compiler knows x86's memory ordering rules, and will take advantage of them by not using any extra barriers except for seq_cst stores. (Atomic RMWs on x86 are always full barriers, but you can't do those using volatile.)
Avoid RMW operations like x++ if you don't actually need atomicity for the whole operation; volatile x++ is more like atomic_store_explicit(&x, 1+atomic_load_explicit(&x, memory_order_acquire), memory_order_release); which is a big pain to type, but often you'd want to load into a tmp variable anyway.
If you're willing to use GNU C features like asm("" ::: "memory"), you can use its __atomic built-ins instead, without even having to change your variable declarations like you would for stdatomic.h.
volatile uint8_t *shared_var;
memcpy((char*)dest, (const char*)src, len);
// a release-store is ordered after all previous stuff in this thread
__atomic_store_explicit(&shared_var, dest, __ATOMIC_RELEASE);
As a bonus, doing it this way makes your code portable to non-x86 ISAs, e.g. AArch64 where it could compile the release-store to stlr. (And no separate barrier could be that efficient.)
The key point is that there's no down-side to the generated asm for x86.
As in When to use volatile with multi threading? - never. Use atomic with memory_order_relaxed, or with acquire / release to get C-level guarantees equivalent to x86 hardware memory-ordering.

Natural alignment + volatile = atomic in C++11?

1) Is the following declaration of a naturally aligned pointer:
alignas(sizeof(void *)) volatile void * p;
equivalent to
std::atomic<void *>
in C++11?
2) Saying more exactly, is it correct to assume that this type of pointer will work in the same way as std::atomic in C++11?
No, volatile does not guarantee that the location will be written or read atomically, just the the compiler can't optimise out multiple reads and writes.
On certain architectures, the processor will atomically read or write if aligned correctly, but that is not universal or even guaranteed through a family of processors. Where it can, the internal implementation of atomic will take advantage of architectural features and atomic instruction modifiers, so why not use atomic, if you mean atomic?

Pass v4sf by value or reference

Which is more efficient of passing a SSE vector by value or reference?
typedef float v4sf __attribute__ ((vector_size(16)));
//Pass by reference
void doStuff(v4sf& foo);
//Pass by value
v4sf doStuff(v4sf foo);
On one hand, v4sf is large 16 byte.
But, we can deal with these things as if they were single element data, and the reference may introduce one level of indirection
Typically SIMD functions which take vector parameters are relatively small and performance-critical, which usually means they should be inlined. Once inlined it doesn't really matter whether you pass by value, pointer or reference, as the compiler will optimise away unnecessary copies or dereferences.
One further point: if you think you might ever need to port your code to Windows then you will almost certainly want to use references, as there are some inane ABI restrictions which limit how many vector parameters you can pass (by value), even when the function is inlined.

Data type compatibility with NEON intrinsics

I am working on ARM optimizations using the NEON intrinsics, from C++ code. I understand and master most of the typing issues, but I am stuck on this one:
The instruction vzip_u8 returns a uint8x8x2_t value (in fact an array of two uint8x8_t). I want to assign the returned value to a plain uint16x8_t. I see no appropriate vreinterpretq intrinsic to achieve that, and simple casts are rejected.
Some definitions to answer clearly...
NEON has 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide).
The NEON unit can view the same register bank as:
sixteen 128-bit quadword registers, Q0-Q15
thirty-two 64-bit doubleword registers, D0-D31.
uint16x8_t is a type which requires 128-bit storage thus it needs to be in an quadword register.
ARM NEON Intrinsics has a definition called vector array data type in ARMĀ® C Language Extensions:
... for use in load and store operations, in
table-lookup operations, and as the result type of operations that return a pair of vectors.
vzip instruction
... interleaves the elements of two vectors.
vzip Dd, Dm
and has an intrinsic like
uint8x8x2_t vzip_u8 (uint8x8_t, uint8x8_t)
from these we can conclude that uint8x8x2_t is actually a list of two random numbered doubleword registers, because vzip instructions doesn't have any requirement on order of input registers.
Now the answer is...
uint8x8x2_t can contain non-consecutive two dualword registers while uint16x8_t is a data structure consisting of two consecutive dualword registers which first one has an even index (D0-D31 -> Q0-Q15).
Because of this you can't cast vector array data type with two double word registers to a quadword register... easily.
Compiler may be smart enough to assist you, or you can just force conversion however I would check the resulting assembly for correctness as well as performance.
You can construct a 128 bit vector from two 64 bit vectors using the vcombine_* intrinsics. Thus, you can achieve what you want like this.
#include <arm_neon.h>
uint8x16_t f(uint8x8_t a, uint8x8_t b)
{
uint8x8x2_t tmp = vzip_u8(a,b);
uint8x16_t result;
result = vcombine_u8(tmp.val[0], tmp.val[1]);
return result;
}
I have found a workaround: given that the val member of the uint8x8x2_t type is an array, it is therefore seen as a pointer. Casting and deferencing the pointer works ! [Whereas taking the address of the data raises an "address of temporary" warning.]
uint16x8_t Value= *(uint16x8_t*)vzip_u8(arg0, arg1).val;
It turns out that this compiles and executes as should (at least in the case I have tried). I haven't looked at the assembly code so I cannot grant it is implemented properly (I mean just keeping the value in a register instead of writing/read to/from memory.)
I was facing the same kind of problem, so I introduced a flexible data type.
I can now therefore define the following:
typedef NeonVectorType<uint8x16_t> uint_128bit_t; //suitable for uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
typedef NeonVectorType<uint8x8_t> uint_64bit_t; //suitable for uint8x8_t, uint32x2_t, etc.
Its a bug in GCC(now fixed) on 4.5 and 4.6 series.
Bugzilla link http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48252
Please take the fix from this bug and apply to gcc source and rebuild it.

How to use Gcc 4.6.0 libquadmath and __float128 on x86 and x86_64

I have medium size C99 program which uses long double type (80bit) for floating-point computation. I want to improve precision with new GCC 4.6 extension __float128. As I get, it is a software-emulated 128-bit precision math.
How should I convert my program from classic long double of 80-bit to quad floats of 128 bit with software emulation of full precision?
What need I change? Compiler flags, sources?
My program have reading of full precision values with strtod, doing a lot of different operations on them (like +-*/ sin, cos, exp and other from <math.h>) and printf-ing of them.
PS: despite that float128 is declared only for Fortran (REAL*16), the libquadmath is written in C and it uses float128. I'm unsure will GCC convert operations on float128 to runtime library or not and I'm unsure how to migrate from long double to __float128 in my sources.
PPS: There is a documentation on "C" language gcc mode: http://gcc.gnu.org/onlinedocs/gcc/Floating-Types.html
"GNU C compiler supports ... 128 bit (TFmode) floating types. Support for additional types includes the arithmetic operators: add, subtract, multiply, divide; unary arithmetic operators; relational operators; equality operators ... __float128 types are supported on i386, x86_64"
How should I convert my program from classic long double of 80-bit to quad floats of 128 bit with software emulation of full precision? What need I change? Compiler flags, sources?
You need recent software, GCC version with support of __float128 type (4.6 and newer) and libquadmath (supported only on x86 and x86_64 targets; in IA64 and HPPA with newer GCC). You should add linker flag -lquadmath (the cannot find -lquadmath' will show that you have no libquadmath installed)
Add #include <quadmath.h> header to have macro and function definitions.
You should modify all long double variable definitions to __float128.
Complex variables may be changed to __complex128 type (quadmath.h) or directly with typedef _Complex float __attribute__((mode(TC))) _Complex128;
All simple arithmetic operations are automatically handled by GCC (converted to calls of helper functions like __*tf3()).
If you use any macro like LDBL_*, replace them with FLT128_* (full list http://gcc.gnu.org/onlinedocs/libquadmath/Typedef-and-constants.html#Typedef-and-constants)
If you need some specific constants like pi (M_PI) or e (M_E) with quadruple precision, use predefined constants with q suffix (M_*q), like M_PIq and M_Eq (full list http://gcc.gnu.org/onlinedocs/libquadmath/Typedef-and-constants.html#Typedef-and-constants)
User-defined constants may be written with Q suffix, like 1.3000011111111Q
All math function calls should be replaced with *q versions, like sqrtq(), sinq() (full list http://gcc.gnu.org/onlinedocs/libquadmath/Math-Library-Routines.html#Math-Library-Routines)
Reading quad-float from string should be done with __float128 strtoflt128 (const char *s, char **sp) - http://gcc.gnu.org/onlinedocs/libquadmath/strtoflt128.html#strtoflt128 (Warning, in older libquadmaths there may be some bugs in strtoflt128, do a double check)
Printing the __float128 is done with help of quadmath_snprintf function. On linux distributions with recent glibc the function will be automagically registered by libquadmath to handle Q (may be also q) length modifier of a, A, e, E, f, F, g, G conversion specifiers in all printfs/sprintfs, like it did L for long doubles. Example: printf ("%Qe", 1.2Q), http://gcc.gnu.org/onlinedocs/libquadmath/quadmath_005fsnprintf.html#quadmath_005fsnprintf
You should also know, that since 4.6 Gfortran will use __float128 type for DOUBLE PRECISION, if the option -fdefault-real-8 was given and there were no option -fdefault-double-8. This may be problem, since 128 long double is much slower than standard long double on many platforms due to software computation. (Thanks to post by glennglockwood http://glennklockwood.blogspot.com/2014/02/linux-perf-libquadmath-and-gfortrans.html)

Resources