memcpy for volatile arrays in gcc C on x86? - gcc

I am using the C volatile keyword in combination with x86 memory ordering guarantees (writes are ordered with writes, and reads are ordered with reads) to implement a barrier-free message queue. Does gcc provide a builtin function that efficiently copies data from one volatile array to another?
I.e. is there a builtin/efficient function that we could call as memcpy_volatile is used in the following example?
uint8_t volatile * dest = ...;
uint8_t volatile const* src = ...;
int len;
memcpy_volatile(dest, src, len);
rather than writing a naive loop?
This question is NOT about the popularity of barrier-free C programs. I am perfectly aware of barrier-based alternatives. This question is, therefore, also NOT a duplicate of any question where the answer is "use barrier primitives".
This question is also NOT a duplicate of similar questions that are not specific to x86/gcc, where of course the answer is "there's no general mechanism that works on all platforms".
Additional Detail
memcpy_volatile is not expected to be atomic. The ordering of operations within memcpy_volatile does not matter. What matters is that if memcpy_volatile(dest, ...) is done before advertising the dest pointer to another thread (via another volatile variable) then the sequence (data write, pointer write) must appear in the same order to the other thread. So if the other thread sees the new pointer (dest) then it must also see the data that was copied to *dest. This is the essential requirement for barrier-free queue implementation.

memcpy_volatile is not expected to be atomic. ... What matters is that if memcpy_volatile(dest, ...) is done before advertising the dest pointer to another thread (via another volatile variable) then the sequence (data write, pointer write) must appear in the same order to the other thread. ...
Ok, that makes the problem solvable, you're just "publishing" the memcpy stores via release/acquire synchronization.
The buffers don't need to be volatile, then, except as one way to ensure compile-time ordering before some other volatile store. Because volatile operations are only guaranteed ordered (at compile time) wrt. other volatile operations. Since it's not being concurrently accessed while you're storing, the possible gotchas in Who's afraid of a big bad optimizing compiler? aren't a factor.
To hack this into your hand-rolled atomics with volatile, use GNU C asm("" ::: "memory") as a compiler memory barrier to block compile-time reordering between the release-store and the memcpy.
volatile uint8_t *shared_var;
memcpy((char*)dest, (const char*)src, len);
asm("" ::: "memory");
shared_var = dest; // release-store
But really you're just making it inconvenient for yourself by avoiding C11 stdatomic.h for atomic_store_explicit(&shared_var, dest, memory_order_release) or GNU C __atomic_store_n(&shared_var, dest, __ATOMIC_RELEASE), which are ordered wrt. non-atomic accesses like a memcpy. Using a memory_order other than the default seq_cst will let it compile with no overhead for x86, to the same asm you get from volatile.
The compiler knows x86's memory ordering rules, and will take advantage of them by not using any extra barriers except for seq_cst stores. (Atomic RMWs on x86 are always full barriers, but you can't do those using volatile.)
Avoid RMW operations like x++ if you don't actually need atomicity for the whole operation; volatile x++ is more like atomic_store_explicit(&x, 1+atomic_load_explicit(&x, memory_order_acquire), memory_order_release); which is a big pain to type, but often you'd want to load into a tmp variable anyway.
If you're willing to use GNU C features like asm("" ::: "memory"), you can use its __atomic built-ins instead, without even having to change your variable declarations like you would for stdatomic.h.
volatile uint8_t *shared_var;
memcpy((char*)dest, (const char*)src, len);
// a release-store is ordered after all previous stuff in this thread
__atomic_store_explicit(&shared_var, dest, __ATOMIC_RELEASE);
As a bonus, doing it this way makes your code portable to non-x86 ISAs, e.g. AArch64 where it could compile the release-store to stlr. (And no separate barrier could be that efficient.)
The key point is that there's no down-side to the generated asm for x86.
As in When to use volatile with multi threading? - never. Use atomic with memory_order_relaxed, or with acquire / release to get C-level guarantees equivalent to x86 hardware memory-ordering.

Related

What may be the problem with Clang's -fmerge-all-constants?

The compilation flag -fmerge-all-constants merges identical constants into a single variable. I keep reading that this results in non-conforming code, and Linus Torvalds wrote that it's inexcusable, but why?
What can possibly happen when you merge two or more identical constant variables?
There are times when programs declare a constant object because they need something with a unique address, and there are times when any address that points to storage holding the proper sequence of byte values would be equally usable. In C, if one writes:
char const * const HelloShared1 = "Hello";
char const * const HelloShared2 = "Hello";
char const HelloUnique1[] = "Hello";
char const HelloUnique2[] = "Hello";
a compiler would have to reserve space for at least three copies of the word Hello, followed by a zero byte. The names HelloUnique1 and HelloUnique2 would refer to two of those copies, and the names HelloShared1 and HelloShared2 would need to identify storage that was distinct from that used by HelloUnique1 and HelloUnique2, but HelloShared1 and HelloShared2 could at the compiler's convenience identify the same storage.
Unfortunately, while the C and C++ Standards usefully provides two ways of specifying objects that hold string literal data, so as to allow programmers to indicate when multiple copies of the same information may be placed in the same storage, it fails to specify any means of specifying the same semantics for any other kind of constant data. For most kinds of applications, situations where a program would care about whether two objects share the same address would be far less common than those where using the same storage for constant objects holding the same data would be advantageous.
Being able to invite an implementation to make optimizations which would not be allowable by the Standard is useful, if one recognizes that programs should not be expected to be compatible with all optimizations, nor vice versa, and if compiler writers do a good job of documenting what kinds of programs different optimizations are compatible with and letting compiler writers enable only optimizations that are known to be compatible with their code.
Fundamentally, optimizations that assume programs won't do X will be useful for applications that don't involve doing X, but at best counter-productive for those that do. The described optimizations would fall into this category. I wouldn't see any basis for complaining about a compiler that makes such optimizations available but doesn't enable them by default. On the other hand, some people believe any program that isn't compatible with any imaginable optimization as "broken".

Natural alignment + volatile = atomic in C++11?

1) Is the following declaration of a naturally aligned pointer:
alignas(sizeof(void *)) volatile void * p;
equivalent to
std::atomic<void *>
in C++11?
2) Saying more exactly, is it correct to assume that this type of pointer will work in the same way as std::atomic in C++11?
No, volatile does not guarantee that the location will be written or read atomically, just the the compiler can't optimise out multiple reads and writes.
On certain architectures, the processor will atomically read or write if aligned correctly, but that is not universal or even guaranteed through a family of processors. Where it can, the internal implementation of atomic will take advantage of architectural features and atomic instruction modifiers, so why not use atomic, if you mean atomic?

Pass v4sf by value or reference

Which is more efficient of passing a SSE vector by value or reference?
typedef float v4sf __attribute__ ((vector_size(16)));
//Pass by reference
void doStuff(v4sf& foo);
//Pass by value
v4sf doStuff(v4sf foo);
On one hand, v4sf is large 16 byte.
But, we can deal with these things as if they were single element data, and the reference may introduce one level of indirection
Typically SIMD functions which take vector parameters are relatively small and performance-critical, which usually means they should be inlined. Once inlined it doesn't really matter whether you pass by value, pointer or reference, as the compiler will optimise away unnecessary copies or dereferences.
One further point: if you think you might ever need to port your code to Windows then you will almost certainly want to use references, as there are some inane ABI restrictions which limit how many vector parameters you can pass (by value), even when the function is inlined.

How to force gcc to use all SSE (or AVX) registers?

I'm trying to write some computationally intensive code for Windows x64 target, with SSE or the new AVX instructions, compiling in GCC 4.5.2 and 4.6.1, MinGW64 (TDM GCC build, and some custom build). My compiler options are -O3 -mavx. (-m64 is implied)
In short, I want to perform some lengthy computation on 4 3D vectors of packed floats. That requires 4x3=12 xmm or ymm registers for storage, and 2 or 3 registers for temporary results. This should IMHO fit snugly in the 16 available SSE (or AVX) registers available for 64bit targets. However, GCC produces a very suboptimal code with register spilling, using only registers xmm0-xmm10 and shuffling data from and onto the stack. My question is:
Is there a way to convince GCC to use all the registers xmm0-xmm15?
To fix ideas, consider the following SSE code (for illustration only):
void example(vect<__m128> q1, vect<__m128> q2, vect<__m128>& a1, vect<__m128>& a2) {
for (int i=0; i < 10; i++) {
vect<__m128> v = q2 - q1;
a1 += v;
// a2 -= v;
q2 *= _mm_set1_ps(2.);
}
}
Here vect<__m128> is simply a struct of 3 __m128, with natural addition and multiplication by scalar. When the line a2 -= v is commented out, i.e. we need only 3x3 registers for storage since we are ignoring a2, the produced code is indeed straightforward with no moves, everything is performed in registers xmm0-xmm10. When I remove the comment a2 -= v, the code is pretty awful with a lot of shuffling between registers and stack. Even though the compiler could just use registers xmm11-xmm13 or something.
I actually haven't seen GCC use any of the registers xmm11-xmm15 anywhere in all my code yet. What am I doing wrong? I understand that they are callee-saved registers, but this overhead is completely justified by simplifying the loop code.
Two points:
First, You're making a lot of assumptions. Register spilling is pretty cheap on x86 CPUs (due to fast L1 caches and register shadowing and other tricks), and the 64-bit only registers are more costly to access (in terms of larger instructions), so it may just be that GCC's version is as fast, or faster, than the one you want.
Second, GCC, like any compiler, does the best register allocation it can. There's no "please do better register allocation" option, because if there was, it'd always be enabled. The compiler isn't trying to spite you. (Register allocation is a NP-complete problem, as I recall, so the compiler will never be able to generate a perfect solution. The best it can do is to approximate)
So, if you want better register allocation, you basically have two options:
write a better register allocator, and patch it into GCC, or
bypass GCC and rewrite the function in assembly, so you can control exactly which registers are used when.
Actually, what you see aren't spills, it is gcc operating on a1 and a2 in memory because it can't know if they are aliased. If you declare the last two parameters as vect<__m128>& __restrict__ GCC can and will register allocate a1 and a2.

Does it change performance to use a non-int counter in a loop?

I'm just curious and can't find the answer anywhere. Usually, we use an integer for a counter in a loop, e.g. in C/C++:
for (int i=0; i<100; ++i)
But we can also use a short integer or even a char. My question is: Does it change the performance? It's a few bytes less so the memory savings are negligible. It just intrigues me if I do any harm by using a char if I know that the counter won't exceed 100.
Probably using the "natural" integer size for the platform will provide the best performance. In C++ this is usually int. However, the difference is likely to be small and you are unlikely to find that this is the performance bottleneck.
Depends on the architecture. On the PowerPC, there's usually a massive performance penalty involved in using anything other than int (or whatever the native word size is) -- eg, don't use short or char. Float is right out, too.
You should time this on your particular architecture because it varies, but in my test cases there was ~20% slowdown from using short instead of int.
I can't provide a citation, but I've heard that you often do incur a little performance overhead by using a short or char.
The memory savings are nonexistant since it's a temporary stack variable. The memory it lives in will almost certainly already be allocated, and you probably won't save anything by using something shorter because the next variable will likely want to be aligned to a larger boundary anyway.
You can use whatever legal type you want in a for; it doesn't have to be integral or even built in. For example, you can use iterators as well:
for( std::vector<std::string>::iterator s = myStrings.begin(); myStrings.end() != s; ++s )
{
...
}
Whether or not it will have an impact on performance comes down to a question of how the operators you use are implemented. So in the above example that means end(), operator!=() and operator++().
This is not really an answer. I'm just exploring what Crashworks said about the PowerPC. As others have pointed out already, using a type that maps to the native word size should yield the shortest code and the best performance.
$ cat loop.c
extern void bar();
void foo()
{
int i;
for (i = 0; i < 42; ++i)
bar();
}
$ powerpc-eabi-gcc -S -O3 -o - loop.c
.
.
.L5:
bl bar
addic. 31,31,-1
bge+ 0,.L5
It is quite different with short i, instead of int i, and looks like won't perform as well either.
.L5:
bl bar
addi 3,31,1
extsh 31,3
cmpwi 7,31,41
ble+ 7,.L5
No, it really shouldn't impact performance.
It probably would have been quicker to type in a quick program (you did the most complex line already) and profile it, than ask this question here. :-)
FWIW, in languages that use bignums by default (Python, Lisp, etc.), I've never seen a profile where a loop counter was the bottleneck. Checking the type tag is not that expensive -- a couple instructions at most -- but probably bigger than the difference between a (fix)int and a short int.
Probably not as long as you don't do it with a float or a double. Since memory is cheap you would probably be best off just using an int.
An unsigned or size_t should, in theory, give you better results ( wow, easy people, we are trying to optimise for evil, and against those shouting 'premature' nonsense. It's the new trend ).
However, it does have its drawbacks, primarily the classic one: screw-up.
Google devs seems to avoid it to but it is pita to fight against std or boost.
If you compile your program with optimization (e.g., gcc -O), it doesn't matter. The compiler will allocate an integer register to the value and never store it in memory or on the stack. If your loop calls a routine, gcc will allocate one of the variables r14-r31 which any called routine will save and restore. So use int, because that causes the least surprise to whomever reads your code.

Resources