Loop versioning with GCC - gcc

I am working on auto vectorization with GCC. I am not in a position to use intrinsics or attributes due to customer requirement. (I cannot get user input to support vectorization)
If the alignment information of the array that can be vectorized is unknown, GCC invokes a pass for 'loop versioning'. Loop versioning will be performed when loop vectorization is done on trees. When a loop is identified to be vectorizable, and the constraint on data alignment or data dependence is hindering it, (because they cannot be determined at compile time), then two versions of the loop will be generated. These are the vectorized and non-vectorized versions of the loop along with runtime checks for alignment or dependence to control which version is executed.
My question is how we have to enforce the alignment? If I have found a loop that is vectorizable, I should not generate two versions of the loop because of missing alignment information.
For example. Consider the below code
short a[15]; short b[15]; short c[15];
int i;
void foo()
{
for (i=0; i<15; i++)
{
a[i] = b[i] ;
}
}
Tree dump (options: -fdump-tree-optimized -ftree-vectorize)
<SNIP>
vector short int * vect_pa.49;
vector short int * vect_pb.42;
vector short int * vect_pa.35;
vector short int * vect_pb.30;
bb 2>:
vect_pb.30 = (vector short int *) &b;
vect_pa.35 = (vector short int *) &a;
if (((signed char) vect_pa.35 | (signed char) vect_pb.30) & 3 == 0) ;; <== (A)
goto <bb 3>;
else
goto <bb 4>;
bb 3>:
</SNIP>
At 'bb 3' version of vectorized code is generated. At 'bb 4' code without vectorization is generated. These are done by checking the alignment (statement 'A'). Now without using intrinsics and other attributes, how should I get only the vectorized code (without this runtime alignment check.)

If the data in question is being allocated statically, then you can use the __align__ attribute that GCC supports to specify that it should be aligned to the necessary boundary. If you are dynamically allocating these arrays, you can over-allocate by the alignment value, and then bump the returned pointer up to the alignment you need.
You can also use the posix_memalign() function if you're on a system that supports it. Finally, note that malloc() will always allocate memory aligned to the size of the largest built-in type, generally 8 bytes for a double. If you don't need better than that, then malloc should suffice.
Edit: If you modify your allocation code to force that check to be true (i.e. overallocate, as suggested above), the compiler should oblige by not conditionalizing the loop code. If you needed alignment to an 8-byte boundary, as it seems, that would be something like a = (a + 7) & ~3;.

I get only one version of the loop, using your exact code with these options: gcc -march=core2 -c -O2 -fdump-tree-optimized -ftree-vectorize vec.c
My version of GCC is gcc version 4.4.1 (Ubuntu 4.4.1-4ubuntu8).
GCC is doing something clever here. It forces the arrays a and b to be 16-byte aligned. It doesn't do that to c, presumably because c is never used in a vectorizable loop.

Related

Why does the pseudocode of _mm_insert_ps calculate %8?

Within the intel intrinsics guide, the pseudocode for the operation of _mm_insert_ps, the following is defined:
FOR j := 0 to 3
i := j*32
IF imm8[j%8]
dst[i+31:i] := 0
ELSE
dst[i+31:i] := tmp2[i+31:i]
FI
ENDFOR
. The access into imm8 confuses me: IF imm8[j%8]. As j is within the range 0..3, the modulo 8 part doesn't seem to do anything. Does this maybe signal a convertion that I am not aware of? Or is % not "modulo" in this case?
Seems like a pointless modulo.
Intel's documentation for the corresponding asm instruction, insertps, doesn't use any % modulo operations in the pseudocode. It uses ZMASK ←imm8[3:0] and then basically unrolls that part of the pseudocode where this uses a loop, with checks like
IF (ZMASK[2] = 1) THEN DEST[95:64]←00000000H
ELSE DEST[95:64]←TMP2[95:64]
This is just showing how the low 4 bits of the immediate perform zero-masking on the 4 dword elements of the final result, after the insert of an element from another vector, or a scalar in memory.
(There's no intrinsic for insert directly from memory; you'd need an intrinsic for movss and then hope the compiler folds that load into a memory operand for insertps. With a memory source, imm8[7:6] are ignored, just taking that scalar dword as the element to insert (that's the ELSE COUNT_S←0 in the asm pseudocode), but then everything else works the same, including the zero-masking you're asking about.)

Fast saturating integer conversion?

I'm wondering if there is any fast bit twiddling trick to do a saturating conversion from a 64-bit unsigned value to a 32-bit unsigned value (it would be nice if it is generalized to other widths but that's the main width I care about). Most of the resources I've been able to find googling have been for saturating arithmetic operations.
A saturating conversion would take a 64-bit unsigned value, and return either the value unmodified as a 32-bit value or 2^32-1 if the input value is greater than 2^32-1. Note that this is not what the default C casting truncating behaviour does.
I can imagine doing something like:
Test if upper half has any bit set
If so create a 32-bit mask with all bits set, otherwise create a mask with all bits unset
Bitwise-or lower half with mask
But I don't know how to quickly generate the mask. I tried straightforward branching implementations in Godbolt to see if the compiler would generate a clever branchless implementation for me but no luck.
Implementation example here.
#include <stdint.h>
#include <limits.h>
// Type your code here, or load an example.
uint32_t square(uint64_t num) {
return num > UINT32_MAX ? UINT32_MAX : num;
}
Edit: my mistake, issue was godbolt not set to use optimizations
You don't need to do any fancy bit twiddling trick to do this. The following function should be enough for compilers to generate efficient code:
uint32_t saturate(uint64_t value) {
return value > UINT32_MAX ? UINT32_MAX : value;
}
This contains a conditional statement, but most common CPUs, like AMD/Intel and Arm ones, have conditional move instructions. So they will test the value for overflowing 32 bits, and based on the test they will replace it with UINT32_MAX, otherwise leave it alone. For example, on 64-bit Arm processors this function will be compiled by GCC (to:
saturate:
mov x1, 4294967295
cmp x0, x1
csel x0, x0, x1, ls
ret
Note that you must enable compiler optimizations to get the above result.
A way to do this without relying on conditional moves is
((-(x >> 32)) | (x << 32)) >> 32

gcc loop unrolling oddity

In the course of writing a "not-equal scan" for Boolean arrays,
I ended up writing this loop:
// Heckman recursive doubling
#ifdef STRENGTHREDUCTION // Haswell/gcc does not like the multiply
for( s=1; s<BITSINWORD; s=s*2) {
#else // STRENGTHREDUCTION
for( s=1; s<BITSINWORD; s=s+s) {
#endif // STRENGTHREDUCTION
w = w XOR ( w >> s);
}
What I observed was that gcc WOULD unroll the s=s*2 loop,
but not the s=s+s loop. This is slightly non-intuitive, as
the loop-count analysis for addition should, IMO be simpler
than for multiply. I suspect that gcc DOES know the s=s+s
loop count, and is merely being coy.
Does anyone know if there is some good reason for this
behavior on gcc's part?
I am asking this out of curiosity...
[The unrolled version, BTW, ran a fair bit slower than the loop.]
Thanks,
Robert
This is interesting.
First guess
My first guess would be that gcc's loop unroll analysis expects the addition case to benefit less from loop unrolling because s grows more slowly.
I experiment on the following code:
#include <stdio.h>
int main(int argc, char **args) {
int s;
int w = 255;
for (s = 1; s < 32; s = s * 2)
{
w = w ^ (w >> s);
}
printf("%d", w); // To prevent everything from being optimized away
return 0;
}
And another version that is the same except the loop has s = s + s. I find that gcc 4.9.2 unrolls the loop in the multiplicative version but not the additive one. This is compiling with
gcc -S -O3 test.c
So my first guess is that gcc assumes the additive version, if unrolled, would result in more bytes of code that fit in the icache and therefore does not optimize. However, changing the loop condition from s < 32 to s < 4 in the additive version still doesn't result in an optimization, even though it seems gcc should easily recognize that there are very few iterations of the loop.
My next attempt (going back to s < 32 as the condition) is to explicitly tell gcc to unroll loops up to 100 times:
gcc -S -O3 -fverbose-asm --param max-unroll-times=100 test.c
This still produces a loop in the assembly. Trying to allow more instructions in unrolled loops with --param max-unrolled-insns retains the loop as well. Therefore, we can pretty much eliminate the possibility that gcc thinks it's inefficient to unroll.
Interestingly, trying to compile with clang at -O3 immediately unrolls the loop. clang is known to unroll more aggressively, but this doesn't seem like a satisfying answer.
I can get gcc to unroll the additive loop by making it add a constant and not s itself, that is, I do s = s + 2. Then the loop unrolls.
Second guess
That leads to me theorize that gcc is unable to understand how many iterations the loop will run for (necessary for unrolling) if the loop's increase value depends on the counter's value more than once. I change the loop as follows:
for (s = 2; s < 32; s = s*s)
And it does not unroll with gcc, while clang unrolls it. So my best guess, in the end, is that gcc fails to calculate the number of iterations when the loop's increment statement is of the form s = s (op) s.
Compilers routinely perform strength reduction, so I would expect that
gcc would use it here, replacing s*2 by s+s, at which point the forms of both
source code expressions would match.
If that is not the case, then I think it is a bug in gcc. The analysis
to compute the loop count using s+s is (marginally) simpler than that
using s*2, so I would expect that gcc would be (marginally)
more likely to unroll the s+s case.

StoreStore reordering happens when compiling C++ for x86

while(true) {
int x(0), y(0);
std::thread t0([&x, &y]() {
x=1;
y=3;
});
std::thread t1([&x, &y]() {
std::cout << "(" << y << ", " <<x <<")" << std::endl;
});
t0.join();
t1.join();
}
Firstly, I know that it is UB because of the data race.
But, I expected only the following outputs:
(3,1), (0,1), (0,0)
I was convinced that it was not possible to get (3,0), but I did. So I am confused- after all x86 doesn't allow StoreStore reordering.
So x = 1 should be globally visible before y = 3
I suppose that from theoretical point of view the output (3,0) is impossible because of the x86 memory model. I suppose that it appeared because of the UB. But I am not sure. Please explain.
What else besides StoreStore reordering could explain getting (3,0)?
You're writing in C++, which has a weak memory model. You didn't do anything to prevent reordering at compile-time.
If you look at the asm, you'll probably find that the stores happen in the opposite order from the source, and/or that the loads happen in the opposite order from what you expect.
The loads don't have any ordering in the source: the compiler can load x before y if it wants to, even if they were std::atomic types:
t2 <- x(0)
t1 -> x(1)
t1 -> y(3)
t2 <- y(3)
This isn't even "re"ordering, since there was no defined order in the first place:
std::cout << "(" << y << ", " <<x <<")" << std::endl; doesn't necessarily evaluate y before x. The << operator has left-to-right associativity, and operator overloading is just syntactic sugar for
op<<( op<<(op<<(y),x), endl); // omitting the string constants.
Since the order of evaluation of function arguments is undefined (even if we're talking about nested function calls), the compiler is free to evaluate x before evaluating op<<(y). IIRC, gcc often just evaluates right to left, matching the order of pushing args onto the stack if necessary. The answers on the linked question indicate that that's often the case. But of course that behaviour is in no way guaranteed by anything.
The order they're loaded is undefined even if they were std::atomic. I'm not sure if there's a sequence point between the evaluation of x and y. If not, then it would be the same as if you evaluated x+y: The compiler is free to evaluate the operands in any order because they're unsequenced. If there is a sequence point, then there is an order but it's undefined which order (i.e. they're indeterminately sequenced).
Slightly related: gcc doesn't reorder non-inline function calls in expression evaluation, to take advantage of the fact that C leaves the order of evaluation unspecified. I assume after inlining it does optimize better, but in this case you haven't given it any reason to favour loading y before x.
How to do it correctly
The key point is that it doesn't matter exactly why the compiler decided to reorder, just that it's allowed to. If you don't impose all the necessary ordering requirements, your code is buggy, full-stop. It doesn't matter if it happens to work with some compilers with some specific surrounding code; that just means it's a latent bug.
See http://en.cppreference.com/w/cpp/atomic/atomic for docs on how/why this works:
// Safe version, which should compile to the asm you expected.
while(true) {
int x(0); // should be atomic, too, because it can be read+written at the same time. You can use memory_order_relaxed, though.
std::atomic<int> y(0);
std::thread t0([&x, &y]() {
x=1;
// std::atomic_thread_fence(std::memory_order_release); // A StoreStore fence is an alternative to using a release-store
y.store(3, std::memory_order_release);
});
std::thread t1([&x, &y]() {
int tx, ty;
ty = y.load(std::memory_order_acquire);
// std::atomic_thread_fence(std::memory_order_acquire); // A LoadLoad fence is an alternative to using an acquire-load
tx = x;
std::cout << ty + tx << "\n"; // Don't use endl, we don't need to force a buffer flush here.
});
t0.join();
t1.join();
}
For Acquire/Release semantics to give you the ordering you want, the last store has to be the release-store, and the acquire-load has to be the first load. That's why I made y a std::atomic, even though you're setting x to 0 or 1 more like a flag.
If you don't want to use release/acquire, you could put a StoreStore fence between the stores and a LoadLoad fence between the loads. On x86, this would just prevent compile-time reordering, but on ARM you'd get a memory-barrier instruction. (Note that y still technically needs to be atomic to obey C's data-race rules, but you can use std::memory_order_relaxed on it.)
Actually, even with Release/Acquire ordering for y, x should be atomic as well. The load of x still happens even if we see y==0. So reading x in thread 2 is not synchronized with writing y in thread 1, so it's UB. In practice, int loads/stores on x86 (and most other architectures) are atomic. But remember that std::atomic implies other semantics, like the fact that the value can be changed asynchronously by other threads.
The hardware-reordering test could run a lot faster if you looped inside one thread storing i and -i or something, and looped inside the other thread checking that abs(y) is always >= abs(x). Creating and destroying two threads per test is a lot of overhead.
Of course, to get this right, you have to know how to use C to generate the asm you want (or write in asm directly).

Strange pointer arithmetic

I came across too strange behaviour of pointer arithmetic. I am developing a program to develop SD card from LPC2148 using ARM GNU toolchain (on Linux). My SD card a sector contains data (in hex) like (checked from linux "xxd" command):
fe 2a 01 34 21 45 aa 35 90 75 52 78
While printing individual byte, it is printing perfectly.
char *ch = buffer; /* char buffer[512]; */
for(i=0; i<12; i++)
debug("%x ", *ch++);
Here debug function sending output on UART.
However pointer arithmetic specially adding a number which is not multiple of 4 giving too strange results.
uint32_t *p; // uint32_t is typedef to unsigned long.
p = (uint32_t*)((char*)buffer + 0);
debug("%x ", *p); // prints 34012afe // correct
p = (uint32_t*)((char*)buffer + 4);
debug("%x ", *p); // prints 35aa4521 // correct
p = (uint32_t*)((char*)buffer + 2);
debug("%x ", *p); // prints 0134fe2a // TOO STRANGE??
Am I choosing any wrong compiler option? Pls help.
I tried optimization options -0 and -s; but no change.
I could think of little/big endian, but here i am getting unexpected data (of previous bytes) and no order reversing.
Your CPU architecture must support unaligned load and store operations.
To the best of my knowledge, it doesn't (and I've been using STM32, which is an ARM-based cortex).
If you try to read a uint32_t value from an address which is not divisible by the size of uint32_t (i.e. not divisible by 4), then in the "good" case you will just get the wrong output.
I'm not sure what's the address of your buffer, but at least one of the three uint32_t read attempts that you describe in your question, requires the processor to perform an unaligned load operation.
On STM32, you would get a memory-access violation (resulting in a hard-fault exception).
The data-sheet should provide a description of your processor's expected behavior.
UPDATE:
Even if your processor does support unaligned load and store operations, you should try to avoid using them, as it might affect the overall running time (in comparison with "normal" load and store operations).
So in either case, you should make sure that whenever you perform a memory access (read or write) operation of size N, the target address is divisible by N. For example:
uint08_t x = *(uint08_t*)y; // 'y' must point to a memory address divisible by 1
uint16_t x = *(uint16_t*)y; // 'y' must point to a memory address divisible by 2
uint32_t x = *(uint32_t*)y; // 'y' must point to a memory address divisible by 4
uint64_t x = *(uint64_t*)y; // 'y' must point to a memory address divisible by 8
In order to ensure this with your data structures, always define them so that every field x is located at an offset which is divisible by sizeof(x). For example:
struct
{
uint16_t a; // offset 0, divisible by sizeof(uint16_t), which is 2
uint08_t b; // offset 2, divisible by sizeof(uint08_t), which is 1
uint08_t a; // offset 3, divisible by sizeof(uint08_t), which is 1
uint32_t c; // offset 4, divisible by sizeof(uint32_t), which is 4
uint64_t d; // offset 8, divisible by sizeof(uint64_t), which is 8
}
Please note, that this does not guarantee that your data-structure is "safe", and you still have to make sure that every myStruct_t* variable that you are using, is pointing to a memory address divisible by the size of the largest field (in the example above, 8).
SUMMARY:
There are two basic rules that you need to follow:
Every instance of your structure must be located at a memory address which is divisible by the size of the largest field in the structure.
Each field in your structure must be located at an offset (within the structure) which is divisible by the size of that field itself.
Exceptions:
Rule #1 may be violated if the CPU architecture supports unaligned load and store operations. Nevertheless, such operations are usually less efficient (requiring the compiler to add NOPs "in between"). Ideally, one should strive to follow rule #1 even if the compiler does support unaligned operations, and let the compiler know that the data is well aligned (using a dedicated #pragma), in order to allow the compiler to use aligned operations where possible.
Rule #2 may be violated if the compiler automatically generates the required padding. This, of course, changes the size of each instance of the structure. It is advisable to always use explicit padding (instead of relying on the current compiler, which may be replaced at some later point in time).
LDR is the ARM instruction to load data. You have lied to the compiler that the pointer is a 32bit value. It is not aligned properly. You pay the price. Here is the LDR documentation,
If the address is not word-aligned, the loaded value is rotated right by 8 times the value of bits [1:0].
See: 4.2.1. LDR and STR, words and unsigned bytes, especially the section Address alignment for word transfers.
Basically your code is like,
p = (uint32_t*)((char*)buffer + 0);
p = (p>>16)|(p<<16);
debug("%x ", *p); // prints 0134fe2a
but has encoded to one instruction on the ARM. This behavior is dependent on the ARM CPU type and possibly co-processor values. It is also highly non-portable code.
It's called "undefined behavior". Your code is casting a value which is not a valid unsigned long * into an unsigned long *. The semantics of that operation are undefined behavior, which means pretty much anything can happen*.
In this case, the reason two of your examples behaved as you expected is because you got lucky and buffer happened to be word-aligned. Your third example was not as lucky (if it was, the other two would not have been), so you ended up with a pointer with extra garbage in the 2 least significant bits. Depending on the version of ARM you are using, that could result in an unaligned read (which it appears is what you were hoping for), or it could result in an aligned read (using the most significant 30 bits) and a rotation (word rotated by the number of bytes indicated in the least significant 2 bits). It looks pretty clear that the later is what happened in your 3rd example.
Anyway, technically, all 3 of your example outputs are correct. It would also be correct for the program to crash on all 3 of them.
Basically, don't do that.
A safer alternative is to write the bytes into a uint32_t. Something like:
uint32_t w;
memcpy(&w, buffer, 4);
debug("%x ", w);
memcpy(&w, buffer+4, 4);
debug("%x ", w);
memcpy(&w, buffer+2, 4);
debug("%x ", w);
Of course, that's still assuming sizeof(uint32_t) == 4 && CHAR_BITS == 8, but that's a much safer assumption. (Ie, it should work on pretty much any machine with 8 bit bytes.)

Resources