Auto vectorization on double and ffast-math - gcc

Why is it mandatory to use -ffast-math with g++ to achieve the vectorization of loops using doubles? I don't like -ffast-math because I don't want to lose precision.

You don’t necessarily lose precision with -ffast-math. It only affects the handling of NaN, Inf etc. and the order in which operations are performed.
If you have a specific piece of code where you do not want GCC to reorder or simplify computations, you can mark variables as being used using an asm statement.
For instance, the following code performs a rounding operation on f. However, the two f += g and f -= g operations are likely to get optimised away by gcc:
static double moo(double f, double g)
{
g *= 4503599627370496.0; // 2 ** 52
f += g;
f -= g;
return f;
}
On x86_64, you can use this asm statement to instruct GCC not to perform that optimisation:
static double moo(double f, double g)
{
g *= 4503599627370496.0; // 2 ** 52
f += g;
__asm__("" : "+x" (f));
f -= g;
return f;
}
You will need to adapt this for each architecture, unfortunately. On PowerPC, use +f instead of +x.

Very likely because vectorization means that you may have different results, or may mean that you miss floating point signals/exceptions.
If you're compiling for 32-bit x86 then gcc and g++ default to using the x87 for floating point math, on 64-bit they default to SSE, however the x87 can and will produce different values for the same computation so it's unlikely g++ will consider vectorizing if it can't guarantee that you will get the same results unless you use -ffast-math or some of the flags it turns on.
Basically it comes down to the floating point environment for vectorized code may not be the same as the one for non vectorized code, sometimes in ways that are important, if the differences don't matter to you, something like
-fno-math-errno -fno-trapping-math -fno-signaling-nans -fno-rounding-math
but first look up those options and make sure that they won't affect your program's correctness. -ffinite-math-only may help also

Because -ffast-math enables operands reordering which allows many code to be vectorized.
For example to calculate this
sum = a[0] + a[1] + a[2] + a[3] + a[4] + a[5] + … a[99]
the compiler is required to do the additions sequentially without -ffast-math, because floating-point math is neither commutative nor associative.
Is floating point addition commutative and associative?
Is floating point addition commutative in C++?
Are floating point operations in C associative?
Is Floating point addition and multiplication associative?
That's the same reason why compilers can't optimize a*a*a*a*a*a to (a*a*a)*(a*a*a) without -ffast-math
That means no vectorization available unless you have very efficient horizontal vector adds.
However if -ffast-math is enabled, the expression can be calculated like this (Look at A7. Auto-Vectorization)
sum0 = a[0] + a[4] + a[ 8] + … a[96]
sum1 = a[1] + a[5] + a[ 9] + … a[97]
sum2 = a[2] + a[6] + a[10] + … a[98]
sum3 = a[3] + a[7] + a[11] + … a[99]
sum’ = sum0 + sum1 + sum2 + sum3
Now the compiler can vectorize it easily by adding each column in parallel and then do a horizontal add at the end
Does sum’ == sum? Only if (a[0]+a[4]+…) + (a[1]+a[5]+…) + (a[2]+a[6]+…) + ([a[3]+a[7]+…) == a[0] + a[1] + a[2] + … This holds under associativity, which floats don’t adhere to, all of the time. Specifying /fp:fast lets the compiler transform your code to run faster – up to 4 times faster, for this simple calculation.
Do You Prefer Fast or Precise? - A7. Auto-Vectorization
It may be enabled by the -fassociative-math flag in gcc
Further readings
Semantics of Floating Point Math in GCC
What does gcc's ffast-math actually do?

To enable auto-vectorization with gcc, ffast-math is not actually necessary. See https://gcc.gnu.org/projects/tree-ssa/vectorization.html#using
To enable vectorization of floating point reductions use -ffast-math or -fassociative-math.
Using -fassociative-math should be sufficient.
This has been the case since 2007, see https://gcc.gnu.org/projects/tree-ssa/vectorization.html#oldnews
-fassociative-math can be used instead of -ffast-math to enable vectorization of reductions of floats (2007-09-04).

Related

gcc loop unrolling oddity

In the course of writing a "not-equal scan" for Boolean arrays,
I ended up writing this loop:
// Heckman recursive doubling
#ifdef STRENGTHREDUCTION // Haswell/gcc does not like the multiply
for( s=1; s<BITSINWORD; s=s*2) {
#else // STRENGTHREDUCTION
for( s=1; s<BITSINWORD; s=s+s) {
#endif // STRENGTHREDUCTION
w = w XOR ( w >> s);
}
What I observed was that gcc WOULD unroll the s=s*2 loop,
but not the s=s+s loop. This is slightly non-intuitive, as
the loop-count analysis for addition should, IMO be simpler
than for multiply. I suspect that gcc DOES know the s=s+s
loop count, and is merely being coy.
Does anyone know if there is some good reason for this
behavior on gcc's part?
I am asking this out of curiosity...
[The unrolled version, BTW, ran a fair bit slower than the loop.]
Thanks,
Robert
This is interesting.
First guess
My first guess would be that gcc's loop unroll analysis expects the addition case to benefit less from loop unrolling because s grows more slowly.
I experiment on the following code:
#include <stdio.h>
int main(int argc, char **args) {
int s;
int w = 255;
for (s = 1; s < 32; s = s * 2)
{
w = w ^ (w >> s);
}
printf("%d", w); // To prevent everything from being optimized away
return 0;
}
And another version that is the same except the loop has s = s + s. I find that gcc 4.9.2 unrolls the loop in the multiplicative version but not the additive one. This is compiling with
gcc -S -O3 test.c
So my first guess is that gcc assumes the additive version, if unrolled, would result in more bytes of code that fit in the icache and therefore does not optimize. However, changing the loop condition from s < 32 to s < 4 in the additive version still doesn't result in an optimization, even though it seems gcc should easily recognize that there are very few iterations of the loop.
My next attempt (going back to s < 32 as the condition) is to explicitly tell gcc to unroll loops up to 100 times:
gcc -S -O3 -fverbose-asm --param max-unroll-times=100 test.c
This still produces a loop in the assembly. Trying to allow more instructions in unrolled loops with --param max-unrolled-insns retains the loop as well. Therefore, we can pretty much eliminate the possibility that gcc thinks it's inefficient to unroll.
Interestingly, trying to compile with clang at -O3 immediately unrolls the loop. clang is known to unroll more aggressively, but this doesn't seem like a satisfying answer.
I can get gcc to unroll the additive loop by making it add a constant and not s itself, that is, I do s = s + 2. Then the loop unrolls.
Second guess
That leads to me theorize that gcc is unable to understand how many iterations the loop will run for (necessary for unrolling) if the loop's increase value depends on the counter's value more than once. I change the loop as follows:
for (s = 2; s < 32; s = s*s)
And it does not unroll with gcc, while clang unrolls it. So my best guess, in the end, is that gcc fails to calculate the number of iterations when the loop's increment statement is of the form s = s (op) s.
Compilers routinely perform strength reduction, so I would expect that
gcc would use it here, replacing s*2 by s+s, at which point the forms of both
source code expressions would match.
If that is not the case, then I think it is a bug in gcc. The analysis
to compute the loop count using s+s is (marginally) simpler than that
using s*2, so I would expect that gcc would be (marginally)
more likely to unroll the s+s case.

Fortran multidimensional sub-array performance

While manipulating and assigning sub-arrays within multidimensional arrays in Fortran90, I stumbled across an interesting performance quirk.
Fortran90 introduced the ability to manipulate sub-sections of arrays and I have seen a few places which recommends that array operations be performed using this "slicing" method instead of loops. For instance, if I have to add two arrays, a and b of size 10, it is better to write:
c(1:10) = a(1:10) + b(1:10)
or
c = a + b
Instead of
do i = 1, 10
c(i) = a(i) + b(i)
end do
I tried this method for simple one dimensional and two dimensional arrays and found it to be faster with the "slicing" notation. However, things began to get a little interesting when assigning such results within multidimensional arrays.
First of all, I must apologize for my rather crude performance measuring exercise. I am not even sure if the method I have adopted is the right way to time and test codes, but I am fairly confident about the qualitative results of the test.
program main
implicit none
integer, parameter :: mSize = 10000
integer :: i, j
integer :: pCnt, nCnt, cntRt, cntMx
integer, dimension(mSize, mSize) :: a, b
integer, dimension(mSize, mSize, 3) :: c
pCnt = 0
call SYSTEM_CLOCK(nCnt, cntRt, cntMx)
print *, "First call: ", nCnt-pCnt
pCnt = nCnt
do j = 1, mSize
do i = 1, mSize
a(i, j) = i*j
b(i, j) = i+j
end do
end do
call SYSTEM_CLOCK(nCnt, cntRt, cntMx)
print *, "Created Matrices: ", nCnt-pCnt
pCnt = nCnt
! OPERATIONS BY SLICING NOTATION
!c(1:mSize, 1:mSize, 1) = a + b
!c(1:mSize, 1:mSize, 2) = a - b
!c(1:mSize, 1:mSize, 3) = a * b
! OPERATIONS WITH LOOP
do j = 1, mSize
do i = 1, mSize
c(i, j, 1) = a(i, j) + b(i, j)
c(i, j, 2) = a(i, j) - b(i, j)
c(i, j, 3) = a(i, j) * b(i, j)
end do
end do
call SYSTEM_CLOCK(nCnt, cntRt, cntMx)
print *, "Added Matrices: ", nCnt-pCnt
pCnt = nCnt
end program main
As can be seen, I have two methods of operating upon and assigning two large 2D arrays into a 3D array. I was heavily in favour of using the slicing notation as it helped me write shorter and more elegant looking code. But upon observing how severely sluggish my code was, I was forced to recheck the capacity of slicing notation over calculating within loops.
I ran the above code with and without -O3 flag using GNU Fortran 4.8.4 for Ubuntu 14.04
Without -O3 flag
a. Slicing notation
5 Runs - 843, 842, 842, 841, 859
Average - 845.4
b. Looped calculation
5 Runs - 1713, 1713, 1723, 1711, 1713
Average - 1714.6
With -O3 flag
a. Slicing notation
5 Runs - 545, 545, 544, 544, 548
Average - 545.2
b. Looped calculation
5 Runs - 479, 477, 475, 472, 472
Average - 475
I found it very interesting that without -O3 flag, the slicing notation continued to perform way better than loops. However, using -O3 flag causes this advantage to vanish completely. Contrarily, it becomes detrimental to use array slicing notation in this case.
In fact, with my rather large 3D parallel computation code, this is turning out to be a significant bottle-neck. I strongly suspect that the formation of array temporaries during the assignment of a lower dimensional array to a higher dimensional array is the culprit here. But why did the optimization flag fail to optimize the assignment in this case?
Moreover, I feel that blaming -O3 flag is not a respectable thing to do. So are array temporaries really the culprit? Is there something else I may be missing? Any insight will be extremely helpful in speeding up my code. Thanks!
When doing any performance comparison, you have to compare apple with apples and orange with oranges. What I mean is that you are not really comparing the same thing. They are totally different even if they are producing the same result.
What comes into play here is the memory management, think of cache faults during the operation. If you turn the loop version into 3 different loops as suggested by haraldkl you will certainly get similar performance.
What happens is that when you combine the 3 assignments in the same loop, there is a lot of cache reuse for right hand side since all the 3 share the same variables in the right hand side. Each element of a or b is loaded into the cache and into registers only once for the loop version while for the array operation version, each element of a or b gets loaded 3 times. That is what makes the difference. The larger the size of the array, the larger the difference, because you will get more cache fault and more reloading of elements into the registers.
I don't know what the compiler really does so not really an answer, but too much text for a comment...
I'd have the suspicion that the compiler expands the array notation into something like this:
do j = 1, mSize
do i = 1, mSize
c(i, j, 1) = a(i, j) + b(i, j)
end do
end do
do j = 1, mSize
do i = 1, mSize
c(i, j, 2) = a(i, j) - b(i, j)
end do
end do
do j = 1, mSize
do i = 1, mSize
c(i, j, 3) = a(i, j) * b(i, j)
end do
end do
Of course, the compiler might still collapse these loops if written like that, so you might need to confuse him a little more, for example by writing something of c to the screen between the loops.

implict SIMD (SSE/AVX) broadcasts with GCC

I have manged to convert most of my SIMD code to us the vector extensions of GCC. However, I have not found a good solution for doing a broadcast as follows
__m256 areg0 = _mm256_broadcast_ss(&a[i]);
I want to do
__m256 argeg0 = a[i];
If you see my answer at Mutiplying vector by constant using SSE I managed to get broadcasts working with another SIMD register. The following works:
__m256 x,y;
y = x + 3.14159f; // broadcast x + 3.14159
y = 3.14159f*x; // broadcast 3.14159*x
but this won't work:
__m256 x;
x = 3.14159f; //should broadcast 3.14159 but does not work
How can I do this with GCC?
I think there is currently no direct way and you have to work around it using the syntax you already noticed:
__m256 zero={};
__m256 x=zero+3.14159f;
It may change in the future if we can agree on a good syntax, see PR 55726.
Note that if you want to create a vector { s, s, ... s } with a non-constant float s, the technique above only works with integers, or with floats and -fno-signed-zeros. You can tweak it to __m256 x=s-zero; and it will work unless you use -frounding-math. A last version, suggested by Z boson, is __m256 x=(zero+1.f)*s; which should work in most cases (except possibly with a compiler paranoid about sNaN).
It turns out that with a precise floating point model (e.g. with -O3) that GCC cannot simplify x+0 to x due to signed zero. So x = zero+3.14159f produces inefficient code. However GCC can simplify 1.0*x to just x therefore the efficient solution in this case is.
__m256 x = ((__m256){} + 1)*3.14159f;
https://godbolt.org/g/5QAQkC
See this answer for more details.
A simpler solution is just x = 3.14159f - (__m256){} because x - 0 = x irrespective of signed zero.

BLAS subroutines dgemm, dgemv and ddot doesn't work with scalars?

I have a Fortran subroutine which uses BLAS' subroutines dgemm, dgemv and ddot, which calculate matrix * matrix, matrix * vector and vector * vector. I have m * m matrices and m * 1 vectors. In some cases m=1. It seems that those subroutines doesn't work well in those cases. They doesn't give errors, but there seems to be some numerical unstability in results. So I have to write something like:
if(m>1) then
vtuni(i,t) = yt(i,t) - ct(i,t) - ddot(m, zt(i,1:m,(t-1)*tvar(3)+1), 1, arec, 1)
else
vtuni(i,t) = yt(i,t) - ct(i,t) - zt(i,1,(t-1)*tvar(3)+1)*arec(1)
So my actual question is, am I right that those BLAS' subroutines doesn't work properly when m=1 or is there just something wrong in my code? Can the compiler affect this? I'm using gfortran.
BLAS routines are supposed to behave correctly with objects of size 1. I don't think it can depend on compiler, but it could possible depend on the implementation of BLAS you're relying on (though I'd consider it a bug of the implementation). The reference (read: not target-optimised) implementation of BLAS, which can be found on Netlib, handles that case fine.
I've done some testing on both arrays of size 1, and size-1 slices of larger array (as in your own code), and they both work fine:
$ cat a.f90
implicit none
double precision :: u(1), v(1)
double precision, external :: ddot
u(:) = 2
v(:) = 3
print *, ddot(1, u, 1, v, 1)
end
$ gfortran a.f90 -lblas && ./a.out
6.0000000000000000
$ cat b.f90
implicit none
double precision, allocatable :: u(:,:,:), v(:)
double precision, external :: ddot
integer :: i, j
allocate(u(3,1,3),v(1))
u(:,:,:) = 2
v(:) = 3
i = 2
j = 2
print *, ddot(1, u(i,1:1,j), 1, v, 1)
end
$ gfortran b.f90 -lblas && ./a.out
6.0000000000000000
Things I'd consider to debug this problem further:
Check that your ddot definition is correct
Substitute the reference BLAS to your optimised one, to check if it changes anything (you can just compile and link in the ddot.f file I linked to earlier in my answer)

Loop versioning with GCC

I am working on auto vectorization with GCC. I am not in a position to use intrinsics or attributes due to customer requirement. (I cannot get user input to support vectorization)
If the alignment information of the array that can be vectorized is unknown, GCC invokes a pass for 'loop versioning'. Loop versioning will be performed when loop vectorization is done on trees. When a loop is identified to be vectorizable, and the constraint on data alignment or data dependence is hindering it, (because they cannot be determined at compile time), then two versions of the loop will be generated. These are the vectorized and non-vectorized versions of the loop along with runtime checks for alignment or dependence to control which version is executed.
My question is how we have to enforce the alignment? If I have found a loop that is vectorizable, I should not generate two versions of the loop because of missing alignment information.
For example. Consider the below code
short a[15]; short b[15]; short c[15];
int i;
void foo()
{
for (i=0; i<15; i++)
{
a[i] = b[i] ;
}
}
Tree dump (options: -fdump-tree-optimized -ftree-vectorize)
<SNIP>
vector short int * vect_pa.49;
vector short int * vect_pb.42;
vector short int * vect_pa.35;
vector short int * vect_pb.30;
bb 2>:
vect_pb.30 = (vector short int *) &b;
vect_pa.35 = (vector short int *) &a;
if (((signed char) vect_pa.35 | (signed char) vect_pb.30) & 3 == 0) ;; <== (A)
goto <bb 3>;
else
goto <bb 4>;
bb 3>:
</SNIP>
At 'bb 3' version of vectorized code is generated. At 'bb 4' code without vectorization is generated. These are done by checking the alignment (statement 'A'). Now without using intrinsics and other attributes, how should I get only the vectorized code (without this runtime alignment check.)
If the data in question is being allocated statically, then you can use the __align__ attribute that GCC supports to specify that it should be aligned to the necessary boundary. If you are dynamically allocating these arrays, you can over-allocate by the alignment value, and then bump the returned pointer up to the alignment you need.
You can also use the posix_memalign() function if you're on a system that supports it. Finally, note that malloc() will always allocate memory aligned to the size of the largest built-in type, generally 8 bytes for a double. If you don't need better than that, then malloc should suffice.
Edit: If you modify your allocation code to force that check to be true (i.e. overallocate, as suggested above), the compiler should oblige by not conditionalizing the loop code. If you needed alignment to an 8-byte boundary, as it seems, that would be something like a = (a + 7) & ~3;.
I get only one version of the loop, using your exact code with these options: gcc -march=core2 -c -O2 -fdump-tree-optimized -ftree-vectorize vec.c
My version of GCC is gcc version 4.4.1 (Ubuntu 4.4.1-4ubuntu8).
GCC is doing something clever here. It forces the arrays a and b to be 16-byte aligned. It doesn't do that to c, presumably because c is never used in a vectorizable loop.

Resources