I would like to understand why GCC does not autovectorize the following loop, unless I pass the -ffinite-math-only. As to my understanding and the GCC manual the optimization requires the -funsafe-math-optimizations
If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=neon), note that floating-point operations are not generated by GCC's auto-vectorization pass unless -funsafe-math-optimizations is also specified. This is because NEON
hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision.
In particular, the flag enables the compiler to assume associative math, so that it can first accumulate with 4 partial sums. The code seems pretty straight forward
template<typename SumType = double>
class UipLineResult {
public:
SumType sqsum;
SumType dcsum;
float pkp;
float pkn;
public:
UipLineResult() {
clear();
}
void clear() {
sqsum = 0;
dcsum = 0;
pkp = -std::numeric_limits<float>::max();
pkn = +std::numeric_limits<float>::max();
}
};
Loop that is not vectorized
static void addSamplesLine(const float* ss, UipLineResult<>* line) {
UipLineResult<float> intermediate;
for(int idx = 0; idx < 120; idx++) {
float s = ss[idx];
intermediate.sqsum += s * s;
intermediate.dcsum += s;
intermediate.pkp = intermediate.pkp < s ? s : intermediate.pkp;
intermediate.pkn = intermediate.pkn > s ? s : intermediate.pkn;
}
line->addIntermediate(&intermediate);
}
For example, the squared addition look like
intermediate.sqsum += s * s;
107da: ee47 6aa7 vmla.f32 s13, s15, s15
With -ffinite-math-only this becomes
intermediate.sqsum += s * s;
1054c: ef40 6df0 vmla.f32 q11, q8, q8
Compiler flags
-funsafe-math-optimizations -ffinite-math-only -mcpu=cortex-a9 -mfpu=neon
Related
I have the following 4x4 matrix-vector multiply code:
double const __restrict__ a[16];
double const __restrict__ x[4];
double __restrict__ y[4];
//#pragma GCC unroll 1 - does not work either
#pragma GCC nounroll
for ( int j = 0; j < 4; ++j )
{
double const* __restrict__ aj = a + j * 4;
double const xj = x[j];
#pragma GCC ivdep
for ( int i = 0; i < 4; ++i )
{
y[i] += aj[i] * xj;
}
}
I compile with -O3 -mavx flags. The inner loop is vectorized (single FMAD). However, gcc (7.2) keeps unrolling the outer loop 4 times, unless I use -O2 or lower optimization.
Is there a way to override -O3 unrolling of a particular loop?
NB. Similar #pragma nounroll works if I use Intel icc.
According to the documentation, #pragma GCC unroll 1 is supposed to work, if you place it just so. If it doesn't then you should submit a bug report.
Alternatively, you can use a function attribute to set optimizations, I think:
void myfn () __attribute__((optimize("no-unroll-loops")));
For concise functions
sans full and partial loop unrolling
when required
the following function attribute
please try.
__attribute__((optimize("Os")))
Consider the following code:
void foo(float* __restrict__ a)
{
int i; float val;
for (i = 0; i < 100; i++) {
val = 2 * i;
a[i] = val;
}
}
void bar(float* __restrict__ a)
{
int i; float val = 0.0;
for (i = 0; i < 100; i++) {
a[i] = val;
val += 2.0;
}
}
They're based on Examples 7.26a and 7.26b in Agner Fog's Optimizing software in C++ and should do the same thing; bar is more "efficient" as written in the sense that we don't do an integer-to-float conversion at every iteration, but rather a float addition which is cheaper (on x86_64).
Here are the clang and gcc results on these two functions (with no vectorization and unrolling).
Question: It seems to me that the optimization of replacing a multiplication by the loop index with an addition of a constant value - when this is beneficial - should be carried out by compilers, even if (or perhaps especially if) there's a type conversion involved. Why is this not happening for these two functions?
Note that if we use int's rather than float's:
void foo(int* __restrict__ a)
{
int i; int val = 0;
for (i = 0; i < 100; i++) {
val = 2 * i;
a[i] = val;
}
}
void bar(int* __restrict__ a)
{
int i; int val = 0;
for (i = 0; i < 100; i++) {
a[i] = val;
val += 2;
}
}
Both clang and gcc perform the expected optimization, albeit not quite in the same way (see this question).
You are looking for enabling induction variable optimization for floating point numbers. This optimization is generally unsafe in floating point land as it changes program semantics. In your example it'll work because both initial value (0.0) and step (2.0) can be precisely represented in IEEE format but this is a rare case in practice.
It could be enabled under -ffast-math but it seems this wasn't considered as important case in GCC as it rejects non-integral induction variables early on (see tree-scalar-evolution.c).
If you believe that this is an important usecase you might consider filing request at GCC Bugzilla.
I've been using the ConjugateGradient solver in Eigen 3.2 and decided to try upgrading to Eigen 3.3.3 with the hope of benefiting from the new multi-threading features.
Sadly, the solver seems slower (~10%) when I enable -fopenmp with GCC 4.8.4. Looking at xosview, I see that all 8 cpus are being used, yet performance is slower...
After some testing, I discovered that if I disable compiler optimization (use -O0 instead of -O3), then -fopenmp does speed up the solver by ~50%.
Of course, it's not really worth disabling optimization just to benefit from multi-threading, since that would be even slower overall.
Following advice from https://stackoverflow.com/a/42135567/7974125, I am storing the full sparse matrix and passing Lower|Upper as the UpLo parameter.
I've also tried each of the 3 preconditioners and also tried using RowMajor matrices, to no avail.
Is there anything else to try to get the full benefits of both multi-threading and compiler optimization?
I cannot post my actual code, but this is a quick test using the Laplacian example from Eigen's documentation, except for some changes to use ConjugateGradient instead of SimplicialCholesky. (Both of these solvers work with SPD matrices.)
#include <Eigen/Sparse>
#include <bench/BenchTimer.h>
#include <iostream>
#include <vector>
using namespace Eigen;
using namespace std;
// Use RowMajor to make use of multi-threading
typedef SparseMatrix<double, RowMajor> SpMat;
typedef Triplet<double> T;
// Assemble sparse matrix from
// https://eigen.tuxfamily.org/dox/TutorialSparse_example_details.html
void insertCoefficient(int id, int i, int j, double w, vector<T>& coeffs,
VectorXd& b, const VectorXd& boundary)
{
int n = int(boundary.size());
int id1 = i+j*n;
if(i==-1 || i==n) b(id) -= w * boundary(j); // constrained coefficient
else if(j==-1 || j==n) b(id) -= w * boundary(i); // constrained coefficient
else coeffs.push_back(T(id,id1,w)); // unknown coefficient
}
void buildProblem(vector<T>& coefficients, VectorXd& b, int n)
{
b.setZero();
ArrayXd boundary = ArrayXd::LinSpaced(n, 0,M_PI).sin().pow(2);
for(int j=0; j<n; ++j)
{
for(int i=0; i<n; ++i)
{
int id = i+j*n;
insertCoefficient(id, i-1,j, -1, coefficients, b, boundary);
insertCoefficient(id, i+1,j, -1, coefficients, b, boundary);
insertCoefficient(id, i,j-1, -1, coefficients, b, boundary);
insertCoefficient(id, i,j+1, -1, coefficients, b, boundary);
insertCoefficient(id, i,j, 4, coefficients, b, boundary);
}
}
}
int main()
{
int n = 300; // size of the image
int m = n*n; // number of unknowns (=number of pixels)
// Assembly:
vector<T> coefficients; // list of non-zeros coefficients
VectorXd b(m); // the right hand side-vector resulting from the constraints
buildProblem(coefficients, b, n);
SpMat A(m,m);
A.setFromTriplets(coefficients.begin(), coefficients.end());
// Solving:
// Use ConjugateGradient with Lower|Upper as the UpLo template parameter to make use of multi-threading
BenchTimer t;
t.reset(); t.start();
ConjugateGradient<SpMat, Lower|Upper> solver(A);
VectorXd x = solver.solve(b); // use the factorization to solve for the given right hand side
t.stop();
cout << "Real time: " << t.value(1) << endl; // 0=CPU_TIMER, 1=REAL_TIMER
return 0;
}
Resulting output:
// No optimization, without OpenMP
g++ cg.cpp -O0 -I./eigen -o cg
./cg
Real time: 23.9473
// No optimization, with OpenMP
g++ cg.cpp -O0 -I./eigen -fopenmp -o cg
./cg
Real time: 17.6621
// -O3 optimization, without OpenMP
g++ cg.cpp -O3 -I./eigen -o cg
./cg
Real time: 0.924272
// -O3 optimization, with OpenMP
g++ cg.cpp -O3 -I./eigen -fopenmp -o cg
./cg
Real time: 1.04809
Your problem is too small to expect any benefits from multi-threading. Sparse matrices are expected to at least one order of magnitude larger. Eigen's code should be adjusted to reduce the number of threads in this case.
Moreover, I guess that you only have 4 physical cores, so running with OMP_NUM_THREADS=4 ./cg might help.
I have a few for loops that does saturated arithmetic operations.
For instance:
Implementation of saturated add in my case is as follows:
static void addsat(Vector &R, Vector &A, Vector &B)
{
int32_t a, b, r;
int32_t max_add;
int32_t min_add;
const int32_t SAT_VALUE = (1<<(16-1))-1;
const int32_t SAT_VALUE2 = (-SAT_VALUE - 1);
const int32_t sat_cond = (SAT_VALUE <= 0x7fffffff);
const uint32_t SAT = 0xffffffff >> 16;
for (int i=0; i<R.length; i++)
{
a = static_cast<uint32_t>(A.data[i]);
b = static_cast<uint32_t>(B.data[i]);
max_add = (int32_t)0x7fffffff - a;
min_add = (int32_t)0x80000000 - a;
r = (a>0 && b>max_add) ? 0x7fffffff : a + b;
r = (a<0 && b<min_add) ? 0x80000000 : a + b;
if ( sat_cond == 1)
{
std_max(r,r,SAT_VALUE2);
std_min(r,r,SAT_VALUE);
}
else
{
r = static_cast<uint16_t> (static_cast<int32_t> (r));
}
R.data[i] = static_cast<uint16_t>(r);
}
}
I see that there is paddsat intrinsic in x86 that could have been the perfect solution to this loop. I do get the code auto vectorized but with a combination of multiple operations according to my code. I would like to know what could be the best way to write this loop that auto-vectorizer finds the addsat operation match right.
Vector structure is:
struct V {
static constexpr int length = 32;
unsigned short data[32];
};
Compiler used is clang 3.8 and code was compiled for AVX2 Haswell x86-64 architecture.
I'm trying to learn how to exploit vectorization with gcc. I followed this tutorial of Erik Holk ( with source code here )
I just modified it to double. I used this dotproduct to compute multiplication of randomly generated square matrices 1200x1200 of doubles ( 300x300 double4 ). I checked that the results are the same. But what really surprised me is, that the simple dotproduct was actually 10x faster than my manually vectorized.
maybe, double4 is too big for SSE ( it would need AVX2 ? ) But I would expect that even in case when gcc cannot find suitable instruction for dealing with double4 at once, it would still be able to exploit the explicit information that data are in big chunks for auto-vectorization.
Details:
the results was:
dot_simple:
time elapsed 1.90000 [s] for 1.728000e+09 evaluations => 9.094737e+08 [ops/s]
dot_SSE:
time elapsed 15.78000 [s] for 1.728000e+09 evaluations => 1.095057e+08 [ops/s]
I used gcc 4.6.3 on Intel® Core™ i5 CPU 750 # 2.67GHz × 4 with these options -std=c99 -O3 -ftree-vectorize -unroll-loops --param max-unroll-times=4 -ffast-math or with just -O2
( the result was the same )
I did it using python/scipy.weave() for convenience, but I hope it doesn't change anything
The code:
double dot_simple( int n, double *a, double *b ){
double dot = 0;
for (int i=0; i<n; i++){
dot += a[i]*b[i];
}
return dot;
}
and that one using explicitly gcc vector extensiobns
double dot_SSE( int n, double *a, double *b ){
const int VECTOR_SIZE = 4;
typedef double double4 __attribute__ ((vector_size (sizeof(double) * VECTOR_SIZE)));
double4 sum4 = {0};
double4* a4 = (double4 *)a;
double4* b4 = (double4 *)b;
for (int i=0; i<n; i++){
sum4 += *a4 * *b4 ;
a4++; b4++;
//sum4 += a4[i] * b4[i];
}
union { double4 sum4_; double sum[VECTOR_SIZE]; };
sum4_ = sum4;
return sum[0]+sum[1]+sum[2]+sum[3];
}
Then I used it for multiplication of 300x300 random matrix to measure performance
void mmul( int n, double* A, double* B, double* C ){
int n4 = n*4;
for (int i=0; i<n4; i++){
for (int j=0; j<n4; j++){
double* Ai = A + n4*i;
double* Bj = B + n4*j;
C[ i*n4 + j ] = dot_SSE( n, Ai, Bj );
//C[ i*n4 + j ] = dot_simple( n4, Ai, Bj );
ijsum++;
}
}
}
scipy weave code:
def mmul_2(A, B, C, __force__=0 ):
code = r''' mmul( NA[0]/4, A, B, C ); '''
weave_options = {
'extra_compile_args': ['-std=c99 -O3 -ftree-vectorize -unroll-loops --param max-unroll-times=4 -ffast-math'],
'compiler' : 'gcc', 'force' : __force__ }
return weave.inline(code, ['A','B','C'], verbose=3, headers=['"vectortest.h"'],include_dirs=['.'], **weave_options )
One of the main problems is that in your function dot_SSE you loop over n items when you should only loop over n/2 items (or n/4 with AVX).
To fix this with GCC's vector extensions you can do this:
double dot_double2(int n, double *a, double *b ) {
typedef double double2 __attribute__ ((vector_size (16)));
double2 sum2 = {};
int i;
double2* a2 = (double2*)a;
double2* b2 = (double2*)b;
for(i=0; i<n/2; i++) {
sum2 += a2[i]*b2[i];
}
double dot = sum2[0] + sum2[1];
for(i*=2;i<n; i++) dot +=a[i]*b[i];
return dot;
}
The other problem with your code is that it has a dependency chain. Your CPU can do a simultaneous SSE addition and multiplication but only for independent data paths. To fix this you need to unroll the loop. The following code unrolls the loop by 2 (but you probably need to unroll by three for the best results).
double dot_double2_unroll2(int n, double *a, double *b ) {
typedef double double2 __attribute__ ((vector_size (16)));
double2 sum2_v1 = {};
double2 sum2_v2 = {};
int i;
double2* a2 = (double2*)a;
double2* b2 = (double2*)b;
for(i=0; i<n/4; i++) {
sum2_v1 += a2[2*i+0]*b2[2*i+0];
sum2_v2 += a2[2*i+1]*b2[2*i+1];
}
double dot = sum2_v1[0] + sum2_v1[1] + sum2_v2[0] + sum2_v2[1];
for(i*=4;i<n; i++) dot +=a[i]*b[i];
return dot;
}
Here is a version using double4 which I think is really what you wanted with your original dot_SSE function. It's ideal for AVX (though it still needs to be unrolled) but it will still work with SSE2 as well. In fact with SSE it seems GCC breaks it into two chains which effectively unrolls the loop by 2.
double dot_double4(int n, double *a, double *b ) {
typedef double double4 __attribute__ ((vector_size (32)));
double4 sum4 = {};
int i;
double4* a4 = (double4*)a;
double4* b4 = (double4*)b;
for(i=0; i<n/4; i++) {
sum4 += a4[i]*b4[i];
}
double dot = sum4[0] + sum4[1] + sum4[2] + sum4[3];
for(i*=4;i<n; i++) dot +=a[i]*b[i];
return dot;
}
If you compile this with FMA it will generate FMA3 instructions. I tested all these functions here (you can edit and compile the code yourself as well) http://coliru.stacked-crooked.com/a/273268902c76b116
Note that using SSE/AVX for a single dot production in matrix multiplication is not the optimal use of SIMD. You should do two (four) dot products at once with SSE (AVX) for double floating point.