Short of checking the actual assembly produced, is there any way to determine what platform-specific instructions will be utilised by OpenMP, for a given use case?
For example, I've identified pcmpeqq i.e. 64-bit integer word equality (SSE 4.1) as the desirable instruction rather than pcmpeqd i.e. 32-bit word equality (SSE 2). Is there any way to know that OpenMP 4.0 will produce the former and not the latter? (spec does not address such specifics.)
The only way to ever guarantee that any compiler will ever emit a particular assembly instruction is to hardcode it. There's no spec in the world that constrains the compiler to generate specific instructions for a given language feature.
Having said that, if support for SSE4.1 or better is specified implicitly or explicitly on the command line, it would greatly surprise me if many compilers emitted SSE2 instructions in situations where the later instructions would work.
Checking the assembly isn't difficult:
$ cat foo.c
#include <stdio.h>
int main(int argc, char **argv) {
const int n=128;
long x[n];
long y[n];
for (int i=0; i<n/2; i++) {
x[i] = y[i] = 1;
x[i+n/2] = 2;
y[i+n/2] = 2;
}
#pragma omp simd
for (int i=0; i<n; i++)
x[i] = (x[i] == y[i]);
for (int i=0; i<n; i++)
printf("%d: %ld\n", i, x[i]);
return 0;
}
$ icc -openmp -msse4.1 -o foo41.s foo.c -S -std=c99 -qopt-report-phase=vec -qopt-report=2
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
$ icc -openmp -msse2 -o foo2.s foo.c -S -std=c99 -qopt-report-phase=vec -qopt-report=2 -o foo2.s
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
And sure enough:
$ grep pcmp foo41.s
pcmpeqq (%rax,%rsi,8), %xmm0 #18.25
$ grep pcmp foo2.s
pcmpeqd (%rax,%rsi,8), %xmm2 #18.25
Related
I'm benchmarking different matrix multiply forms with different optimization levels (for teaching purposes) and I detected a strange behavior in gcc autovectorization. It fails to vectorize when arrays are parameters (see mxmp) but is able to vectorize when arrays are global variables (see mxmg)
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
but behaviour was the same with older gcc versions
Compiling options:
gcc -O3 -mavx2 -mfma
#define N 1024
float A[N][N], B[N][N], C[N][N];
void mxmp(float A[N][N], float B[N][N], float C[N][N]) {
int i,j,k;
for (i=0; i<N; i++)
for (j=0; j<N; j++)
for (k=0; k<N; k++)
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
void mxmg() {
int i,j,k;
for (i=0; i<N; i++)
for (j=0; j<N; j++)
for (k=0; k<N; k++)
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
main(){
mxmg();
mxmp(A, B, C);
}
I expected the compiler to do the same in both functions however mxmp requires about 10 times the execution time of mxmg. Exploring the assembly code it just happens that gcc is able to autovectorize mxmg (when arrays are global variables) but fails to vectorize mxmp (where arrays are parameters).
Tried the same with kij form and it's able to vectorize both functions.
I need help to discover why gcc has this behavior. And how to help gcc (pragmas, compile options, atributes, ...) to properly vectorize mxmp function.
Thanks
When the arrays are global, the compiler can easily see that they are disjoint memory regions. When they are function parameters, you could call mxmp(A,A,A), so it has to assume that writing to C may modify A or B, which could affect later iterations and complicates vectorization. Of course the compiler could inline or do other things to know it in your particular case...
You can explicitly specify the lack of aliasing with restrict:
void mxmp(float A[restrict N][N], float B[restrict N][N], float C[restrict N][N]) {
I have the following 4x4 matrix-vector multiply code:
double const __restrict__ a[16];
double const __restrict__ x[4];
double __restrict__ y[4];
//#pragma GCC unroll 1 - does not work either
#pragma GCC nounroll
for ( int j = 0; j < 4; ++j )
{
double const* __restrict__ aj = a + j * 4;
double const xj = x[j];
#pragma GCC ivdep
for ( int i = 0; i < 4; ++i )
{
y[i] += aj[i] * xj;
}
}
I compile with -O3 -mavx flags. The inner loop is vectorized (single FMAD). However, gcc (7.2) keeps unrolling the outer loop 4 times, unless I use -O2 or lower optimization.
Is there a way to override -O3 unrolling of a particular loop?
NB. Similar #pragma nounroll works if I use Intel icc.
According to the documentation, #pragma GCC unroll 1 is supposed to work, if you place it just so. If it doesn't then you should submit a bug report.
Alternatively, you can use a function attribute to set optimizations, I think:
void myfn () __attribute__((optimize("no-unroll-loops")));
For concise functions
sans full and partial loop unrolling
when required
the following function attribute
please try.
__attribute__((optimize("Os")))
I am trying to use lapack functions from C.
Here is some test code, copied from this question
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include "clapack.h"
#include "cblas.h"
void invertMatrix(float *a, unsigned int height){
int info, ipiv[height];
info = clapack_sgetrf(CblasColMajor, height, height, a, height, ipiv);
info = clapack_sgetri(CblasColMajor, height, a, height, ipiv);
}
void displayMatrix(float *a, unsigned int height, unsigned int width)
{
int i, j;
for(i = 0; i < height; i++){
for(j = 0; j < width; j++)
{
printf("%1.3f ", a[height*j + i]);
}
printf("\n");
}
printf("\n");
}
int main(int argc, char *argv[])
{
int i;
float a[9], b[9], c[9];
srand(time(NULL));
for(i = 0; i < 9; i++)
{
a[i] = 1.0f*rand()/RAND_MAX;
b[i] = a[i];
}
displayMatrix(a, 3, 3);
return 0;
}
I compile this with gcc:
gcc -o test test.c \
-lblas -llapack -lf2c
n.b.: I've tried those libraries in various orders, I've also tried others libs like latlas, lcblas, lgfortran, etc.
The error message is:
/tmp//cc8JMnRT.o: In function `invertMatrix':
test.c:(.text+0x94): undefined reference to `clapack_sgetrf'
test.c:(.text+0xb4): undefined reference to `clapack_sgetri'
collect2: error: ld returned 1 exit status
clapack.h is found and included (installed as part of atlas). clapack.h includes the offending functions --- so how can they not be found?
The symbols are actually in the library libalapack (found using strings). However, adding -lalapack to the gcc command seems to require adding -lcblas (lots of undefined cblas_* references). Installing cblas automatically uninstalls atlas, which removes clapack.h.
So, this feels like some kind of dependency hell.
I am on FreeBSD 10 amd64, all the relevant libraries seem to be installed and on the right paths.
Any help much appreciated.
Thanks
Ivan
I uninstalled everything remotely relevant --- blas, cblas, lapack, atlas, etc. --- then reinstalled atlas (from ports) alone, and then the lapack and blas packages.
This time around, /usr/local/lib contained a new lib file: libcblas.so --- previous random installations must have deleted it.
The gcc line that compiles is now:
gcc -o test test.c \
-llapack -lblas -lalapack -lcblas
Changing the order of the -l arguments doesn't seem to make any difference.
I am not the best when it comes to compiling/writing makefiles.
I am trying to write a program that uses both GSL and OpenMP.
I have no problem using GSL and OpenMP separately, but I'm having issues using both. For instance, I can compile the GSL program
http://www.gnu.org/software/gsl/manual/html_node/An-Example-Program.html
By typing
$gcc -c Bessel.c
$gcc Bessel.o -lgsl -lgslcblas -lm
$./a.out
and it works.
I was also able to compile the program that uses OpenMP that I found here:
Starting a thread for each inner loop in OpenMP
In this case I typed
$gcc -fopenmp test_omp.c
$./a.out
And I got what I wanted (all 4 threads I have were used).
However, when I simply write a program that combines the two codes
#include <stdio.h>
#include <gsl/gsl_sf_bessel.h>
#include <omp.h>
int
main (void)
{
double x = 5.0;
double y = gsl_sf_bessel_J0 (x);
printf ("J0(%g) = %.18e\n", x, y);
int dimension = 4;
int i = 0;
int j = 0;
#pragma omp parallel private(i, j)
for (i =0; i < dimension; i++)
for (j = 0; j < dimension; j++)
printf("i=%d, jjj=%d, thread = %d\n", i, j, omp_get_thread_num());
return 0;
}
Then I try to compile to typing
$gcc -c Bessel_omp_test.c
$gcc Bessel_omp_test.o -fopenmp -lgsl -lgslcblas -lm
$./a.out
The GSL part works (The Bessel function is computed), but only one thread is used for the OpenMP part. I'm not sure what's wrong here...
You missed the worksharing directive for in your OpenMP part. It should be:
// Just in case GSL modifies the number of threads
omp_set_num_threads(omp_get_max_threads());
omp_set_dynamic(0);
#pragma omp parallel for private(i, j)
for (i =0; i < dimension; i++)
for (j = 0; j < dimension; j++)
printf("i=%d, jjj=%d, thread = %d\n", i, j, omp_get_thread_num());
Edit: To summarise the discussion in the comments below, the OP failed to supply -fopenmp during the compilation phase. That prevented GCC from recognising the OpenMP directives and thus no paralle code was generated.
IMHO, it's incorrect to declare the variables i and j as shared. Try declaring them private. Otherwise, each thread would get the same j and j++ would generate a race condition among threads.
This is my first try on OpenMP, but cannot get speedup on it. The machine is Linux amd_64.
I coded the following code:
printf ("nt = %d\n", nt);
omp_set_num_threads(nt);
int i, j, s;
#pragma omp parallel for private(j,s)
for (i=0; i<10000; i++)
{
for (j=0; j<100000; j++)
{
s++;
}
}
And the compile with
g++ tempomp.cpp -o tomp -lgomp
And run it with different nthreads, no speedup:
nt = 1
elapsed time =2.670000
nt = 2
elapsed time =2.670000
nt = 12
elapsed time =2.670000
Any ideas?
I think you need to add the flag -fopenmp to your compiler:
g++ tempomp.cpp -o tomp -lgomp -fopenmp
When -fopenmp is used, the compiler will generate parallel code
based on the OpenMP directives encountered.
-lgomp loads libraries of the Gnu OpenMP Project.
How many cores do your machine have?