Compiling using GSL and OpenMP - gcc

I am not the best when it comes to compiling/writing makefiles.
I am trying to write a program that uses both GSL and OpenMP.
I have no problem using GSL and OpenMP separately, but I'm having issues using both. For instance, I can compile the GSL program
http://www.gnu.org/software/gsl/manual/html_node/An-Example-Program.html
By typing
$gcc -c Bessel.c
$gcc Bessel.o -lgsl -lgslcblas -lm
$./a.out
and it works.
I was also able to compile the program that uses OpenMP that I found here:
Starting a thread for each inner loop in OpenMP
In this case I typed
$gcc -fopenmp test_omp.c
$./a.out
And I got what I wanted (all 4 threads I have were used).
However, when I simply write a program that combines the two codes
#include <stdio.h>
#include <gsl/gsl_sf_bessel.h>
#include <omp.h>
int
main (void)
{
double x = 5.0;
double y = gsl_sf_bessel_J0 (x);
printf ("J0(%g) = %.18e\n", x, y);
int dimension = 4;
int i = 0;
int j = 0;
#pragma omp parallel private(i, j)
for (i =0; i < dimension; i++)
for (j = 0; j < dimension; j++)
printf("i=%d, jjj=%d, thread = %d\n", i, j, omp_get_thread_num());
return 0;
}
Then I try to compile to typing
$gcc -c Bessel_omp_test.c
$gcc Bessel_omp_test.o -fopenmp -lgsl -lgslcblas -lm
$./a.out
The GSL part works (The Bessel function is computed), but only one thread is used for the OpenMP part. I'm not sure what's wrong here...

You missed the worksharing directive for in your OpenMP part. It should be:
// Just in case GSL modifies the number of threads
omp_set_num_threads(omp_get_max_threads());
omp_set_dynamic(0);
#pragma omp parallel for private(i, j)
for (i =0; i < dimension; i++)
for (j = 0; j < dimension; j++)
printf("i=%d, jjj=%d, thread = %d\n", i, j, omp_get_thread_num());
Edit: To summarise the discussion in the comments below, the OP failed to supply -fopenmp during the compilation phase. That prevented GCC from recognising the OpenMP directives and thus no paralle code was generated.

IMHO, it's incorrect to declare the variables i and j as shared. Try declaring them private. Otherwise, each thread would get the same j and j++ would generate a race condition among threads.

Related

Gcc autovectorization weird behaviour in matrix multiply when arrays are function parameters

I'm benchmarking different matrix multiply forms with different optimization levels (for teaching purposes) and I detected a strange behavior in gcc autovectorization. It fails to vectorize when arrays are parameters (see mxmp) but is able to vectorize when arrays are global variables (see mxmg)
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
but behaviour was the same with older gcc versions
Compiling options:
gcc -O3 -mavx2 -mfma
#define N 1024
float A[N][N], B[N][N], C[N][N];
void mxmp(float A[N][N], float B[N][N], float C[N][N]) {
int i,j,k;
for (i=0; i<N; i++)
for (j=0; j<N; j++)
for (k=0; k<N; k++)
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
void mxmg() {
int i,j,k;
for (i=0; i<N; i++)
for (j=0; j<N; j++)
for (k=0; k<N; k++)
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
main(){
mxmg();
mxmp(A, B, C);
}
I expected the compiler to do the same in both functions however mxmp requires about 10 times the execution time of mxmg. Exploring the assembly code it just happens that gcc is able to autovectorize mxmg (when arrays are global variables) but fails to vectorize mxmp (where arrays are parameters).
Tried the same with kij form and it's able to vectorize both functions.
I need help to discover why gcc has this behavior. And how to help gcc (pragmas, compile options, atributes, ...) to properly vectorize mxmp function.
Thanks
When the arrays are global, the compiler can easily see that they are disjoint memory regions. When they are function parameters, you could call mxmp(A,A,A), so it has to assume that writing to C may modify A or B, which could affect later iterations and complicates vectorization. Of course the compiler could inline or do other things to know it in your particular case...
You can explicitly specify the lack of aliasing with restrict:
void mxmp(float A[restrict N][N], float B[restrict N][N], float C[restrict N][N]) {

Disable unrolling of a particular loop in GCC

I have the following 4x4 matrix-vector multiply code:
double const __restrict__ a[16];
double const __restrict__ x[4];
double __restrict__ y[4];
//#pragma GCC unroll 1 - does not work either
#pragma GCC nounroll
for ( int j = 0; j < 4; ++j )
{
double const* __restrict__ aj = a + j * 4;
double const xj = x[j];
#pragma GCC ivdep
for ( int i = 0; i < 4; ++i )
{
y[i] += aj[i] * xj;
}
}
I compile with -O3 -mavx flags. The inner loop is vectorized (single FMAD). However, gcc (7.2) keeps unrolling the outer loop 4 times, unless I use -O2 or lower optimization.
Is there a way to override -O3 unrolling of a particular loop?
NB. Similar #pragma nounroll works if I use Intel icc.
According to the documentation, #pragma GCC unroll 1 is supposed to work, if you place it just so. If it doesn't then you should submit a bug report.
Alternatively, you can use a function attribute to set optimizations, I think:
void myfn () __attribute__((optimize("no-unroll-loops")));
For concise functions
sans full and partial loop unrolling
when required
the following function attribute
please try.
__attribute__((optimize("Os")))

Knowing what SIMD instructions OpenMP 4.0 will produce?

Short of checking the actual assembly produced, is there any way to determine what platform-specific instructions will be utilised by OpenMP, for a given use case?
For example, I've identified pcmpeqq i.e. 64-bit integer word equality (SSE 4.1) as the desirable instruction rather than pcmpeqd i.e. 32-bit word equality (SSE 2). Is there any way to know that OpenMP 4.0 will produce the former and not the latter? (spec does not address such specifics.)
The only way to ever guarantee that any compiler will ever emit a particular assembly instruction is to hardcode it. There's no spec in the world that constrains the compiler to generate specific instructions for a given language feature.
Having said that, if support for SSE4.1 or better is specified implicitly or explicitly on the command line, it would greatly surprise me if many compilers emitted SSE2 instructions in situations where the later instructions would work.
Checking the assembly isn't difficult:
$ cat foo.c
#include <stdio.h>
int main(int argc, char **argv) {
const int n=128;
long x[n];
long y[n];
for (int i=0; i<n/2; i++) {
x[i] = y[i] = 1;
x[i+n/2] = 2;
y[i+n/2] = 2;
}
#pragma omp simd
for (int i=0; i<n; i++)
x[i] = (x[i] == y[i]);
for (int i=0; i<n; i++)
printf("%d: %ld\n", i, x[i]);
return 0;
}
$ icc -openmp -msse4.1 -o foo41.s foo.c -S -std=c99 -qopt-report-phase=vec -qopt-report=2
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
$ icc -openmp -msse2 -o foo2.s foo.c -S -std=c99 -qopt-report-phase=vec -qopt-report=2 -o foo2.s
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
And sure enough:
$ grep pcmp foo41.s
pcmpeqq (%rax,%rsi,8), %xmm0 #18.25
$ grep pcmp foo2.s
pcmpeqd (%rax,%rsi,8), %xmm2 #18.25

Failing to link c code to lapack: undefined reference

I am trying to use lapack functions from C.
Here is some test code, copied from this question
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include "clapack.h"
#include "cblas.h"
void invertMatrix(float *a, unsigned int height){
int info, ipiv[height];
info = clapack_sgetrf(CblasColMajor, height, height, a, height, ipiv);
info = clapack_sgetri(CblasColMajor, height, a, height, ipiv);
}
void displayMatrix(float *a, unsigned int height, unsigned int width)
{
int i, j;
for(i = 0; i < height; i++){
for(j = 0; j < width; j++)
{
printf("%1.3f ", a[height*j + i]);
}
printf("\n");
}
printf("\n");
}
int main(int argc, char *argv[])
{
int i;
float a[9], b[9], c[9];
srand(time(NULL));
for(i = 0; i < 9; i++)
{
a[i] = 1.0f*rand()/RAND_MAX;
b[i] = a[i];
}
displayMatrix(a, 3, 3);
return 0;
}
I compile this with gcc:
gcc -o test test.c \
-lblas -llapack -lf2c
n.b.: I've tried those libraries in various orders, I've also tried others libs like latlas, lcblas, lgfortran, etc.
The error message is:
/tmp//cc8JMnRT.o: In function `invertMatrix':
test.c:(.text+0x94): undefined reference to `clapack_sgetrf'
test.c:(.text+0xb4): undefined reference to `clapack_sgetri'
collect2: error: ld returned 1 exit status
clapack.h is found and included (installed as part of atlas). clapack.h includes the offending functions --- so how can they not be found?
The symbols are actually in the library libalapack (found using strings). However, adding -lalapack to the gcc command seems to require adding -lcblas (lots of undefined cblas_* references). Installing cblas automatically uninstalls atlas, which removes clapack.h.
So, this feels like some kind of dependency hell.
I am on FreeBSD 10 amd64, all the relevant libraries seem to be installed and on the right paths.
Any help much appreciated.
Thanks
Ivan
I uninstalled everything remotely relevant --- blas, cblas, lapack, atlas, etc. --- then reinstalled atlas (from ports) alone, and then the lapack and blas packages.
This time around, /usr/local/lib contained a new lib file: libcblas.so --- previous random installations must have deleted it.
The gcc line that compiles is now:
gcc -o test test.c \
-llapack -lblas -lalapack -lcblas
Changing the order of the -l arguments doesn't seem to make any difference.

Open MP with gsl_matrix is slower than sequential

I've created a simple c program using gsl(GNU Scienctific Library) and open mp. In this simple program, I want to test the execution time for sequential and parallel. Here is the program snippets, main.c.
#include "omp.h"
#include <stdio.h>
#include <gsl/gsl_matrix.h>
#include <time.h>
int main()
{
omp_set_num_threads(4);
int n1=10000, n2=10000;
gsl_matrix *A = gsl_matrix_alloc(n1, n2);
int i,j;
struct timeval tv1, tv2, tv3, tv4;
gettimeofday(&tv1, 0);
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
gettimeofday(&tv2, 0);
long elapsed = (tv2.tv_sec-tv1.tv_sec)*1000000 + tv2.tv_usec-tv1.tv_usec;
printf("Sequential Duration:%ldms\n", elapsed);
gettimeofday(&tv3, 0);
#pragma omp parallel for private(i,j)
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
gettimeofday(&tv4, 0);
elapsed = (tv4.tv_sec-tv3.tv_sec)*1000000 + tv4.tv_usec-tv3.tv_usec;
printf(" Parallel Duration:%ldms\n", elapsed);
return 0;
}
Then I compiled the above code, using this command:
gcc -fopenmp main.c -o test -lgsl -lgslcblas -lm
Here is the program's result:
Sequential Duration:11980106ms
Parallel Duration:20624043ms
Why, the parallel part slower than the sequential part. How can I optimize this code? Thanks
as you have written it the j variable is shared between all threads so the threads are overwritting other threads state constantly, leading to them iterating values they have already covered.
You should always minimize the scope of variables when trying to parallelize with openmp. Either move the scope of j into the loop or mark it as private explicitly:
#pragma omp parallel for private(j)
also clock counts the processor time not the real time, you probably want to use gettimeofday
you matrix is too small to benefit much from parallelization, the threading overhead will dominate. Increase it to ~10000x10000 to start seeing something.
The problem here is that you do not know what the procedure gsl_matrix_set does with A. You do not know if it is thread safe. To change one element in that matrix you supply the whole matrix to the routine instead of only the indices of the element. This smells by false sharing (see e.g. this answer).
I would try this instead
gsl_matrix_set(A[i][j],i*j*1000000);
If that does not work and what you are interested in is only the time difference between serial and parallel I would just do
A[i][j] = i*j*1000000
In the thread part, try this:
#pragma omp parallel private(i,j)
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
or
#pragma omp parallel for
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}

Resources