Unexplainable LLVM vs GCC OpenMP differences

Unexplainable LLVM vs GCC OpenMP differences - gcc

Consider the following matrix multiplication code:
#define BLOCKING 64
void mat_mult_ijk_blocked(
const int m,
const int n,
const int p,
real a[restrict m][p],
const real b[m][n],
const real c[n][p]) {
for(int i=0; i<m; i++) {
for(int k=0; k<p; k++) {
a[i][k] = 0.0;
}
}
#pragma omp parallel for
for(int block_i=0; block_i<m; block_i += BLOCKING) {
for(int block_j=0; block_j<n; block_j += BLOCKING) {
for(int i=block_i; i<min(block_i + BLOCKING, m); i++) {
for(int j=block_j; j<min(block_j + BLOCKING, n); j++) {
real w = b[i][j];
for(int k=0; k<p; k++) {
a[i][k] += w * c[j][k];
}
}
}
}
}
}
For 1800×2200×1400 multiplication:
With clang without -fopenmp, the code takes 3.2s on my machine.
With -fopenmp and
OMP_NUM_THREADS=4 1.6s
OMP_NUM_THREADS=3 2.1s
OMP_NUM_THREADS=2 1.6s
OMP_NUM_THREADS=1 2.6s
This machine has 2 cores with hyperthreading, this seems kind of reasonable.
Now with gcc-5.4.0:
without -fopenmp 2.8s
with -fopenmp
OMP_NUM_THREADS=4 5.1s
OMP_NUM_THREADS=3 4.6s
OMP_NUM_THREADS=2 4.9s
OMP_NUM_THREADS=1 10s
What am I doing wrong that gives so poor performance with gcc!?
The performance with a single thread of OpenMP is considerably worse than that of the original single-threaded code.
UPDATE - with gcc-6.2
without -fopenmp 3.s
with -fopenmp
OMP_NUM_THREADS=4 2.23s
OMP_NUM_THREADS=3 1.89s
OMP_NUM_THREADS=2 1.61s
OMP_NUM_THREADS=1 2.9s
Clearly now results are similar to those in LLVM. What is the reason for this?

Related

Why do I get 128 teams/thread blocks and 96 threads in each teams/thread blocks using #pragma omp target team distribute parallel for in OpenMP?

I am running this code on Ubuntu 18.04, clang/llvm compiler with Nvidia GTX 1070 GPU
#pragma omp target data map(to: A,B) map(from: C)
{
#pragma omp target teams distribute
for(int n=0; n<Row; n++)
{
int team_id= omp_get_team_num();
#pragma omp parallel for default(shared) schedule(auto)
for(int j = 0; j <Col; j++)
{
int thread_id = omp_get_thread_num();
printf("Iteration= c[ %d ][ %d ], Team=%d, Thread=%d\n",n, j, team_id, thread_id);
C[n][j] = A[n][j] + B[n][j];
}
}
}
in the above code, max value of team is 127 and thread is 95
compile flags: clang++ -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_61 -Wall -O3 debug.cpp -o debug

Matrix Multiplication OpenMP Counter-Intuitive Results

I am currently porting some code over to OpenMP at my place of work. One of the tasks I am doing is figuring out how to speed up matrix multiplication for one of our applications.
The matrices are stored in row-major format, so A[i*cols +j] gives the A_i_j element of the matrix A.
The code looks like this (uncommenting the pragma parallelises the code):
#include <omp.h>
#include <iostream>
#include <iomanip>
#include <stdio.h>
#define NUM_THREADS 8
#define size 500
#define num_iter 10
int main (int argc, char *argv[])
{
// omp_set_num_threads(NUM_THREADS);
int *A = new int [size*size];
int *B = new int [size*size];
int *C = new int [size*size];
for (int i=0; i<size; i++)
{
for (int j=0; j<size; j++)
{
A[i*size+j] = j*1;
B[i*size+j] = i*j+2;
C[i*size+j] = 0;
}
}
double total_time = 0;
double start = 0;
for (int t=0; t<num_iter; t++)
{
start = omp_get_wtime();
int i, k;
// #pragma omp parallel for num_threads(10) private(i, k) collapse(2) schedule(dynamic)
for (int j=0; j<size; j++)
{
for (i=0; i<size; i++)
{
for (k=0; k<size; k++)
{
C[i*size+j] += A[i*size+k] * B[k*size+j];
}
}
}
total_time += omp_get_wtime() - start;
}
std::setprecision(5);
std::cout << total_time/num_iter << std::endl;
delete[] A;
delete[] B;
delete[] C;
return 0;
}
What is confusing me is the following: why is dynamic scheduling faster than static scheduling for this task? Timing the runs and taking an average shows that static scheduling is slower, which to me is a bit counterintuitive since each thread is doing the same amount of work.
Also, am I correctly speeding up my matrix multiplication code?

Parallel matrix multiplication is non-trivial (have you even considered cache-blocking?). Your best bet is likely to be to use a BLAS Library for this, rather than writing it yourself. (Remember, "The best code is the code I do not have to write").
Wikipedia: Basic Linear Algebra Subprograms points to many implementations, a lot of which (including Intel Math Kernel Library) have free licenses.

Using "unsigned long long" as iteration-range in for-loop using OpenMP

If I do this it works fine:
#pragma omp parallel for
for (int i = 1; i <= 200; i++) { ... }
this still works fine
#pragma omp parallel for
for (unsigned long long i = 1; i <= 200; i++) { ... }
but this isnt working
#pragma omp parallel for
for (unsigned long long i = 1; i <= LLONG_MAX; i++) { ... }
-> compiler error: invalid controlling predicate
LLONG_MAX is coming from
#include <limits.h>
g++ --version -> g++ (tdm64-1) 5.1.0
it is said that openmp 3.0 can handle unsigned integer - types.
I searched alot for this issue, without success. They all use int as iteration-variable.
Someone knows a solution?

i changed the program to:
unsigned long long n = ULLONG_MAX;
#pragma omp parallel for
for (unsigned long long i = 1; i < n; i++) { ... }
it seems to work now. Thank you Jeff for the hint.
i tried before with:
for (auto i = 1; i < n; i++) { ... }
-> no error, but loop didnt produce an output, very strange.

Open MP with gsl_matrix is slower than sequential

I've created a simple c program using gsl(GNU Scienctific Library) and open mp. In this simple program, I want to test the execution time for sequential and parallel. Here is the program snippets, main.c.
#include "omp.h"
#include <stdio.h>
#include <gsl/gsl_matrix.h>
#include <time.h>
int main()
{
omp_set_num_threads(4);
int n1=10000, n2=10000;
gsl_matrix *A = gsl_matrix_alloc(n1, n2);
int i,j;
struct timeval tv1, tv2, tv3, tv4;
gettimeofday(&tv1, 0);
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
gettimeofday(&tv2, 0);
long elapsed = (tv2.tv_sec-tv1.tv_sec)*1000000 + tv2.tv_usec-tv1.tv_usec;
printf("Sequential Duration:%ldms\n", elapsed);
gettimeofday(&tv3, 0);
#pragma omp parallel for private(i,j)
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
gettimeofday(&tv4, 0);
elapsed = (tv4.tv_sec-tv3.tv_sec)*1000000 + tv4.tv_usec-tv3.tv_usec;
printf(" Parallel Duration:%ldms\n", elapsed);
return 0;
}
Then I compiled the above code, using this command:
gcc -fopenmp main.c -o test -lgsl -lgslcblas -lm
Here is the program's result:
Sequential Duration:11980106ms
Parallel Duration:20624043ms
Why, the parallel part slower than the sequential part. How can I optimize this code? Thanks

as you have written it the j variable is shared between all threads so the threads are overwritting other threads state constantly, leading to them iterating values they have already covered.
You should always minimize the scope of variables when trying to parallelize with openmp. Either move the scope of j into the loop or mark it as private explicitly:
#pragma omp parallel for private(j)
also clock counts the processor time not the real time, you probably want to use gettimeofday
you matrix is too small to benefit much from parallelization, the threading overhead will dominate. Increase it to ~10000x10000 to start seeing something.

The problem here is that you do not know what the procedure gsl_matrix_set does with A. You do not know if it is thread safe. To change one element in that matrix you supply the whole matrix to the routine instead of only the indices of the element. This smells by false sharing (see e.g. this answer).
I would try this instead
gsl_matrix_set(A[i][j],i*j*1000000);
If that does not work and what you are interested in is only the time difference between serial and parallel I would just do
A[i][j] = i*j*1000000

In the thread part, try this:
#pragma omp parallel private(i,j)
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
or
#pragma omp parallel for
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}

Compiling using GSL and OpenMP

I am not the best when it comes to compiling/writing makefiles.
I am trying to write a program that uses both GSL and OpenMP.
I have no problem using GSL and OpenMP separately, but I'm having issues using both. For instance, I can compile the GSL program
http://www.gnu.org/software/gsl/manual/html_node/An-Example-Program.html
By typing
$gcc -c Bessel.c
$gcc Bessel.o -lgsl -lgslcblas -lm
$./a.out
and it works.
I was also able to compile the program that uses OpenMP that I found here:
Starting a thread for each inner loop in OpenMP
In this case I typed
$gcc -fopenmp test_omp.c
$./a.out
And I got what I wanted (all 4 threads I have were used).
However, when I simply write a program that combines the two codes
#include <stdio.h>
#include <gsl/gsl_sf_bessel.h>
#include <omp.h>
int
main (void)
{
double x = 5.0;
double y = gsl_sf_bessel_J0 (x);
printf ("J0(%g) = %.18e\n", x, y);
int dimension = 4;
int i = 0;
int j = 0;
#pragma omp parallel private(i, j)
for (i =0; i < dimension; i++)
for (j = 0; j < dimension; j++)
printf("i=%d, jjj=%d, thread = %d\n", i, j, omp_get_thread_num());
return 0;
}
Then I try to compile to typing
$gcc -c Bessel_omp_test.c
$gcc Bessel_omp_test.o -fopenmp -lgsl -lgslcblas -lm
$./a.out
The GSL part works (The Bessel function is computed), but only one thread is used for the OpenMP part. I'm not sure what's wrong here...

You missed the worksharing directive for in your OpenMP part. It should be:
// Just in case GSL modifies the number of threads
omp_set_num_threads(omp_get_max_threads());
omp_set_dynamic(0);
#pragma omp parallel for private(i, j)
for (i =0; i < dimension; i++)
for (j = 0; j < dimension; j++)
printf("i=%d, jjj=%d, thread = %d\n", i, j, omp_get_thread_num());
Edit: To summarise the discussion in the comments below, the OP failed to supply -fopenmp during the compilation phase. That prevented GCC from recognising the OpenMP directives and thus no paralle code was generated.

IMHO, it's incorrect to declare the variables i and j as shared. Try declaring them private. Otherwise, each thread would get the same j and j++ would generate a race condition among threads.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Unexplainable LLVM vs GCC OpenMP differences - gcc

Related

Why do I get 128 teams/thread blocks and 96 threads in each teams/thread blocks using #pragma omp target team distribute parallel for in OpenMP?

Matrix Multiplication OpenMP Counter-Intuitive Results

Using "unsigned long long" as iteration-range in for-loop using OpenMP

Open MP with gsl_matrix is slower than sequential

Compiling using GSL and OpenMP

Categories

Resources