error in OpenMP program execution - openmp

I am trying to sum all the elements of an array which is initialized in the same code. As every element is independent from each other, I tried to perform the sum in parallel. My code is shown below:
int main(int argc, char** argv)
{
cout.precision(20);
double sumre=0.,Mre[11];
int n=11;
for(int i=0; i<n; i++)
Mre[i]=2.*exp(-10*M_PI*i/(1.*n));
#pragma omp parallel for reduction(+:sumre)
for(int i=0; i<n; i++)
{
sumre+=Mre[i];
}
cout<<sumre<<"\n";
}
which I compile and run with:
g++ -O3 -o sum sumparallel.cpp -fopenmp
./sum
respectively. My problem is that the output differs every time I run it. Sometimes it gives
2.1220129388411006488
or
2.1220129388411002047
Does anyone have an idea what is happening here?

Some of these comments hint at the problem here, but there could be two distinct issues
Double precision numbers do not have 20 decimal precision
If you want to print the maximum precision of sumre, use something like this
#include <float.h>
int maint(int argc, char* argv[])
{
...
printf("%.*g", DBL_DECIMAL_DIG, number);
return 0;
}
Floating point arithmetic is non-commutative
The effect of this property is roundoff error. In fact, the function that you have defined, a gaussian, is especially prone to roundoff for summation. Considering that workload distribution of OpenMP parallel for is undefined, you may receive different answers when you run it. To get around this you may use the kahan summation algorithm. An implementation with OpenMP would look something like this:
...
double sum = 0.0, c = 0.0;
#pragma omp parallel for reduction(+:sum, +:c)
for(i = 0; i < n; i++)
{
double y = Mre[i] - c;
double t = sum + y;
c = (t - sum) - y;
sum = t;
}
sum = sum - c;
...

Related

Difference between mutual exclusion like atomic and reduction in OpenMP

I'm am following video lectures of Tim Mattson on OpenMP and there was one exercise to find errors in provided code that count area of the Mandelbrot. So here is the solution that was provided:
#define NPOINTS 1000
#define MAXITER 1000
void testpoint(struct d_complex);
struct d_complex{
double r;
double i;
};
struct d_complex c;
int numoutside = 0;
int main(){
int i,j;
double area, error, eps = 1.0e-5;
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps)
for(i = 0; i<NPOINTS; i++){
for(j=0; j < NPOINTS; j++){
c.r = -2.0+2.5*(double)(i)/(double)(NPOINTS)+eps;
c.i = 1.125*(double)(j)/(double)(NPOINTS)+eps;
testpoint(c);
}
}
area=2.0*2.5*1.125*(double)(NPOINTS*NPOINTS-numoutside)/(double)(NPOINTS*NPOINTS);
error=area/(double)NPOINTS;
printf("Area of Mandlebrot set = %12.8f +/- %12.8f\n",area,error);
printf("Correct answer should be around 1.510659\n");
}
void testpoint(struct d_complex c){
// Does the iteration z=z*z+c, until |z| > 2 when point is known to be outside set
// If loop count reaches MAXITER, point is considered to be inside the set
struct d_complex z;
int iter;
double temp;
z=c;
for (iter=0; iter<MAXITER; iter++){
temp = (z.r*z.r)-(z.i*z.i)+c.r;
z.i = z.r*z.i*2+c.i;
z.r = temp;
if ((z.r*z.r+z.i*z.i)>4.0) {
#pragma omp atomic
numoutside++;
break;
}
}
}
The question I have is, could we use reduction in #pragma omp parallel of variable numoutside like:
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps) reduction(+:numoutside)
without atomic construct in testpoint function?
I tested the function without atomic, and the result was different from the one I got in the first place. Why does that happen? And while I understand the concept of mutual exclusion and use of it because of race conditioning, isn't reduction just another form of solving that problem with private variables?
Thank You in advance.

Matrix Multiplication OpenMP Counter-Intuitive Results

I am currently porting some code over to OpenMP at my place of work. One of the tasks I am doing is figuring out how to speed up matrix multiplication for one of our applications.
The matrices are stored in row-major format, so A[i*cols +j] gives the A_i_j element of the matrix A.
The code looks like this (uncommenting the pragma parallelises the code):
#include <omp.h>
#include <iostream>
#include <iomanip>
#include <stdio.h>
#define NUM_THREADS 8
#define size 500
#define num_iter 10
int main (int argc, char *argv[])
{
// omp_set_num_threads(NUM_THREADS);
int *A = new int [size*size];
int *B = new int [size*size];
int *C = new int [size*size];
for (int i=0; i<size; i++)
{
for (int j=0; j<size; j++)
{
A[i*size+j] = j*1;
B[i*size+j] = i*j+2;
C[i*size+j] = 0;
}
}
double total_time = 0;
double start = 0;
for (int t=0; t<num_iter; t++)
{
start = omp_get_wtime();
int i, k;
// #pragma omp parallel for num_threads(10) private(i, k) collapse(2) schedule(dynamic)
for (int j=0; j<size; j++)
{
for (i=0; i<size; i++)
{
for (k=0; k<size; k++)
{
C[i*size+j] += A[i*size+k] * B[k*size+j];
}
}
}
total_time += omp_get_wtime() - start;
}
std::setprecision(5);
std::cout << total_time/num_iter << std::endl;
delete[] A;
delete[] B;
delete[] C;
return 0;
}
What is confusing me is the following: why is dynamic scheduling faster than static scheduling for this task? Timing the runs and taking an average shows that static scheduling is slower, which to me is a bit counterintuitive since each thread is doing the same amount of work.
Also, am I correctly speeding up my matrix multiplication code?
Parallel matrix multiplication is non-trivial (have you even considered cache-blocking?). Your best bet is likely to be to use a BLAS Library for this, rather than writing it yourself. (Remember, "The best code is the code I do not have to write").
Wikipedia: Basic Linear Algebra Subprograms points to many implementations, a lot of which (including Intel Math Kernel Library) have free licenses.

mandelbrot using openMP

// return 1 if in set, 0 otherwise
int inset(double real, double img, int maxiter){
double z_real = real;
double z_img = img;
for(int iters = 0; iters < maxiter; iters++){
double z2_real = z_real*z_real-z_img*z_img;
double z2_img = 2.0*z_real*z_img;
z_real = z2_real + real;
z_img = z2_img + img;
if(z_real*z_real + z_img*z_img > 4.0) return 0;
}
return 1;
}
// count the number of points in the set, within the region
int mandelbrotSetCount(double real_lower, double real_upper, double img_lower, double img_upper, int num, int maxiter){
int count=0;
double real_step = (real_upper-real_lower)/num;
double img_step = (img_upper-img_lower)/num;
for(int real=0; real<=num; real++){
for(int img=0; img<=num; img++){
count+=inset(real_lower+real*real_step,img_lower+img*img_step,maxiter);
}
}
return count;
}
// main
int main(int argc, char *argv[]){
double real_lower;
double real_upper;
double img_lower;
double img_upper;
int num;
int maxiter;
int num_regions = (argc-1)/6;
for(int region=0;region<num_regions;region++){
// scan the arguments
sscanf(argv[region*6+1],"%lf",&real_lower);
sscanf(argv[region*6+2],"%lf",&real_upper);
sscanf(argv[region*6+3],"%lf",&img_lower);
sscanf(argv[region*6+4],"%lf",&img_upper);
sscanf(argv[region*6+5],"%i",&num);
sscanf(argv[region*6+6],"%i",&maxiter);
printf("%d\n",mandelbrotSetCount(real_lower,real_upper,img_lower,img_upper,num,maxiter));
}
return EXIT_SUCCESS;
}
I need to convert the above code into openMP. I know how to do it for a single matrix or image but i have to do it for 2 images at the same time
the arguments are as follows
$./mandelbrot -2.0 1.0 -1.0 1.0 100 10000 -1 1.0 0.0 1.0 100 10000
Any suggestion how to divide the work in to different threads for the two images and then further divide work for each image.
thanks in advance
If you want to process multiple images at a time, you need to add a #pragma omp parallel for into the loop in the main body such as:
#pragma omp parallel for private(real_lower, real_upper, img_lower, img_upper, num, maxiter)
for(int region=0;region<num_regions;region++){
// scan the arguments
sscanf(argv[region*6+1],"%lf",&real_lower);
sscanf(argv[region*6+2],"%lf",&real_upper);
sscanf(argv[region*6+3],"%lf",&img_lower);
sscanf(argv[region*6+4],"%lf",&img_upper);
sscanf(argv[region*6+5],"%i",&num);
sscanf(argv[region*6+6],"%i",&maxiter);
printf("%d\n",mandelbrotSetCount(real_lower,real_upper,img_lower,img_upper,num,maxiter));
}
Notice that some variables need to be classified as private (i.e. each thread has its own copy).
Now, if you want additional parallelism you need nested OpenMP (see nested and NESTED_OMP in OpenMP specification) as the work will be spawned by OpenMP threads -- but note that nesting may not give you a performance boost always.
In this case, what about adding a #pragma omp parallel for (with the appropriate reduction clause so that each thread accumulates into count) into the mandelbrotSetCount routine such as
// count the number of points in the set, within the region
int mandelbrotSetCount(double real_lower, double real_upper, double img_lower, double img_upper, int num, int maxiter)
{
int count=0;
double real_step = (real_upper-real_lower)/num;
double img_step = (img_upper-img_lower)/num;
#pragma omp parallel for reduction(+:count)
for(int real=0; real<=num; real++){
for(int img=0; img<=num; img++){
count+=inset(real_lower+real*real_step,img_lower+img*img_step,maxiter);
}
}
return count;
}
The whole approach would split images between threads first and then the rest of the available threads would be able to split the loop iterations in this routine among all the available threads each time you invoke the routine.
EDIT
As user Hristo suggest's on the comments, the mandelBrotSetCount routine might be unbalanced (the best reason is that the user simply requests a different number of maxiter) on each invocation. One way to address this performance issue might be to use dynamic thread scheduling in the routine. So rather than having
#pragma omp parallel for reduction(+:count)
we might want to have
#pragma omp parallel for reduction(+:count) schedule(dynamic,N)
and here N should be a relatively small value (and likely larger than 1).

Openacc error ibgomp: while loading libgomp-plugin-host_nonshm.so.1: libgomp-plugin-host_nonshm.so.1: cannot

I want to compile an easy openacc sample (it was attached) , it was correctly compiled but when i run it got an error :
compile with : gcc-5 -fopenacc accVetAdd.c -lm
run with : ./a.out
got error in runtime
error: libgomp: while loading libgomp-plugin-host_nonshm.so.1: libgomp-plugin-host_nonshm.so.1: cannot open shared object file: No such file or directory
I google it and find only one page! then i ask how to fix this problem?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char* argv[])
{
// Size of vectors
int n = 10000;
// Input vectors
double *restrict a;
double *restrict b;
// Output vector
double *restrict c;
// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);
// Allocate memory for each vector
a = (double*)malloc(bytes);
b = (double*)malloc(bytes);
c = (double*)malloc(bytes);
// Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
int i;
for (i = 0; i<n; i++) {
a[i] = sin(i)*sin(i);
b[i] = cos(i)*cos(i);
}
// sum component wise and save result into vector c
#pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n])
for (i = 0; i<n; i++) {
c[i] = a[i] + b[i];
}
// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0.0;
for (i = 0; i<n; i++) {
sum += c[i];
}
sum = sum / n;
printf("final result: %f\n", sum);
// Release memory
free(a);
free(b);
free(c);
return 0;
}
libgomp dynamically loads shared object files for the plugins it supports, such as the one implementing the host_nonshm device. If they're installed in a non-standard directory (that is, not in the system's default search path), you need to tell the dynamic linker where to look for these shared object files: either compile with -Wl,-rpath,[...], or set the LD_LIBRARY_PATH environment variable.

How to parallelize an array shift with OpenMP?

How can I parallelize an array shift with OpenMP?
I've tryed a few things but didn't get any accurate results for the following example (which rotates the elements of an array of Carteira objects, for a permutation algorithm):
void rotaciona(int i)
{
Carteira aux = this->carteira[i];
for(int c = i; c < this->size - 1; c++)
{
this->carteira[c] = this->carteira[c+1];
}
this->carteira[this->size-1] = aux;
}
Thank you very much!
This is an example of a loop with loop-carried dependencies, and so can't be easily parallelized as written because the tasks (each iteration of the loop) aren't independent. Breaking the dependency can vary from a trivial modification to the completely impossible
(eg, an iteration loop).
Here, the case is somewhat in between. The issue with doing this in parallel is that you need to find out what your rightmost value is going to be before your neighbour changes the value. The OMP for construct doesn't expose to you which loop iterations values will be "yours", so I don't think you can use the OpenMP for worksharing construct to break up the loop. However, you can do it yourself; but it requires a lot more code, and it won't nicely reduce to the serial case any more.
But still, an example of how to do this is shown below. You have to break the loop up yourself, and then get your rightmost value. An OpenMP barrier ensures that no one starts modifying values until all the threads have cached their new rightmost value.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char **argv) {
int i;
char *array;
const int n=27;
array = malloc(n * sizeof(char) );
for (i=0; i<n-1; i++)
array[i] = 'A'+i;
array[n-1] = '\0';
printf("Array pre-shift = <%s>\n",array);
#pragma omp parallel default(none) shared(array) private(i)
{
int nthreads = omp_get_num_threads();
int tid = omp_get_thread_num();
int blocksize = (n-2)/nthreads;
int start = tid*blocksize;
int end = start + blocksize - 1;
if (tid == nthreads-1) end = n-2;
/* we are responsible for values start...end */
char rightval = array[end+1];
#pragma omp barrier
for (i=start; i<end; i++)
array[i] = array[i+1];
array[end] = rightval;
}
printf("Array post-shift = <%s>\n",array);
return 0;
}
Though your sample doesn't show any explicit openmp pragma's, I don't think it could work easily:
you are doing an in-place operation with overlapping regions.
If you split the loop in chunks, you'll have race conditions at the boundaries (because el[n] gets copied from el[n+1], which might already have been updated in another thread).
I suggest that you do manual chunking (which can be done), but I suspect that openmp parallel for is not flexible enough (haven't tried), so you could just have a parallell region that does the work in chunks, and fixup the boundary elements after a thread barrier/end of parallel block
Other thoughts:
if your values are POD, you can use memmove instead
if you can, simply switch to a list
.
std::list<Carteira> items(3000);
// rotation is now simply:
items.push_back(items.front());
items.erase(items.begin());

Resources