Ways to parallelize this using OpenMP? - openmp

Can anybody suggest a best way to parallelize this using openmp? The program gets aborted when I run this code.
void grayerode(int **img, int height, int width, int filterheight,
int filterwidth, int iterations, int pixrange)
{
int maxlabel=0;
int fh, fw, iters, pixval=0, i, j, s, k;
int fhlimit = filterheight/2;
int fwlimit = filterwidth/2;
int **smoothedlabels;
allocate_2D_int_matrix ( &smoothedlabels, height, width );
#pragma omp parallel for shared(smoothedlabels,height,width,k)
for (i=0; i<height; i++)
for (j=0; j<width; j++)
smoothedlabels[i][j] = img[i][j];
int *labeltemp = (int *)malloc(pixrange*sizeof(int));
for (s=0; s<pixrange; s++)
labeltemp[s] = 0;
for (iters=0; iters<iterations; iters++) {
#pragma omp parallel for private(i,j,labeltemp)
for (i=fhlimit; i<height-fhlimit; i++) {
for (j=fwlimit; j<width-fwlimit; j++) {
for (fh=-fhlimit; fh<=fhlimit; fh++)
for (fw=-fwlimit; fw<=fwlimit; fw++) {
labeltemp[img[i+fh][j+fw]]++;
}
for (s=0; s<pixrange; s++) {
if (labeltemp[s]>maxlabel) {
maxlabel = labeltemp[s];
pixval = s;
}
}
smoothedlabels[i][j]=pixval;
for (s=0; s<pixrange; s++)
labeltemp[s] = 0;
maxlabel = 0;
}
}
}
for (i=0; i<height; i++)
for (j=0; j<width; j++)
img[i][j] = smoothedlabels[i][j];
free_2D_int_matrix ( &smoothedlabels );
free(labeltemp);
return;
}

A few things:
You are not declaring private variables correctly. One example of doing it the correct way in your code:
#pragma omp parallel for private(i,j) shared(smoothedlabels, img, width, height)
for(i=0; i<height; i++)
for(j=0; j<width; j++)
smoothedlabels[i][j] = img[i][j]
It is important that j remains private or each thread will try change its value - giving you unexpected behaviour. (Note: i is actually implicitly declared private when you declare the pragma statement, but I always prefer to state it explicitly for better readability)
Try avoiding 2D arrays because they restrict your ability to parallelize. In the same example you could do the following:
#pragma omp parallel for private(i) shared(width, height, smoothedlabels, img)
for(i=0; i<width * height; i++)
smoothedlabels[i] = img[i]
This will parallelize the entire loop for you rather than just the outer loop. You can order your 1D array either column wise or row wise.
Same thing goes for the rest of the loops - just apply the same concept.
Later in your code for example, you have the following:
for (fh=-fhlimit; fh<=fhlimit; fh++)
for (fw=-fwlimit; fw<=fwlimit; fw++) {
labeltemp[img[i+fh][j+fw]]++;
}
If you do not declare fh and fw private, then you will get unexpected behaviour for the same reason not declaring j before would give you unexpected behaviour.

Related

-ta=tesla:deepcopy flag and #pragma acc shape

I just found out about the deepcopy flag. Until this moment I've always used -ta=tesla:managed to handle deep copy and I would like to explore the alternative.
I read this article: https://www.pgroup.com/blogs/posts/deep-copy-beta.htm which is well written but I think it does not cover my case. I have a structure of this type:
typedef struct Data_{
double ****Vc;
double ****Uc;
} Data
The shape of these to array is not defined by an element of the struct itself but by the elements of another structure and that are themselves defined only during the execution of the program.
How can I use the #pragma acc shape(Vc, Uc) in this case?
Without this pragma and copying the structure as follows:
int main(){
Data data;
Initialize(&data);
}
int Initialize(Data *data){
data->Uc = ARRAY_4D(ntot[KDIR], ntot[JDIR], ntot[IDIR], NVAR, double);
data->Vc = ARRAY_4D(NVAR, ntot[KDIR], ntot[JDIR], ntot[IDIR], double);
#pragma acc enter data copyin(data)
PrimToCons3D(data->Vc, data->Uc, grid, NULL);
}
void PrimToCons3D(double ****V, double ****U, Grid *grid, RBox *box){
#pragma acc parallel loop collapse(3) present(V[:NVAR][:nx3_tot][:nx2_tot][:nx1_tot])
for (k = kbeg; k <= kend; k++){
for (j = jbeg; j <= jend; j++){
for (i = ibeg; i <= iend; i++){
double v[NVAR];
#pragma acc loop
for (nv = 0; nv < NVAR; nv++) v[nv] = V[nv][k][j][i];
}
I get
FATAL ERROR: data in PRESENT clause was not found on device 1: name=V host:0x1fd2b80
file:/home/Prova/Src/mappers3D.c PrimToCons3D line:140
Btw, this same code works fine with -ta=tesla:managed.
Since you don't provide a full reproducing example, I wasn't able to test this, but it would look something like:
typedef struct Data_{
int i,j,k,l;
double ****Vc;
double ****Uc;
#pragma acc shape(Vc[0:k][0:j][0:i][0:l])
#pragma acc shape(Uc[0:k][0:j][0:i][0:l])
} Data;
int Initialize(Data *data){
data->Vc.i = ntot[IDIR];
data->Vc.j = ntot[JDIR];
data->Vc.k = ntot[KDIR];
data->Vc.l = NVAR;
data->Uc.i = ntot[IDIR];
data->Uc.j = ntot[JDIR];
data->Uc.k = ntot[KDIR];
data->Uc.l = NVAR;
data->Uc = ARRAY_4D(ntot[KDIR], ntot[JDIR], ntot[IDIR], NVAR, double);
data->Vc = ARRAY_4D(NVAR, ntot[KDIR], ntot[JDIR], ntot[IDIR], double);
#pragma acc enter data copyin(data)
PrimToCons3D(data->Vc, data->Uc, grid, NULL);
}
void PrimToCons3D(double ****V, double ****U, Grid *grid, RBox *box){
int kbeg, jbeg, ibeg, kend, jend, iend;
#pragma acc parallel loop collapse(3) present(V, U)
for (int k = kbeg; k <= kend; k++){
for (int j = jbeg; j <= jend; j++){
for (int i = ibeg; i <= iend; i++){
Though keep in mind that the "shape" and "policy" directives we not adopted by the OpenACC standard and we (the NVHPC compiler team) only did a Beta version, which we have not maintained.
Probably better to do a manual deep copy, which will be standard compliant, which I can help with if you can provide a reproducer which includes how you're doing the array allocation, i.e. "ARRAY_4D".

Difference between mutual exclusion like atomic and reduction in OpenMP

I'm am following video lectures of Tim Mattson on OpenMP and there was one exercise to find errors in provided code that count area of the Mandelbrot. So here is the solution that was provided:
#define NPOINTS 1000
#define MAXITER 1000
void testpoint(struct d_complex);
struct d_complex{
double r;
double i;
};
struct d_complex c;
int numoutside = 0;
int main(){
int i,j;
double area, error, eps = 1.0e-5;
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps)
for(i = 0; i<NPOINTS; i++){
for(j=0; j < NPOINTS; j++){
c.r = -2.0+2.5*(double)(i)/(double)(NPOINTS)+eps;
c.i = 1.125*(double)(j)/(double)(NPOINTS)+eps;
testpoint(c);
}
}
area=2.0*2.5*1.125*(double)(NPOINTS*NPOINTS-numoutside)/(double)(NPOINTS*NPOINTS);
error=area/(double)NPOINTS;
printf("Area of Mandlebrot set = %12.8f +/- %12.8f\n",area,error);
printf("Correct answer should be around 1.510659\n");
}
void testpoint(struct d_complex c){
// Does the iteration z=z*z+c, until |z| > 2 when point is known to be outside set
// If loop count reaches MAXITER, point is considered to be inside the set
struct d_complex z;
int iter;
double temp;
z=c;
for (iter=0; iter<MAXITER; iter++){
temp = (z.r*z.r)-(z.i*z.i)+c.r;
z.i = z.r*z.i*2+c.i;
z.r = temp;
if ((z.r*z.r+z.i*z.i)>4.0) {
#pragma omp atomic
numoutside++;
break;
}
}
}
The question I have is, could we use reduction in #pragma omp parallel of variable numoutside like:
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps) reduction(+:numoutside)
without atomic construct in testpoint function?
I tested the function without atomic, and the result was different from the one I got in the first place. Why does that happen? And while I understand the concept of mutual exclusion and use of it because of race conditioning, isn't reduction just another form of solving that problem with private variables?
Thank You in advance.

Output container from openmp parallel loop

Which critical section style is better when collecting output container?
// Insert into the output container one object at a time.
vector<float> output;
#pragma omp parallel for
for(int i=0; i<1000000; ++i)
{
float value = // compute something complicated
#pragma omp critical
{
output.push_back(value);
}
}
// Insert object into per-thread container; later aggregate those containers.
vector<float> output;
#pragma omp parallel
{
vector<float> per_thread;
#pragma omp for
for(int i=0; i<1000000; ++i)
{
float value = // compute something complicated
per_thread.push_back(value);
}
#pragma omp critical
{
output.insert(output.end(), per_thread.begin(), per_thread.end());
}
}
EDIT: the above examples were misleading because they indicated that each iteration pushes exactly one item, which is not true in my case. Here are more accurate examples:
// Insert into the output container one object at a time.
vector<float> output;
#pragma omp parallel for
for(int i=0; i<1000000; ++i)
{
int k = // compute number of items
for( int j=0; j<k; ++j)
{
float value = // compute something complicated
#pragma omp critical
{
output.push_back(value);
}
}
}
// Insert object into per-thread container; later aggregate those containers.
vector<float> output;
#pragma omp parallel
{
vector<float> per_thread;
#pragma omp for
for(int i=0; i<1000000; ++i)
{
int k = // compute number of items
for( int j=0; j<k; ++j)
{
float value = // compute something complicated
per_thread.push_back(value);
}
}
#pragma omp critical
{
output.insert(output.end(), per_thread.begin(), per_thread.end());
}
}
If you always insert exactly one item per parallel iteration, the proper way is:
std::vector<float> output(1000000);
#pragma omp parallel for
for(int i=0; i<1000000; ++i)
{
float value = // compute something complicated
output[i] = value;
}
It is threadsafe to assign distinct elements of std::vector (which is guaranteed because all i are different). And there is no significant false-sharing in this case.
If you do not insert exactly one item per parallel iteration either version is basically correct.
Your first version using a critical in the loop can be very slow - note that if the computation is really slow, it may still be fine overall.
The per-thread container / manual reduction is generally fine. Of course it makes the order of the result non-deterministic. You could streamline this by using a user-defined reduction.

generating random variables with openmp in c++

How can I generate in parallel (is it efficient? possible?) random variables with my linear congruential generator below:
double* uniform(long N)
{
long i,j;
long a=16807;
long long m=(((long long)1)<<31)-1;
long I[N];
double *U;
#pragma omp parallel for firstprivate(i)
for (j = 0; j < N;j++)
{
if (i==0)
{
int y= omp_get_thread_num(); // undefined ref error here
I[y];
i++;
}
else
{
I[j] = (a*I[j-1])%m;
}
}
#pragma omp parallel for
for (i=0; i<N; i++)
U[i] = (double)I[i]/(m+1.0);
return U;
}
My goal is to generate 2 variables to use them in another function (box-muller method):
double* gauss(long int N)
{
double *X, *Y, *U;
X = generator(N/2);
Y = generator(N/2);
#pragma omp parallel for
for (i=0;i<N/2;i++)
{
U[2*i]=sqrt(-2 * log(X[i]))*sin(Y[i]*2*3.14);
U[2*i+1]=sqrt(-2 * log(X[i]))*cos(Y[i]*2*3.14);
}
return U;
}
How want to know how can I get different seeds when generating uniform variables with the function uniform?

OpenMP: How to copy back value of firstprivate variable back to global

I am new to OpenMP and I am stuck with a basic operation. Here is a sample code for my question.
#include <omp.h>
int main(void)
{
int A[16] = {1,2,3,4,5 ...... 16};
#pragma omp parallel for firstprivate(A)
for(int i = 0; i < 4; i++)
{
for(int j = 0; j < 4; j++)
{
A[i*4+j] = Process(A[i*4+j]);
}
}
}
As evident,value of A is local to each thread. However, at the end, I want to write back part of A calculated by each threadto the corresponding position in global variable A. How this can be accomplished?
Simply make A shared. This is fine, because all loop iterations operate on separate elements of A. Remember that OpenMP is shared memory programming.
You can do so explicitly by using shared instead of firstprivate, or simply remove the declaration:
int A[16] = {1,2,3,4,5 ...... 16};
#pragma omp parallel for
for(int i = 0; i < 4; i++)
By default all variables declared outside of the parallel region. You can find an extended exemplary description in this answer.

Resources