Output container from openmp parallel loop - parallel-processing

Which critical section style is better when collecting output container?
// Insert into the output container one object at a time.
vector<float> output;
#pragma omp parallel for
for(int i=0; i<1000000; ++i)
{
float value = // compute something complicated
#pragma omp critical
{
output.push_back(value);
}
}
// Insert object into per-thread container; later aggregate those containers.
vector<float> output;
#pragma omp parallel
{
vector<float> per_thread;
#pragma omp for
for(int i=0; i<1000000; ++i)
{
float value = // compute something complicated
per_thread.push_back(value);
}
#pragma omp critical
{
output.insert(output.end(), per_thread.begin(), per_thread.end());
}
}
EDIT: the above examples were misleading because they indicated that each iteration pushes exactly one item, which is not true in my case. Here are more accurate examples:
// Insert into the output container one object at a time.
vector<float> output;
#pragma omp parallel for
for(int i=0; i<1000000; ++i)
{
int k = // compute number of items
for( int j=0; j<k; ++j)
{
float value = // compute something complicated
#pragma omp critical
{
output.push_back(value);
}
}
}
// Insert object into per-thread container; later aggregate those containers.
vector<float> output;
#pragma omp parallel
{
vector<float> per_thread;
#pragma omp for
for(int i=0; i<1000000; ++i)
{
int k = // compute number of items
for( int j=0; j<k; ++j)
{
float value = // compute something complicated
per_thread.push_back(value);
}
}
#pragma omp critical
{
output.insert(output.end(), per_thread.begin(), per_thread.end());
}
}

If you always insert exactly one item per parallel iteration, the proper way is:
std::vector<float> output(1000000);
#pragma omp parallel for
for(int i=0; i<1000000; ++i)
{
float value = // compute something complicated
output[i] = value;
}
It is threadsafe to assign distinct elements of std::vector (which is guaranteed because all i are different). And there is no significant false-sharing in this case.
If you do not insert exactly one item per parallel iteration either version is basically correct.
Your first version using a critical in the loop can be very slow - note that if the computation is really slow, it may still be fine overall.
The per-thread container / manual reduction is generally fine. Of course it makes the order of the result non-deterministic. You could streamline this by using a user-defined reduction.

Related

OpenMP device offload reduction to existing device memory location

How do I tell OpenMP device offload to use an existing location in device memory for a reduction? I want to avoid data movement to/from device. Results will only be accessed on the device.
Here's my code
void reduce(const double *mi, const double *xi, const double *yi,
double *mo, double *xo, double *yo, long n)
{
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
for (long i = 0; i < n; ++i)
{
mo[0] += mi[i];
xo[0] += mi[i]*xi[i];
yo[0] += mi[i]*yi[i];
}
#pragma omp target is_device_ptr(mo,xo,yo)
{
xo[0] /= mo[0];
yo[0] /= mo[0];
}
}
with this code and clang++ 15 targeting nvidia ptx, I'm getting the error:
test.cpp:6:109: error: reduction variable cannot be in a is_device_ptr clause in '#pragma omp target teams distribute parallel for' directive
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
^
test.cpp:6:67: note: defined as reduction
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
^
You cannot use array subscripts in reduction clause. That's non-conforming code. Please try something along these lines:
#include <stdio.h>
int main(int argc, char * argv[]) {
double sum = 0;
#pragma omp target data map(tofrom:sum)
{
for (int t = 0; t < 10; t++) {
#pragma omp target teams distribute parallel for map(tofrom:sum) reduction(+:sum)
for (int j = 0; j < 10000; j++) {
sum += 1;
}
}
}
printf("sum=%lf\n", sum);
return 0;
}
With the the target data construct you can allocate a buffer for the reduction variable on the GPU. The target construct's reduction clause will then reduce the value into that buffered variable and will only transfer the variable back from the GPU at the closing curly brace of the target data construct.
It is possible to do this all on the device. A couple of key pieces of info were required:
the map clause's map-type is used to optimize the copies to/from the device and disable unnecessary copies. The alloc map-type disables both copies to and from the device.
variables have only one instance in the device. map clauses for variables already mapped by an enclosing target data or target enter data by default do not result in copies to/from the device.
With that the solution is as follows:
// --------------------------------------------------------------------------
void reduce(const double *mi, const double *xi, const double *yi,
double *mo, double *xo, double *yo, long n)
{
double m, x, y;
#pragma omp target enter data map(alloc: m,x,y)
#pragma omp target map(alloc: m,x,y)
{
m = 0.;
x = 0.;
y = 0.;
}
#pragma omp target teams distribute parallel for reduction(+: m,x,y), \
is_device_ptr(mi,xi,yi), map(alloc: m,x,y)
for (long i = 0; i < n; ++i)
{
m += mi[i];
x += mi[i]*xi[i];
y += mi[i]*yi[i];
}
#pragma omp target is_device_ptr(mo,xo,yo), map(alloc: m,x,y)
{
mo[0] = m;
xo[0] = x/m;
yo[0] = y/m;
}
#pragma omp target exit data map(release: m,x,y)
}
BEWARE this is a tentative, unverified answer, since I don't have the target, and Compiler Explorer doesn.t seem to have a gcc which has offload enabled. Hence this is untested.
However, you can clearly try this for yourself!
I suggest splitting the directives, and adding explicit scalar locals for the reduction.
So your code would look something like this
void reduce(const double *mi, const double *xi, const double *yi,
double *mo, double *xo, double *yo, long n)
{
#pragma omp target is_device_ptr(mi,xi,yi,mo,xo,yo)
{
double mTotal = 0.0;
double xTotal = 0.0;
double yTotal = 0.0;
#pragma omp teams distribute parallel for reduction(+: mTotal, xTotal, yTotal)
for (long i = 0; i < n; ++i)
{
mTotal += mi[i];
xTotal += mi[i]*xi[i];
yTotal += mi[i]*yi[i];
}
mo[0] = mTotal;
xo[0] = xTotal/mTotal;
yo[0] = yTotal/mTotal;
}
}
That compiles OK for the host, but, as above, YOUR MILEAGE MAY VARY

Difference between mutual exclusion like atomic and reduction in OpenMP

I'm am following video lectures of Tim Mattson on OpenMP and there was one exercise to find errors in provided code that count area of the Mandelbrot. So here is the solution that was provided:
#define NPOINTS 1000
#define MAXITER 1000
void testpoint(struct d_complex);
struct d_complex{
double r;
double i;
};
struct d_complex c;
int numoutside = 0;
int main(){
int i,j;
double area, error, eps = 1.0e-5;
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps)
for(i = 0; i<NPOINTS; i++){
for(j=0; j < NPOINTS; j++){
c.r = -2.0+2.5*(double)(i)/(double)(NPOINTS)+eps;
c.i = 1.125*(double)(j)/(double)(NPOINTS)+eps;
testpoint(c);
}
}
area=2.0*2.5*1.125*(double)(NPOINTS*NPOINTS-numoutside)/(double)(NPOINTS*NPOINTS);
error=area/(double)NPOINTS;
printf("Area of Mandlebrot set = %12.8f +/- %12.8f\n",area,error);
printf("Correct answer should be around 1.510659\n");
}
void testpoint(struct d_complex c){
// Does the iteration z=z*z+c, until |z| > 2 when point is known to be outside set
// If loop count reaches MAXITER, point is considered to be inside the set
struct d_complex z;
int iter;
double temp;
z=c;
for (iter=0; iter<MAXITER; iter++){
temp = (z.r*z.r)-(z.i*z.i)+c.r;
z.i = z.r*z.i*2+c.i;
z.r = temp;
if ((z.r*z.r+z.i*z.i)>4.0) {
#pragma omp atomic
numoutside++;
break;
}
}
}
The question I have is, could we use reduction in #pragma omp parallel of variable numoutside like:
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps) reduction(+:numoutside)
without atomic construct in testpoint function?
I tested the function without atomic, and the result was different from the one I got in the first place. Why does that happen? And while I understand the concept of mutual exclusion and use of it because of race conditioning, isn't reduction just another form of solving that problem with private variables?
Thank You in advance.

OpenMP: How to copy back value of firstprivate variable back to global

I am new to OpenMP and I am stuck with a basic operation. Here is a sample code for my question.
#include <omp.h>
int main(void)
{
int A[16] = {1,2,3,4,5 ...... 16};
#pragma omp parallel for firstprivate(A)
for(int i = 0; i < 4; i++)
{
for(int j = 0; j < 4; j++)
{
A[i*4+j] = Process(A[i*4+j]);
}
}
}
As evident,value of A is local to each thread. However, at the end, I want to write back part of A calculated by each threadto the corresponding position in global variable A. How this can be accomplished?
Simply make A shared. This is fine, because all loop iterations operate on separate elements of A. Remember that OpenMP is shared memory programming.
You can do so explicitly by using shared instead of firstprivate, or simply remove the declaration:
int A[16] = {1,2,3,4,5 ...... 16};
#pragma omp parallel for
for(int i = 0; i < 4; i++)
By default all variables declared outside of the parallel region. You can find an extended exemplary description in this answer.

Open MP with gsl_matrix is slower than sequential

I've created a simple c program using gsl(GNU Scienctific Library) and open mp. In this simple program, I want to test the execution time for sequential and parallel. Here is the program snippets, main.c.
#include "omp.h"
#include <stdio.h>
#include <gsl/gsl_matrix.h>
#include <time.h>
int main()
{
omp_set_num_threads(4);
int n1=10000, n2=10000;
gsl_matrix *A = gsl_matrix_alloc(n1, n2);
int i,j;
struct timeval tv1, tv2, tv3, tv4;
gettimeofday(&tv1, 0);
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
gettimeofday(&tv2, 0);
long elapsed = (tv2.tv_sec-tv1.tv_sec)*1000000 + tv2.tv_usec-tv1.tv_usec;
printf("Sequential Duration:%ldms\n", elapsed);
gettimeofday(&tv3, 0);
#pragma omp parallel for private(i,j)
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
gettimeofday(&tv4, 0);
elapsed = (tv4.tv_sec-tv3.tv_sec)*1000000 + tv4.tv_usec-tv3.tv_usec;
printf(" Parallel Duration:%ldms\n", elapsed);
return 0;
}
Then I compiled the above code, using this command:
gcc -fopenmp main.c -o test -lgsl -lgslcblas -lm
Here is the program's result:
Sequential Duration:11980106ms
Parallel Duration:20624043ms
Why, the parallel part slower than the sequential part. How can I optimize this code? Thanks
as you have written it the j variable is shared between all threads so the threads are overwritting other threads state constantly, leading to them iterating values they have already covered.
You should always minimize the scope of variables when trying to parallelize with openmp. Either move the scope of j into the loop or mark it as private explicitly:
#pragma omp parallel for private(j)
also clock counts the processor time not the real time, you probably want to use gettimeofday
you matrix is too small to benefit much from parallelization, the threading overhead will dominate. Increase it to ~10000x10000 to start seeing something.
The problem here is that you do not know what the procedure gsl_matrix_set does with A. You do not know if it is thread safe. To change one element in that matrix you supply the whole matrix to the routine instead of only the indices of the element. This smells by false sharing (see e.g. this answer).
I would try this instead
gsl_matrix_set(A[i][j],i*j*1000000);
If that does not work and what you are interested in is only the time difference between serial and parallel I would just do
A[i][j] = i*j*1000000
In the thread part, try this:
#pragma omp parallel private(i,j)
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}
or
#pragma omp parallel for
for(i=0; i<n1; i++)
{
for(j=0; j<n2; j++)
{
gsl_matrix_set(A, i, j, i*j*1000000);
}
}

How to create an `omp parallel for` with synchronization (`barrier`) of all threads in the middle with OpenMP

I have two functions, do_step_one(i) and do_step_two(i), for i from 0 to N-1.
Currently, I have this (sequential) code:
for(unsigned int i=0; i<N; i++) {
do_step_one(i);
}
for(unsigned int i=0; i<N; i++) {
do_step_two(i);
}
Each call of do_step_one() and do_step2() can be done in any order and in parallel, but any do_step_two() needs the end of all the do_step_one() to start (it use do_step_one() results).
I tried the following :
#omp parallel for
for(unsigned int i=0; i<N; i++) {
do_step_one(i);
#omp barrier
do_step_two(i);
}
But gcc complains
convolve_slices.c:21: warning: barrier region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region.
What do I misunderstand? How to solve that issue?
Just a side note, if you want to make sure the threads are not recreated, separate the declaration of parallel and declaration of for:
#pragma omp parallel
{
#pragma omp for
for(unsigned int i=0; i<N; i++){
do_step_one(i);
}
//implicit barrier here
#pragma omp for
for(unsigned int i=0; i<N; i++){
do_step_two(i);
}
}
One problem I see with this code, is that the code does not comply with the spec :)
If you need all do_step_one()'s to end, you'll need something like the following:
#pragma omp parallel for
for(unsigned int i=0; i<N; i++){
do_step_one(i);
}
#pragma omp parallel for
for(unsigned int i=0; i<N; i++){
do_step_two(i);
}
The result of this would be a parallelism of the first for, and then a parallelism of the second for.

Resources