OpenMP double for loop - openmp

I'd like to use openMP to apply multi-thread.
Here is simple code that I wrote.
vector<Vector3f> a;
int i, j;
for (i = 0; i<10; i++)
{
Vector3f b;
#pragma omp parallel for private(j)
for (j = 0; j < 3; j++)
{
b[j] = j;
}
a.push_back(b);
}
for (i = 0; i < 10; i++)
{
cout << a[i] << endl;
}
I want to change it to works lik :
parallel for1
{
for2
}
or
for1
{
parallel for2
}
Code works when #pragma line is deleted. but it does not work when I use it. What's the problem?
///////// Added
Actually I use OpenMP to more complicated example,
double for loop question.
here, also When I do not apply MP, it works well.
But When I apply it,
the error occurs at vector push_back line.
vector<Class> B;
for 1
{
#pragma omp parallel for private(j)
parallel for j
{
Class A;
B.push_back(A); // error!!!!!!!
}
}
If I erase B.push_back(A) line, it works as well when I applying MP.
I could not find exact error message, but it looks like exception error about vector I guess. Debug stops at
void _Reallocate(size_type _Count)
{ // move to array of exactly _Count elements
pointer _Ptr = this->_Getal().allocate(_Count);
_TRY_BEGIN
_Umove(this->_Myfirst, this->_Mylast, _Ptr);

std::vector::push_back is not thread safe, you cannot call that without any protection against race conditions from multiple threads.
Instead, prepare the vector such that it's size is already correct and then insert the elements via operator[].
Alternatively you can protect the insertion with a critical region:
#pragma omp critical
B.push_back(A);
This way only one thread at a time will do the insertion which will fix the error but slow down the code.
In general I think you don't approach parallelization the right way, but there is no way to give better advise without a clearer and more representative problem description.

Related

Difference between mutual exclusion like atomic and reduction in OpenMP

I'm am following video lectures of Tim Mattson on OpenMP and there was one exercise to find errors in provided code that count area of the Mandelbrot. So here is the solution that was provided:
#define NPOINTS 1000
#define MAXITER 1000
void testpoint(struct d_complex);
struct d_complex{
double r;
double i;
};
struct d_complex c;
int numoutside = 0;
int main(){
int i,j;
double area, error, eps = 1.0e-5;
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps)
for(i = 0; i<NPOINTS; i++){
for(j=0; j < NPOINTS; j++){
c.r = -2.0+2.5*(double)(i)/(double)(NPOINTS)+eps;
c.i = 1.125*(double)(j)/(double)(NPOINTS)+eps;
testpoint(c);
}
}
area=2.0*2.5*1.125*(double)(NPOINTS*NPOINTS-numoutside)/(double)(NPOINTS*NPOINTS);
error=area/(double)NPOINTS;
printf("Area of Mandlebrot set = %12.8f +/- %12.8f\n",area,error);
printf("Correct answer should be around 1.510659\n");
}
void testpoint(struct d_complex c){
// Does the iteration z=z*z+c, until |z| > 2 when point is known to be outside set
// If loop count reaches MAXITER, point is considered to be inside the set
struct d_complex z;
int iter;
double temp;
z=c;
for (iter=0; iter<MAXITER; iter++){
temp = (z.r*z.r)-(z.i*z.i)+c.r;
z.i = z.r*z.i*2+c.i;
z.r = temp;
if ((z.r*z.r+z.i*z.i)>4.0) {
#pragma omp atomic
numoutside++;
break;
}
}
}
The question I have is, could we use reduction in #pragma omp parallel of variable numoutside like:
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps) reduction(+:numoutside)
without atomic construct in testpoint function?
I tested the function without atomic, and the result was different from the one I got in the first place. Why does that happen? And while I understand the concept of mutual exclusion and use of it because of race conditioning, isn't reduction just another form of solving that problem with private variables?
Thank You in advance.

Parallel programing with OpenMP race condition not working

void ompClassifyToClusteres(Point* points, Cluster* clusteres, int
numOfPoints, int numOfClusteres, int myid) {
int i, j;
Cluster closestCluster;
double closestDistance;
double tempDistance;
omp_set_num_threads(OMP_NUM_OF_THREADS);
#pragma omp parallel private(j)
{
#pragma omp for
for (i = 0; i < numOfPoints; i++) {
closestCluster = clusteres[0];
closestDistance = distanceFromClusterCenter(points[i], closestCluster);
for (j = 1; j <= numOfClusteres; j++) {
tempDistance = distanceFromClusterCenter(points[i], clusteres[j]);
if (tempDistance < closestDistance) {
closestCluster = clusteres[j];
closestDistance = tempDistance;
}
}
points[i].clusterId = closestCluster.id;
}
}
printPoints(points, numOfPoints);
}
Output:
!(output
Im trying to classify points to clusteres for K-MEANS algorithm.
So im getting this output (dont notice the checks) in one execution and the right results in the second execution and goes on..
I tried to put some varibales in private but it didnt work.
Ill just say that this 3 points need to classify to cluster 0 and im guessing theres a race or something but i cant figure it out.
Yes, there is a race condition. tempDistance, closestCluster, and closestDistance should be private also. A good check is to ask yourself, do you need these variables to be different for each for loop iteration, if they happened at the same time?
You could make them private with the private() clause, like you did with j, or just declare them within the outer for loop.

shared arrays in OpenMP

I'm trying to parallelize a piece of C++ code with OpenMp but I'm facing some problems.
In fact, my parallelized code is not faster than the serial one.
I think I have understood the cause of this, but I'm not able to solve it.
The structure of my code is like this:
int vec1 [M];
int vec2 [N];
...initialization of vec1 and vec2...
for (int it=0; it < tot_iterations; it++) {
if ( (it+1)%2 != 0 ) {
#pragma omp parallel for
for (int j=0 ; j < N ; j++) {
....code involving a call to a function to which I'm passing as a parameter vec1.....
if (something) { vec2[j]=vec2[j]-1;}
}
}
else {
# pragma omp parallel for
for (int i=0 ; i < M ; i++) {
....code involving a call to a function to which I'm passing as a parameter vec2.....
if (something) { vec1[i]=vec1[i]-1;}
}
}
}
I thought that maybe my parallelized code is slower because multiple threads want to access to the same shared array and one has to wait until another has finished, but I'm not sure how things really go. But I can't make vec1 and vec2 private since the updates wouldn't be seen in the other iterations...
How can I improve it??
When you speak about issue when accessing the same array with multiple thread, this is called "false-sharing". Except if your array is small, it should not be the bottle neck here as pragma omp parallel for use static scheduling in default implementation (with gcc at least) so each thread should access most of the array without concurency except if your "...code involving a call to a function to which I'm passing as a parameter vec2....." really access a lot of elements in the array.
Case 1: You do not access most elements in the array in this part of the code
Is M big enough to make parallelism useful?
Can you move parallelism on the outer loop? (with one loop for vec1 only and the other for vec2 only)
Try to move the parallel region code :
int vec1 [M];
int vec2 [N];
...initialization of vec1 and vec2...
#pragma omp parallel
for (int it=0; it < tot_iterations; it++) {
if ( (it+1)%2 != 0 ) {
#pragma omp for
for (int j=0 ; j < N ; j++) {
....code involving a call to a function to which I'm passing as a parameter vec1.....
if (something) { vec2[j]=vec2[j]-1;}
}
}
else {
# pragma omp for
for (int i=0 ; i < M ; i++) {
....code involving a call to a function to which I'm passing as a parameter vec2.....
if (something) { vec1[i]=vec1[i]-1;}
}
}
This should not change much but some implementation have a costly parallel region creation.
case 2: You access every elements with every thread
I would say you can't do that if you perform update, otherwise, you may have concurency issue as you have order dependency in the loop.

Nested data environment with different subparts of the same array

Here is my question about openacc.
I read the APIs (v1 and v2), and the behavior of nested data environment with different subparts of the same array is unclear to me.
Code example:
#pragma acc data pcopyin(a[0:20])
{
#pragma acc data pcopyin(a[100:20])
{
#pragma acc parallel loop
for(i=0; i<20; i++)
a[i] = i;
a[i+100] = i;
}
}
My understanding is that this should work (or at leaste the two acc data parts):
The first pragma checks if a[0,20] is on the accelerator
NO -> data are allocated on the device and transferred
The second pragma checks if a[100,120] is on the accelerator
The pointer a is on the accelerator, but not the data from a[100,120]
The data are allocated on the device and transferred
I tried this kind of thing with CAPS compiler (v3.3.0 which is the only available right now on my test machine), and the second pragma acc data returns me an error (my second subarray don't have the correct shape).
So what happens with my test (I suppose) is that the pointer "a" was found on the accelerator, but the shape associated with it ([0:20]) is not the same in my second pragma ([100:20]).
Is this the normal behavior planned in the API, or should my example work?
Moreover, if this is supposed to work, is there some sort of coherence between the subparts of the same array (somehow, they will be positionned like on the host and I will be able to put a[i] += a[100+i] in my kernel)?
The present test will be looking if "a" is on the device. Hence, when the second data region is encountered, "a" is already on the device but only partially. Instead, a better method would be to add a pointer to point into "a" and reference this pointer on the device. Something like:
#include <stdio.h>
int main () {
int a[200];
int *b;
int i;
for(i=0; i<200; i++) a[i] = 0;
b=a+100;
#pragma acc data pcopy(a[0:20])
{
#pragma acc data pcopy(b[0:20])
{
#pragma acc parallel loop
for(i=0; i<20; i++) {
a[i] = i;
b[i] = i;
}
}
}
for(i=0; i<22; i++) printf("%d = %d \n", i, a[i]);
for(i=100; i<122; i++) printf("%d = %d \n", i, a[i]);
return 0;
}
If you had just copied "a[100:20]", then accessing outside this range would be considered a programmer error.
Hope this helps,
Mat

openmp: difference in combining 2 for loop & not combining

What is the difference in combining 2 for loops and parallizing together and parallizing separately
Example
1. not paralleling together
#pragma omp parallel for
for(i = 0; i < 100; i++) {
//.... some code
}
#pragma omp parallel for
for(i = 0; i < 1000; i++) {
//.... some code
}
2. paralleling together
#pragma omp parallel
{
#pragma omp for
for(i = 0; i < 100; i++) {
//.... some code
}
#pragma omp for
for(i = 0; i < 1000; i++) {
//.... some code
}
}
which code is better and why????
One might expect a small win in the second, because one is fork/joining (or the functional equivalent) the OMP threads twice, rather than once. Whether it makes any actual difference for your code is an empirical question best answered by measurement.
The second can also have a more significant advantage if the work in the two loops are independant, and you can start the second at any time, and there's reason to expect some load imbalance in the first loop. In that case, you can add a nowait clause to the firs tomp for and, rather than all threads waiting until the for loop ends, whoever's done first can immediately go on to start working on the second loop. Or, one could put the two chunks of codes each in a section, or task. In general, you have a lot of control over what threads do and how they do it within a parallel section; whereas once you end the parallel section, you lose that flexibility - everything has to join together and you're done.

Resources