How to traverse two-dimensional vector by OpenMP - parallel-processing

I'm new to OpenMP and got an error that I can't fix.
Suppose I have a two-dimensional vector:
vector<vector<int>> a{{...}, {...}, ...};
I want to traverse it as
#pragma omp parallel for collapse(2)
for (int i = 0; i < a.size(); i++){
for (int j = 0; j < a[i].size(); j++){
work(a[i][j]);
}
}
However, there is an error: condition expression refers to iteration variable ‘i’.
So how can I traverse the two-dimensional vector correctly?

The problem is that the end condition of second loop depends on the first loop's index variable (a[i].size()). Only OpenMP 5.0 (or above) supports so-called non-rectangular collapsed loops, so if you use earlier OpenMP version you cannot use the collapse(2) clause here. Just remove collapse(2) clause and it will work.
Note that, if a[i].size() is the same for all i, then you can easily remove this dependency.

Related

while loop getting stuck - Openmp

I was trying to implement some piece of parallel code and tried to synchronize the threads using an array of flags as shown below
// flags array set to zero initially
#pragma omp parallel for num_threads (n_threads) schedule(static, 1)
for(int i = 0; i < n; i ++){
for(int j = 0; j < i; j++) {
while(!flag[j]);
y[i] -= L[i][j]*y[j];
}
y[i] /= L[i][i];
flag[i] = 1;
}
However, the code always gets stuck after a few iterations when I try to compile it using gcc -O3 -fopenmp <file_name>. I have tried different number of threads like 2, 4, 8 all of them leads to the loop getting stuck. On putting print statements inside critical sections, I figured out that even though the value of flag[i] gets updated to 1, the while loop is still stuck or maybe there is some other problem with the code, I am not aware of.
I also figured out that if I try to do something inside the while block like printf("Hello\n") the problem goes away. I think there is some problem with the memory consistency across threads but I do not know how to resolve this. Any help would be appreciated.
Edit: The single threaded code I am trying to parallelise is
for(int i=0; i<n; i++){
for(int j=0; j < i; j++){
y[i]-=L[i][j]*y[j];
}
y[i]/=L[i][i];
}
You have data race in your code, which is easy to fix, but the bigger problem is that you also have loop carried dependency. The result of your code does depend on the order of execution. Try reversing the i loop without OpenMP, you will get different result, so your code cannot be parallelized efficiently.
One possibility is to parallelize the j loop, but the workload is very small inside this loop, so the OpenMP overheads will be significantly bigger than the speed gain by parallelization.
EDIT: In the case of your updated code I suggest to forget parallelization (because of loop carried dependency) and make sure that inner loop is properly vectorized, so I suggest the following:
for(int i = 0; i < n; i ++){
double sum_yi=y[i];
#pragma GCC ivdep
for(int j = 0; j < i; j++) {
sum_yi -= L[i][j]*y[j];
}
y[i] = sum_yi/L[i][i];
}
#pragma GCC ivdep tells the compiler that there is no loop carried dependency in the loop, so it can vectorize it safely. Do not forget to inform compiler the about the vectorization capabilities of your processor (e.g. use -mavx2 flag if your processor is AVX2 capable).

OpenMP collapse parallel for with parallel max-reduction?

I have the following nested loops that I want to collapse into one for parallelization. Unfortunately the inner loop is a max-reduction rather than standard for loop thus collapse(2) directive apparently can't be used here. Is there any way to collapse these two loops anyway? Thanks!
(note that s is the number of sublists and n is the length of each sublist and suppose n >> s)
#pragma omp parallel for default(shared) private(i,j)
for (i=0; i<n; i++) {
rank[i] = 0;
for (j=0; j<s; j++)
if (rank[i] < sublistrank[j][i])
rank[i] = sublistrank[j][i];
}
In this code the best idea is not to parallelize the inner loop at all, but make sure it is properly vectorized. The inner loop does not access the memory continuously, which prevents vectorization and results in a poor cache utilization. You should rewrite your entire code to ensure continuous memory access (e.g. change the order of indices and use sublistrank[i][j] instead of sublistrank[j][i]).
If also beneficial to use a temporary variable for comparisons and assign it to rank[i] after the loop.
Another comment is that always use your variables in their minimum required scope, it also helps the compiler to create more optimized code. Putting it together your code should look like something like this (assuming you use unsigned int for rank and loop variables)
#pragma omp parallel for default(none) shared(sublistrank, rank)
for (unsigned int i=0; i<n; i++) {
unsigned int max=0;
for (unsigned int j=0; j<s; j++)
if (max < sublistrank[i][j])
max = sublistrank[i][j];
rank[i]=max;
}
I have compared your code and this code on CompilerExporer. You can see that the compiler is able to vectorize it, but not the old one.
Note also that if n is small, the parallel overhead may be bigger than the benefit of parallelization.

OpenACC: nested loops with private variable

I want to accelerate these nested loops. Because of the dimension of v (NMAX=MAX(NX1, NX2, NX3)), I understand that can be a conflict in the parallelization of the two external loops. I tried to use the private clause:
static double **v;
if (v == NULL) {
v = ARRAY_2D(NMAX_POINT, NVAR, double);
}
#pragma acc parallel loop present(V, U) private(v[:NMAX_POINT][:NVAR])
for (k = kbeg; k <= kend; k++){ g_k = k;
#pragma acc loop
for (j = jbeg; j <= jend; j++){ g_j = j;
#pragma acc loop collapse(2)
for (i = ibeg; i <= iend; i++) {
for (nv = 0; nv < NVAR; nv++){
v[i][nv] = V[nv][k][j][i];
}}
#pragma acc routine(PrimToCons) seq
PrimToCons (v, U[k][j], ibeg, iend);
}}
I get these errors:
Generating present(V[:][:][:][:],U[:][:][:][:])
Generating Tesla code
144, #pragma acc loop seq
146, #pragma acc loop seq
151, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
154, /* blockIdx.x threadIdx.x collapsed */
144, Accelerator restriction: induction variable live-out from loop: g_k
Complex loop carried dependence of v->-> prevents parallelization
146, Accelerator restriction: induction variable live-out from loop: g_j
Loop carried dependence due to exposed use of v prevents parallelization
Complex loop carried dependence of V->->->->,v->-> prevents parallelization
g_k and g_j are extern int. I've never seen the message "induction variable live-out from loop" before.
EDIT:
I modified the loop as suggested but it sill doesn't work
#pragma acc parallel loop collapse(2) present(U, V) private(v[:NMAX_POINT][:NVAR])
for (k = kbeg; k <= kend; k++){
for (j = jbeg; j <= jend; j++){
#pragma acc loop collapse(2)
for (i = ibeg; i <= iend; i++) {
for (nv = 0; nv < NVAR; nv++){
v[i][nv] = V[nv][k][j][i];
}}
PrimToCons (v, U[k][j], ibeg, iend, g_gamma);
}}
I get this error:
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
It's as if the compiler cannot find v, U or V but in the main function I use these directives:
#pragma acc enter data copyin(data)
#pragma acc enter data copyin(data.Vc[:NVAR][:NX3_TOT][:NX2_TOT][NX1_TOT], data.Uc[:NX3_TOT][:NX2_TOT][NX1_TOT][:NVAR])
data.Vc and data.Uc are V and U in this routine I want to parallelize.
g_k and g_j are extern int. I've never seen the message "induction
variable live-out from loop" before.
When run in parallel, the order in which the loop iterations are executed is non-deterministic. Hence the values of g_k and g_j once exiting the loop would be what ever iteration that happens to be last. This creates a dependency since in order to get correct answers (i.e. answers that would agree with those when running serially), the "k" and "j" loops must be run sequentially.
If "g_k" and "g_j" were local variables, then the compiler would implicitly privatize them in order to remove this dependency. However since they are global variables, it must assume other portions of the code uses the results and hence can't assume they can be made private. If you know the variables aren't used elsewhere, then you can fix this issue by adding them to your "private" clause. Note, it doesn't appear that these variables are used the loop itself so could be removed and just assigned the values "kend" and "jend" outside of the loop.
Unless "g_k" and "g_j" are used in the "PrimToCons" subroutine? In that case, you have a bigger problem in that this would cause a race condition in that the variables values may be updated by other threads and no longer be the value expected by the subroutine. In this case, the fix would be to pass "k" and "j" as arguments to "PrimToCons" and not use "g_k" and "g_j".
As for "v", it should be private to the "j" loop as well, not just the "k" loop. To fix, the I'd recommend adding a "collapse(2)" clause to the "k" loop's pragma and remove the loop directive about the "j" loop.

Blitz++, armadillo and OpenMP very slow

I have been trying for very long time to learn how to parallelise and I have been reading lots of notes on OpenMP. So, I tried to used it and the results I get is that all placed where I tried to parallelise are 5 times slower than the serial case and I am wondering why...
My code is the next:
toevaluate is a blitz matrix of two columns and rows length.
storecallj and storecallk are just two blitz vectors that I used to store the calls and avoid extra function callings.
matrix is an square armadillo matrix of length columns (cols= rows) and length of row is rows (I will use it later for other thing and it is more convenient to define it as armadillo matrix)
externfunction1 is a function defined outside of this which computes the (x,y) values of f(a,b) function, where (a,b) is the input and (x,y) is the output.
Problem is a string variable and normal is a boolean.
resultk and resultj are such vectors (x,y) storing the output of such functions.
externfunction2 is another function defined outside of this which computes a double by evaluating funcci (i=1,2) which is a blitz polynomial. This polynomial gets evaluated in a double: innerprod to give another double: wvaluei (i=1,2).
I think these are all the generalities. The code is below.
{
omp_set_dynamic(0);
OMP_NUM_THREADS=4;
omp_set_num_threads(OMP_NUM_THREADS);
int chunk = int(floor(cols/OMP_NUM_THREADS));
#pragma omp parallel shared(matrix,storecallj,toevaluate,resultj,cols,rows, storecallk,atzero,innerprod,wvalue1,wvalue2,funcc1, funcc2,storediff,checking,problem,normal,chunk) private(tid,j,k)
{
tid = omp_get_thread_num();
if (tid == 0)
{
printf("Initializing parallel process...\n");
}
#pragma omp for collapse(2) schedule (dynamic, chunk) nowait
for(j=0; j<cols; ++j)
{
for(k=0; k<rows; ++k)
{
storecallj = toevaluate(j, All);
externfunction1(problem,normal,storecallj,resultj);
storecallk = toevaluate(k, All);
storediff = storecallj-storecallk;
if(k==j){
Matrix(j,k)=-atzero*sum(resultj*resultj);
}else{
innerprod =sqrt(sum(storediff* storediff));
checking=1.0-c* innerprod;
if(checking>0.0)
{
externfunction1(problem,normal, storecallk,resultk);
externfunction2(c, innerprod, funcc1, wvalue1);
externfunction2(c, innerprod, funcc2, wvalue2);
Matrix(j,k)=-wvalue2*sum(storediff*resultj)*sum(storediff*resultk)-wvalue1*sum(resultj*resultk);
}
}
}
}
}
}
and it is very slow. Another thing that I have tried is:
for(j=0; j<cols; ++j)
{
storecallj = toevaluate(j, All);
externfunction1(problem,normal,storecallj,resultj);
#pragma omp for collapse(2) schedule (dynamic, chunk) nowait
for(k=0; k<rows; ++k)
{
storecallk = toevaluate(k, All);
storediff = storecallj-storecallk;
...
I am using dynamic because I am to avoid problems in case the total number of points to evaluate is not a multiple of 4.
Could someone please give me a hand to understand why this is so slow?

OpenMP: cannot change the value of reduction variable

After I had read that the initial value of reduction variable is set according to the operator used for reduction, I decided that instead of remembering these default values it is better to initialize it explicitly. So I modified the code in question by Totonga as follows
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
#pragma omp parallel private(x) reduction(+:sum)
{
sum = 0.;
#pragma omp for schedule(static)
for (int i=0; i<num_steps; ++i)
{
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
}
But it turns out that no matter whether I write sum = 0. or sum = 123.456 the code produces the same result (used gcc-4.5.2 compiler). Can somebody, please, explain me why? (with a reference to openmp standard, if possible) Thanks in advance to everybody.
P.S. since some people object initializing reduction variable, I think it makes sense to expand a question a little. The code below works as expected: I initialize reduction variable and obtain result, which DOES depend on MY initial value
int sum;
#pragma omp parallel reduction(+:sum)
{
sum = 1;
}
printf("Reduction sum = %d\n",sum);
The printed result will be the number of cores, and not 0.
P.P.S I have to update my question again. User Gilles gave an insightful comment: And upon exit of the parallel region, these local values will be reduced using the + operator, and with the initial value of the variable, prior to entering the section.
Well, the following code gives me the result 3.142592653598146, which is badly calculated pi instead of expected 103.141592653598146 (the initial code was giving me excellent value of pi=3.141592653598146)
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
sum = 100.;
#pragma omp parallel private(x) reduction(+:sum)
{
#pragma omp for schedule(static)
for (int i=0; i<num_steps; ++i)
{
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
}
Why would you want to do that? This is just begging with all your soul for troubles. The reduction clause and the way the local variables used are initialised are defined for a reason, and the idea is that you don't need to remember these initialisation value just because they are already right.
However, in your code, the behaviour is undefined. Let's see why...
Let's assume your initial code is this:
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
sum = 0.;
for (int i=0; i<num_steps; ++i) {
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
Well, the "normal" way of parallelising it with OpenMP would be:
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
sum = 0.;
#pragma omp parallel for reduction(+:sum) private(x)
for (int i=0; i<num_steps; ++i) {
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
Pretty straightforward, isn't it?
Now, when instead of that, you do:
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
#pragma omp parallel private(x) reduction(+:sum)
{
sum = 0.;
#pragma omp for schedule(static)
for (int i=0; i<num_steps; ++i)
{
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
}
You have a problem... The reason is that upon entry into the parallel region, sum hadn't been initialised. So when you declare omp parallel reduction(+:sum), you create a per-thread private version of sum, initialised to the "logical" initial value corresponding to the operator of you reduction clause, namely 0 here because you asked for a + reduction. And upon exit of the parallel region, these local values will be reduced using the + operator, and with the initial value of the variable, prior to entering the section. See this for reference:
The reduction clause specifies a reduction-identifier and one or more
list items. For each list item, a private copy is created in each
implicit task or SIMD lane, and is initialized with the initializer
value of the reduction-identifier. After the end of the region, the
original list item is updated with the values of the private copies
using the combiner associated with the reduction-identifier
So in summary, upon exit you have the equivalent of sum += sum_local_0 + sum_local_1 + ... sum_local_nbthreadsMinusOne
Therefore, since in your code, sum doesn't have any initial value, its value upon exit of the parallel region isn't defined as well, and can be whatever...
Now let's imagine you did indeed initialise it... Then, if instead of using the right initialiser inside the parallel region (like your sum=0.; in the hereinabove code), you used for whatever reason sum=1.; instead, then the final sum won't be just incremented by 1, but by 1 times the number of threads used inside the parallel region, since the extra value will be counted as many times as there are of threads.
So in conclusion, just use reduction clauses and variables the "expected"/"naïve" way, that will spare you and the people coming after for maintaining your code a lot of troubles.
Edit: It looks like my point was not clear enough, so I'll try to explain it better:
this code:
int sum;
#pragma omp parallel reduction(+:sum)
{
sum = 1;
}
printf("Reduction sum = %d\n",sum);
Has an undefined behaviour because it is equivalent to:
int sum, numthreads;
#pragma omp parallel
#pragma omp single
numthreads = omp_get_num_threads();
sum += numthreads; // value of sum is undefined since it never was initialised
printf("Reduction sum = %d\n",sum);
Now, this code is valid:
int sum = 0; //here, sum has been initialised
#pragma omp parallel reduction(+:sum)
{
sum = 1;
}
printf("Reduction sum = %d\n",sum);
To convince yourself, just read the snippet of the standard I gave:
After the end of the region, the
original list item is updated with the values of the private copies
using the combiner associated with the reduction-identifier
So the reduction uses the combination of the private reduction variables and the original value to perform the final reduction upon exit. So if the original value wasn't set, the final value is undefined as well. And that's not because for some reason your compiler gives you a value that seems right, that the code is right.
Is that clearer now?

Resources