In my code I define the lower and upper bounds of different computational
regions by using a structure,
typedef struct RBox_{
int ibeg;
int iend;
int jbeg;
int jend;
int kbeg;
int kend;
} RBox;
I have then introduced the following macro,
#define BOX_LOOP(box, k,j,i) for (k = (box)->kbeg; k <= (box)->kend; k++) \
for (j = (box)->jbeg; j <= (box)->jend; j++) \
for (i = (box)->ibeg; i <= (box)->iend; i++)
(where box is a pointer to a RBox structure) to perform loops as follows:
#pragma acc parallel loop collapse(3) present(box, data)
BOX_LOOP(&box, k,j,i){
A[k][j][i] = ...
}
My question is: Is employing the macro totally equivalent to writing the
loop explicitly as below ?
ibeg = box->ibeg; iend = box->iend;
jbeg = box->jbeg; jend = box->jend;
kbeg = box->kbeg; kend = box->kend;
#pragma acc parallel loop collapse(3) present(box, data)
for (k = kbeg; k <= kend; k++){
for (j = jbeg; j <= jend; j++){
for (i = ibeg; i <= iend; i++){
A[k][j][i] = ...
}}}
Furthermore, are macros portable to different versions of the nvc compiler?
Preprocessor directives and user defined macros are part of the C99 language standard, which nvc (as well as it's predecessor "pgcc"), has supported for quite sometime (~20 years). So, yes would be portable to all versions of nvc.
The preprocessing step occurs very early in the compilation process. Only after the macros are applied, does the compiler process the OpenACC pragmas. So, yes, using the macro above would be equivalent to explicitly writing out the loops.
Since the macro is expanded by the pre-processor, which runs before the OpenACC directives are interpreted, I would expect that this will work exactly how you hope. What are you hoping to accomplish here by not writing these loops in a function rather than a macro?
Related
I just found out about the deepcopy flag. Until this moment I've always used -ta=tesla:managed to handle deep copy and I would like to explore the alternative.
I read this article: https://www.pgroup.com/blogs/posts/deep-copy-beta.htm which is well written but I think it does not cover my case. I have a structure of this type:
typedef struct Data_{
double ****Vc;
double ****Uc;
} Data
The shape of these to array is not defined by an element of the struct itself but by the elements of another structure and that are themselves defined only during the execution of the program.
How can I use the #pragma acc shape(Vc, Uc) in this case?
Without this pragma and copying the structure as follows:
int main(){
Data data;
Initialize(&data);
}
int Initialize(Data *data){
data->Uc = ARRAY_4D(ntot[KDIR], ntot[JDIR], ntot[IDIR], NVAR, double);
data->Vc = ARRAY_4D(NVAR, ntot[KDIR], ntot[JDIR], ntot[IDIR], double);
#pragma acc enter data copyin(data)
PrimToCons3D(data->Vc, data->Uc, grid, NULL);
}
void PrimToCons3D(double ****V, double ****U, Grid *grid, RBox *box){
#pragma acc parallel loop collapse(3) present(V[:NVAR][:nx3_tot][:nx2_tot][:nx1_tot])
for (k = kbeg; k <= kend; k++){
for (j = jbeg; j <= jend; j++){
for (i = ibeg; i <= iend; i++){
double v[NVAR];
#pragma acc loop
for (nv = 0; nv < NVAR; nv++) v[nv] = V[nv][k][j][i];
}
I get
FATAL ERROR: data in PRESENT clause was not found on device 1: name=V host:0x1fd2b80
file:/home/Prova/Src/mappers3D.c PrimToCons3D line:140
Btw, this same code works fine with -ta=tesla:managed.
Since you don't provide a full reproducing example, I wasn't able to test this, but it would look something like:
typedef struct Data_{
int i,j,k,l;
double ****Vc;
double ****Uc;
#pragma acc shape(Vc[0:k][0:j][0:i][0:l])
#pragma acc shape(Uc[0:k][0:j][0:i][0:l])
} Data;
int Initialize(Data *data){
data->Vc.i = ntot[IDIR];
data->Vc.j = ntot[JDIR];
data->Vc.k = ntot[KDIR];
data->Vc.l = NVAR;
data->Uc.i = ntot[IDIR];
data->Uc.j = ntot[JDIR];
data->Uc.k = ntot[KDIR];
data->Uc.l = NVAR;
data->Uc = ARRAY_4D(ntot[KDIR], ntot[JDIR], ntot[IDIR], NVAR, double);
data->Vc = ARRAY_4D(NVAR, ntot[KDIR], ntot[JDIR], ntot[IDIR], double);
#pragma acc enter data copyin(data)
PrimToCons3D(data->Vc, data->Uc, grid, NULL);
}
void PrimToCons3D(double ****V, double ****U, Grid *grid, RBox *box){
int kbeg, jbeg, ibeg, kend, jend, iend;
#pragma acc parallel loop collapse(3) present(V, U)
for (int k = kbeg; k <= kend; k++){
for (int j = jbeg; j <= jend; j++){
for (int i = ibeg; i <= iend; i++){
Though keep in mind that the "shape" and "policy" directives we not adopted by the OpenACC standard and we (the NVHPC compiler team) only did a Beta version, which we have not maintained.
Probably better to do a manual deep copy, which will be standard compliant, which I can help with if you can provide a reproducer which includes how you're doing the array allocation, i.e. "ARRAY_4D".
I was trying to implement some piece of parallel code and tried to synchronize the threads using an array of flags as shown below
// flags array set to zero initially
#pragma omp parallel for num_threads (n_threads) schedule(static, 1)
for(int i = 0; i < n; i ++){
for(int j = 0; j < i; j++) {
while(!flag[j]);
y[i] -= L[i][j]*y[j];
}
y[i] /= L[i][i];
flag[i] = 1;
}
However, the code always gets stuck after a few iterations when I try to compile it using gcc -O3 -fopenmp <file_name>. I have tried different number of threads like 2, 4, 8 all of them leads to the loop getting stuck. On putting print statements inside critical sections, I figured out that even though the value of flag[i] gets updated to 1, the while loop is still stuck or maybe there is some other problem with the code, I am not aware of.
I also figured out that if I try to do something inside the while block like printf("Hello\n") the problem goes away. I think there is some problem with the memory consistency across threads but I do not know how to resolve this. Any help would be appreciated.
Edit: The single threaded code I am trying to parallelise is
for(int i=0; i<n; i++){
for(int j=0; j < i; j++){
y[i]-=L[i][j]*y[j];
}
y[i]/=L[i][i];
}
You have data race in your code, which is easy to fix, but the bigger problem is that you also have loop carried dependency. The result of your code does depend on the order of execution. Try reversing the i loop without OpenMP, you will get different result, so your code cannot be parallelized efficiently.
One possibility is to parallelize the j loop, but the workload is very small inside this loop, so the OpenMP overheads will be significantly bigger than the speed gain by parallelization.
EDIT: In the case of your updated code I suggest to forget parallelization (because of loop carried dependency) and make sure that inner loop is properly vectorized, so I suggest the following:
for(int i = 0; i < n; i ++){
double sum_yi=y[i];
#pragma GCC ivdep
for(int j = 0; j < i; j++) {
sum_yi -= L[i][j]*y[j];
}
y[i] = sum_yi/L[i][i];
}
#pragma GCC ivdep tells the compiler that there is no loop carried dependency in the loop, so it can vectorize it safely. Do not forget to inform compiler the about the vectorization capabilities of your processor (e.g. use -mavx2 flag if your processor is AVX2 capable).
I want to accelerate these nested loops. Because of the dimension of v (NMAX=MAX(NX1, NX2, NX3)), I understand that can be a conflict in the parallelization of the two external loops. I tried to use the private clause:
static double **v;
if (v == NULL) {
v = ARRAY_2D(NMAX_POINT, NVAR, double);
}
#pragma acc parallel loop present(V, U) private(v[:NMAX_POINT][:NVAR])
for (k = kbeg; k <= kend; k++){ g_k = k;
#pragma acc loop
for (j = jbeg; j <= jend; j++){ g_j = j;
#pragma acc loop collapse(2)
for (i = ibeg; i <= iend; i++) {
for (nv = 0; nv < NVAR; nv++){
v[i][nv] = V[nv][k][j][i];
}}
#pragma acc routine(PrimToCons) seq
PrimToCons (v, U[k][j], ibeg, iend);
}}
I get these errors:
Generating present(V[:][:][:][:],U[:][:][:][:])
Generating Tesla code
144, #pragma acc loop seq
146, #pragma acc loop seq
151, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
154, /* blockIdx.x threadIdx.x collapsed */
144, Accelerator restriction: induction variable live-out from loop: g_k
Complex loop carried dependence of v->-> prevents parallelization
146, Accelerator restriction: induction variable live-out from loop: g_j
Loop carried dependence due to exposed use of v prevents parallelization
Complex loop carried dependence of V->->->->,v->-> prevents parallelization
g_k and g_j are extern int. I've never seen the message "induction variable live-out from loop" before.
EDIT:
I modified the loop as suggested but it sill doesn't work
#pragma acc parallel loop collapse(2) present(U, V) private(v[:NMAX_POINT][:NVAR])
for (k = kbeg; k <= kend; k++){
for (j = jbeg; j <= jend; j++){
#pragma acc loop collapse(2)
for (i = ibeg; i <= iend; i++) {
for (nv = 0; nv < NVAR; nv++){
v[i][nv] = V[nv][k][j][i];
}}
PrimToCons (v, U[k][j], ibeg, iend, g_gamma);
}}
I get this error:
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
It's as if the compiler cannot find v, U or V but in the main function I use these directives:
#pragma acc enter data copyin(data)
#pragma acc enter data copyin(data.Vc[:NVAR][:NX3_TOT][:NX2_TOT][NX1_TOT], data.Uc[:NX3_TOT][:NX2_TOT][NX1_TOT][:NVAR])
data.Vc and data.Uc are V and U in this routine I want to parallelize.
g_k and g_j are extern int. I've never seen the message "induction
variable live-out from loop" before.
When run in parallel, the order in which the loop iterations are executed is non-deterministic. Hence the values of g_k and g_j once exiting the loop would be what ever iteration that happens to be last. This creates a dependency since in order to get correct answers (i.e. answers that would agree with those when running serially), the "k" and "j" loops must be run sequentially.
If "g_k" and "g_j" were local variables, then the compiler would implicitly privatize them in order to remove this dependency. However since they are global variables, it must assume other portions of the code uses the results and hence can't assume they can be made private. If you know the variables aren't used elsewhere, then you can fix this issue by adding them to your "private" clause. Note, it doesn't appear that these variables are used the loop itself so could be removed and just assigned the values "kend" and "jend" outside of the loop.
Unless "g_k" and "g_j" are used in the "PrimToCons" subroutine? In that case, you have a bigger problem in that this would cause a race condition in that the variables values may be updated by other threads and no longer be the value expected by the subroutine. In this case, the fix would be to pass "k" and "j" as arguments to "PrimToCons" and not use "g_k" and "g_j".
As for "v", it should be private to the "j" loop as well, not just the "k" loop. To fix, the I'd recommend adding a "collapse(2)" clause to the "k" loop's pragma and remove the loop directive about the "j" loop.
OpenMP clauses and brackets could exists same time in code. Is there any regulation of coding style about nested OpenMp clauses?
e.g:
#pragma omp parallel
for (int i = 0; i < N; i++) {
code1();
# pragma omp for // Should this line be intended?
for (int j = 0; j < M; j++) {
code2();
# pragma omp critical {
code3(); // Should this block and brackets be intended?
}
}
code4();
}
From an OpenMP perspective there's no real guideline about how to indent the code.
The way I write the code would look like this:
#pragma omp parallel
for (int i = 0; i < N; i++) {
code1();
#pragma omp for // Should this line be intended?
for (int j = 0; j < M; j++) {
code2();
#pragma omp critical
{ // this curly brace needs to go on its own line
code3(); // Should this block and brackets be intended?
}
}
code4();
}
So, the pragmas start in the first column and the base language code follows whatever style you are using. The rational is that if you deleted all OpenMP pragmas you still get "pretty" base language code.
I seem to also recall that compiler pragmas have to have their '#' in the first column. I'll leave it other to correct my memory on this, as I'm not sure about if ISO C/C++ actually requires it. I haven't seen any compiler lately, that would enforce it.
I compute the sum from 0 to N = 100 using openmp. Specifically, I use for directive with firstprivate and lastprivate keys to obtain the the value of s from the last iteration at each thread and sum it up. The logic seems right but this code sums up to 1122 while the correct result is 4950. Does anyone know why?
Thanks.
#define N 100
int main(){
int s = 0;
int i;
#pragma omp parallel num_threads(8) //shared(s) private(i)
{
// s = 0;
#pragma omp for firstprivate(s) lastprivate(s)
for(i=0; i<N; i++)
s += i;
}
printf("sum = %d\n",s);
return 1;
}
Edit: I don't think my question is a duplicate of this question. That question is the difference between firstprivate and lastprivate with private while in my case I don't have such problem. My question is about whether the use of lastprivate and firstprivate in this very specific example is proper. I think this question can benefit some people who have misunderstood lastprivate as I did.