OpenACC: nested loops with private variable - openacc

I want to accelerate these nested loops. Because of the dimension of v (NMAX=MAX(NX1, NX2, NX3)), I understand that can be a conflict in the parallelization of the two external loops. I tried to use the private clause:
static double **v;
if (v == NULL) {
v = ARRAY_2D(NMAX_POINT, NVAR, double);
}
#pragma acc parallel loop present(V, U) private(v[:NMAX_POINT][:NVAR])
for (k = kbeg; k <= kend; k++){ g_k = k;
#pragma acc loop
for (j = jbeg; j <= jend; j++){ g_j = j;
#pragma acc loop collapse(2)
for (i = ibeg; i <= iend; i++) {
for (nv = 0; nv < NVAR; nv++){
v[i][nv] = V[nv][k][j][i];
}}
#pragma acc routine(PrimToCons) seq
PrimToCons (v, U[k][j], ibeg, iend);
}}
I get these errors:
Generating present(V[:][:][:][:],U[:][:][:][:])
Generating Tesla code
144, #pragma acc loop seq
146, #pragma acc loop seq
151, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
154, /* blockIdx.x threadIdx.x collapsed */
144, Accelerator restriction: induction variable live-out from loop: g_k
Complex loop carried dependence of v->-> prevents parallelization
146, Accelerator restriction: induction variable live-out from loop: g_j
Loop carried dependence due to exposed use of v prevents parallelization
Complex loop carried dependence of V->->->->,v->-> prevents parallelization
g_k and g_j are extern int. I've never seen the message "induction variable live-out from loop" before.
EDIT:
I modified the loop as suggested but it sill doesn't work
#pragma acc parallel loop collapse(2) present(U, V) private(v[:NMAX_POINT][:NVAR])
for (k = kbeg; k <= kend; k++){
for (j = jbeg; j <= jend; j++){
#pragma acc loop collapse(2)
for (i = ibeg; i <= iend; i++) {
for (nv = 0; nv < NVAR; nv++){
v[i][nv] = V[nv][k][j][i];
}}
PrimToCons (v, U[k][j], ibeg, iend, g_gamma);
}}
I get this error:
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
It's as if the compiler cannot find v, U or V but in the main function I use these directives:
#pragma acc enter data copyin(data)
#pragma acc enter data copyin(data.Vc[:NVAR][:NX3_TOT][:NX2_TOT][NX1_TOT], data.Uc[:NX3_TOT][:NX2_TOT][NX1_TOT][:NVAR])
data.Vc and data.Uc are V and U in this routine I want to parallelize.

g_k and g_j are extern int. I've never seen the message "induction
variable live-out from loop" before.
When run in parallel, the order in which the loop iterations are executed is non-deterministic. Hence the values of g_k and g_j once exiting the loop would be what ever iteration that happens to be last. This creates a dependency since in order to get correct answers (i.e. answers that would agree with those when running serially), the "k" and "j" loops must be run sequentially.
If "g_k" and "g_j" were local variables, then the compiler would implicitly privatize them in order to remove this dependency. However since they are global variables, it must assume other portions of the code uses the results and hence can't assume they can be made private. If you know the variables aren't used elsewhere, then you can fix this issue by adding them to your "private" clause. Note, it doesn't appear that these variables are used the loop itself so could be removed and just assigned the values "kend" and "jend" outside of the loop.
Unless "g_k" and "g_j" are used in the "PrimToCons" subroutine? In that case, you have a bigger problem in that this would cause a race condition in that the variables values may be updated by other threads and no longer be the value expected by the subroutine. In this case, the fix would be to pass "k" and "j" as arguments to "PrimToCons" and not use "g_k" and "g_j".
As for "v", it should be private to the "j" loop as well, not just the "k" loop. To fix, the I'd recommend adding a "collapse(2)" clause to the "k" loop's pragma and remove the loop directive about the "j" loop.

Related

Are macros (always) compatible and portable with OpenACC?

In my code I define the lower and upper bounds of different computational
regions by using a structure,
typedef struct RBox_{
int ibeg;
int iend;
int jbeg;
int jend;
int kbeg;
int kend;
} RBox;
I have then introduced the following macro,
#define BOX_LOOP(box, k,j,i) for (k = (box)->kbeg; k <= (box)->kend; k++) \
for (j = (box)->jbeg; j <= (box)->jend; j++) \
for (i = (box)->ibeg; i <= (box)->iend; i++)
(where box is a pointer to a RBox structure) to perform loops as follows:
#pragma acc parallel loop collapse(3) present(box, data)
BOX_LOOP(&box, k,j,i){
A[k][j][i] = ...
}
My question is: Is employing the macro totally equivalent to writing the
loop explicitly as below ?
ibeg = box->ibeg; iend = box->iend;
jbeg = box->jbeg; jend = box->jend;
kbeg = box->kbeg; kend = box->kend;
#pragma acc parallel loop collapse(3) present(box, data)
for (k = kbeg; k <= kend; k++){
for (j = jbeg; j <= jend; j++){
for (i = ibeg; i <= iend; i++){
A[k][j][i] = ...
}}}
Furthermore, are macros portable to different versions of the nvc compiler?
Preprocessor directives and user defined macros are part of the C99 language standard, which nvc (as well as it's predecessor "pgcc"), has supported for quite sometime (~20 years). So, yes would be portable to all versions of nvc.
The preprocessing step occurs very early in the compilation process. Only after the macros are applied, does the compiler process the OpenACC pragmas. So, yes, using the macro above would be equivalent to explicitly writing out the loops.
Since the macro is expanded by the pre-processor, which runs before the OpenACC directives are interpreted, I would expect that this will work exactly how you hope. What are you hoping to accomplish here by not writing these loops in a function rather than a macro?

How to traverse two-dimensional vector by OpenMP

I'm new to OpenMP and got an error that I can't fix.
Suppose I have a two-dimensional vector:
vector<vector<int>> a{{...}, {...}, ...};
I want to traverse it as
#pragma omp parallel for collapse(2)
for (int i = 0; i < a.size(); i++){
for (int j = 0; j < a[i].size(); j++){
work(a[i][j]);
}
}
However, there is an error: condition expression refers to iteration variable ‘i’.
So how can I traverse the two-dimensional vector correctly?
The problem is that the end condition of second loop depends on the first loop's index variable (a[i].size()). Only OpenMP 5.0 (or above) supports so-called non-rectangular collapsed loops, so if you use earlier OpenMP version you cannot use the collapse(2) clause here. Just remove collapse(2) clause and it will work.
Note that, if a[i].size() is the same for all i, then you can easily remove this dependency.

Unexpected slowdown using omp

I'm using OMP to try to get some speedup in a small kernel. It's basically just querying a vector of unordered_sets for membership. I tried to make an optimization, but surprisingly I got a slowdown, and am really curious why.
My first pass was:
vector<unordered_set<uint16_t> > setList = getData();
#pragma omp parallel for default(shared) private(i, j) schedule(dynamic, 50)
for(i = 0; i < size; i++){
for(j = 0; j < 500; j++){
count = count + setList[i].count(val[j]);
}
}
Then I thought I could maybe get a speedup by moving the setList[i] sub expression up one level of nesting and save it in a temp variable, by doing the following:
#pragma omp parallel for default(shared) private(i, j, currSet) schedule(dynamic, 50)
for(i = 0; i < size; i++){
currSet = setList[i];
for(j = 0; j < 500; j++){
count = count + currSet.count(val[j]);
}
}
I had thought this would maybe save a load each iteration of the "j" for loop, and get a speedup, but it actually SLOWED DOWN by about 3x. By this I mean the entire kernel took about 3 times as long to run. Thoughts on why this would occur?
Thanks!
Adding up a few integers is really not enough work to warrant starting threads for.
If you forget to add the reduction clause, you'll suffer from true sharing - all threads want to update that count variable at the same time. This makes all cores fight for the cache line containing tha variable, which will considerably impact your performance.
I just noticed that you set the schedule to be dynamic. You shouldn't. This workload can be divided at compile time already. So don't specify a schedule.
As has already been stated inter-loop dependencies, i.e. threads waiting for data from other threads, or data being accessed by multiple threads successively, can cause a paralleled program to experience slow down and should be avoided as a rule of thumb. Built in functions like reductions can collect individual results and compile them together in an optimised fashion.
Here is a good example of reduction being used in a similar case to yours from the university of Utah
int array[8] = { 1, 1, 1, 1, 1, 1, 1, 1};
int sum = 0, i;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < 8; i++) {
sum += array[i];
}
printf("total %d\n", sum);
source: http://www.eng.utah.edu/~cs4960-01/lecture9.pdf
as an aside: private variables need only be assigned when they are local variables inside a parallel region In both cases it is not necessary for i to be declared private.
see wikipedia: https://en.wikipedia.org/wiki/OpenMP#Data_sharing_attribute_clauses
Data sharing attribute clauses
shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.
private: the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.
see stack exchange answer here: OpenMP: are local variables automatically private?

shared arrays in OpenMP

I'm trying to parallelize a piece of C++ code with OpenMp but I'm facing some problems.
In fact, my parallelized code is not faster than the serial one.
I think I have understood the cause of this, but I'm not able to solve it.
The structure of my code is like this:
int vec1 [M];
int vec2 [N];
...initialization of vec1 and vec2...
for (int it=0; it < tot_iterations; it++) {
if ( (it+1)%2 != 0 ) {
#pragma omp parallel for
for (int j=0 ; j < N ; j++) {
....code involving a call to a function to which I'm passing as a parameter vec1.....
if (something) { vec2[j]=vec2[j]-1;}
}
}
else {
# pragma omp parallel for
for (int i=0 ; i < M ; i++) {
....code involving a call to a function to which I'm passing as a parameter vec2.....
if (something) { vec1[i]=vec1[i]-1;}
}
}
}
I thought that maybe my parallelized code is slower because multiple threads want to access to the same shared array and one has to wait until another has finished, but I'm not sure how things really go. But I can't make vec1 and vec2 private since the updates wouldn't be seen in the other iterations...
How can I improve it??
When you speak about issue when accessing the same array with multiple thread, this is called "false-sharing". Except if your array is small, it should not be the bottle neck here as pragma omp parallel for use static scheduling in default implementation (with gcc at least) so each thread should access most of the array without concurency except if your "...code involving a call to a function to which I'm passing as a parameter vec2....." really access a lot of elements in the array.
Case 1: You do not access most elements in the array in this part of the code
Is M big enough to make parallelism useful?
Can you move parallelism on the outer loop? (with one loop for vec1 only and the other for vec2 only)
Try to move the parallel region code :
int vec1 [M];
int vec2 [N];
...initialization of vec1 and vec2...
#pragma omp parallel
for (int it=0; it < tot_iterations; it++) {
if ( (it+1)%2 != 0 ) {
#pragma omp for
for (int j=0 ; j < N ; j++) {
....code involving a call to a function to which I'm passing as a parameter vec1.....
if (something) { vec2[j]=vec2[j]-1;}
}
}
else {
# pragma omp for
for (int i=0 ; i < M ; i++) {
....code involving a call to a function to which I'm passing as a parameter vec2.....
if (something) { vec1[i]=vec1[i]-1;}
}
}
This should not change much but some implementation have a costly parallel region creation.
case 2: You access every elements with every thread
I would say you can't do that if you perform update, otherwise, you may have concurency issue as you have order dependency in the loop.

OpenMP: cannot change the value of reduction variable

After I had read that the initial value of reduction variable is set according to the operator used for reduction, I decided that instead of remembering these default values it is better to initialize it explicitly. So I modified the code in question by Totonga as follows
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
#pragma omp parallel private(x) reduction(+:sum)
{
sum = 0.;
#pragma omp for schedule(static)
for (int i=0; i<num_steps; ++i)
{
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
}
But it turns out that no matter whether I write sum = 0. or sum = 123.456 the code produces the same result (used gcc-4.5.2 compiler). Can somebody, please, explain me why? (with a reference to openmp standard, if possible) Thanks in advance to everybody.
P.S. since some people object initializing reduction variable, I think it makes sense to expand a question a little. The code below works as expected: I initialize reduction variable and obtain result, which DOES depend on MY initial value
int sum;
#pragma omp parallel reduction(+:sum)
{
sum = 1;
}
printf("Reduction sum = %d\n",sum);
The printed result will be the number of cores, and not 0.
P.P.S I have to update my question again. User Gilles gave an insightful comment: And upon exit of the parallel region, these local values will be reduced using the + operator, and with the initial value of the variable, prior to entering the section.
Well, the following code gives me the result 3.142592653598146, which is badly calculated pi instead of expected 103.141592653598146 (the initial code was giving me excellent value of pi=3.141592653598146)
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
sum = 100.;
#pragma omp parallel private(x) reduction(+:sum)
{
#pragma omp for schedule(static)
for (int i=0; i<num_steps; ++i)
{
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
}
Why would you want to do that? This is just begging with all your soul for troubles. The reduction clause and the way the local variables used are initialised are defined for a reason, and the idea is that you don't need to remember these initialisation value just because they are already right.
However, in your code, the behaviour is undefined. Let's see why...
Let's assume your initial code is this:
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
sum = 0.;
for (int i=0; i<num_steps; ++i) {
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
Well, the "normal" way of parallelising it with OpenMP would be:
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
sum = 0.;
#pragma omp parallel for reduction(+:sum) private(x)
for (int i=0; i<num_steps; ++i) {
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
Pretty straightforward, isn't it?
Now, when instead of that, you do:
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
#pragma omp parallel private(x) reduction(+:sum)
{
sum = 0.;
#pragma omp for schedule(static)
for (int i=0; i<num_steps; ++i)
{
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
}
You have a problem... The reason is that upon entry into the parallel region, sum hadn't been initialised. So when you declare omp parallel reduction(+:sum), you create a per-thread private version of sum, initialised to the "logical" initial value corresponding to the operator of you reduction clause, namely 0 here because you asked for a + reduction. And upon exit of the parallel region, these local values will be reduced using the + operator, and with the initial value of the variable, prior to entering the section. See this for reference:
The reduction clause specifies a reduction-identifier and one or more
list items. For each list item, a private copy is created in each
implicit task or SIMD lane, and is initialized with the initializer
value of the reduction-identifier. After the end of the region, the
original list item is updated with the values of the private copies
using the combiner associated with the reduction-identifier
So in summary, upon exit you have the equivalent of sum += sum_local_0 + sum_local_1 + ... sum_local_nbthreadsMinusOne
Therefore, since in your code, sum doesn't have any initial value, its value upon exit of the parallel region isn't defined as well, and can be whatever...
Now let's imagine you did indeed initialise it... Then, if instead of using the right initialiser inside the parallel region (like your sum=0.; in the hereinabove code), you used for whatever reason sum=1.; instead, then the final sum won't be just incremented by 1, but by 1 times the number of threads used inside the parallel region, since the extra value will be counted as many times as there are of threads.
So in conclusion, just use reduction clauses and variables the "expected"/"naïve" way, that will spare you and the people coming after for maintaining your code a lot of troubles.
Edit: It looks like my point was not clear enough, so I'll try to explain it better:
this code:
int sum;
#pragma omp parallel reduction(+:sum)
{
sum = 1;
}
printf("Reduction sum = %d\n",sum);
Has an undefined behaviour because it is equivalent to:
int sum, numthreads;
#pragma omp parallel
#pragma omp single
numthreads = omp_get_num_threads();
sum += numthreads; // value of sum is undefined since it never was initialised
printf("Reduction sum = %d\n",sum);
Now, this code is valid:
int sum = 0; //here, sum has been initialised
#pragma omp parallel reduction(+:sum)
{
sum = 1;
}
printf("Reduction sum = %d\n",sum);
To convince yourself, just read the snippet of the standard I gave:
After the end of the region, the
original list item is updated with the values of the private copies
using the combiner associated with the reduction-identifier
So the reduction uses the combination of the private reduction variables and the original value to perform the final reduction upon exit. So if the original value wasn't set, the final value is undefined as well. And that's not because for some reason your compiler gives you a value that seems right, that the code is right.
Is that clearer now?

Resources