Reduction clause in OpenMP when parallel and for belong to two different statements - openmp

I am confused with the behavior of the reduction clause because when
I compile
int main() {
//default(none) shared(suma)
int suma = 0;
#pragma parallel omp default(shared) num_threads(2)
{
#pragma omp for reduction(+:suma)
for (int n = 0; n<20; n++ ) {
for ( int j = 0; j < 30; j++ ) {
suma += 1;
}
}
}
printf("suma = %d\n", suma);
assert( suma == 20*30);
return 0;
}
the compiler says that I have to declare suma as shared whereas I explicitly specified it to be shared.
reduction.cpp:28:33: error: reduction variable must be shared
#pragma omp for reduction(+:suma)
^
What drives me mad is that when I write code that is supposedly incorrect (over-parallelized) it works
int main() {
//default(none) shared(suma)
int suma = 0;
#pragma parallel omp default(shared) num_threads(2)
{
#pragma parallel omp for reduction(+:suma)
for (int n = 0; n<20; n++ ) {
for ( int j = 0; j < 30; j++ ) {
suma += 1;
}
}
}
printf("suma = %d\n", suma);
assert( suma == 20*30);
return 0;
}
At this point I don't know if I am missing something or if there is a bug in the compiler. Anyway, I am compiling with clang++ -fopenmp reduction.cpp
clang version 7.0.0 (https://git.llvm.org/git/clang.git/ bb7269ae797f282e27e47eb4ebedfa6abe826e9e) (https://git.llvm.org/git/llvm.git/ 37d8f03a3676034f21f0a652359ec4ace8d0521f)
Target: x86_64-unknown-linux-gnu
Thread model: posix

Related

Using OpenMP "for simd" in matrix-vector multiplication?

I'm currently trying to get my matrix-vector multiplication function to compare favorably with BLAS by combining #pragma omp for with #pragma omp simd, but it's not getting any speedup improvement than if I were to just use the for construct. How do I properly vectorize the inner loop with OpenMP's SIMD construct?
vector dot(const matrix& A, const vector& x)
{
assert(A.shape(1) == x.size());
vector y = xt::zeros<double>({A.shape(0)});
int i, j;
#pragma omp parallel shared(A, x, y) private(i, j)
{
#pragma omp for // schedule(static)
for (i = 0; i < y.size(); i++) { // row major
#pragma omp simd
for (j = 0; j < x.size(); j++) {
y(i) += A(i, j) * x(j);
}
}
}
return y;
}
Your directive is incorrect because there would introduce in a race condition (on y(i)). You should use a reduction in this case. Here is an example:
vector dot(const matrix& A, const vector& x)
{
assert(A.shape(1) == x.size());
vector y = xt::zeros<double>({A.shape(0)});
int i, j;
#pragma omp parallel shared(A, x, y) private(i, j)
{
#pragma omp for // schedule(static)
for (i = 0; i < y.size(); i++) { // row major
decltype(y(0)) sum = 0;
#pragma omp simd reduction(+:sum)
for (j = 0; j < x.size(); j++) {
sum += A(i, j) * x(j);
}
y(i) += sum;
}
}
return y;
}
Note that it may not be necessary faster because some compilers are able to automatically vectorize the code (ICC for example). GCC and Clang often fail to perform (advanced) SIMD reductions automatically and such a directive help them a bit. You can check the assembly code to check how the code is vectorized or enable vectorization reports (see here for GCC).

openacc create data while running inside a kernels

I'm having a task that is to be accelerated by OpenACC. I need to do dynamic memory allocation within a kernel computation. I've built a simpler demo for it as following.
#include <iostream>
using namespace std;
#pragma acc routine seq
int *routine(int init) {
int *ptr;
#pragma acc data create(ptr[:10])
for (int i = 0; i < 10; ++i) {
ptr[i] = init + i;
}
return ptr;
}
void print_array(int *arr) {
for (int i = 0; i < 10; ++i) {
cout << arr[i] << " ";
}
cout << endl;
}
int main(void) {
int *arrs[5];
#pragma acc kernels
for (int i = 0; i < 5; ++i) {
arrs[i] = routine(i);
}
for (int i = 0; i < 5; ++i) {
print_array(arrs[i]);
}
return 0;
}
In this demo, I'm trying to call the routine while running inside a kernel construct. The routine procedure wants to create some data within the GPU and put some values into it.
While I can compile the code, but it reports runtime problems as following.
lisanhu#lisanhu-XPS-15-9550:create_and_copyout$ pgc++ -o test main.cc -acc -Minfo=accel
routine(int):
6, Generating acc routine seq
main:
23, Generating implicit copyout(arrs[:])
26, Accelerator restriction: size of the GPU copy of arrs is unknown
Loop is parallelizable
Generating implicit copy(arrs[:][:])
Accelerator kernel generated
Generating Tesla code
26, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
lisanhu#lisanhu-XPS-15-9550:create_and_copyout$ ./test
call to cuStreamSynchronize returned error 715: Illegal instruction
I'm wondering what I should do to accomplish this task (dynamically allocating memory within processing of a kernel construct). Really appreciate it if you could help.
This is untested, and probably very slow, but this might do what you need it to.
int main() {
const int num = 20;
int a[x] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 0};
int* sizes = (int *)malloc(num * sizeof(int));
int *ptrs[num];
int* temp, *temp2;
int sum;
int* finished = (int *)malloc(num * sizeof(int));
for (int x = 0; x < num; ++x){
finished[x] = 0;
}
#pragma acc kernels copyin(a[0:10]) copyout(ptrs[:num][:1]) async(num*2+1)
{
#pragma acc loop private(temp)
for (int i = 0; i < num; ++i){
#pragma acc loop seq async(i)
for (int j = 0; j < 1; ++j){
temp = ptrs[x];
sizes[i] = ...
}
while (ptrs[x] != x);
ptrs[x] = routine(a, sizes[i]);
}
}
while (true){
sum = 0;
for (int x = 0; x < num; ++x){
sum += finished[x];
}
if (sum == num){
break;
}
for (int x = 0; x < num; ++x){
if (acc_async_test(x) != 0 && finished[x] == 0){
finished[x] = 1;
#pragma acc update host(sizes[x:1])
temp = (int *)malloc(size[x] * sizeof(int));
#pragma acc enter data copyin(temp[0:x])
temp2 = acc_deviceptr(temp);
ptrs[x] = temp2;
#pragma acc update device(ptrs[x:1][0:1])
}
}
}
}

openmp, for loop parallelization and critical zone error

I am new to OpenMP and I am using it to implement the Sieve of Eratosthenes, My code are:
int check_eratothenes(int *p, int pn, int n)
{
int count = 0;
bool* out = new bool[int(pow(pn, 2))];
memset(out, 0, pow(pn, 2));
#pragma omp parallel
for (int i = 0; i < n; i ++)
{
int j = floor((pn + 1) / p[i]) * p[i];
#pragma omp critical
while (j <= pow(pn, 2))
{
out[j] = 1;
j += p[i];
}
}
#pragma omp parallel
for (int i = pn+1; i < pow(pn, 2); i ++)
{
#pragma omp critical
if (out[i] == 0)
{
//cout << i << " ";
count ++;
}
}
return count;
}
But, the above OpenMP pragma is wrong. It can be complied but when it runs, it takes a lot of time to get the result, so it press CTRL + C to stop. And I felt at a loss on how to solve it . Since there are many loops and if statements.
Thanks in advance.

openmp parallel for over struct members?

Is it possible to parallelize loop over stuct members with OpenMP?
I tried the following with GCC
point_t p;
double sum;
#pragma omp parallel for private(p) reduction(+: sum)
for (p.x = 0; p.x < N; p.x++) {
for (p.y = 0; p.y < N; p.y++) {
sum += foo(p);
}
}
But that gives me a compile error
error: expected iteration declaration or initialization before ‘p
Is this a GCC bug or is it not part of the OpenMP specs?
I don't think this is allowed in OpenMP; parallel for needs to loop over a variable, not a general lvalue. Do
int x, y; // or whatever you store in a point_t
double sum;
#pragma omp parallel for reduction(+:sum)
for (x=0; x<N; x++)
for (y=0; y<N; y++) {
point_t p(x, y); // assuming C++
sum += foo(p);
}

Gaussian Elimination in OpenMP

Gaussian Elimination in OpenMP. I am new to openmp and wondering if I used my pragmas and barrier at correct places. my x values are different each time. Are they supposed to be the same??
#include <stdio.h>
int num;
double mm[6][7];
void gaussElimination();
int main() {
int i, j;
int k, s;
FILE *f = fopen("matrix.in", "r");
fscanf(f, "%d", &num);
for (i=0; i<num; ++i)
for (j=0; j<num+1; ++j)
fscanf(f, "%f", &mm[i][j]);
fclose(f);
for (i=0; i < num; i++)
for(j=0; j <num; j++);
gaussElimination();
for(k=0; k < num; ++k) {
for(s = 0; s < num+1; ++s)
printf("%3.2f\t", mm[k][s]);
printf("\n");
}
return 0;
}
void gaussElimination() {
int i, j, k, max;
double R;
// #pragma omp parallel for private (i, j)
for( i=0; i < num; ++i) {
max = i;
for(j= i+1; j < num; ++j)
if(mm[j][i] > mm[max][i])
max =j;
for(j=0; j < num+1; ++j) {
R = mm[max][j];
mm[max][j] = mm[i][j];
mm[i][j] = R;
}
#pragma omp parallel for private ( i, j)
for(j=num; j>= i; --j)
for(k=i+1; k <num; ++k)
mm[k][j] -= mm[k][i]/mm[i][i] * mm[i][j];
}
#pragma omp barrier
for(i = num-1; i >=0; --i) {
mm[i][num] = mm[i][num] / mm[i][i];
mm[i][i] = 1;
#pragma omp barrier
for(j= i - 1; j >= 0; --j) {
mm[j][num] -= mm[j][i] * mm[i][num];
mm[j][i] = 0;
}
#pragma omp barrier
}
}
With the current code, you have placed the OpenMP pragam on the the j and k loops. However, you have a private(i,j), which makes the variables i and j private (with no initial values). This should be private(j,k), because the j and k loop variables need to be private and i needs to be shared (since it is the loop bound for the j loop). The OpenMP barriers are not doing anything.

Resources