Is it possible to parallelize loop over stuct members with OpenMP?
I tried the following with GCC
point_t p;
double sum;
#pragma omp parallel for private(p) reduction(+: sum)
for (p.x = 0; p.x < N; p.x++) {
for (p.y = 0; p.y < N; p.y++) {
sum += foo(p);
}
}
But that gives me a compile error
error: expected iteration declaration or initialization before ‘p
Is this a GCC bug or is it not part of the OpenMP specs?
I don't think this is allowed in OpenMP; parallel for needs to loop over a variable, not a general lvalue. Do
int x, y; // or whatever you store in a point_t
double sum;
#pragma omp parallel for reduction(+:sum)
for (x=0; x<N; x++)
for (y=0; y<N; y++) {
point_t p(x, y); // assuming C++
sum += foo(p);
}
Related
I try to use openmp parallelize my code. However, the code did't speed up. And it was 10 times slower.
code:
N=10000;
int i, count=0,d;
double x, y;
#pragma omp parallel for shared(N) private(i,x,y) reduction(+:count)
for( i = 0; i < N; i++ ){
x = rand()/((double)RAND_MAX+1);
y = rand()/((double)RAND_MAX+1);
if(x*x + y*y < 1){
++count;
}
}
double pi= 4.0 * count / N;
I think it was because of the if statement?
thanks for any help!!
I'm trying to compile a simple OpenACC benchmark:
void foo(const float * restrict a, int a_stride, float * restrict c, int c_stride) {
#pragma acc parallel copyin(a[0:a_stride*256]) copyout(c[0:c_stride*256])
#pragma acc loop vector(128)
{
for (int i = 0; i < 256; ++i) {
float sum = 0;
for (int j = 0; j < 256; ++j) {
sum += *(a + a_stride * i + j);
}
*(c + c_stride * i) = sum;
}
}
}
with Nvidia HPC SDK 21.5 and run into an error
$ nvc++ -S tmp.cc -Wall -Wextra -O2 -acc -acclibs -Minfo=all -g -gpu=cc80
NVC++-S-0155-Illegal context for gang(num:) or worker(num:) or vector(length:) (tmp.cc: 7)
NVC++/x86-64 Linux 21.5-0: compilation completed with severe errors
Any idea what may cause this? From what I can tell my syntax for vector(128) is legal.
It's illegal OpenACC syntax to use "vector(value)" with a parallel construct. You need to use a "vector_length" clause on the parallel directive to define the vector length. The reason is because "parallel" defines a single compute region to be offloaded and hence all vector loops in this region need to have the same vector length.
You can use "vector(value)" only with a "kernels" construct since the compiler can then split the region into multiple kernels each having a different vector length.
Option 1:
% cat test.c
void foo(const float * restrict a, int a_stride, float * restrict c, int c_stride) {
#pragma acc parallel vector_length(128) copyin(a[0:a_stride*256]) copyout(c[0:c_stride*256])
#pragma acc loop vector
{
for (int i = 0; i < 256; ++i) {
float sum = 0;
for (int j = 0; j < 256; ++j) {
sum += *(a + a_stride * i + j);
}
*(c + c_stride * i) = sum;
}
}
}
% nvc -acc -c test.c -Minfo=accel
foo:
4, Generating copyout(c[:c_stride*256]) [if not already present]
Generating copyin(a[:a_stride*256]) [if not already present]
Generating Tesla code
5, #pragma acc loop vector(128) /* threadIdx.x */
7, #pragma acc loop seq
5, Loop is parallelizable
7, Loop is parallelizable
Option 2:
% cat test.c
void foo(const float * restrict a, int a_stride, float * restrict c, int c_stride) {
#pragma acc kernels copyin(a[0:a_stride*256]) copyout(c[0:c_stride*256])
#pragma acc loop independent vector(128)
{
for (int i = 0; i < 256; ++i) {
float sum = 0;
for (int j = 0; j < 256; ++j) {
sum += *(a + a_stride * i + j);
}
*(c + c_stride * i) = sum;
}
}
}
% nvc -acc -c test.c -Minfo=accel
foo:
4, Generating copyout(c[:c_stride*256]) [if not already present]
Generating copyin(a[:a_stride*256]) [if not already present]
5, Loop is parallelizable
Generating Tesla code
5, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
7, #pragma acc loop seq
7, Loop is parallelizable
I'm currently trying to get my matrix-vector multiplication function to compare favorably with BLAS by combining #pragma omp for with #pragma omp simd, but it's not getting any speedup improvement than if I were to just use the for construct. How do I properly vectorize the inner loop with OpenMP's SIMD construct?
vector dot(const matrix& A, const vector& x)
{
assert(A.shape(1) == x.size());
vector y = xt::zeros<double>({A.shape(0)});
int i, j;
#pragma omp parallel shared(A, x, y) private(i, j)
{
#pragma omp for // schedule(static)
for (i = 0; i < y.size(); i++) { // row major
#pragma omp simd
for (j = 0; j < x.size(); j++) {
y(i) += A(i, j) * x(j);
}
}
}
return y;
}
Your directive is incorrect because there would introduce in a race condition (on y(i)). You should use a reduction in this case. Here is an example:
vector dot(const matrix& A, const vector& x)
{
assert(A.shape(1) == x.size());
vector y = xt::zeros<double>({A.shape(0)});
int i, j;
#pragma omp parallel shared(A, x, y) private(i, j)
{
#pragma omp for // schedule(static)
for (i = 0; i < y.size(); i++) { // row major
decltype(y(0)) sum = 0;
#pragma omp simd reduction(+:sum)
for (j = 0; j < x.size(); j++) {
sum += A(i, j) * x(j);
}
y(i) += sum;
}
}
return y;
}
Note that it may not be necessary faster because some compilers are able to automatically vectorize the code (ICC for example). GCC and Clang often fail to perform (advanced) SIMD reductions automatically and such a directive help them a bit. You can check the assembly code to check how the code is vectorized or enable vectorization reports (see here for GCC).
I am confused with the behavior of the reduction clause because when
I compile
int main() {
//default(none) shared(suma)
int suma = 0;
#pragma parallel omp default(shared) num_threads(2)
{
#pragma omp for reduction(+:suma)
for (int n = 0; n<20; n++ ) {
for ( int j = 0; j < 30; j++ ) {
suma += 1;
}
}
}
printf("suma = %d\n", suma);
assert( suma == 20*30);
return 0;
}
the compiler says that I have to declare suma as shared whereas I explicitly specified it to be shared.
reduction.cpp:28:33: error: reduction variable must be shared
#pragma omp for reduction(+:suma)
^
What drives me mad is that when I write code that is supposedly incorrect (over-parallelized) it works
int main() {
//default(none) shared(suma)
int suma = 0;
#pragma parallel omp default(shared) num_threads(2)
{
#pragma parallel omp for reduction(+:suma)
for (int n = 0; n<20; n++ ) {
for ( int j = 0; j < 30; j++ ) {
suma += 1;
}
}
}
printf("suma = %d\n", suma);
assert( suma == 20*30);
return 0;
}
At this point I don't know if I am missing something or if there is a bug in the compiler. Anyway, I am compiling with clang++ -fopenmp reduction.cpp
clang version 7.0.0 (https://git.llvm.org/git/clang.git/ bb7269ae797f282e27e47eb4ebedfa6abe826e9e) (https://git.llvm.org/git/llvm.git/ 37d8f03a3676034f21f0a652359ec4ace8d0521f)
Target: x86_64-unknown-linux-gnu
Thread model: posix
i'm trying to utilize my Nvidia Geforce GT 740M for parallel-programming using OpenMP and the clang-3.8 compiler.
When processed in parallel on the CPU, I manage to get the desired result. However, when processed on the GPU, my results are some almost random numbers.
Therefore, I figured that I'm not correctly distributing my thread teams and that there might be some data races. I guess I have to do my for-loops differently but I have no idea where the mistake could be.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char* argv[])
{
const int n =100; float a = 3.0f; float b = 2.0f;
float *x = (float *) malloc(n * sizeof(float));
float *y = (float *) malloc(n * sizeof(float));
int i;
int j;
int k;
double start;
double end;
start = omp_get_wtime();
for (k=0; k<n; k++){
x[k] = 2.0f;
y[k] = 3.0f;
}
#pragma omp target data map(to:x[0:n]) map(tofrom:y[0:n]) map(to:i) map(to:j)
{
#pragma omp target teams
#pragma omp distribute
for(i = 0; i < n; i++) {
#pragma omp parallel for
for (j = 0; j < n; j++){
y[j] = a*x[j] + y[j];
}
}
}
end = omp_get_wtime();
printf("Work took %f seconds.\n", end - start);
free(x); free(y);
return 0;
}
I guess that it might have something to to with the Architecture of my GPU. So therefore I'm adding this:
Im fairly new to the topic, so thanks for your help :)
Yes, there is a race here. Different teams are reading and writing to the same element of the array 'y'. Perhaps you want something like this?
for(i = 0; i < n; i++) {
#pragma omp target teams distribute parallel for
for (j = 0; j < n; j++){
y[j] = a*x[j] + y[j];
}
}