I try to use openmp parallelize my code. However, the code did't speed up. And it was 10 times slower.
code:
N=10000;
int i, count=0,d;
double x, y;
#pragma omp parallel for shared(N) private(i,x,y) reduction(+:count)
for( i = 0; i < N; i++ ){
x = rand()/((double)RAND_MAX+1);
y = rand()/((double)RAND_MAX+1);
if(x*x + y*y < 1){
++count;
}
}
double pi= 4.0 * count / N;
I think it was because of the if statement?
thanks for any help!!
Related
I have been trying to parallelize with this loop with OpenMP
#define AX(i,j,k) (Ax[((k)*n+(j))*n+(i)])
for (int k = k1; k < k2; ++k) {
for (int j = j1; j < j2; ++j) {
for (int i = i1; i < i2; ++i) {
double xx = AX(i,j,k);
double xn = (i > 0) ? AX(i-1,j,k) : 0;
double xe = (j > 0) ? AX(i,j-1,k) : 0;
double xu = (k > 0) ? AX(i,j,k-1) : 0;
AX(i,j,k) = (xx+xn+xe+xu)/6*w;
}
}
}
#undef AX
I put this at the top of this code:
#pragma omp parallel for private (k,j,i) shared(Ax)
I noticed, however, that the #pragma is not working, since my function is simultaneously faster but generates more inconsistent results (probably due to data dependencies).
I probably have to put another clause or try to change something in the code, but I don't have any idea as to what.
EDIT :
Okay thank you I understand why it is not working but I tried why you said, and unfortunetaly it is still not working. Yet, I know the problem but I don't know how to solve it.
void ssor_forward_sweep(int n, int i1, int i2, int j1, int j2, int k1, int k2,
double* restrict Ax, double w)
{
int k,j,i;
double* AxL=malloc(n*sizeof(double));
for (int a=0; a < n;a++){
AxL[a]=Ax[a];
}
#define AX(i,j,k) (Ax[((k)*n+(j))*n+(i)])
#define AXL(i,j,k) (AxL[((k)*n+(j))*n+(i)])
#pragma omp parallel for private (k,j,i) shared(Ax)
for (k = k1; k < k2; ++k) {
for (j = j1; j < j2; ++j) {
for (i = i1; i < i2; ++i) {
double xx = AXL(i,j,k);
double xn = (i > 0) ? AXL(i-1,j,k) : 0;
double xe = (j > 0) ? AXL(i,j-1,k) : 0;
double xu = (k > 0) ? AXL(i,j,k-1) : 0;
AX(i,j,k) = (xx+xn+xe+xu)/6*w;
//AXL(i,j,k) = (xx+xn+xe+xu)/6*w;
}
}
}
#undef AX
#undef AXL
I know that there is still a problem with data dependencies but I don't know how to solve it ; indeed modified values aren't taking in account for the new ones. It also may have a problem when I am copying data.
When I am saying it is not working I don't have any output (no error and no output), it is just directly crashing.
Hope someone can help me !
Thank you so much for the help !
Best regards,
I found Intel's performance suggestion on Xeon Phi on Collapse clause in OpenMP.
#pragma omp parallel for collapse(2)
for (i = 0; i < imax; i++) {
for (j = 0; j < jmax; j++) a[ j + jmax*i] = 1.;
}
Modified example for better performance:
#pragma omp parallel for collapse(2)
for (i = 0; i < imax; i++) {
for (j = 0; j < jmax; j++) a[ k++] = 1.;
}
I test both case in Fortran with similar code on regular CPU using GFortran 4.8, they both get correct result. Test using similar Fortran Code with later code does not pass for GFortran5.2.0 and Intel 14.0
But as far as I understand, the loop body for OpenMP should avoid "loop sequence dependent" variable, for this case is k, so why in the later case it can get correct result and even better performance?
Here's the equivalent code for the two approaches when using collapse clause. You could see the second one is better.
for(int k=0; k<imax*jmax; k++) {
int i = k / jmax;
int j = k % jmax;
a[j + jmax*i]=1.;
}
for(int k=0; k<imax*jmax; k++) {
a[k]=1.;
}
I am new to OpenMP and I am using it to implement the Sieve of Eratosthenes, My code are:
int check_eratothenes(int *p, int pn, int n)
{
int count = 0;
bool* out = new bool[int(pow(pn, 2))];
memset(out, 0, pow(pn, 2));
#pragma omp parallel
for (int i = 0; i < n; i ++)
{
int j = floor((pn + 1) / p[i]) * p[i];
#pragma omp critical
while (j <= pow(pn, 2))
{
out[j] = 1;
j += p[i];
}
}
#pragma omp parallel
for (int i = pn+1; i < pow(pn, 2); i ++)
{
#pragma omp critical
if (out[i] == 0)
{
//cout << i << " ";
count ++;
}
}
return count;
}
But, the above OpenMP pragma is wrong. It can be complied but when it runs, it takes a lot of time to get the result, so it press CTRL + C to stop. And I felt at a loss on how to solve it . Since there are many loops and if statements.
Thanks in advance.
Is it possible to parallelize loop over stuct members with OpenMP?
I tried the following with GCC
point_t p;
double sum;
#pragma omp parallel for private(p) reduction(+: sum)
for (p.x = 0; p.x < N; p.x++) {
for (p.y = 0; p.y < N; p.y++) {
sum += foo(p);
}
}
But that gives me a compile error
error: expected iteration declaration or initialization before ‘p
Is this a GCC bug or is it not part of the OpenMP specs?
I don't think this is allowed in OpenMP; parallel for needs to loop over a variable, not a general lvalue. Do
int x, y; // or whatever you store in a point_t
double sum;
#pragma omp parallel for reduction(+:sum)
for (x=0; x<N; x++)
for (y=0; y<N; y++) {
point_t p(x, y); // assuming C++
sum += foo(p);
}
I am using gcc's implementation of openmp to try to parallelize a program. Basically the assignment is to add omp pragmas to obtain speedup on a program that finds amicable numbers.
The original serial program was given(shown below except for the 3 lines I added with comments at the end). We have to parallize first just the outer loop, then just the inner loop. The outer loop was easy and I get close to ideal speedup for a given number of processors. For the inner loop, I get much worse performance than the original serial program. Basically what I am trying to do is a reduction on the sum variable.
Looking at the cpu usage, I am only using ~30% per core. What could be causing this? Is the program continually making new threads everytime it hits the omp parallel for clause? Is there just so much more overhead in doing a barrier for the reduction? Or could it be memory access issue(eg cache thrashing)? From what I read with most implementations of openmp threads get reused overtime(eg pooled), so I am not so sure the first problem is what is wrong.
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include <omp.h>
#define numThread 2
int main(int argc, char* argv[]) {
int ser[29], end, i, j, a, limit, als;
als = atoi(argv[1]);
limit = atoi(argv[2]);
for (i = 2; i < limit; i++) {
ser[0] = i;
for (a = 1; a <= als; a++) {
ser[a] = 1;
int prev = ser[a-1];
if ((prev > i) || (a == 1)) {
end = sqrt(prev);
int sum = 0;//added this
#pragma omp parallel for reduction(+:sum) num_threads(numThread)//added this
for (j = 2; j <= end; j++) {
if (prev % j == 0) {
sum += j;
sum += prev / j;
}
}
ser[a] = sum + 1;//added this
}
}
if (ser[als] == i) {
printf("%d", i);
for (j = 1; j < als; j++) {
printf(", %d", ser[j]);
}
printf("\n");
}
}
}
OpenMP thread teams are instantiated on entering the parallel section. This means, indeed, that the thread creation is repeated every time the inner loop is starting.
To enable reuse of threads, use a larger parallel section (to control the lifetime of the team) and specificly control the parallellism for the outer/inner loops, like so:
Execution time for test.exe 1 1000000 has gone down from 43s to 22s using this fix (and the number of threads reflects the numThreads defined value + 1
PS Perhaps stating the obvious, it would not appear that parallelizing the inner loop is a sound performance measure. But that is likely the whole point to this exercise, and I won't critique the question for that.
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include <omp.h>
#define numThread 2
int main(int argc, char* argv[]) {
int ser[29], end, i, j, a, limit, als;
als = atoi(argv[1]);
limit = atoi(argv[2]);
#pragma omp parallel num_threads(numThread)
{
#pragma omp single
for (i = 2; i < limit; i++) {
ser[0] = i;
for (a = 1; a <= als; a++) {
ser[a] = 1;
int prev = ser[a-1];
if ((prev > i) || (a == 1)) {
end = sqrt(prev);
int sum = 0;//added this
#pragma omp parallel for reduction(+:sum) //added this
for (j = 2; j <= end; j++) {
if (prev % j == 0) {
sum += j;
sum += prev / j;
}
}
ser[a] = sum + 1;//added this
}
}
if (ser[als] == i) {
printf("%d", i);
for (j = 1; j < als; j++) {
printf(", %d", ser[j]);
}
printf("\n");
}
}
}
}