I'm am following video lectures of Tim Mattson on OpenMP and there was one exercise to find errors in provided code that count area of the Mandelbrot. So here is the solution that was provided:
#define NPOINTS 1000
#define MAXITER 1000
void testpoint(struct d_complex);
struct d_complex{
double r;
double i;
};
struct d_complex c;
int numoutside = 0;
int main(){
int i,j;
double area, error, eps = 1.0e-5;
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps)
for(i = 0; i<NPOINTS; i++){
for(j=0; j < NPOINTS; j++){
c.r = -2.0+2.5*(double)(i)/(double)(NPOINTS)+eps;
c.i = 1.125*(double)(j)/(double)(NPOINTS)+eps;
testpoint(c);
}
}
area=2.0*2.5*1.125*(double)(NPOINTS*NPOINTS-numoutside)/(double)(NPOINTS*NPOINTS);
error=area/(double)NPOINTS;
printf("Area of Mandlebrot set = %12.8f +/- %12.8f\n",area,error);
printf("Correct answer should be around 1.510659\n");
}
void testpoint(struct d_complex c){
// Does the iteration z=z*z+c, until |z| > 2 when point is known to be outside set
// If loop count reaches MAXITER, point is considered to be inside the set
struct d_complex z;
int iter;
double temp;
z=c;
for (iter=0; iter<MAXITER; iter++){
temp = (z.r*z.r)-(z.i*z.i)+c.r;
z.i = z.r*z.i*2+c.i;
z.r = temp;
if ((z.r*z.r+z.i*z.i)>4.0) {
#pragma omp atomic
numoutside++;
break;
}
}
}
The question I have is, could we use reduction in #pragma omp parallel of variable numoutside like:
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps) reduction(+:numoutside)
without atomic construct in testpoint function?
I tested the function without atomic, and the result was different from the one I got in the first place. Why does that happen? And while I understand the concept of mutual exclusion and use of it because of race conditioning, isn't reduction just another form of solving that problem with private variables?
Thank You in advance.
I am currently porting some code over to OpenMP at my place of work. One of the tasks I am doing is figuring out how to speed up matrix multiplication for one of our applications.
The matrices are stored in row-major format, so A[i*cols +j] gives the A_i_j element of the matrix A.
The code looks like this (uncommenting the pragma parallelises the code):
#include <omp.h>
#include <iostream>
#include <iomanip>
#include <stdio.h>
#define NUM_THREADS 8
#define size 500
#define num_iter 10
int main (int argc, char *argv[])
{
// omp_set_num_threads(NUM_THREADS);
int *A = new int [size*size];
int *B = new int [size*size];
int *C = new int [size*size];
for (int i=0; i<size; i++)
{
for (int j=0; j<size; j++)
{
A[i*size+j] = j*1;
B[i*size+j] = i*j+2;
C[i*size+j] = 0;
}
}
double total_time = 0;
double start = 0;
for (int t=0; t<num_iter; t++)
{
start = omp_get_wtime();
int i, k;
// #pragma omp parallel for num_threads(10) private(i, k) collapse(2) schedule(dynamic)
for (int j=0; j<size; j++)
{
for (i=0; i<size; i++)
{
for (k=0; k<size; k++)
{
C[i*size+j] += A[i*size+k] * B[k*size+j];
}
}
}
total_time += omp_get_wtime() - start;
}
std::setprecision(5);
std::cout << total_time/num_iter << std::endl;
delete[] A;
delete[] B;
delete[] C;
return 0;
}
What is confusing me is the following: why is dynamic scheduling faster than static scheduling for this task? Timing the runs and taking an average shows that static scheduling is slower, which to me is a bit counterintuitive since each thread is doing the same amount of work.
Also, am I correctly speeding up my matrix multiplication code?
Parallel matrix multiplication is non-trivial (have you even considered cache-blocking?). Your best bet is likely to be to use a BLAS Library for this, rather than writing it yourself. (Remember, "The best code is the code I do not have to write").
Wikipedia: Basic Linear Algebra Subprograms points to many implementations, a lot of which (including Intel Math Kernel Library) have free licenses.
Which critical section style is better when collecting output container?
// Insert into the output container one object at a time.
vector<float> output;
#pragma omp parallel for
for(int i=0; i<1000000; ++i)
{
float value = // compute something complicated
#pragma omp critical
{
output.push_back(value);
}
}
// Insert object into per-thread container; later aggregate those containers.
vector<float> output;
#pragma omp parallel
{
vector<float> per_thread;
#pragma omp for
for(int i=0; i<1000000; ++i)
{
float value = // compute something complicated
per_thread.push_back(value);
}
#pragma omp critical
{
output.insert(output.end(), per_thread.begin(), per_thread.end());
}
}
EDIT: the above examples were misleading because they indicated that each iteration pushes exactly one item, which is not true in my case. Here are more accurate examples:
// Insert into the output container one object at a time.
vector<float> output;
#pragma omp parallel for
for(int i=0; i<1000000; ++i)
{
int k = // compute number of items
for( int j=0; j<k; ++j)
{
float value = // compute something complicated
#pragma omp critical
{
output.push_back(value);
}
}
}
// Insert object into per-thread container; later aggregate those containers.
vector<float> output;
#pragma omp parallel
{
vector<float> per_thread;
#pragma omp for
for(int i=0; i<1000000; ++i)
{
int k = // compute number of items
for( int j=0; j<k; ++j)
{
float value = // compute something complicated
per_thread.push_back(value);
}
}
#pragma omp critical
{
output.insert(output.end(), per_thread.begin(), per_thread.end());
}
}
If you always insert exactly one item per parallel iteration, the proper way is:
std::vector<float> output(1000000);
#pragma omp parallel for
for(int i=0; i<1000000; ++i)
{
float value = // compute something complicated
output[i] = value;
}
It is threadsafe to assign distinct elements of std::vector (which is guaranteed because all i are different). And there is no significant false-sharing in this case.
If you do not insert exactly one item per parallel iteration either version is basically correct.
Your first version using a critical in the loop can be very slow - note that if the computation is really slow, it may still be fine overall.
The per-thread container / manual reduction is generally fine. Of course it makes the order of the result non-deterministic. You could streamline this by using a user-defined reduction.
I am new to OpenMP and I am stuck with a basic operation. Here is a sample code for my question.
#include <omp.h>
int main(void)
{
int A[16] = {1,2,3,4,5 ...... 16};
#pragma omp parallel for firstprivate(A)
for(int i = 0; i < 4; i++)
{
for(int j = 0; j < 4; j++)
{
A[i*4+j] = Process(A[i*4+j]);
}
}
}
As evident,value of A is local to each thread. However, at the end, I want to write back part of A calculated by each threadto the corresponding position in global variable A. How this can be accomplished?
Simply make A shared. This is fine, because all loop iterations operate on separate elements of A. Remember that OpenMP is shared memory programming.
You can do so explicitly by using shared instead of firstprivate, or simply remove the declaration:
int A[16] = {1,2,3,4,5 ...... 16};
#pragma omp parallel for
for(int i = 0; i < 4; i++)
By default all variables declared outside of the parallel region. You can find an extended exemplary description in this answer.
#include <stdio.h>
#include <omp.h>
int main()
{
int i, key=85, tid;
int a[100] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33, 34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,6 4,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94 ,95};
#pragma omp parallel num_threads(2) private(i)
{
tid = omp_get_thread_num();
#pragma omp for
for(i=0; i<100; i++)
if(a[i] == key)
{
printf("Key found. Position = %d by thread %d \n", i+1, tid);
}
}
return 0;
}
Here is my parallel program.. I'm using GCC in Fedora and system is dual-core...
Actually i need to compare both sequential and parallel program for linear search and prove parallel is better than sequential.
Do i need to add user and sys time to calculate execution time for both sequential and parallel( as this uses two core)??
pls help me out. Thanks in advance.
It costs some time to setup the parallel environment. Try a much larger array. You should certanly see a speed up.