mandelbrot using openMP - openmp

// return 1 if in set, 0 otherwise
int inset(double real, double img, int maxiter){
double z_real = real;
double z_img = img;
for(int iters = 0; iters < maxiter; iters++){
double z2_real = z_real*z_real-z_img*z_img;
double z2_img = 2.0*z_real*z_img;
z_real = z2_real + real;
z_img = z2_img + img;
if(z_real*z_real + z_img*z_img > 4.0) return 0;
}
return 1;
}
// count the number of points in the set, within the region
int mandelbrotSetCount(double real_lower, double real_upper, double img_lower, double img_upper, int num, int maxiter){
int count=0;
double real_step = (real_upper-real_lower)/num;
double img_step = (img_upper-img_lower)/num;
for(int real=0; real<=num; real++){
for(int img=0; img<=num; img++){
count+=inset(real_lower+real*real_step,img_lower+img*img_step,maxiter);
}
}
return count;
}
// main
int main(int argc, char *argv[]){
double real_lower;
double real_upper;
double img_lower;
double img_upper;
int num;
int maxiter;
int num_regions = (argc-1)/6;
for(int region=0;region<num_regions;region++){
// scan the arguments
sscanf(argv[region*6+1],"%lf",&real_lower);
sscanf(argv[region*6+2],"%lf",&real_upper);
sscanf(argv[region*6+3],"%lf",&img_lower);
sscanf(argv[region*6+4],"%lf",&img_upper);
sscanf(argv[region*6+5],"%i",&num);
sscanf(argv[region*6+6],"%i",&maxiter);
printf("%d\n",mandelbrotSetCount(real_lower,real_upper,img_lower,img_upper,num,maxiter));
}
return EXIT_SUCCESS;
}
I need to convert the above code into openMP. I know how to do it for a single matrix or image but i have to do it for 2 images at the same time
the arguments are as follows
$./mandelbrot -2.0 1.0 -1.0 1.0 100 10000 -1 1.0 0.0 1.0 100 10000
Any suggestion how to divide the work in to different threads for the two images and then further divide work for each image.
thanks in advance

If you want to process multiple images at a time, you need to add a #pragma omp parallel for into the loop in the main body such as:
#pragma omp parallel for private(real_lower, real_upper, img_lower, img_upper, num, maxiter)
for(int region=0;region<num_regions;region++){
// scan the arguments
sscanf(argv[region*6+1],"%lf",&real_lower);
sscanf(argv[region*6+2],"%lf",&real_upper);
sscanf(argv[region*6+3],"%lf",&img_lower);
sscanf(argv[region*6+4],"%lf",&img_upper);
sscanf(argv[region*6+5],"%i",&num);
sscanf(argv[region*6+6],"%i",&maxiter);
printf("%d\n",mandelbrotSetCount(real_lower,real_upper,img_lower,img_upper,num,maxiter));
}
Notice that some variables need to be classified as private (i.e. each thread has its own copy).
Now, if you want additional parallelism you need nested OpenMP (see nested and NESTED_OMP in OpenMP specification) as the work will be spawned by OpenMP threads -- but note that nesting may not give you a performance boost always.
In this case, what about adding a #pragma omp parallel for (with the appropriate reduction clause so that each thread accumulates into count) into the mandelbrotSetCount routine such as
// count the number of points in the set, within the region
int mandelbrotSetCount(double real_lower, double real_upper, double img_lower, double img_upper, int num, int maxiter)
{
int count=0;
double real_step = (real_upper-real_lower)/num;
double img_step = (img_upper-img_lower)/num;
#pragma omp parallel for reduction(+:count)
for(int real=0; real<=num; real++){
for(int img=0; img<=num; img++){
count+=inset(real_lower+real*real_step,img_lower+img*img_step,maxiter);
}
}
return count;
}
The whole approach would split images between threads first and then the rest of the available threads would be able to split the loop iterations in this routine among all the available threads each time you invoke the routine.
EDIT
As user Hristo suggest's on the comments, the mandelBrotSetCount routine might be unbalanced (the best reason is that the user simply requests a different number of maxiter) on each invocation. One way to address this performance issue might be to use dynamic thread scheduling in the routine. So rather than having
#pragma omp parallel for reduction(+:count)
we might want to have
#pragma omp parallel for reduction(+:count) schedule(dynamic,N)
and here N should be a relatively small value (and likely larger than 1).

Related

Difference between mutual exclusion like atomic and reduction in OpenMP

I'm am following video lectures of Tim Mattson on OpenMP and there was one exercise to find errors in provided code that count area of the Mandelbrot. So here is the solution that was provided:
#define NPOINTS 1000
#define MAXITER 1000
void testpoint(struct d_complex);
struct d_complex{
double r;
double i;
};
struct d_complex c;
int numoutside = 0;
int main(){
int i,j;
double area, error, eps = 1.0e-5;
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps)
for(i = 0; i<NPOINTS; i++){
for(j=0; j < NPOINTS; j++){
c.r = -2.0+2.5*(double)(i)/(double)(NPOINTS)+eps;
c.i = 1.125*(double)(j)/(double)(NPOINTS)+eps;
testpoint(c);
}
}
area=2.0*2.5*1.125*(double)(NPOINTS*NPOINTS-numoutside)/(double)(NPOINTS*NPOINTS);
error=area/(double)NPOINTS;
printf("Area of Mandlebrot set = %12.8f +/- %12.8f\n",area,error);
printf("Correct answer should be around 1.510659\n");
}
void testpoint(struct d_complex c){
// Does the iteration z=z*z+c, until |z| > 2 when point is known to be outside set
// If loop count reaches MAXITER, point is considered to be inside the set
struct d_complex z;
int iter;
double temp;
z=c;
for (iter=0; iter<MAXITER; iter++){
temp = (z.r*z.r)-(z.i*z.i)+c.r;
z.i = z.r*z.i*2+c.i;
z.r = temp;
if ((z.r*z.r+z.i*z.i)>4.0) {
#pragma omp atomic
numoutside++;
break;
}
}
}
The question I have is, could we use reduction in #pragma omp parallel of variable numoutside like:
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps) reduction(+:numoutside)
without atomic construct in testpoint function?
I tested the function without atomic, and the result was different from the one I got in the first place. Why does that happen? And while I understand the concept of mutual exclusion and use of it because of race conditioning, isn't reduction just another form of solving that problem with private variables?
Thank You in advance.

error in OpenMP program execution

I am trying to sum all the elements of an array which is initialized in the same code. As every element is independent from each other, I tried to perform the sum in parallel. My code is shown below:
int main(int argc, char** argv)
{
cout.precision(20);
double sumre=0.,Mre[11];
int n=11;
for(int i=0; i<n; i++)
Mre[i]=2.*exp(-10*M_PI*i/(1.*n));
#pragma omp parallel for reduction(+:sumre)
for(int i=0; i<n; i++)
{
sumre+=Mre[i];
}
cout<<sumre<<"\n";
}
which I compile and run with:
g++ -O3 -o sum sumparallel.cpp -fopenmp
./sum
respectively. My problem is that the output differs every time I run it. Sometimes it gives
2.1220129388411006488
or
2.1220129388411002047
Does anyone have an idea what is happening here?
Some of these comments hint at the problem here, but there could be two distinct issues
Double precision numbers do not have 20 decimal precision
If you want to print the maximum precision of sumre, use something like this
#include <float.h>
int maint(int argc, char* argv[])
{
...
printf("%.*g", DBL_DECIMAL_DIG, number);
return 0;
}
Floating point arithmetic is non-commutative
The effect of this property is roundoff error. In fact, the function that you have defined, a gaussian, is especially prone to roundoff for summation. Considering that workload distribution of OpenMP parallel for is undefined, you may receive different answers when you run it. To get around this you may use the kahan summation algorithm. An implementation with OpenMP would look something like this:
...
double sum = 0.0, c = 0.0;
#pragma omp parallel for reduction(+:sum, +:c)
for(i = 0; i < n; i++)
{
double y = Mre[i] - c;
double t = sum + y;
c = (t - sum) - y;
sum = t;
}
sum = sum - c;
...

OpenMP: cannot change the value of reduction variable

After I had read that the initial value of reduction variable is set according to the operator used for reduction, I decided that instead of remembering these default values it is better to initialize it explicitly. So I modified the code in question by Totonga as follows
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
#pragma omp parallel private(x) reduction(+:sum)
{
sum = 0.;
#pragma omp for schedule(static)
for (int i=0; i<num_steps; ++i)
{
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
}
But it turns out that no matter whether I write sum = 0. or sum = 123.456 the code produces the same result (used gcc-4.5.2 compiler). Can somebody, please, explain me why? (with a reference to openmp standard, if possible) Thanks in advance to everybody.
P.S. since some people object initializing reduction variable, I think it makes sense to expand a question a little. The code below works as expected: I initialize reduction variable and obtain result, which DOES depend on MY initial value
int sum;
#pragma omp parallel reduction(+:sum)
{
sum = 1;
}
printf("Reduction sum = %d\n",sum);
The printed result will be the number of cores, and not 0.
P.P.S I have to update my question again. User Gilles gave an insightful comment: And upon exit of the parallel region, these local values will be reduced using the + operator, and with the initial value of the variable, prior to entering the section.
Well, the following code gives me the result 3.142592653598146, which is badly calculated pi instead of expected 103.141592653598146 (the initial code was giving me excellent value of pi=3.141592653598146)
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
sum = 100.;
#pragma omp parallel private(x) reduction(+:sum)
{
#pragma omp for schedule(static)
for (int i=0; i<num_steps; ++i)
{
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
}
Why would you want to do that? This is just begging with all your soul for troubles. The reduction clause and the way the local variables used are initialised are defined for a reason, and the idea is that you don't need to remember these initialisation value just because they are already right.
However, in your code, the behaviour is undefined. Let's see why...
Let's assume your initial code is this:
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
sum = 0.;
for (int i=0; i<num_steps; ++i) {
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
Well, the "normal" way of parallelising it with OpenMP would be:
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
sum = 0.;
#pragma omp parallel for reduction(+:sum) private(x)
for (int i=0; i<num_steps; ++i) {
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
Pretty straightforward, isn't it?
Now, when instead of that, you do:
const int num_steps = 100000;
double x, sum, dx = 1./num_steps;
#pragma omp parallel private(x) reduction(+:sum)
{
sum = 0.;
#pragma omp for schedule(static)
for (int i=0; i<num_steps; ++i)
{
x = (i+0.5)*dx;
sum += 4./(1.+x*x);
}
}
You have a problem... The reason is that upon entry into the parallel region, sum hadn't been initialised. So when you declare omp parallel reduction(+:sum), you create a per-thread private version of sum, initialised to the "logical" initial value corresponding to the operator of you reduction clause, namely 0 here because you asked for a + reduction. And upon exit of the parallel region, these local values will be reduced using the + operator, and with the initial value of the variable, prior to entering the section. See this for reference:
The reduction clause specifies a reduction-identifier and one or more
list items. For each list item, a private copy is created in each
implicit task or SIMD lane, and is initialized with the initializer
value of the reduction-identifier. After the end of the region, the
original list item is updated with the values of the private copies
using the combiner associated with the reduction-identifier
So in summary, upon exit you have the equivalent of sum += sum_local_0 + sum_local_1 + ... sum_local_nbthreadsMinusOne
Therefore, since in your code, sum doesn't have any initial value, its value upon exit of the parallel region isn't defined as well, and can be whatever...
Now let's imagine you did indeed initialise it... Then, if instead of using the right initialiser inside the parallel region (like your sum=0.; in the hereinabove code), you used for whatever reason sum=1.; instead, then the final sum won't be just incremented by 1, but by 1 times the number of threads used inside the parallel region, since the extra value will be counted as many times as there are of threads.
So in conclusion, just use reduction clauses and variables the "expected"/"naïve" way, that will spare you and the people coming after for maintaining your code a lot of troubles.
Edit: It looks like my point was not clear enough, so I'll try to explain it better:
this code:
int sum;
#pragma omp parallel reduction(+:sum)
{
sum = 1;
}
printf("Reduction sum = %d\n",sum);
Has an undefined behaviour because it is equivalent to:
int sum, numthreads;
#pragma omp parallel
#pragma omp single
numthreads = omp_get_num_threads();
sum += numthreads; // value of sum is undefined since it never was initialised
printf("Reduction sum = %d\n",sum);
Now, this code is valid:
int sum = 0; //here, sum has been initialised
#pragma omp parallel reduction(+:sum)
{
sum = 1;
}
printf("Reduction sum = %d\n",sum);
To convince yourself, just read the snippet of the standard I gave:
After the end of the region, the
original list item is updated with the values of the private copies
using the combiner associated with the reduction-identifier
So the reduction uses the combination of the private reduction variables and the original value to perform the final reduction upon exit. So if the original value wasn't set, the final value is undefined as well. And that's not because for some reason your compiler gives you a value that seems right, that the code is right.
Is that clearer now?

OpenMP program is slower than sequential one

When I try the following code
double start = omp_get_wtime();
long i;
#pragma omp parallel for
for (i = 0; i <= 1000000000; i++) {
double x = rand();
}
double end = omp_get_wtime();
printf("%f\n", end - start);
Execution time is about 168 seconds, while the sequential version only spends 20 seconds.
I'm still a newbie in parallel programming. How could I get a parallel version that's faster that the sequential one?
The random number generator rand(3) uses global state variables (hidden in the (g)libc implementation). Access to them from multiple threads leads to cache issues and also is not thread safe. You should use the rand_r(3) call with seed parameter private to the thread:
long i;
unsigned seed;
#pragma omp parallel private(seed)
{
// Initialise the random number generator with different seed in each thread
// The following constants are chosen arbitrarily... use something more sensible
seed = 25234 + 17*omp_get_thread_num();
#pragma omp for
for (i = 0; i <= 1000000000; i++) {
double x = rand_r(&seed);
}
}
Note that this will produce different stream of random numbers when executed in parallel than when executed in serial. I would also recommend erand48(3) as a better (pseudo-)random number source.

How to parallelize an array shift with OpenMP?

How can I parallelize an array shift with OpenMP?
I've tryed a few things but didn't get any accurate results for the following example (which rotates the elements of an array of Carteira objects, for a permutation algorithm):
void rotaciona(int i)
{
Carteira aux = this->carteira[i];
for(int c = i; c < this->size - 1; c++)
{
this->carteira[c] = this->carteira[c+1];
}
this->carteira[this->size-1] = aux;
}
Thank you very much!
This is an example of a loop with loop-carried dependencies, and so can't be easily parallelized as written because the tasks (each iteration of the loop) aren't independent. Breaking the dependency can vary from a trivial modification to the completely impossible
(eg, an iteration loop).
Here, the case is somewhat in between. The issue with doing this in parallel is that you need to find out what your rightmost value is going to be before your neighbour changes the value. The OMP for construct doesn't expose to you which loop iterations values will be "yours", so I don't think you can use the OpenMP for worksharing construct to break up the loop. However, you can do it yourself; but it requires a lot more code, and it won't nicely reduce to the serial case any more.
But still, an example of how to do this is shown below. You have to break the loop up yourself, and then get your rightmost value. An OpenMP barrier ensures that no one starts modifying values until all the threads have cached their new rightmost value.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char **argv) {
int i;
char *array;
const int n=27;
array = malloc(n * sizeof(char) );
for (i=0; i<n-1; i++)
array[i] = 'A'+i;
array[n-1] = '\0';
printf("Array pre-shift = <%s>\n",array);
#pragma omp parallel default(none) shared(array) private(i)
{
int nthreads = omp_get_num_threads();
int tid = omp_get_thread_num();
int blocksize = (n-2)/nthreads;
int start = tid*blocksize;
int end = start + blocksize - 1;
if (tid == nthreads-1) end = n-2;
/* we are responsible for values start...end */
char rightval = array[end+1];
#pragma omp barrier
for (i=start; i<end; i++)
array[i] = array[i+1];
array[end] = rightval;
}
printf("Array post-shift = <%s>\n",array);
return 0;
}
Though your sample doesn't show any explicit openmp pragma's, I don't think it could work easily:
you are doing an in-place operation with overlapping regions.
If you split the loop in chunks, you'll have race conditions at the boundaries (because el[n] gets copied from el[n+1], which might already have been updated in another thread).
I suggest that you do manual chunking (which can be done), but I suspect that openmp parallel for is not flexible enough (haven't tried), so you could just have a parallell region that does the work in chunks, and fixup the boundary elements after a thread barrier/end of parallel block
Other thoughts:
if your values are POD, you can use memmove instead
if you can, simply switch to a list
.
std::list<Carteira> items(3000);
// rotation is now simply:
items.push_back(items.front());
items.erase(items.begin());

Resources