How can i make this loop parallel in OpenMP? - parallel-processing

Hi everyone I got a question at the exam that I could not solve about parallel programming.
Can someone help me with this .
Question: For the following code segment, use OpenMP pragmas to make the loop parallel, or explain why the code segment is not suitable for parallel execution:
flag = 0
for(i=0;(i<n) & (!flag);i++){
a[i] = 2.3 *i;
if(a[i]<b[i])flag = 1;
}

As written the loop cannot trivially be parallelised with OpenMP because the test-expr in the loop (i.e. (i<n) & !flag) does not conform with the OpenMP restrictions :-
test-expr
One of the following:
var relational-op ub
ub relational-op var
relational-op
One of the following: <, <=, >, >=, !=
(OpenMP standard).
At a semantic level this is because the test on flag prevents the loop iteration count from being determinable at entry to the loop, which is what OpenMP requires.
In recent OpenMP standards there is a cancel construct which could be used here, something like (untested, uncompiled code)
bool flag = false;
#pragma omp parallel for
for(int i=0;(i<n);i++){
a[i] = 2.3 *i;
if (a[i]<b[i]) {
#pragma omp atomic write
flag = true;
}
#pragma omp cancel for if (flag)
}
However it seems unlikely that a loop with this little work in it will be profitable for parallelisation in a real code (rather than an exam question).

Related

OpenMP collapse parallel for with parallel max-reduction?

I have the following nested loops that I want to collapse into one for parallelization. Unfortunately the inner loop is a max-reduction rather than standard for loop thus collapse(2) directive apparently can't be used here. Is there any way to collapse these two loops anyway? Thanks!
(note that s is the number of sublists and n is the length of each sublist and suppose n >> s)
#pragma omp parallel for default(shared) private(i,j)
for (i=0; i<n; i++) {
rank[i] = 0;
for (j=0; j<s; j++)
if (rank[i] < sublistrank[j][i])
rank[i] = sublistrank[j][i];
}
In this code the best idea is not to parallelize the inner loop at all, but make sure it is properly vectorized. The inner loop does not access the memory continuously, which prevents vectorization and results in a poor cache utilization. You should rewrite your entire code to ensure continuous memory access (e.g. change the order of indices and use sublistrank[i][j] instead of sublistrank[j][i]).
If also beneficial to use a temporary variable for comparisons and assign it to rank[i] after the loop.
Another comment is that always use your variables in their minimum required scope, it also helps the compiler to create more optimized code. Putting it together your code should look like something like this (assuming you use unsigned int for rank and loop variables)
#pragma omp parallel for default(none) shared(sublistrank, rank)
for (unsigned int i=0; i<n; i++) {
unsigned int max=0;
for (unsigned int j=0; j<s; j++)
if (max < sublistrank[i][j])
max = sublistrank[i][j];
rank[i]=max;
}
I have compared your code and this code on CompilerExporer. You can see that the compiler is able to vectorize it, but not the old one.
Note also that if n is small, the parallel overhead may be bigger than the benefit of parallelization.

How to balance the thread number in nested case when using OpenMP?

This fabulous post teaches me a lot, but I still have a question. For the following code:
double multiply(std::vector<double> const& a, std::vector<double> const& b){
double tmp(0);
int active_levels = omp_get_active_level();
#pragma omp parallel for reduction(+:tmp) if(active_level < 1)
for(unsigned int i=0;i<a.size();i++){
tmp += a[i]+b[i];
}
return tmp;
}
If multiply() is called from another parallel part:
#pragma omp parallel for
for (int i = 0; i < count; i++) {
multiply(a[i], b[i]);
}
Because the outer loop iteration depends on count variable, if count is a big number, it is reasonable. But if count is only 1 and our server is a multiple-core machine(e.g., has 512 cores), then the multiply() function only generate 1 thread. So in this case, the server is under-utilized. BTW, the answer also mentioned:
In any case, writing such code is a bad practice. You should simply leave the parallel regions as they are and allow the end user choose whether nested parallelism should be enabled or not.
So how to balance the thread number in nested case when using OpenMP?
Consider using OpenMP tasks (omp taskloop within one parallel section and an intermediate omp single). This allows you to flexibly use the threads in OpenMP on different nesting levels instead of manually defining numbers of threads for each level or oversubscribing OS threads.
However this comes at increased scheduling costs. At the end of the day, there is no perfect solution that will always do best. Instead you will have to keep measuring and analyzing your performance on practical inputs.

Avoiding race condition in OpenMP?

I was reading about OpenMP and shared memory programming and fell over this pseudo code that has an integer x and two threads
thread 1
x++;
thread 2
x--;
This will lead to a race condition, but can be avoided. I want to avoid it using OpenMP, how should it be done?
This is how I think it will be avoided:
int x;
#pragma omp parallel shared(X) num_threads(2)
int tid = omp_get_thread_num();
if(tid == 1)
x++;
else
x--;
I know that by eliminating the race condition will lead to correct execution but also poor performance, but I don't know why?
If more than one thread is modifying x, the code is at risk from a race condition. Taking a simplified version of your example
int main()
{
int x = 0;
#pragma omp parallel sections
{
#pragma omp section {
++x;
}
#pragma omp section {
--x;
}
}
return x;
}
The two threads modifying x may be interleaved with each other, meaning that the result will not necessarily be zero.
One way to protect the modifications is to wrap the read-modify-write code in a critical region.
Another, that suitable for the simple operations here, is to mark the ++ and -- lines with #pragma omp atomic - that will use platform-native atomic instructions where they exist, which is lightweight compared to a critical region.
Another approach that usually works (but isn't strictly guaranteed by OpenMP) is to change the type used for x to a standard atomic type. Simply changing it from int to std::atomic<int> gives you indivisible ++ and -- operators which you can use here.

parallel 'task's inside an already parallelized 'for' loop in OpenMP

[Background: OpenMP v4+ on Intel's icc compiler]
I want to parallelize tasks inside a loop that is already parallelized. I saw quite a few queries on subjects close to this one, e.g.:
Parallel sections in OpenMP using a loop
Doing a section with one thread and a for-loop with multiple threads
and others with more concentrated wisdom still.
but I could not get a definite answer other than a compile time error message when trying it.
Code:
#pragma omp parallel for private(a,bd) reduction(+:sum)
for (int i=0; i<128; i++) {
a = i%2;
for (int j=a; j<128; j=j+2) {
u_n = 0.25 * ( u[ i*128 + (j-3) ]+
u[ i*128 + (j+3) ]+
u[ (i-1)*128 + j ]+
u[ (i+1)*128 + j ]);
// #pragma omp single nowait
// {
// #pragma omp task shared(sum1) firstprivate(i,j)
// sum1 = (u[i*128+(j-3)]+u[i*128+(j-2)] + u[i*128+(j-1)])/3;
// #pragma omp task shared(sum2) firstprivate(i,j)
// sum2 = (u[i*128+(j+3)]+u[i*128+(j+2)]+u[i*128+(j+1)])/3;
// #pragma omp task shared(sum3) firstprivate(i,j)
// sum3 = (u[(i-1)*128+j]+u[(i-2)*128+j]+u[(i-3)*128+j])/3;
// #pragma omp task shared(sum4) firstprivate(i,j)
// sum4 = (u[(i+1)*128+j]+u[(i+2)*128+j]+u[(i+3)*128+j])/3;
// }
// #pragma omp taskwait
// {
// u_n = 0.25*(sum1+sum2+sum3+sum4);
// }
bd = u_n - u[i*128+ j];
sum += diff * diff;
u[i*128+j]=u_n;
}
}
In the above code, I tried replacing the u_n = 0.25 *(...); line with the 15 commented lines, to try not only to paralllelize the iterations over the 2 for loops, but also to acheive a degree of parallelism on each of the 4 calculations (sum1 to sum4) involving array u[].
The compile error is fairly explicit:
error: the OpenMP "single" pragma must not be enclosed by the
"parallel for" pragma
Is there a way around this so I can optimize that calculation further with OpenMP ?
The single worksharing construct within a loop worksharing construct is prohibited by the standard, but you don't need it there.
The usual parallel -> single -> task setup for tasking is to ensure that you have a thread team setup for your tasks (parallel), but then only spawn each task once (single). You wouldn't need the latter in a parallel for context because each iteration is already executed only once. So you could spawn tasks directly within the loop. This seems to have the expected behavior on both gnu and Intel compilers, i.e. threads that have completed their own loop iterations do help other threads to execute their tasks.
However, that is a bad idea to do in your case. A tiny computation such as the one of sum1 will be much faster on it's own compared to the overhead of spawning a task.
Removing all pragmas except for the parallel for, this is a very reasonable parallelization. Before further optimizing the calculation, you should measure! In particularly, you are interested in whether all your available threads are always computing something, or whether some threads finish early and wait for others (load imbalance). To measure, you should look for a parallel performance analysis tool for your platform. If that is the case, you can address it with scheduling policies, or possibly by nested parallelism in the inner loop.
A full discussion of the performance of your code is more complex, and requires a minimal, complete and verifiable example, a detailed system description, and actual measured performance numbers.

Simple openmp call for loop not working

I am writing some code that would definitively benefit from trying to integrate openmp some software that I am writing. I am new to openmp, and while testing some very basic test code (see below) I noticed that the execution times are extremely longer with openmp activated (#pragma line). Any insight is much appreciated.
int main()
{
int number=200;
int max = 2000000;
for(int t=1; t<max; t++)
{
double fac = 0.0;
#pragma omp parallel for reduction(+:fac)
for(int n=2; n<=number; n++)
fac += 1;
}
return 0;
}
As currently written the code encounters the parallel region max times. The overhead of entering a parallel region in an OpenMP program is small, but you incur it 2000000 times. You don't actually tell us what the run times are, but I can readily believe that this makes the them extremely longer than the serial version. I suggest you wrap the outer loop in a parallel region, not the inner loop.
Take care when you rewrite your code to ensure that the payload inside the parallel region is significant, and returns some value(s) to the program outside the parallel region. Absent these steps a crafty optimising compiler can determine that a loop returns nothing to the rest of the program and simply optimise it away.
Also insert some timing instructions (use omp_get_wtime), rerun your code and, if matters are still not satisfactory, update your question with the new information you gather.
This is an improved code that actually works as intended. It basically wraps the outer loop, rather than the inner one. When compiled without openmp support it takes 1.49s, with openmp 0.48s.
int main()
{
int number=200;
int max = 2000000;
#pragma omp parallel for
for(int t=1; t<max; t++)
{
double fac = 0.0;
for(int n=2; n<=number; n++)
fac += 1;
}
return 0;
}

Resources