OpenMP: task_reduction=reduction? What is 'in_reduction'? - openmp

Is 'task_reduction' same as 'reduction ([ reduction-modifier ,] reduction
identifier : list )' with task reduction-modifier?
If it is same, then why do we need 'task_reduction'?
What is 'in_reduction' doing?
In the text says 'The in_reduction clause specifies that a task participates in a reduction '
But what does that mean? In 'in_reduction' we need same clause as reduction.
in_reduction(identifier : list)
But if we can put reduction variables in 'list', then what does that do with 'task participates in reduction'...?
I can imagine how reduction works but I can't imagine with 'in_reduction'.
Why do we need that?
======================================
I made an example. This code should give sum of num at even index number
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char* argv[]){
int Array [10]= {0,1,2,3,4,5,6,7,8,9};
int Array_length = 10;
int counter = 0;
int result = 0;
#pragma omp parallel
#pragma omp single
{
#pragma omp taskgroup task_reduction(+:result)
//#pragma omp parallel reduction(task,+:result) this can work same as taskgroup task_reduction?
{
while (Array_length!=counter){
if (counter%2==0){
#pragma omp task in_reduction(+:result)
{
result+=Array[counter];
}
}else {
result+=Array[counter];
}
counter=counter+1;
}
}
}
printf("The sum of all array elements is equal to %d.\n", result);
}
And I also made an illustration, so I could visualize my comprehension.
So… task_reduction create a reduction, so that local result can be contributed and only task with in_reduction will participate in contribution of Reduction?
If I understand correctly, then this code should give 20 as result. However, my code gives 45, which is sum of 0 to 9.
Where did I make a mistake?
By the way, what happens if I didn’t write ‘in_reduction’ at all? Then the result is 0?

The way task reduction works is that the task needs to know where to contribute its local result to. SO, what you have to do is to have an taskgroup that "creates" the reduction and then have tasks contribute to it:
void example() {
int result = 0;
#pragma omp parallel // create parallel team
#pragma omp single // have only one task creator
{
#pragma omp taskgroup task_reduction(+:result)
{
while(have_to_create_tasks()) {
#pragma omp task in_reduction(+:result)
{ // this tasks contribute to the reduction
result = do_something();
}
#pragma omp task firstprivate(result)
{ // this task does not contribute to the reduction
result = do_something_else();
}
}
}
}
}
So, the in_reduction is needed for a task to contribute to a reduction that has been created by a task_reduction clause of the enclosing taskgroup region.
The reduction clause cannot be used with the task construct, but only worksharing constructs and other loop constructs.
The only tasking construct that has the reduction clause, is the taskloop construct that uses it for a short cut for a hidden task_reduction construct that encloses all the loop constructs that it creates and that then have a hidden in_reduction clause, too.
UPDATE (to cover the edits by the original poster):
The problem with the code is now that you have two things happening (see the inline comments in your updated code):
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char* argv[]){
int Array [10]= {0,1,2,3,4,5,6,7,8,9};
int Array_length = 10;
int counter = 0;
int result = 0;
#pragma omp parallel
#pragma omp single
{
#pragma omp taskgroup task_reduction(+:result)
{
// "result" is a shared variable in this parallel region
while (Array_length!=counter) {
if (counter%2==0){
#pragma omp task in_reduction(+:result)
{
// This task will contribute to the reduction result
// as you would expect.
result+=Array[counter];
}
} else {
// This addition to "result" is performed by the "single"
// thread and thus hits the shared variable. You can see
// this when you print the address of "result" here
// and before the parallel region.
result+=Array[counter];
}
counter=counter+1;
}
} // Here the "single" thread waits for the taskgroup to complete
// and the reduction to happen. So, here the shared variable
// "result" is added to the value of "result" coming from the
// task reduction. So, result = 25 from the "single" thread and
// result = 20 are added up to result =45
}
printf("The sum of all array elements is equal to %d.\n", result);
}
The addition at the end of the task reduction seems to be a race condition as the updates coming from the single thread and the updates coming from the end of the taskgroup are not synchronized. I guess that the race does not show up, as the code is too fast to clearly expose it.
To fix the code, you'd have to also have a task construct around the update for odd numbers, like so:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char* argv[]){
int Array [10]= {0,1,2,3,4,5,6,7,8,9};
int Array_length = 10;
int counter = 0;
int result = 0;
#pragma omp parallel
#pragma omp single
{
#pragma omp taskgroup task_reduction(+:result)
{
// "result" is a shared variable in this parallel region
while (Array_length!=counter) {
if (counter%2==0){
#pragma omp task in_reduction(+:result)
{
// This task will contribute to the reduction result
// as you would expect.
result+=Array[counter];
}
} else {
#pragma omp task firstprivate(result)
{
// "result" is now a task-local variable that is not
// shared. If you remove the firstprivate, then the
// race condition on the shared variable "result" is
// back.
result+=Array[counter];
}
}
counter=counter+1;
}
} // Here the "single" thread waits for the taskgroup to complete
// and the reduction to happen. So, here the shared variable
// "result" is added to the value of "result" coming from the
// task reduction. So, result = 25 from the "single" thread and
// result = 20 are added up to result =45
}
printf("The sum of all array elements is equal to %d.\n", result);
}
In my first answer, I failed to add a proper firstprivate or private clause to the task. I'm sorry about that.

Related

What kinds of problem do we have with this program?(OpenMP) how to svoid it?

Following is my code and i wanna know why there will happen the problem with "race-condition" and how to solve this?
#include <iostream>
int main(){
int a = 123;
#pragma omp parallel num_threads(2)
{
int thread_id = omp_get_thread_num();
int b = (thread_id + 1)*10;
a += b;
}
std::cout << “a = “ << a << “\n”;
return 0;
}
A race condition occurs when two or more threads access shared data and at least one of them changes its value at the same time. In your code this line cause a race condition:
a += b;
a is a shared variable and updated by 2 threads simultaneously, so the final result may be incorrect. Note that depending on the hardware used, possible race condition does not necessarily means that a data race actually will occur, so the result may be correct, but it is a semantic error in your code.
To fix it you have 2 options:
use atomic operation:
#pragma omp atomic
a += b;
use reduction:
#pragma omp parallel num_threads(2) reduction(+:a)

How to : Bitonic Sort with OpenMP

I' m quite new to openmp and i have this assignment for school.I think the problem is in bitonicMerge.I have been trying a lot of variations and possibilities and the "best solution" i found is the following:
`void sort() {
#pragma omp parallel
{
#pragma omp single
recBitonicSort(0, N, ASCENDING);
}
}
void recBitonicSort(int lo, int cnt, int dir) {
if (cnt>1) {
int k=cnt/2;
#pragma omp task if(cnt>1024) // elements vary from 2^12 to 2^24
recBitonicSort(lo, k, ASCENDING);
#pragma omp task if(cnt>1024)
recBitonicSort(lo+k, k, DESCENDING);
#pragma omp taskwait
bitonicMerge(lo, cnt, dir);
}
}
void bitonicMerge(int lo, int cnt, int dir) {
if (cnt>1) {
int k=cnt/2;
int i;
#pragma omp parallel num_threads(p)
{
#pragma omp for schedule(static) nowait
for (i=lo; i<lo+k; i++)
{
//printf("Num of threads: %d\n", omp_get_num_threads());
compare(i, i+k, dir);
}
#pragma omp single
{
#pragma omp task if(cnt>1024)
bitonicMerge(lo, k, dir);
#pragma omp task if(cnt>1024)
bitonicMerge(lo+k, k, dir);
}
}
}
}`
The code works but has a time cost(Imperative bitonic takes 0.5s, Recursive takes 7-8s with elements=2^20 and maxthreads=8).I am aware that printf prints 1 thread only, probably because recBitonicSort assigns a single thread to bitonicMerge, but i cant find a better solution.

OpenMP, dependency graph

I took some of my old OpenMP exercises to practice a little bit, but I have difficulties to find the solution for on in particular.
The goal is to write the most simple OpenMP code that correspond to the dependency graph.
The graphs are visible here: http://imgur.com/a/8qkYb
First one is simple.
It correspond to the following code:
#pragma omp parallel
{
#pragma omp simple
{
#pragma omp task
{
A1();
A2();
}
#pragma omp task
{
B1();
B2();
}
#pragma omp task
{
C1();
C2();
}
}
}
Second one is still easy.
#pragma omp parallel
{
#pragma omp simple
{
#pragma omp task
{
A1();
}
#pragma omp task
{
B1();
}
#pragma omp task
{
C1();
}
#pragma omp barrier
A2();
B2();
C2();
}
}
And now comes the last one…
which is bugging me quite a bit because the number of dependencies is unequal across all function calls. I thought there was a to explicitly state which task you should be waiting for, but I can't find what I'm looking for in the OpenMP documentation.
If anyone have an explanation for this question, I will be very grateful because I've been thinking about it for more than a month now.
First of all there is no #pragma omp simple in the OpenMP 4.5 specification.
I assume you meant #pragma omp single.
If so pragma omp barrier is a bad idea inside a single region, since only one thread will execude the code and waits for all other threads, which do not execute the region.
Additionally in the second on A2,B2 and C2 are not executed in parallel as tasks anymore.
To your acutual question:
What you are looking for seems to be the depend clause for Task constructs at OpenMP Secification pg. 169.
There is a pretty good explaination of the depend clause and how it works by Massimiliano for this question.
The last example is not that complex once you understand what is going on there: each task Tn depends on the previous iteration T-1_n AND its neighbors (T-1_n-1 and T-1_n+1). This pattern is known as Jacobi stencil. It is very common in partial differential equation solvers.
As Henkersmann said, the easiest option is using OpenMP Task's depend clause:
int val_a[N], val_b[N];
#pragma omp parallel
#pragma omp single
{
int *a = val_a;
int *b = val_b;
for( int t = 0; t < T; ++t ) {
// Unroll the inner loop for the boundary cases
#pragma omp task depend(in:a[0], a[1]) depend(out:b[0])
stencil(b, a, i);
for( int i = 1; i < N-1; ++i ) {
#pragma omp task depend(in:a[i-1],a[i],a[i+1]) \
depend(out:b[i])
stencil(b, a, i);
}
#pragma omp task depend(in:a[N-2],a[N-1]) depend(out:b[N-1])
stencil(b, a, N-1);
// Swap the pointers for the next iteration
int *tmp = a;
a = b;
b = tmp;
}
#pragma omp taskwait
}
As you may see, OpenMP task dependences are point-to-point, that means you can not express them in terms of array regions.
Another option, a bit cleaner for this specific case, is to enforce the dependences indirectly, using a barrier:
int a[N], b[N];
#pragma omp parallel
for( int t = 0; t < T; ++t ) {
#pragma omp for
for( int i = 0; i < N-1; ++i ) {
stencil(b, a, i);
}
}
This second case performs a synchronization barrier every time the inner loop finishes. The synchronization granularity is coarser, in the sense that you have only 1 synchronization point for each outer loop iteration. However, if stencil function is long and unbalanced, it is probably worth using tasks.

Thread safe counter using OpenMP

What are solutions to the hazards in doing this?
#include <iostream>
#include <unistd.h>
#include <cstdlib>
#include <ctime>
int main(){
int k = 0;
#pragma omp parallel for
for(int i = 0; i < 100000; i++){
if (i % 2){ /** Conditional based on i **/
#pragma omp atomic
k++;
usleep(1000 * ((float)std::rand() / RAND_MAX));
#pragma omp task
std::cout << k << std::endl; /** Some sort of task **/
}
}
return 0;
}
I need all ks to be unique. What would be a better way of doing this?
Edit
Notice how this question refers to an aggregate
In particular I want to spawn tasks based on a shared variable. I run the risk of having a race condition.
Consider thread 2 completes, evaluates true for the conditional, and increments k before thread 1 spawns all tasks.
Edit edit
I tried to force a race condition. It wasn't obvious without the sleep. There are in fact problems. How can I overcome this.
Here's a quick solution:
...
#pragma omp atomic
k++;
int c = k;
...
but I'd like a guarantee.
Tangential. Why doesn't this implementation work?
...
int c;
#pragma omp crtical
{
k++;
c = k;
}
...
At the end of the function, std::cout << k;, is consistently less than the expected 50000 output proof
I hate to answer my question so quickly, but I found a solution for this particular instance.
As of OpenMP 3.1 there is the "atomic capture" pragma
The use case is for problems just like this. The resultant code:
#include <iostream>
#include <unistd.h>
#include <cstdlib>
#include <ctime>
int main(){
int k = 0;
#pragma omp parallel for
for(int i = 0; i < 100000; i++){
if (i % 2){ /** Conditional based on i **/
int c;
#pragma omp atomic capture
{
c = k;
k++;
}
usleep(1000 * ((float)std::rand() / RAND_MAX));
#pragma omp task
std::cout << c << std::endl; /** Some sort of task **/
}
}
std::cout << k << std::endl; /** Some sort of task **/
std::cout.flush();
return 0;
}
I will leave this problem open if someone would like to contribute ideas/ code arch. suggestions for avoiding these problems, reasons the #pragma omp crtical didn't work

Parallelizing a series of independent sequential lines of code

What is the best way to execute multiple lines of code in parallel if they are not dependent of each other? (I'm using OpenMP)
Pseudo code:
database->connect()
openfile("stuff.txt")
ping("stackoverflow.com")
x = 2;
y = a + b;
The only way I can come up with is:
#pragma omp parallel for
for(i = 0; i < 5; i++)
switch (i) {
case 0: database->connect(); break;
...
I haven't tried it, but I also remember that you're not supposed to break while using OpenMP
So I'm assuming that the indivdual things you listed as independant tasks were just examples. If they really are things like y=a+b, then as #chrisaycock and #ejd have said, they're too small for this sort of parallelism (eg thread based, as opposed to ILP or something) to actually take advantage of the concurrency due to overheads. But if they are bigger operations, the way to do task-based parallelism in OpenMP is with the task directive: eg,
#include <stdio.h>
#include <omp.h>
#include <unistd.h>
void work(int *v) {
*v = omp_get_thread_num();
sleep(1);
}
int main(int argc, char **argv)
{
int a, b, c;
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(a) default(none)
work(&a);
#pragma omp task shared(b) default(none)
work(&b);
#pragma omp task shared(c) default(none)
work(&c);
}
}
printf("a,b,c = %d,%d,%d\n", a, b, c);
return 0;
}

Resources