How to do OpenMP reduction (sum) inside parallel region? (Result is needed on master thread only).
Algorithm prototype:
#pragma omp parallel
{
t = omp_get_thread_num();
while iterate
{
float f = get_local_result(t);
// fsum is required on master only
float fsum = // ? - SUM of f
if (t == 0):
MPI_Bcast(&fsum, ...);
}
If I have OpenMP region inside while iterate loop, parallel region overhead at each iteration kills the performance...
Here is the simplest way to do this:
float sharedFsum = 0.f;
float masterFsum;
#pragma omp parallel
{
const int t = omp_get_thread_num();
while(iteration_condition)
{
float f = get_local_result(t);
// Manual reduction
#pragma omp update
sharedFsum += f;
// Ensure the reduction is completed
#pragma omp barrier
#pragma omp master
MPI_Bcast(&sharedFsum, ...);
// Ensure no other threads update sharedFsum during the MPI_Bcast
#pragma omp barrier
}
}
The atomic operations can be costly if you have a lot of threads (eg. hundreds). A better approach is to let the runtime perform the reduction for you.
Here is a better version:
float sharedFsum = 0;
#pragma omp parallel
{
const int threadCount = omp_get_num_threads();
float masterFsum;
while(iteration_condition)
{
// Execute get_local_result on each thread and
// perform the reduction into sharedFsum
#pragma omp for reduction(+:sharedFsum) schedule(static,1)
for(int i=0 ; i<threadCount ; ++i)
sharedFsum += get_local_result(i);
#pragma omp master
{
MPI_Bcast(&sharedFsum, ...);
// sharedFsum must be reinitialized for the next iteration
sharedFsum = 0.f;
}
// Ensure no other threads update sharedFsum during the MPI_Bcast
#pragma omp barrier
}
}
Side notes:
t is not protected in your code, use private(t) in the #pragma omp parallel section to avoid an undefined behavior due to a race condition. Alternatively, you can use scoped variables.
#pragma omp master should be preferred to a conditional on the thread ID.
parallel region overhead at each iteration kills the performance...
Most of the time this is due to either (implicit) synchronizations/communications or a work imbalance.
The code above may have the same problem since it is quite synchronous.
If it makes sense in your application, you can make it a bit less synchronous (and thus possibly faster) by removing or moving barriers regarding the speed of the MPI_Bcast and get_local_result. However, this is far from being easy to do it correctly. One way to do that it to use OpenMP tasks and multi-buffering.
Related
I am trying to make a fast parallel loop. In each iteration of the loop, I build an array which is costly so I want it distributed over many threads. After the array is built, I use it to update a matrix. Here it gets tricky because the matrix is common to all threads so only 1 thread can modify parts of the matrix at one time, but when I work on the matrix, it turns out I can distribute that work too since I can work on different parts of the matrix at the same time.
Here is what I currently am doing:
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
#pragma omp critical
{
update_matrix(A, bi)
}
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %d\n", omp_get_thread_num());
#pragma omp parallel sections
{
#pragma omp section
{
printf("id1 = %d\n", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp section
{
printf("id2 = %d\n", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
}
The problem is that the two different sections of the update_matrix() routine are not being parallelized. The output I get looks like this:
id0 = 19
id1 = 0
id2 = 0
id0 = 5
id1 = 0
id2 = 0
...
So the two sections are being executed by the same thread (0). I tried removing the #pragma omp critical in the main loop but it gives the same result. Does anyone know what I'm doing wrong?
#pragma omp parallel sections should not work there because you are already in a parallel part of the code distributed by the #pragma omp prallel for clause. Unless you have enabled nested parallelization with omp_set_nested(1);, the parallel sections clause will be ignored.
Please not that it is not necessarily efficient as spawning new threads has an overhead cost which may not be worth if the update_matrix part is not too CPU intensive.
You have several options:
Forget about that. If the non-critical part of the loop is really what takes most calculations and you already have as many threads as CPUs, spwaning extra threads for a simple operations will do no good. Just remove the parallel sections clause in the subroutine.
Try enable nesting with omp_set_nested(1);
Another option, which comes at the cost of a double synchronization overhead and would be use named critical sections. There may be only one thread in critical section ONE_TO_J and one on critical section J_TO_K so basically up to two threads may update the matrix in parallel. This is costly in term of synchronization overhead.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %d\n", omp_get_thread_num());
#pragma omp critical(ONE_TO_J)
{
printf("id1 = %d\n", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp critical(J_TO_K)
{
printf("id2 = %d\n", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
Or use atomic operations to edit the matrix, if this is suitable.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
float tmp;
printf("id0 = %d\n", omp_get_thread_num());
for (int row=0; row<max_row;row++)
for (int column=0;column<k;column++){
float(tmp)=some_function(b,row,column);
#pragma omp atomic
A[column][row]+=tmp;
}
}
By the way, data is stored in row major order in C, so you should be updating the matrix row by row rather than column by column. This will prevent false-sharing and will improve the algorithm memory-access performance.
I took some of my old OpenMP exercises to practice a little bit, but I have difficulties to find the solution for on in particular.
The goal is to write the most simple OpenMP code that correspond to the dependency graph.
The graphs are visible here: http://imgur.com/a/8qkYb
First one is simple.
It correspond to the following code:
#pragma omp parallel
{
#pragma omp simple
{
#pragma omp task
{
A1();
A2();
}
#pragma omp task
{
B1();
B2();
}
#pragma omp task
{
C1();
C2();
}
}
}
Second one is still easy.
#pragma omp parallel
{
#pragma omp simple
{
#pragma omp task
{
A1();
}
#pragma omp task
{
B1();
}
#pragma omp task
{
C1();
}
#pragma omp barrier
A2();
B2();
C2();
}
}
And now comes the last oneā¦
which is bugging me quite a bit because the number of dependencies is unequal across all function calls. I thought there was a to explicitly state which task you should be waiting for, but I can't find what I'm looking for in the OpenMP documentation.
If anyone have an explanation for this question, I will be very grateful because I've been thinking about it for more than a month now.
First of all there is no #pragma omp simple in the OpenMP 4.5 specification.
I assume you meant #pragma omp single.
If so pragma omp barrier is a bad idea inside a single region, since only one thread will execude the code and waits for all other threads, which do not execute the region.
Additionally in the second on A2,B2 and C2 are not executed in parallel as tasks anymore.
To your acutual question:
What you are looking for seems to be the depend clause for Task constructs at OpenMP Secification pg. 169.
There is a pretty good explaination of the depend clause and how it works by Massimiliano for this question.
The last example is not that complex once you understand what is going on there: each task Tn depends on the previous iteration T-1_n AND its neighbors (T-1_n-1 and T-1_n+1). This pattern is known as Jacobi stencil. It is very common in partial differential equation solvers.
As Henkersmann said, the easiest option is using OpenMP Task's depend clause:
int val_a[N], val_b[N];
#pragma omp parallel
#pragma omp single
{
int *a = val_a;
int *b = val_b;
for( int t = 0; t < T; ++t ) {
// Unroll the inner loop for the boundary cases
#pragma omp task depend(in:a[0], a[1]) depend(out:b[0])
stencil(b, a, i);
for( int i = 1; i < N-1; ++i ) {
#pragma omp task depend(in:a[i-1],a[i],a[i+1]) \
depend(out:b[i])
stencil(b, a, i);
}
#pragma omp task depend(in:a[N-2],a[N-1]) depend(out:b[N-1])
stencil(b, a, N-1);
// Swap the pointers for the next iteration
int *tmp = a;
a = b;
b = tmp;
}
#pragma omp taskwait
}
As you may see, OpenMP task dependences are point-to-point, that means you can not express them in terms of array regions.
Another option, a bit cleaner for this specific case, is to enforce the dependences indirectly, using a barrier:
int a[N], b[N];
#pragma omp parallel
for( int t = 0; t < T; ++t ) {
#pragma omp for
for( int i = 0; i < N-1; ++i ) {
stencil(b, a, i);
}
}
This second case performs a synchronization barrier every time the inner loop finishes. The synchronization granularity is coarser, in the sense that you have only 1 synchronization point for each outer loop iteration. However, if stencil function is long and unbalanced, it is probably worth using tasks.
I want to parallelize that kind of loop. Note that each "calc_block" uses the data that obtained on previous iteration.
for (i=0 ; i<MAX_ITER; i++){
norma1 = calc_block1();
norma2 = calc_block2();
norma3 = calc_block3();
norma4 = calc_block4();
norma = norma1+norma2+norma3+norma4;
...some calc...
if(norma<eps)break;
}
I tryed this, but speedup is quite small ~1.2
for (i=0 ; i<MAX_ITER; i++){
#pragma omp parallel sections{
#pragma omp section
norma1 = calc_block1();
#pragma omp section
norma2 = calc_block2();
#pragma omp section
norma3 = calc_block3();
#pragma omp section
norma4 = calc_block4();
}
norma = norma1+norma2+norma3+norma4;
...some calc...
if(norma<eps)break;
}
I think it happened because of the overhead of using sections inside of loop. But i dont know how to fix it up...
Thanks in advance!
You could reduce the overhead by moving the entire loop inside the parallel region. Thus the threads in the pool used to implement the team would only get "awaken" once. It is a bit tricky and involves careful consideration of variable sharing classes:
#pragma omp parallel private(i,...) num_threads(4)
{
for (i = 0; i < MAX_ITER; i++)
{
#pragma omp sections
{
#pragma omp section
norma1 = calc_block1();
#pragma omp section
norma2 = calc_block2();
#pragma omp section
norma3 = calc_block3();
#pragma omp section
norma4 = calc_block4();
}
#pragma omp single
{
norma = norm1 + norm2 + norm3 + norm4;
// ... some calc ..
}
if (norma < eps) break;
}
}
Both sections and single constructs have implicit barriers at their ends, hence the threads would synchronise before going into the next loop iteration. The single construct reproduces the previously serial part of your program. The ... part in the private clause should list as many as possible variables that are only relevant to ... some calc .... The idea is to run the serial part with thread-local variables since access to shared variables is slower with most OpenMP implementations.
Note that often time the speed-up might not be linear for completely different reason. For example calc_blockX() (with X being 1, 2, 3 or 4) might have too low compute intensity and therefore require very high memory bandwidth. If the memory subsystem is not able to feed all 4 threads at the same time, the speed-up would be less than 4. An example of such case - this question.
I have this piece of code that is parallelized.
int i,n; double pi,x;
double area=0.0;
#pragma omp parallel for private(x) reduction (+:area)
for(i=0; i<n; i++){
x= (i+0.5)/n;
area+= 4.0/(1.0+x*x);
}
pi = area/n;
It is said that the reduction will remove the race condition that could happen if we didn't use a reduction. Still I'm wondering do we need to add lastprivate for area since its used outside the parallel loop and will not be visible outside of it. Else does the reduction cover this as well?
Reduction takes care of making a private copy of area for each thread. Once the parallel region ends area is reduced in one atomic operation. In other words the area that is exposed is an aggregate of all private areas of each thread.
thread 1 - private area = compute(x)
thread 2 - private area = compute(y)
thread 3 - private area = compute(z)
reduction step - public area = area<thread1> + area<thread2> + area<thread3> ...
You do not need lastprivate. To help you understand how reductions are done I think it's useful to see how this can be done with atomic. The following code
float sum = 0.0f;
pragma omp parallel for reduction (+:sum)
for(int i=0; i<N; i++) {
sum += //
}
is equivalent to
float sum = 0.0f;
#pragma omp parallel
{
float sum_private = 0.0f;
#pragma omp for nowait
for(int i=0; i<N; i++) {
sum_private += //
}
#pragma omp atomic
sum += sum_private;
}
Although this alternative has more code it is helpful to show how to use more complicated operators. One limitation when suing reduction is that atomic only supports a few basic operators. If you want to use a more complicated operator (such as a SSE/AVX addition) then you can replace atomic with critical reduction with OpenMP with SSE/AVX
I'm wondering if SSE/AVX operations such as addition and multiplication can be an atomic operation? The reason I ask this is that in OpenMP the atomic construct only works on a limited set of operators. It does not work on for example SSE/AVX additions.
Let's assume I had a datatype float4 that corresponds to a SSE register and that the addition operator is defined for float4 to do an SSE addition. In OpenMP I could do a reduction over an array with the following code:
float4 sum4 = 0.0f; //sets all four values to zero
#pragma omp parallel
{
float4 sum_private = 0.0f;
#pragma omp for nowait
for(int i=0; i<N; i+=4) {
float4 val = float4().load(&array[i]) //load four floats into a SSE register
sum_private4 += val; //sum_private4 = _mm_addps(val,sum_private4)
}
#pragma omp critical
sum4 += sum_private;
}
float sum = horizontal_sum(sum4); //sum4[0] + sum4[1] + sum4[2] + sum4[3]
But atomic is faster than critical in general and my instinct tells me SSE/AVX operations should be atomic (even if OpenMP does not support it). Is this a limitation of OpenMP? Could I use for example e.g. Intel Threading Building Blocks or pthreads to do this as an atomic operation?
Edit: Based on Jim Cownie's comment I created a new function which is the best solution. I verified that it gives the correct result.
float sum = 0.0f;
#pragma omp parallel reduction(+:sum)
{
Vec4f sum4 = 0.0f;
#pragma omp for nowait
for(int i=0; i<N; i+=4) {
Vec4f val = Vec4f().load(&A[i]); //load four floats into a SSE register
sum4 += val; //sum4 = _mm_addps(val,sum4)
}
sum += horizontal_add(sum4);
}
Edit: based on comments Jim Cownie and comments by Mystical at this thread
OpenMP atomic _mm_add_pd I realize now that the reduction implementation in OpenMP does not necessarily use atomic operators and it's best to rely on OpenMP's reduction implementation rather than try to do it with atomic.
SSE & AVX in general are not atomic operations (but multiword CAS would sure be sweet).
You can use the combinable class template in tbb or ppl for more general purpose reductions and thread local initializations, think of it as a synchronized hash table indexed by thread id; it works just fine with OpenMP and doesn't spin up any extra threads on its own.
You can find examples on the tbb site and on msdn.
Regarding the comment, consider this code:
x = x + 5
You should really think of it as the following particularly when multiple threads are involved:
while( true ){
oldValue = x
desiredValue = oldValue + 5
//this conditional is the atomic compare and swap
if( x == oldValue )
x = desiredValue
break;
}
make sense?