I use OpenMP as:
#pragma omp parallel for reduction(+:average_stroke_width)
for(int i = 0; i < GB_name.size(); ++i) {...}
I know I can use :
#pragma omp parallel for num_threads(thread)
for(int index = 0; index < GB_name.size(); ++index){...}
How can I control the thread number when I use reduction?
How can I control the thread number when I use reduction?
Both clauses can be used togehter:
#pragma omp parallel for reduction(+:average_stroke_width) num_threads(thread)
for(int i = 0; i < GB_name.size(); ++i) {...}
Note that reduction involves all threads, so you cannot have a parallel loop with 8 threads and then perform a reduction with 4 threads only. Reduction combines the local values in all threads and therefore all of them need to participate.
Related
How to do OpenMP reduction (sum) inside parallel region? (Result is needed on master thread only).
Algorithm prototype:
#pragma omp parallel
{
t = omp_get_thread_num();
while iterate
{
float f = get_local_result(t);
// fsum is required on master only
float fsum = // ? - SUM of f
if (t == 0):
MPI_Bcast(&fsum, ...);
}
If I have OpenMP region inside while iterate loop, parallel region overhead at each iteration kills the performance...
Here is the simplest way to do this:
float sharedFsum = 0.f;
float masterFsum;
#pragma omp parallel
{
const int t = omp_get_thread_num();
while(iteration_condition)
{
float f = get_local_result(t);
// Manual reduction
#pragma omp update
sharedFsum += f;
// Ensure the reduction is completed
#pragma omp barrier
#pragma omp master
MPI_Bcast(&sharedFsum, ...);
// Ensure no other threads update sharedFsum during the MPI_Bcast
#pragma omp barrier
}
}
The atomic operations can be costly if you have a lot of threads (eg. hundreds). A better approach is to let the runtime perform the reduction for you.
Here is a better version:
float sharedFsum = 0;
#pragma omp parallel
{
const int threadCount = omp_get_num_threads();
float masterFsum;
while(iteration_condition)
{
// Execute get_local_result on each thread and
// perform the reduction into sharedFsum
#pragma omp for reduction(+:sharedFsum) schedule(static,1)
for(int i=0 ; i<threadCount ; ++i)
sharedFsum += get_local_result(i);
#pragma omp master
{
MPI_Bcast(&sharedFsum, ...);
// sharedFsum must be reinitialized for the next iteration
sharedFsum = 0.f;
}
// Ensure no other threads update sharedFsum during the MPI_Bcast
#pragma omp barrier
}
}
Side notes:
t is not protected in your code, use private(t) in the #pragma omp parallel section to avoid an undefined behavior due to a race condition. Alternatively, you can use scoped variables.
#pragma omp master should be preferred to a conditional on the thread ID.
parallel region overhead at each iteration kills the performance...
Most of the time this is due to either (implicit) synchronizations/communications or a work imbalance.
The code above may have the same problem since it is quite synchronous.
If it makes sense in your application, you can make it a bit less synchronous (and thus possibly faster) by removing or moving barriers regarding the speed of the MPI_Bcast and get_local_result. However, this is far from being easy to do it correctly. One way to do that it to use OpenMP tasks and multi-buffering.
I am trying to make a fast parallel loop. In each iteration of the loop, I build an array which is costly so I want it distributed over many threads. After the array is built, I use it to update a matrix. Here it gets tricky because the matrix is common to all threads so only 1 thread can modify parts of the matrix at one time, but when I work on the matrix, it turns out I can distribute that work too since I can work on different parts of the matrix at the same time.
Here is what I currently am doing:
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
#pragma omp critical
{
update_matrix(A, bi)
}
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %d\n", omp_get_thread_num());
#pragma omp parallel sections
{
#pragma omp section
{
printf("id1 = %d\n", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp section
{
printf("id2 = %d\n", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
}
The problem is that the two different sections of the update_matrix() routine are not being parallelized. The output I get looks like this:
id0 = 19
id1 = 0
id2 = 0
id0 = 5
id1 = 0
id2 = 0
...
So the two sections are being executed by the same thread (0). I tried removing the #pragma omp critical in the main loop but it gives the same result. Does anyone know what I'm doing wrong?
#pragma omp parallel sections should not work there because you are already in a parallel part of the code distributed by the #pragma omp prallel for clause. Unless you have enabled nested parallelization with omp_set_nested(1);, the parallel sections clause will be ignored.
Please not that it is not necessarily efficient as spawning new threads has an overhead cost which may not be worth if the update_matrix part is not too CPU intensive.
You have several options:
Forget about that. If the non-critical part of the loop is really what takes most calculations and you already have as many threads as CPUs, spwaning extra threads for a simple operations will do no good. Just remove the parallel sections clause in the subroutine.
Try enable nesting with omp_set_nested(1);
Another option, which comes at the cost of a double synchronization overhead and would be use named critical sections. There may be only one thread in critical section ONE_TO_J and one on critical section J_TO_K so basically up to two threads may update the matrix in parallel. This is costly in term of synchronization overhead.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
printf("id0 = %d\n", omp_get_thread_num());
#pragma omp critical(ONE_TO_J)
{
printf("id1 = %d\n", omp_get_thread_num());
modify columns 1 to j of A using b
}
#pragma omp critical(J_TO_K)
{
printf("id2 = %d\n", omp_get_thread_num());
modify columns j+1 to k of A using b
}
}
Or use atomic operations to edit the matrix, if this is suitable.
#pragma omp parallel for
for (i = 0; i < n; ++i)
{
... build array bi ...
update_matrix(A, bi); // not critical
}
...
subroutine update_matrix(A, b)
{
float tmp;
printf("id0 = %d\n", omp_get_thread_num());
for (int row=0; row<max_row;row++)
for (int column=0;column<k;column++){
float(tmp)=some_function(b,row,column);
#pragma omp atomic
A[column][row]+=tmp;
}
}
By the way, data is stored in row major order in C, so you should be updating the matrix row by row rather than column by column. This will prevent false-sharing and will improve the algorithm memory-access performance.
I want to parallelize that kind of loop. Note that each "calc_block" uses the data that obtained on previous iteration.
for (i=0 ; i<MAX_ITER; i++){
norma1 = calc_block1();
norma2 = calc_block2();
norma3 = calc_block3();
norma4 = calc_block4();
norma = norma1+norma2+norma3+norma4;
...some calc...
if(norma<eps)break;
}
I tryed this, but speedup is quite small ~1.2
for (i=0 ; i<MAX_ITER; i++){
#pragma omp parallel sections{
#pragma omp section
norma1 = calc_block1();
#pragma omp section
norma2 = calc_block2();
#pragma omp section
norma3 = calc_block3();
#pragma omp section
norma4 = calc_block4();
}
norma = norma1+norma2+norma3+norma4;
...some calc...
if(norma<eps)break;
}
I think it happened because of the overhead of using sections inside of loop. But i dont know how to fix it up...
Thanks in advance!
You could reduce the overhead by moving the entire loop inside the parallel region. Thus the threads in the pool used to implement the team would only get "awaken" once. It is a bit tricky and involves careful consideration of variable sharing classes:
#pragma omp parallel private(i,...) num_threads(4)
{
for (i = 0; i < MAX_ITER; i++)
{
#pragma omp sections
{
#pragma omp section
norma1 = calc_block1();
#pragma omp section
norma2 = calc_block2();
#pragma omp section
norma3 = calc_block3();
#pragma omp section
norma4 = calc_block4();
}
#pragma omp single
{
norma = norm1 + norm2 + norm3 + norm4;
// ... some calc ..
}
if (norma < eps) break;
}
}
Both sections and single constructs have implicit barriers at their ends, hence the threads would synchronise before going into the next loop iteration. The single construct reproduces the previously serial part of your program. The ... part in the private clause should list as many as possible variables that are only relevant to ... some calc .... The idea is to run the serial part with thread-local variables since access to shared variables is slower with most OpenMP implementations.
Note that often time the speed-up might not be linear for completely different reason. For example calc_blockX() (with X being 1, 2, 3 or 4) might have too low compute intensity and therefore require very high memory bandwidth. If the memory subsystem is not able to feed all 4 threads at the same time, the speed-up would be less than 4. An example of such case - this question.
I have this piece of code that is parallelized.
int i,n; double pi,x;
double area=0.0;
#pragma omp parallel for private(x) reduction (+:area)
for(i=0; i<n; i++){
x= (i+0.5)/n;
area+= 4.0/(1.0+x*x);
}
pi = area/n;
It is said that the reduction will remove the race condition that could happen if we didn't use a reduction. Still I'm wondering do we need to add lastprivate for area since its used outside the parallel loop and will not be visible outside of it. Else does the reduction cover this as well?
Reduction takes care of making a private copy of area for each thread. Once the parallel region ends area is reduced in one atomic operation. In other words the area that is exposed is an aggregate of all private areas of each thread.
thread 1 - private area = compute(x)
thread 2 - private area = compute(y)
thread 3 - private area = compute(z)
reduction step - public area = area<thread1> + area<thread2> + area<thread3> ...
You do not need lastprivate. To help you understand how reductions are done I think it's useful to see how this can be done with atomic. The following code
float sum = 0.0f;
pragma omp parallel for reduction (+:sum)
for(int i=0; i<N; i++) {
sum += //
}
is equivalent to
float sum = 0.0f;
#pragma omp parallel
{
float sum_private = 0.0f;
#pragma omp for nowait
for(int i=0; i<N; i++) {
sum_private += //
}
#pragma omp atomic
sum += sum_private;
}
Although this alternative has more code it is helpful to show how to use more complicated operators. One limitation when suing reduction is that atomic only supports a few basic operators. If you want to use a more complicated operator (such as a SSE/AVX addition) then you can replace atomic with critical reduction with OpenMP with SSE/AVX
What is the difference in combining 2 for loops and parallizing together and parallizing separately
Example
1. not paralleling together
#pragma omp parallel for
for(i = 0; i < 100; i++) {
//.... some code
}
#pragma omp parallel for
for(i = 0; i < 1000; i++) {
//.... some code
}
2. paralleling together
#pragma omp parallel
{
#pragma omp for
for(i = 0; i < 100; i++) {
//.... some code
}
#pragma omp for
for(i = 0; i < 1000; i++) {
//.... some code
}
}
which code is better and why????
One might expect a small win in the second, because one is fork/joining (or the functional equivalent) the OMP threads twice, rather than once. Whether it makes any actual difference for your code is an empirical question best answered by measurement.
The second can also have a more significant advantage if the work in the two loops are independant, and you can start the second at any time, and there's reason to expect some load imbalance in the first loop. In that case, you can add a nowait clause to the firs tomp for and, rather than all threads waiting until the for loop ends, whoever's done first can immediately go on to start working on the second loop. Or, one could put the two chunks of codes each in a section, or task. In general, you have a lot of control over what threads do and how they do it within a parallel section; whereas once you end the parallel section, you lose that flexibility - everything has to join together and you're done.