Is it better to use the collapse clause - c++11

I am never sure which possibility I should choose to parallelize nested for loops.
For example I have the following code snippet:
#pragma omp parallel for schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];
#pragma omp parallel for collapse(2) schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];
In the first snippet I use parallel for (with schedule(static) because of the first touch policy). In some codes I saw people use mostly the collapse-clausel to parallize nested for loops in other codes it is never used instead the nested for loops are parallelized with a simple parallel for. Is this more a habit or is there a difference between the two versions? Is there a reason some people never use collapse(n)?

As with everything in HPC, the answer is "It depends..."
Here it will depend on
How big your machine is and how big "bSize", and "N" are
What the content of the inner loop is
For static scheduling of iterations which all run in the same amount of time, unless you can guarantee that number of iterations being work-shared divides by the number of threads, you need to ensure that the number of available iterations is ~10x the number of threads to guarantee 90% efficiency because of potential imbalance. Therefore if you have a 16 core machine you want >160 iterations. If "bSize" is small, then using collapse to generate more available parallelism will help performance. (In the worst case, imagine that "bSize" is smaller than the number of threads!)
On the other hand, as #tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.
On the third hand, there is nothing to stop you doing both :-
#pragma omp for simd collapse(2)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];
If your inner loop really is this small (and vectorizable) then you certainly want to vectorize it, since, unlike parallelism, vectorization can reduce the total CPU time you use, rather than just moving it between cores.

Related

Differences between `#pragma parallel for collapse` and `#pragma omp parallel for`

Firstly, the question might be slightly misleading, I understand the main differences between the collapse clause in a parallel region and a region without one. Let's say I want to transpose a matrix and there are the following two methods, First a parallel for with SIMD directive for the inner loop and a second method using the collapse(2) clause
#pragma omp parallel for
for(int i=0; i<rows; i++){
#pragma omp simd
for(int j=0; j<columns; j++){
*(output + j * rows + i) = *(input + i * columns + j);
}
}
#pragma omp parallel for collapse(2)
for(int i=0; i<rows; i++){
for(int j=0; j<columns; j++){
*(output + j * rows + i) = *(input + i * columns + j);
}
In the two methods above, which would be more efficient especially in terms of caching?.
Out of the two above, which of the implementations would be more efficient and faster? Is there any way we can ascertain that just by looking at the implementations.
And Given all the loop counters are independent of each other, can one set a basic guideline as to when to use when?
TIA
TL;DR: both implementations are quite inefficient. The second one will likely be slower than the first in practice, although it could theoretically scale better.
The first implementation is unlikely to be vectorized because the accesses are not contiguous in memory. Both GCC 10 and Clang 11 generate inefficient code.
The point is that OpenMP provide no high-level SIMD construct to deal with data transposition! Thus, if you want to do it efficiently, you probably need to get your hands dirty by doing it yourself (or by using an external library that does it for you).
The second implementation could be significantly slower than the first implementation because the loop iterator is linearized often resulting in more instruction to be executed in the hot path. Some implementation (eg. Clang 11 and ICC 19 but not GCC 10) even use a very slow modulus operation (ie. div instruction) to do so resulting in a much slower loop.
The second implementation should also theoretically scale better than the first because the collapse clause provide more parallelism. Indeed, in the first implementation, there is only rows lines to share between n threads. So, if you work on massively parallel machines or wide rectangular matrices, with n not so small compared to rows, this could cause some work imbalance, or even thread starvation.
Why both implementations are inefficient
The two implementations are inefficient because of the memory access pattern. Indeed, on big matrices, writes in output are not contiguous and will cause many cache misses. A full cache line (64 bytes on most common architectures) will be written while only few bytes will be written into it. If columns is a power of two, cache thrashing will occurs and further decrease performance.
One solution to mitigate these issues is to use tiling. Here is an example:
// Assume rows and columns are nice for sake of clarity ;)
constexpr int tileSize = 8;
assert(rows % tileSize == 0);
assert(columns % tileSize == 0);
// Note the collapse clause is needed here for scalability and
// the collapse overhead is mitigated by the inner loop.
#pragma omp parallel for collapse(2)
for(int i=0; i<rows; i+=tileSize)
{
for(int j=0; j<columns; j+=tileSize)
{
for(int ti=i; ti<i+tileSize; ++ti)
{
for(int tj=j; tj<j+tileSize; ++tj)
{
output[tj * rows + ti] = input[ti * columns + tj];
}
}
}
}
The above code should be faster, but not optimal. Successfully writing a fast transposition code is challenging. Here is some advises to improve the code:
use a temporary tile buffer to improve the memory access pattern (so the compiler can use fast SIMD instructions)
use square tiles to improve the use of the cache
use multi-level tiling to improve the use of the L2/L3 cache or use a Z-tiling approach
Alternatively, you can simply use a fast BLAS implementation providing matrix transposition functions quite well optimized (not all do, but AFAIK OpenBLAS and the MKL does).
PS: I assumed matrices are stored in a row-major order.

openmp parallelizing code with an internal for loop

I'm trying to write a code that runs in parallel hardware using mpi and openmp. I have the following code piece:
#pragma omp parallel for private(k, temp_r)
for(j=0; j<size; j++){
temp_r = b[j];
for(k=0; k<rows; k++){
temp_r = temp_r - A[j*rows + k] * x[k];
}
r[j] = temp_r;
}
I know this code could be further improved because the internal for loop is a reduction. I can do the reduction for one for loop. But I'm not sure how to go about this since there are two for loops involved here. Any insight would be helpful.
If size >> #CPUs then using a reduction for the inner loop will only reduce the performance. Reduction needs an extra log(#CPUs) steps compared to serial for.
Thus parallelizing this code any further will not gain improvement and will probably harm it.
It would, however, improve performance if size < #CPUs. This is because you will have fewer work-chunks than CPUs.
Cache optimizations are also not viable. Each basic op (temp_r = temp_r - A[j*rows + k] * x[k]) requires reading two values (A[j][k] and x[k]), one of which is exclusive for that op (A[j][k]), which means it is not in the cache.
If you are working on an Out-of-Order-Exectution CPU (which you probably are), you will not gain any improvement from trying to improve the cache locality over the reading of the x array because the CPU will also have to wait for the second read and it will do it simultaneously (it will only start the op once both values are ready).

Does an OpenMP ordered for always assign parts of the loop to threads in order, too?

Background
I am relying on OpenMP parallelization and pseudo-random number generation in my program but at the same I would like to make the results to be perfectly replicable if desired (provided the same number of threads).
I'm seeding a thread_local PRNG for each thread separately like this,
{
std::minstd_rand master{};
#pragma omp parallel for ordered
for(int j = 0; j < omp_get_num_threads(); j++)
#pragma omp ordered
global::tl_rng.seed(master());
}
and I've come up with the following way of producing count of some elements and putting them all in an array at the end in a deterministic order (results of thread 0 first, of thread 1 next etc.)
std::vector<Element> all{};
...
#pragma omp parallel if(parallel)
{
std::vector<Element> tmp{};
tmp.reserve(count/omp_get_num_threads() + 1);
// generation loop
#pragma omp for
for(size_t j = 0; j < count; j++)
tmp.push_back(generateElement(global::tl_rng));
// collection loop
#pragma omp for ordered
for(int j = 0; j < omp_get_num_threads(); j++)
#pragma omp ordered
all.insert(all.end(),
std::make_move_iterator(tmp.begin()),
std::make_move_iterator(tmp.end()));
}
The question
This seems to work but I'm not sure if it's reliable (read: portable). Specifically, if, for example, the second thread is done with its share of the main loop early because its generateElement() calls happened to return quick, won't it technically be allowed to pick the first iteration of the collecting loop? In my compiler that does not happen and it's always thread 0 doing j = 0, thread 1 doing j = 1 etc. as intended. Does that follow from the standard or is it allowed to be compiler-specific behaviour?
I could not find much about the ordered clause in the for directive except that it is required if the loop contains an ordered directive inside. Does it always guarantee that the threads will split the loop from the start in increasing thread_num? Where does it say so in referrable sources? Or do I have to make my "generation" loop ordered as well – does it actually make difference (performance- or logic-wise) when there's no ordered directive in it?
Please don't answer by experience, or by how OpenMP would logically be implemented. I'd like to be backed by the standard.
No, the code in its current state is not portable. It will work only if the default loop schedule is static, that is, the iteration space is divided into count / #threads contiguous chunks and then assigned to the threads in the order of their thread ID with a guaranteed mapping between chunk and thread ID. But the OpenMP specification does not prescribe any default schedule and leaves it to the implementation to pick one. Many implementations use static, but that is not guaranteed to always be the case.
If you add schedule(static) to all loop constructs, then the combination of ordered clause and ordered construct within each loop body will ensure that thread 0 will receive the the first chunk of iterations and will also be the first one to execute the ordered construct. For the loops that run over the number of threads, the chunk size will be one, i.e. each thread will execute exactly one iteration and the order of the iterations of the parallel loop will match those of a sequential loop. The 1:1 mapping of iteration number to thread ID done by the static schedule will then ensure the behaviour you are aiming for.
Note that if the first loop, where you initialise the thread-local PRNGs, is in a different parallel region, you must ensure that both parallel regions execute with the same number of threads, e.g., by disabling dynamic team sizing (omp_set_dynamic(0);) or by explicit application of the num_threads clause.
As to the significance of the ordered clause + construct, it does not influence the assignment of iterations to threads, but it synchronises the threads and makes sure that the physical execution order will match the logical one. A statically scheduled loop without an ordered clause will still assign iteration 0 to thread 0, but there will be no guarantee that some other thread won't execute its loop body ahead of thread 0. Also, any code in the loop body outside of the ordered construct is still allowed to execute concurrently and out of order - see here for a more detailed explanation.

Reduction in Openmp returns different results with the same number of threads in my code

My code with openmp using "reduction" doesn't return the same results from run to run.
Case 1: using "reduction"
sum = 0;
omp_set_num_threads(4);
#pragma omp parallel for reduction(+:sum)
for(ii = 0; ii < 100; i++)
sum = sum + func(ii);
with func(ii) has side effects. In fact, func(ii) uses an other calcul() function which can lead to race condition in parallel execution. I think the calcul() function can be a reason for this problem. However, I use "critical", the results is always the same but this solution is not good for performance.
Case 2nd: using "critical"
sum = 0;
#pragma omp parallel for
for(ii = 0; ii < 100; i++)
{
#pragma omp critical
sum = sum + func(ii);
}
with the func(ii) function
func(int val)
{
read_file(val);
calcul(); /*calculate something from reading_file(val)*/
return val_fin;
}
Please help me to resolve it?
Thanks a lot!
The reason you're getting poor performance in the second case is the entire loop body is in a critical, so it can't actually execute anything in parallel.
Since you say there are some race conditions in the calcul function, consider putting a critical section just on that line inside func. That way, the files can be read in parallel (which may be the I/O that is slowing down your execution anyway).
If the performance is still poor, you will need to look into the nested calcul function and try to identify the race conditions.
Basically, you want to push any critical sections down as far as possible or eliminate them entirely. If it comes down to very simple updates to shared variables, in some cases you can use the OpenMP atomic pragma instead, which has better performance but is much less flexible.
Even if everything in the code is correct, you still might get different results from the OpenMP reduction due to the associativity of the operations (additions).
To be able to reproduce the same result for a given number of threads, you need to implement the reduction yourself by storing the partial sum of each thread in a shared array. After the parallel region, the master thread can add these results. This approach implies that the threads always execute the same iterations, i.e. a static scheduling policy.
Related question:
Order of execution in Reduction Operation in OpenMP

Open mp parallel for does not work

I'm studying OpenMP now, and I have a question. The work time of the following code and the same code without a parallel section is statistically equal, though all threads are accessing the function. I tried to look at some guides in the internet, but it did not help. So the question is, what is wrong with this parallel section?
int sumArrayParallel( )
{
int i = 0;
int sum = 0;
#pragma omp parallel for
for (i = 0; i < arraySize; ++i)
{
cout << omp_get_thread_num() << " ";
sum += testArray[i];
}
return sum;
}
There are two very common causes of OpenMP codes failing to exhibit improved performance over their serial counterparts:
The work being done is not sufficient to outweigh the overhead of parallel computation. Think of there being a cost, in time, for setting up a team of threads, for distributing work to them, for gathering results from them. Unless this cost is less than the time saved by parallelising the computation an OpenMP code, even if correct, will not show any speed up and may show the opposite. You haven't shown us the numbers so do the calculations on this yourself.
The programmer imposes serial operation on the parallel program, perhaps by wrapping data access inside memory fences, perhaps by accessing platform resources which are inherently serial. I suspect (but my knowledge of C is lousy) that your writing to cout may inadvertently serialise that part of your computation.
Of course, you can have a mixture of these two problems, too much serialisation and not enough work, resulting in disappointing performance.
For further reading this page on Intel's website is useful, and not just for beginners.
I think, though, that you have a more serious problem with your code than its poor parallel performance. Does the OpenMP version produce the correct sum ? Since you have made no specific provision sum is shared by all threads and they will race for access to it. While learning OpenMP it is a very good idea to attach the clause default(none) to your parallel regions and to take responsibility for defining the shared/private status of each variable in each region. Then, once you are fluent in OpenMP you will know why it makes sense to continue to use the default(none) clause.
Even if you reply Yes, the code does produce the correct result the data race exists and your program can't be trusted. Data races are funny like that, they don't show up in all the tests you run then, once you roll-out your code into production, bang ! and egg all over your face.
However, you seem to be rolling your own reduction and OpenMP provides the tools for doing this. Investigate the reduction clause in your OpenMP references. If I read your code correctly, and taking into account the advice above, you could rewrite the loop to
#pragma omp parallel for default(none) shared(sum, arraySize, testArray) private(i) reduction(+:sum)
for (i = 0; i < arraySize; ++i)
{
sum += testArray[i];
}
In a nutshell, using the reduction clause tells OpenMP to sort out the problems of summing a single value from work distributed across threads, avoiding race conditions etc.
Since OpenMP makes loop iteration variables private by default you could omit the clause private(i) from the directive without too much risk. Even better though might be to declare it inside the for statement:
#pragma omp parallel for default(none) shared(sum, arraySize, testArray) reduction(+:sum)
for (int i = 0; i < arraySize; ++i)
variables declared inside parallel regions are (leaving aside some special cases) always private.

Resources