The difference between these two OpenMP constructs - openmp

Is there any reason to use 2nd construct if I have only 1 for loop and nothing else? Thank you!
#pragma omp parallel for
// for loop goes here
#pragma omp parallel
{
#pragma omp for
// for loop goes here
}

With most implementations the first structure will only have one implicit barrier, while the second may have two (depending on how good the implementation is at removing redundant barriers). If the implementation is good though, you shouldn't see any difference between the two.

I totally second what ejd said.
I would add the fact that one may use the nowait clause so that the threads do not synchronize at the end of the parallel loop.

Related

pragma omp for of atomic operations on a histogram

I'm having trouble in efficiently parallelizing the next line of code:
# pragma omp for nowait
for (int i = 0; i < M; i++) {
# pragma omp atomic
centroids[points[i].cluster].points_in_cluster++;
}
This runs, I guess due to the omp for overhead, slower than this:
# pragma omp single nowait
for (int i = 0; i < M; i++) {
centroids[points[i].cluster].points_in_cluster++;
}
Is there any way to make this go faster?
Theory
While atomics are certainly better than locks or critical regions due to their implementation in hardware on most platforms, they are still in general to be avoided if possible as they do not scale well, i.e. increasing the number of threads will create more atomic collisions and therefore more overhead. Further hardware-/implementation-specific bottlenecks due to atomics are described in the comments below the question and this answer by #PeterCordes.
The alternative to atomics is a parallel reduction algorithm. Assuming that there are much more points than centroids, one can use OpenMP's reduction clause to let every thread have a private version of centroids. These private histograms will be consolidated in an implementation-defined fashion after filling them.
There is no guarantee that this technique is faster than using atomics in every possible case. It could not only depend on the size of the two index spaces, but also on the data as it determines the number of collisions when using atomics. A proper parallel reduction algorithm is in general still expected to scale better to big numbers of threads.
Practice
The problem with using reduce in your code is the Array-of-Structs (AoS) data layout. Specifying
# pragma omp for reduction(+: centroids[0:num_centroids])
will produce an error at build time, as the compiler does not know how to reduce the user-defined type of centroids. Specifying
# pragma omp for reduction(+: centroids[0:num_centroids].points_in_cluster)
does not work either as it is not a valid OpenMP array section.
One can try to use an custom reduction here, but I do not know how to combine a user-defined reduction with OpenMP array sections (see the edit at the end). Also it could be very inefficient to create all the unused variables in the centroid struct on every thread.
With a Struct-of-Array (SoA) data layout you would just have a plain integer buffer, e.g. int *points_in_clusters, which could then be used in the following way (assuming that there are num_centroids elements in centroids and now points_in_clusters):
# pragma omp for nowait reduction(+: points_in_clusters[0:num_centroids])
for (int i = 0; i < M; i++) {
points_in_clusters[points[i].cluster]++;
}
If you cannot just change the data layout, you could still use some scratch space for the OpenMP reduction and afterwards copy the results back to the centroid structs in another loop. But this additional copy operation could eat into the savings from using reduction in the first place.
Using SoA also has benefits for (auto-) vectorization (of other loops) and potentially improves cache locality for regular access patterns. AoS on the other hand can be better for cache locality when encountering random access patterns (e.g. most sorting algorithms if the comparison makes use of multiple variables from the struct).
PS: Be careful with nowait. Does the following work really not depend on the resulting points_in_cluster?
EDIT: I removed my alternative implementation using a user-defined reduction operator as it was not working. I seem to have fixed the problem, but I do not have enough confidence in this implementation (performance- and correctness-wise) to add it back into the answer. Feel free to improve upon the linked code and post another answer.

Same class, 2 programs, different OpenMP speedups; MSVC2017

I have a C++ class, several of whose functions have OpenMP parallel for loops. I'm building it into two apps with MSVC2017, and find that one of those functions runs differently in the 2 apps. The function has two separate parallel for loops. In one build, the VS debugger shows them both using 7 cores for a solid second while processing a block of test data; in the other, it shows just two blips of multicore usage, presumably at the beginning of each parallel section, but only 1 processor runs most of the time.
These functions are deep inside the code for the class, which is identical in the 2 apps. The builds have the same compiler and linker options so far as I can see. I generate the projects with CMake and never modify them by hand.
Can anyone suggest possible reasons for this behavior? I am fully aware of other ways to parallelize code, so please don't tell me about those. I am just looking for expertise on OpenMP under MSVC.
I expect he two calls are passing in significantly different amounts of work. Consider (example, trivial, typed into this post, not compiled, not the way to write this!) code like
void scale(int n, double *d, double f) {
#pragma omp parallel for
for (int i=0; i<n; i++)
d[i] = d[i] * f;
}
If invoked with a large vector where n == 10000, you'll get some parallelism and many threads working. If called with n == 3 there's obviously only work for three threads!
If you use #pragma omp parallel for schedule(dynamic) it's quite possible that even with ten or twenty iterations a single thread will execute most of them.
In summary: context matters.

Must ordered be at the end?

#pragma omp parallel for ordered
for (int i = 0; i < n; ++i) {
... code happens nicely in parallel here ...
#pragma omp ordered
{
.. one at a time in order of i, as expected, good ...
}
... single threaded here but I expected parallel ...
}
I expected the next thread to enter the ordered section as soon as this thread left the ordered section. But the next thread only enters the ordered section when the for loop's body ends. So the code after the ordered section ends goes serially.
The OpenMP 4.0 manual contains :
The ordered construct specifies a structured block in a loop region
that will be executed in the order of the loop iterations. This
sequentializes and orders the code within an ordered region while
allowing code outside the region to run in parallel.
Where I've added the bold. I'm reading "outside" to include after the ordered section ends.
Is this expected? Must the ordered section in fact be last?
I've searched for an answer and did find one other place where someone observed similar nearly 2 years ago : https://stackoverflow.com/a/32078625/403310 :
Testing with gfortran 5.2, it appears everything after the ordered
region is executed in order for each loop iteration, so having the
ordered block at the beginning of the loop leads to serial performance
while having the ordered block at the end of the loop does not have
this implication as the code before the block is parallelized. Testing
with ifort 15 is not as dramatic but I would still recommend
structuring your code so your ordered block occurs after any code than
needs parallelization in a loop iteration rather than before.
I'm using gcc 5.4.0 on Ubuntu 16.04.
Many thanks.
There is no need for the ordered region to be last. The behavior you observe is implementation dependent, and a known flaw in libgomp (the OpenMP runtime library from gcc). I suppose this behavior is tolerated by the standard though clearly not optimal.
Technically, the compiler produces the following code from the annotations:
#pragma omp parallel for ordered
for (int i = 0; i < n; ++i) {
... code happens nicely in parallel here ...
GOMP_ordered_start();
{
.. one at a time in order of i, as expected, good ...
}
GOMP_ordered_end();
... single threaded here but I expected parallel ...
GOMP_loop_ordered_static_next();
}
Unfortunately, GOMP_ordered_end is implemented as follows:
/* This function is called by user code when encountering the end of an
ORDERED block. With the current ORDERED implementation there's nothing
for us to do.
However, the current implementation has a flaw in that it does not allow
the next thread into the ORDERED section immediately after the current
thread exits the ORDERED section in its last iteration. The existance
of this function allows the implementation to change. */
void
GOMP_ordered_end (void)
{
}
I speculate, that just never was a significant use case for this given that ordered is probably commonly used in the sense of:
#pragma omp parallel for ordered
for (...) {
result = expensive_computation()
#pragma omp ordered
{
append(results, result);
}
}
The OpenMP runtime from the Intel compiler does not suffer from this flaw.

Reduction in Openmp returns different results with the same number of threads in my code

My code with openmp using "reduction" doesn't return the same results from run to run.
Case 1: using "reduction"
sum = 0;
omp_set_num_threads(4);
#pragma omp parallel for reduction(+:sum)
for(ii = 0; ii < 100; i++)
sum = sum + func(ii);
with func(ii) has side effects. In fact, func(ii) uses an other calcul() function which can lead to race condition in parallel execution. I think the calcul() function can be a reason for this problem. However, I use "critical", the results is always the same but this solution is not good for performance.
Case 2nd: using "critical"
sum = 0;
#pragma omp parallel for
for(ii = 0; ii < 100; i++)
{
#pragma omp critical
sum = sum + func(ii);
}
with the func(ii) function
func(int val)
{
read_file(val);
calcul(); /*calculate something from reading_file(val)*/
return val_fin;
}
Please help me to resolve it?
Thanks a lot!
The reason you're getting poor performance in the second case is the entire loop body is in a critical, so it can't actually execute anything in parallel.
Since you say there are some race conditions in the calcul function, consider putting a critical section just on that line inside func. That way, the files can be read in parallel (which may be the I/O that is slowing down your execution anyway).
If the performance is still poor, you will need to look into the nested calcul function and try to identify the race conditions.
Basically, you want to push any critical sections down as far as possible or eliminate them entirely. If it comes down to very simple updates to shared variables, in some cases you can use the OpenMP atomic pragma instead, which has better performance but is much less flexible.
Even if everything in the code is correct, you still might get different results from the OpenMP reduction due to the associativity of the operations (additions).
To be able to reproduce the same result for a given number of threads, you need to implement the reduction yourself by storing the partial sum of each thread in a shared array. After the parallel region, the master thread can add these results. This approach implies that the threads always execute the same iterations, i.e. a static scheduling policy.
Related question:
Order of execution in Reduction Operation in OpenMP

Open mp parallel for does not work

I'm studying OpenMP now, and I have a question. The work time of the following code and the same code without a parallel section is statistically equal, though all threads are accessing the function. I tried to look at some guides in the internet, but it did not help. So the question is, what is wrong with this parallel section?
int sumArrayParallel( )
{
int i = 0;
int sum = 0;
#pragma omp parallel for
for (i = 0; i < arraySize; ++i)
{
cout << omp_get_thread_num() << " ";
sum += testArray[i];
}
return sum;
}
There are two very common causes of OpenMP codes failing to exhibit improved performance over their serial counterparts:
The work being done is not sufficient to outweigh the overhead of parallel computation. Think of there being a cost, in time, for setting up a team of threads, for distributing work to them, for gathering results from them. Unless this cost is less than the time saved by parallelising the computation an OpenMP code, even if correct, will not show any speed up and may show the opposite. You haven't shown us the numbers so do the calculations on this yourself.
The programmer imposes serial operation on the parallel program, perhaps by wrapping data access inside memory fences, perhaps by accessing platform resources which are inherently serial. I suspect (but my knowledge of C is lousy) that your writing to cout may inadvertently serialise that part of your computation.
Of course, you can have a mixture of these two problems, too much serialisation and not enough work, resulting in disappointing performance.
For further reading this page on Intel's website is useful, and not just for beginners.
I think, though, that you have a more serious problem with your code than its poor parallel performance. Does the OpenMP version produce the correct sum ? Since you have made no specific provision sum is shared by all threads and they will race for access to it. While learning OpenMP it is a very good idea to attach the clause default(none) to your parallel regions and to take responsibility for defining the shared/private status of each variable in each region. Then, once you are fluent in OpenMP you will know why it makes sense to continue to use the default(none) clause.
Even if you reply Yes, the code does produce the correct result the data race exists and your program can't be trusted. Data races are funny like that, they don't show up in all the tests you run then, once you roll-out your code into production, bang ! and egg all over your face.
However, you seem to be rolling your own reduction and OpenMP provides the tools for doing this. Investigate the reduction clause in your OpenMP references. If I read your code correctly, and taking into account the advice above, you could rewrite the loop to
#pragma omp parallel for default(none) shared(sum, arraySize, testArray) private(i) reduction(+:sum)
for (i = 0; i < arraySize; ++i)
{
sum += testArray[i];
}
In a nutshell, using the reduction clause tells OpenMP to sort out the problems of summing a single value from work distributed across threads, avoiding race conditions etc.
Since OpenMP makes loop iteration variables private by default you could omit the clause private(i) from the directive without too much risk. Even better though might be to declare it inside the for statement:
#pragma omp parallel for default(none) shared(sum, arraySize, testArray) reduction(+:sum)
for (int i = 0; i < arraySize; ++i)
variables declared inside parallel regions are (leaving aside some special cases) always private.

Resources