Parallel delaunay triangulation - parallel-processing

I am trying to parallelize the Guibas Stolfi delaunay triangulation using openmp.
There are two things to parallelize here-
the mergesort(),which i did and
the divide() where I am stuck.
I have tried all possible approaches but in vain.
The approach followed(divide n conquer) in divide() is same as that of mergesort(),but applying the same parallelization technique(omp sections) works only for mergesort.
I tried the parallelization technique shown here,but even that doesn't work.
I read about nested parallelism somewhere but i am not sure how to implement it.
Can anybody explain how divide and conquer algorithms are parallelized?
CODE:Called mergesort twice in main function and applied sections construct.Doing same for divide function doesn't work
#pragma omp parallel
{
#pragma omp sections nowait
{
#pragma omp section
{
merge_sort(p_sorted, p_temp, 0, n/2);
}
#pragma omp section
{
merge_sort(p_sorted, p_temp, (n/2)+1, n-1);
}
}
}

I was successful in parallelizing using the CreateThread calls in Windows, the trick is to divide the points into 2^n buffers, process each buffer in a separate thread and then merge adjacent edges, successively, until one final merge.
I have a demonstation program to create random data and triangulate and display the results (for smaller cases). It doesn't look like this site lets me download the .zip I have of the program and display tool. If you can suggest an upload site or provide an email I'll send it to you.

Related

pragma omp for of atomic operations on a histogram

I'm having trouble in efficiently parallelizing the next line of code:
# pragma omp for nowait
for (int i = 0; i < M; i++) {
# pragma omp atomic
centroids[points[i].cluster].points_in_cluster++;
}
This runs, I guess due to the omp for overhead, slower than this:
# pragma omp single nowait
for (int i = 0; i < M; i++) {
centroids[points[i].cluster].points_in_cluster++;
}
Is there any way to make this go faster?
Theory
While atomics are certainly better than locks or critical regions due to their implementation in hardware on most platforms, they are still in general to be avoided if possible as they do not scale well, i.e. increasing the number of threads will create more atomic collisions and therefore more overhead. Further hardware-/implementation-specific bottlenecks due to atomics are described in the comments below the question and this answer by #PeterCordes.
The alternative to atomics is a parallel reduction algorithm. Assuming that there are much more points than centroids, one can use OpenMP's reduction clause to let every thread have a private version of centroids. These private histograms will be consolidated in an implementation-defined fashion after filling them.
There is no guarantee that this technique is faster than using atomics in every possible case. It could not only depend on the size of the two index spaces, but also on the data as it determines the number of collisions when using atomics. A proper parallel reduction algorithm is in general still expected to scale better to big numbers of threads.
Practice
The problem with using reduce in your code is the Array-of-Structs (AoS) data layout. Specifying
# pragma omp for reduction(+: centroids[0:num_centroids])
will produce an error at build time, as the compiler does not know how to reduce the user-defined type of centroids. Specifying
# pragma omp for reduction(+: centroids[0:num_centroids].points_in_cluster)
does not work either as it is not a valid OpenMP array section.
One can try to use an custom reduction here, but I do not know how to combine a user-defined reduction with OpenMP array sections (see the edit at the end). Also it could be very inefficient to create all the unused variables in the centroid struct on every thread.
With a Struct-of-Array (SoA) data layout you would just have a plain integer buffer, e.g. int *points_in_clusters, which could then be used in the following way (assuming that there are num_centroids elements in centroids and now points_in_clusters):
# pragma omp for nowait reduction(+: points_in_clusters[0:num_centroids])
for (int i = 0; i < M; i++) {
points_in_clusters[points[i].cluster]++;
}
If you cannot just change the data layout, you could still use some scratch space for the OpenMP reduction and afterwards copy the results back to the centroid structs in another loop. But this additional copy operation could eat into the savings from using reduction in the first place.
Using SoA also has benefits for (auto-) vectorization (of other loops) and potentially improves cache locality for regular access patterns. AoS on the other hand can be better for cache locality when encountering random access patterns (e.g. most sorting algorithms if the comparison makes use of multiple variables from the struct).
PS: Be careful with nowait. Does the following work really not depend on the resulting points_in_cluster?
EDIT: I removed my alternative implementation using a user-defined reduction operator as it was not working. I seem to have fixed the problem, but I do not have enough confidence in this implementation (performance- and correctness-wise) to add it back into the answer. Feel free to improve upon the linked code and post another answer.

Differences between `#pragma parallel for collapse` and `#pragma omp parallel for`

Firstly, the question might be slightly misleading, I understand the main differences between the collapse clause in a parallel region and a region without one. Let's say I want to transpose a matrix and there are the following two methods, First a parallel for with SIMD directive for the inner loop and a second method using the collapse(2) clause
#pragma omp parallel for
for(int i=0; i<rows; i++){
#pragma omp simd
for(int j=0; j<columns; j++){
*(output + j * rows + i) = *(input + i * columns + j);
}
}
#pragma omp parallel for collapse(2)
for(int i=0; i<rows; i++){
for(int j=0; j<columns; j++){
*(output + j * rows + i) = *(input + i * columns + j);
}
In the two methods above, which would be more efficient especially in terms of caching?.
Out of the two above, which of the implementations would be more efficient and faster? Is there any way we can ascertain that just by looking at the implementations.
And Given all the loop counters are independent of each other, can one set a basic guideline as to when to use when?
TIA
TL;DR: both implementations are quite inefficient. The second one will likely be slower than the first in practice, although it could theoretically scale better.
The first implementation is unlikely to be vectorized because the accesses are not contiguous in memory. Both GCC 10 and Clang 11 generate inefficient code.
The point is that OpenMP provide no high-level SIMD construct to deal with data transposition! Thus, if you want to do it efficiently, you probably need to get your hands dirty by doing it yourself (or by using an external library that does it for you).
The second implementation could be significantly slower than the first implementation because the loop iterator is linearized often resulting in more instruction to be executed in the hot path. Some implementation (eg. Clang 11 and ICC 19 but not GCC 10) even use a very slow modulus operation (ie. div instruction) to do so resulting in a much slower loop.
The second implementation should also theoretically scale better than the first because the collapse clause provide more parallelism. Indeed, in the first implementation, there is only rows lines to share between n threads. So, if you work on massively parallel machines or wide rectangular matrices, with n not so small compared to rows, this could cause some work imbalance, or even thread starvation.
Why both implementations are inefficient
The two implementations are inefficient because of the memory access pattern. Indeed, on big matrices, writes in output are not contiguous and will cause many cache misses. A full cache line (64 bytes on most common architectures) will be written while only few bytes will be written into it. If columns is a power of two, cache thrashing will occurs and further decrease performance.
One solution to mitigate these issues is to use tiling. Here is an example:
// Assume rows and columns are nice for sake of clarity ;)
constexpr int tileSize = 8;
assert(rows % tileSize == 0);
assert(columns % tileSize == 0);
// Note the collapse clause is needed here for scalability and
// the collapse overhead is mitigated by the inner loop.
#pragma omp parallel for collapse(2)
for(int i=0; i<rows; i+=tileSize)
{
for(int j=0; j<columns; j+=tileSize)
{
for(int ti=i; ti<i+tileSize; ++ti)
{
for(int tj=j; tj<j+tileSize; ++tj)
{
output[tj * rows + ti] = input[ti * columns + tj];
}
}
}
}
The above code should be faster, but not optimal. Successfully writing a fast transposition code is challenging. Here is some advises to improve the code:
use a temporary tile buffer to improve the memory access pattern (so the compiler can use fast SIMD instructions)
use square tiles to improve the use of the cache
use multi-level tiling to improve the use of the L2/L3 cache or use a Z-tiling approach
Alternatively, you can simply use a fast BLAS implementation providing matrix transposition functions quite well optimized (not all do, but AFAIK OpenBLAS and the MKL does).
PS: I assumed matrices are stored in a row-major order.

Same class, 2 programs, different OpenMP speedups; MSVC2017

I have a C++ class, several of whose functions have OpenMP parallel for loops. I'm building it into two apps with MSVC2017, and find that one of those functions runs differently in the 2 apps. The function has two separate parallel for loops. In one build, the VS debugger shows them both using 7 cores for a solid second while processing a block of test data; in the other, it shows just two blips of multicore usage, presumably at the beginning of each parallel section, but only 1 processor runs most of the time.
These functions are deep inside the code for the class, which is identical in the 2 apps. The builds have the same compiler and linker options so far as I can see. I generate the projects with CMake and never modify them by hand.
Can anyone suggest possible reasons for this behavior? I am fully aware of other ways to parallelize code, so please don't tell me about those. I am just looking for expertise on OpenMP under MSVC.
I expect he two calls are passing in significantly different amounts of work. Consider (example, trivial, typed into this post, not compiled, not the way to write this!) code like
void scale(int n, double *d, double f) {
#pragma omp parallel for
for (int i=0; i<n; i++)
d[i] = d[i] * f;
}
If invoked with a large vector where n == 10000, you'll get some parallelism and many threads working. If called with n == 3 there's obviously only work for three threads!
If you use #pragma omp parallel for schedule(dynamic) it's quite possible that even with ten or twenty iterations a single thread will execute most of them.
In summary: context matters.

Open mp parallel for does not work

I'm studying OpenMP now, and I have a question. The work time of the following code and the same code without a parallel section is statistically equal, though all threads are accessing the function. I tried to look at some guides in the internet, but it did not help. So the question is, what is wrong with this parallel section?
int sumArrayParallel( )
{
int i = 0;
int sum = 0;
#pragma omp parallel for
for (i = 0; i < arraySize; ++i)
{
cout << omp_get_thread_num() << " ";
sum += testArray[i];
}
return sum;
}
There are two very common causes of OpenMP codes failing to exhibit improved performance over their serial counterparts:
The work being done is not sufficient to outweigh the overhead of parallel computation. Think of there being a cost, in time, for setting up a team of threads, for distributing work to them, for gathering results from them. Unless this cost is less than the time saved by parallelising the computation an OpenMP code, even if correct, will not show any speed up and may show the opposite. You haven't shown us the numbers so do the calculations on this yourself.
The programmer imposes serial operation on the parallel program, perhaps by wrapping data access inside memory fences, perhaps by accessing platform resources which are inherently serial. I suspect (but my knowledge of C is lousy) that your writing to cout may inadvertently serialise that part of your computation.
Of course, you can have a mixture of these two problems, too much serialisation and not enough work, resulting in disappointing performance.
For further reading this page on Intel's website is useful, and not just for beginners.
I think, though, that you have a more serious problem with your code than its poor parallel performance. Does the OpenMP version produce the correct sum ? Since you have made no specific provision sum is shared by all threads and they will race for access to it. While learning OpenMP it is a very good idea to attach the clause default(none) to your parallel regions and to take responsibility for defining the shared/private status of each variable in each region. Then, once you are fluent in OpenMP you will know why it makes sense to continue to use the default(none) clause.
Even if you reply Yes, the code does produce the correct result the data race exists and your program can't be trusted. Data races are funny like that, they don't show up in all the tests you run then, once you roll-out your code into production, bang ! and egg all over your face.
However, you seem to be rolling your own reduction and OpenMP provides the tools for doing this. Investigate the reduction clause in your OpenMP references. If I read your code correctly, and taking into account the advice above, you could rewrite the loop to
#pragma omp parallel for default(none) shared(sum, arraySize, testArray) private(i) reduction(+:sum)
for (i = 0; i < arraySize; ++i)
{
sum += testArray[i];
}
In a nutshell, using the reduction clause tells OpenMP to sort out the problems of summing a single value from work distributed across threads, avoiding race conditions etc.
Since OpenMP makes loop iteration variables private by default you could omit the clause private(i) from the directive without too much risk. Even better though might be to declare it inside the for statement:
#pragma omp parallel for default(none) shared(sum, arraySize, testArray) reduction(+:sum)
for (int i = 0; i < arraySize; ++i)
variables declared inside parallel regions are (leaving aside some special cases) always private.

Ordered 'for' loop efficiency in OpenMP

I am trying to parallelise a single MCMC chain which is sequential in nature and hence, I need to preserve the order of iterations being executed. For this purpose, I was thinking of using an 'ordered for' loop via OpenMP. I wanted to know how does the execution of an ordered for loop in OpenMP really work, does it really provide any speed-up in terms of parallelisation of the code?
Thanks!
If your loop contains only one block with an ordered construct, then the execution will be serial, and you will not obtain any speedup from parallel execution.
In the example below there is one block that can be executed in parallel and one that will be serialized:
void example(int b, int e, float* data)
{
#pragma omp for schedule(static) ordered
for (int i = b; i < e; ++i) {
// This block can be executed in parallel
data[i] = SomeThing(data[i]);
if (data[i] == 0.0f)
{
// This block will be serialized
#pragma omp ordered
printf("Element %d resulted in zero\n", i);
}
}
}
As long as you're having just a single Markov chain, the easiest way to parallelize it is to use the 'embarassing' parallelism: run a bunch of independent chains and collect the results when they all are done [or gather the results once in a while.]
This way you do not incur any communication overhead whatsoever.
The main caveat here is that you need to make sure different chains get different random number generator seeds.
UPD: practicalities of collecting the results.
In a nutshell, you just mix together the results generated by all the chains. For the sake of simplicity, suppose you have three independent chains:
x1, x2, x3,...
y1, y2, y3,...
z1, z2, z3,...
From these, you make a chain x1,y1,z1,x2,y2,z2,x3,y3,z3,... This is a perfectly valid MC chain and it samples the correct distribution.
Writing out all the chain history is almost always impractical. Typically, each chain saves the binning statistics, which you then mix together and analysize by a separate program. For binning analysis see, e.g. [boulder.research.yale.edu/Boulder-2010/ReadingMaterial-2010/Troyer/Article.pdf][1]
The openMP ordered directive can be listed only in a dynamic perspective.
The specifications suggest that while writing for we must mention the ordered keyword. However, where in the loop would be the ordered block is your choice.
My guess is that even if we mention the ordered keyword in for, each thread will start its work in parallel. Any thread that encounters a ordered keyword must enter this block only if all the previous iterations are completed. Please focus on the keyword all previous iterations must be completed.
The intuition for the above reasoning is that an "ordered for" if executes serially does not make any sense at all.

Resources