Must ordered be at the end? - openmp

#pragma omp parallel for ordered
for (int i = 0; i < n; ++i) {
... code happens nicely in parallel here ...
#pragma omp ordered
{
.. one at a time in order of i, as expected, good ...
}
... single threaded here but I expected parallel ...
}
I expected the next thread to enter the ordered section as soon as this thread left the ordered section. But the next thread only enters the ordered section when the for loop's body ends. So the code after the ordered section ends goes serially.
The OpenMP 4.0 manual contains :
The ordered construct specifies a structured block in a loop region
that will be executed in the order of the loop iterations. This
sequentializes and orders the code within an ordered region while
allowing code outside the region to run in parallel.
Where I've added the bold. I'm reading "outside" to include after the ordered section ends.
Is this expected? Must the ordered section in fact be last?
I've searched for an answer and did find one other place where someone observed similar nearly 2 years ago : https://stackoverflow.com/a/32078625/403310 :
Testing with gfortran 5.2, it appears everything after the ordered
region is executed in order for each loop iteration, so having the
ordered block at the beginning of the loop leads to serial performance
while having the ordered block at the end of the loop does not have
this implication as the code before the block is parallelized. Testing
with ifort 15 is not as dramatic but I would still recommend
structuring your code so your ordered block occurs after any code than
needs parallelization in a loop iteration rather than before.
I'm using gcc 5.4.0 on Ubuntu 16.04.
Many thanks.

There is no need for the ordered region to be last. The behavior you observe is implementation dependent, and a known flaw in libgomp (the OpenMP runtime library from gcc). I suppose this behavior is tolerated by the standard though clearly not optimal.
Technically, the compiler produces the following code from the annotations:
#pragma omp parallel for ordered
for (int i = 0; i < n; ++i) {
... code happens nicely in parallel here ...
GOMP_ordered_start();
{
.. one at a time in order of i, as expected, good ...
}
GOMP_ordered_end();
... single threaded here but I expected parallel ...
GOMP_loop_ordered_static_next();
}
Unfortunately, GOMP_ordered_end is implemented as follows:
/* This function is called by user code when encountering the end of an
ORDERED block. With the current ORDERED implementation there's nothing
for us to do.
However, the current implementation has a flaw in that it does not allow
the next thread into the ORDERED section immediately after the current
thread exits the ORDERED section in its last iteration. The existance
of this function allows the implementation to change. */
void
GOMP_ordered_end (void)
{
}
I speculate, that just never was a significant use case for this given that ordered is probably commonly used in the sense of:
#pragma omp parallel for ordered
for (...) {
result = expensive_computation()
#pragma omp ordered
{
append(results, result);
}
}
The OpenMP runtime from the Intel compiler does not suffer from this flaw.

Related

Parallel programming dependency openacc

I am trying to parallelize this loops, but get some error in PGI compiler, I don't understand what's wrong
#pragma acc kernels
{
#pragma acc loop independent
for (i = 0;i < k; i++)
{
for(;dt*j <= Ms[i+1].t;j++)
{
w = (j*dt - Ms[i].t)/(Ms[i+1].t-Ms[i].t);
X[j] = Ms[i].x*(1-w)+Ms[i+1].x*w;
Y[j] = Ms[i].y*(1-w)+Ms[i+1].y*w;
}
}
}
Error
85, Generating Multicore code
87, #pragma acc loop gang
89, Accelerator restriction: size of the GPU copy of Y,X is unknown
Complex loop carried dependence of Ms->t,Ms->x,X->,Ms->y,Y-> prevents parallelization
Loop carried reuse of Y->,X-> prevents parallelization
So what i can do to solve this dependence problem?
I see a few issues here. Also given the output, I'm assuming that you're compiling with "-ta=multicore,tesla" (i.e. targeting both a multicore CPU and a GPU)
First, since "j" is not initialized in the "i" loop, the starting value of "j" will depended on the ending value of "j" from the previous iteration of "i". Hence, the loops are not parallelizable. By using "loop independent", you have forced parallelization on the outer loop, but you will get differing answers from running the code sequentially. You will need to rethink your algorithm.
I would suggest making X and Y two dimensional. With the first dimension of size "k". The second dimension can be a jagged array (i.e. each having a differing size) with the size corresponding to the "Ms[i+1].t" value.
I wrote an example of using jagged arrays as part of my Chapter (#5) of the Parallel Programming with OpenACC book. See: https://github.com/rmfarber/ParallelProgrammingWithOpenACC/blob/master/Chapter05/jagged_array.c
Alternatively, you might be able to set "j=Ms[i].t" assuming "Ms[0].t" is set.
for(j=Ms[i].t;dt*j <= Ms[i+1].t;j++)
"Accelerator restriction: size of the GPU copy of Y,X is unknown"
This is telling you that the compiler can not implicitly copy the X and Y arrays on the device. In C/C++, unbounded pointers don't have sizes so the compiler can't tell how big these arrays are. Often it can derive this information from the loop trip counts, but since the loop trip count is unknown (see above), it can't in this case. To fix, you need to include a data directive on the "kernels" directive or add a data region to your code. For example:
#pragma acc kernels copyout(X[0:size], Y[0:size])
or
#pragma acc data copyout(X[0:size], Y[0:size])
{
...
#pragma acc kernels
...
}
Another thing to keep in mind is pointer aliasing. In C/C++, pointers of the same type are allowed to point at the same object. Hence, without additional information such as the "restrict" attribute, the "independent" clause, or the PGI compiler flag "-Msafeptr", the compiler must assume your pointers do point to the same object making the loop not parallelizable.
This would most likely go away by either adding loop independent to the inner loop as well or using the collapse clause to flatted the loop, applying independent to both. Might also go away if all of your arrays are passed in using restrict, but maybe not.

Does an OpenMP ordered for always assign parts of the loop to threads in order, too?

Background
I am relying on OpenMP parallelization and pseudo-random number generation in my program but at the same I would like to make the results to be perfectly replicable if desired (provided the same number of threads).
I'm seeding a thread_local PRNG for each thread separately like this,
{
std::minstd_rand master{};
#pragma omp parallel for ordered
for(int j = 0; j < omp_get_num_threads(); j++)
#pragma omp ordered
global::tl_rng.seed(master());
}
and I've come up with the following way of producing count of some elements and putting them all in an array at the end in a deterministic order (results of thread 0 first, of thread 1 next etc.)
std::vector<Element> all{};
...
#pragma omp parallel if(parallel)
{
std::vector<Element> tmp{};
tmp.reserve(count/omp_get_num_threads() + 1);
// generation loop
#pragma omp for
for(size_t j = 0; j < count; j++)
tmp.push_back(generateElement(global::tl_rng));
// collection loop
#pragma omp for ordered
for(int j = 0; j < omp_get_num_threads(); j++)
#pragma omp ordered
all.insert(all.end(),
std::make_move_iterator(tmp.begin()),
std::make_move_iterator(tmp.end()));
}
The question
This seems to work but I'm not sure if it's reliable (read: portable). Specifically, if, for example, the second thread is done with its share of the main loop early because its generateElement() calls happened to return quick, won't it technically be allowed to pick the first iteration of the collecting loop? In my compiler that does not happen and it's always thread 0 doing j = 0, thread 1 doing j = 1 etc. as intended. Does that follow from the standard or is it allowed to be compiler-specific behaviour?
I could not find much about the ordered clause in the for directive except that it is required if the loop contains an ordered directive inside. Does it always guarantee that the threads will split the loop from the start in increasing thread_num? Where does it say so in referrable sources? Or do I have to make my "generation" loop ordered as well – does it actually make difference (performance- or logic-wise) when there's no ordered directive in it?
Please don't answer by experience, or by how OpenMP would logically be implemented. I'd like to be backed by the standard.
No, the code in its current state is not portable. It will work only if the default loop schedule is static, that is, the iteration space is divided into count / #threads contiguous chunks and then assigned to the threads in the order of their thread ID with a guaranteed mapping between chunk and thread ID. But the OpenMP specification does not prescribe any default schedule and leaves it to the implementation to pick one. Many implementations use static, but that is not guaranteed to always be the case.
If you add schedule(static) to all loop constructs, then the combination of ordered clause and ordered construct within each loop body will ensure that thread 0 will receive the the first chunk of iterations and will also be the first one to execute the ordered construct. For the loops that run over the number of threads, the chunk size will be one, i.e. each thread will execute exactly one iteration and the order of the iterations of the parallel loop will match those of a sequential loop. The 1:1 mapping of iteration number to thread ID done by the static schedule will then ensure the behaviour you are aiming for.
Note that if the first loop, where you initialise the thread-local PRNGs, is in a different parallel region, you must ensure that both parallel regions execute with the same number of threads, e.g., by disabling dynamic team sizing (omp_set_dynamic(0);) or by explicit application of the num_threads clause.
As to the significance of the ordered clause + construct, it does not influence the assignment of iterations to threads, but it synchronises the threads and makes sure that the physical execution order will match the logical one. A statically scheduled loop without an ordered clause will still assign iteration 0 to thread 0, but there will be no guarantee that some other thread won't execute its loop body ahead of thread 0. Also, any code in the loop body outside of the ordered construct is still allowed to execute concurrently and out of order - see here for a more detailed explanation.

Reduction in Openmp returns different results with the same number of threads in my code

My code with openmp using "reduction" doesn't return the same results from run to run.
Case 1: using "reduction"
sum = 0;
omp_set_num_threads(4);
#pragma omp parallel for reduction(+:sum)
for(ii = 0; ii < 100; i++)
sum = sum + func(ii);
with func(ii) has side effects. In fact, func(ii) uses an other calcul() function which can lead to race condition in parallel execution. I think the calcul() function can be a reason for this problem. However, I use "critical", the results is always the same but this solution is not good for performance.
Case 2nd: using "critical"
sum = 0;
#pragma omp parallel for
for(ii = 0; ii < 100; i++)
{
#pragma omp critical
sum = sum + func(ii);
}
with the func(ii) function
func(int val)
{
read_file(val);
calcul(); /*calculate something from reading_file(val)*/
return val_fin;
}
Please help me to resolve it?
Thanks a lot!
The reason you're getting poor performance in the second case is the entire loop body is in a critical, so it can't actually execute anything in parallel.
Since you say there are some race conditions in the calcul function, consider putting a critical section just on that line inside func. That way, the files can be read in parallel (which may be the I/O that is slowing down your execution anyway).
If the performance is still poor, you will need to look into the nested calcul function and try to identify the race conditions.
Basically, you want to push any critical sections down as far as possible or eliminate them entirely. If it comes down to very simple updates to shared variables, in some cases you can use the OpenMP atomic pragma instead, which has better performance but is much less flexible.
Even if everything in the code is correct, you still might get different results from the OpenMP reduction due to the associativity of the operations (additions).
To be able to reproduce the same result for a given number of threads, you need to implement the reduction yourself by storing the partial sum of each thread in a shared array. After the parallel region, the master thread can add these results. This approach implies that the threads always execute the same iterations, i.e. a static scheduling policy.
Related question:
Order of execution in Reduction Operation in OpenMP

Open mp parallel for does not work

I'm studying OpenMP now, and I have a question. The work time of the following code and the same code without a parallel section is statistically equal, though all threads are accessing the function. I tried to look at some guides in the internet, but it did not help. So the question is, what is wrong with this parallel section?
int sumArrayParallel( )
{
int i = 0;
int sum = 0;
#pragma omp parallel for
for (i = 0; i < arraySize; ++i)
{
cout << omp_get_thread_num() << " ";
sum += testArray[i];
}
return sum;
}
There are two very common causes of OpenMP codes failing to exhibit improved performance over their serial counterparts:
The work being done is not sufficient to outweigh the overhead of parallel computation. Think of there being a cost, in time, for setting up a team of threads, for distributing work to them, for gathering results from them. Unless this cost is less than the time saved by parallelising the computation an OpenMP code, even if correct, will not show any speed up and may show the opposite. You haven't shown us the numbers so do the calculations on this yourself.
The programmer imposes serial operation on the parallel program, perhaps by wrapping data access inside memory fences, perhaps by accessing platform resources which are inherently serial. I suspect (but my knowledge of C is lousy) that your writing to cout may inadvertently serialise that part of your computation.
Of course, you can have a mixture of these two problems, too much serialisation and not enough work, resulting in disappointing performance.
For further reading this page on Intel's website is useful, and not just for beginners.
I think, though, that you have a more serious problem with your code than its poor parallel performance. Does the OpenMP version produce the correct sum ? Since you have made no specific provision sum is shared by all threads and they will race for access to it. While learning OpenMP it is a very good idea to attach the clause default(none) to your parallel regions and to take responsibility for defining the shared/private status of each variable in each region. Then, once you are fluent in OpenMP you will know why it makes sense to continue to use the default(none) clause.
Even if you reply Yes, the code does produce the correct result the data race exists and your program can't be trusted. Data races are funny like that, they don't show up in all the tests you run then, once you roll-out your code into production, bang ! and egg all over your face.
However, you seem to be rolling your own reduction and OpenMP provides the tools for doing this. Investigate the reduction clause in your OpenMP references. If I read your code correctly, and taking into account the advice above, you could rewrite the loop to
#pragma omp parallel for default(none) shared(sum, arraySize, testArray) private(i) reduction(+:sum)
for (i = 0; i < arraySize; ++i)
{
sum += testArray[i];
}
In a nutshell, using the reduction clause tells OpenMP to sort out the problems of summing a single value from work distributed across threads, avoiding race conditions etc.
Since OpenMP makes loop iteration variables private by default you could omit the clause private(i) from the directive without too much risk. Even better though might be to declare it inside the for statement:
#pragma omp parallel for default(none) shared(sum, arraySize, testArray) reduction(+:sum)
for (int i = 0; i < arraySize; ++i)
variables declared inside parallel regions are (leaving aside some special cases) always private.

Ordered 'for' loop efficiency in OpenMP

I am trying to parallelise a single MCMC chain which is sequential in nature and hence, I need to preserve the order of iterations being executed. For this purpose, I was thinking of using an 'ordered for' loop via OpenMP. I wanted to know how does the execution of an ordered for loop in OpenMP really work, does it really provide any speed-up in terms of parallelisation of the code?
Thanks!
If your loop contains only one block with an ordered construct, then the execution will be serial, and you will not obtain any speedup from parallel execution.
In the example below there is one block that can be executed in parallel and one that will be serialized:
void example(int b, int e, float* data)
{
#pragma omp for schedule(static) ordered
for (int i = b; i < e; ++i) {
// This block can be executed in parallel
data[i] = SomeThing(data[i]);
if (data[i] == 0.0f)
{
// This block will be serialized
#pragma omp ordered
printf("Element %d resulted in zero\n", i);
}
}
}
As long as you're having just a single Markov chain, the easiest way to parallelize it is to use the 'embarassing' parallelism: run a bunch of independent chains and collect the results when they all are done [or gather the results once in a while.]
This way you do not incur any communication overhead whatsoever.
The main caveat here is that you need to make sure different chains get different random number generator seeds.
UPD: practicalities of collecting the results.
In a nutshell, you just mix together the results generated by all the chains. For the sake of simplicity, suppose you have three independent chains:
x1, x2, x3,...
y1, y2, y3,...
z1, z2, z3,...
From these, you make a chain x1,y1,z1,x2,y2,z2,x3,y3,z3,... This is a perfectly valid MC chain and it samples the correct distribution.
Writing out all the chain history is almost always impractical. Typically, each chain saves the binning statistics, which you then mix together and analysize by a separate program. For binning analysis see, e.g. [boulder.research.yale.edu/Boulder-2010/ReadingMaterial-2010/Troyer/Article.pdf][1]
The openMP ordered directive can be listed only in a dynamic perspective.
The specifications suggest that while writing for we must mention the ordered keyword. However, where in the loop would be the ordered block is your choice.
My guess is that even if we mention the ordered keyword in for, each thread will start its work in parallel. Any thread that encounters a ordered keyword must enter this block only if all the previous iterations are completed. Please focus on the keyword all previous iterations must be completed.
The intuition for the above reasoning is that an "ordered for" if executes serially does not make any sense at all.

Resources