OpenMP Target Task reduction - openmp

I'm using OpenMP target offloading do offload some nested loops to the gpu. I'm using the nowait to tun it asynchronous. This makes it a task. With the same input values the result differs from the one when not offloading (e.g. cpu: sum=0.99, offloading sum=0.5).
When removing the nowait clause it works just fine. So I think the issue is that it becomes an OpenMP task and I'm struggling getting it right.
#pragma omp target teams distribute parallel for reduction( +: sum) collapse(2) nowait depend(in: a, b) depend(out: sum)
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= n; j++)
{
double c = 0;
for (int k = 0; k < n; k++)
{
c += /* some computation */
}
sum += fabs(c);
}
}

The OpenMP 5.2 specification states:
The target construct generates a target task. The generated task region encloses the target region. If a depend clause is present, it is associated with the target task. [...]. If the nowait clause is present, execution of the target task may be deferred. If the nowait clause is not present, the target task is an included task.
This means that your code is executed in a task with a possibly deferred execution (with nowait). Thus, it can be executed at the end of the parallel in the worst case, but always before all the dependent tasks and taskwait directives waiting for the target task (or the ones including a similar behaviour like taskgroup). Because of that, you need not to modify the working arrays (nor release them) during this time span. If you do, the behaviour is undefined.
You should especially pay attention to the correctness of synchronization points and task dependencies in your code (it is impossible for us to check that with the current incomplete provided code).

Related

How to balance the thread number in nested case when using OpenMP?

This fabulous post teaches me a lot, but I still have a question. For the following code:
double multiply(std::vector<double> const& a, std::vector<double> const& b){
double tmp(0);
int active_levels = omp_get_active_level();
#pragma omp parallel for reduction(+:tmp) if(active_level < 1)
for(unsigned int i=0;i<a.size();i++){
tmp += a[i]+b[i];
}
return tmp;
}
If multiply() is called from another parallel part:
#pragma omp parallel for
for (int i = 0; i < count; i++) {
multiply(a[i], b[i]);
}
Because the outer loop iteration depends on count variable, if count is a big number, it is reasonable. But if count is only 1 and our server is a multiple-core machine(e.g., has 512 cores), then the multiply() function only generate 1 thread. So in this case, the server is under-utilized. BTW, the answer also mentioned:
In any case, writing such code is a bad practice. You should simply leave the parallel regions as they are and allow the end user choose whether nested parallelism should be enabled or not.
So how to balance the thread number in nested case when using OpenMP?
Consider using OpenMP tasks (omp taskloop within one parallel section and an intermediate omp single). This allows you to flexibly use the threads in OpenMP on different nesting levels instead of manually defining numbers of threads for each level or oversubscribing OS threads.
However this comes at increased scheduling costs. At the end of the day, there is no perfect solution that will always do best. Instead you will have to keep measuring and analyzing your performance on practical inputs.

parallel 'task's inside an already parallelized 'for' loop in OpenMP

[Background: OpenMP v4+ on Intel's icc compiler]
I want to parallelize tasks inside a loop that is already parallelized. I saw quite a few queries on subjects close to this one, e.g.:
Parallel sections in OpenMP using a loop
Doing a section with one thread and a for-loop with multiple threads
and others with more concentrated wisdom still.
but I could not get a definite answer other than a compile time error message when trying it.
Code:
#pragma omp parallel for private(a,bd) reduction(+:sum)
for (int i=0; i<128; i++) {
a = i%2;
for (int j=a; j<128; j=j+2) {
u_n = 0.25 * ( u[ i*128 + (j-3) ]+
u[ i*128 + (j+3) ]+
u[ (i-1)*128 + j ]+
u[ (i+1)*128 + j ]);
// #pragma omp single nowait
// {
// #pragma omp task shared(sum1) firstprivate(i,j)
// sum1 = (u[i*128+(j-3)]+u[i*128+(j-2)] + u[i*128+(j-1)])/3;
// #pragma omp task shared(sum2) firstprivate(i,j)
// sum2 = (u[i*128+(j+3)]+u[i*128+(j+2)]+u[i*128+(j+1)])/3;
// #pragma omp task shared(sum3) firstprivate(i,j)
// sum3 = (u[(i-1)*128+j]+u[(i-2)*128+j]+u[(i-3)*128+j])/3;
// #pragma omp task shared(sum4) firstprivate(i,j)
// sum4 = (u[(i+1)*128+j]+u[(i+2)*128+j]+u[(i+3)*128+j])/3;
// }
// #pragma omp taskwait
// {
// u_n = 0.25*(sum1+sum2+sum3+sum4);
// }
bd = u_n - u[i*128+ j];
sum += diff * diff;
u[i*128+j]=u_n;
}
}
In the above code, I tried replacing the u_n = 0.25 *(...); line with the 15 commented lines, to try not only to paralllelize the iterations over the 2 for loops, but also to acheive a degree of parallelism on each of the 4 calculations (sum1 to sum4) involving array u[].
The compile error is fairly explicit:
error: the OpenMP "single" pragma must not be enclosed by the
"parallel for" pragma
Is there a way around this so I can optimize that calculation further with OpenMP ?
The single worksharing construct within a loop worksharing construct is prohibited by the standard, but you don't need it there.
The usual parallel -> single -> task setup for tasking is to ensure that you have a thread team setup for your tasks (parallel), but then only spawn each task once (single). You wouldn't need the latter in a parallel for context because each iteration is already executed only once. So you could spawn tasks directly within the loop. This seems to have the expected behavior on both gnu and Intel compilers, i.e. threads that have completed their own loop iterations do help other threads to execute their tasks.
However, that is a bad idea to do in your case. A tiny computation such as the one of sum1 will be much faster on it's own compared to the overhead of spawning a task.
Removing all pragmas except for the parallel for, this is a very reasonable parallelization. Before further optimizing the calculation, you should measure! In particularly, you are interested in whether all your available threads are always computing something, or whether some threads finish early and wait for others (load imbalance). To measure, you should look for a parallel performance analysis tool for your platform. If that is the case, you can address it with scheduling policies, or possibly by nested parallelism in the inner loop.
A full discussion of the performance of your code is more complex, and requires a minimal, complete and verifiable example, a detailed system description, and actual measured performance numbers.

Unexpected slowdown using omp

I'm using OMP to try to get some speedup in a small kernel. It's basically just querying a vector of unordered_sets for membership. I tried to make an optimization, but surprisingly I got a slowdown, and am really curious why.
My first pass was:
vector<unordered_set<uint16_t> > setList = getData();
#pragma omp parallel for default(shared) private(i, j) schedule(dynamic, 50)
for(i = 0; i < size; i++){
for(j = 0; j < 500; j++){
count = count + setList[i].count(val[j]);
}
}
Then I thought I could maybe get a speedup by moving the setList[i] sub expression up one level of nesting and save it in a temp variable, by doing the following:
#pragma omp parallel for default(shared) private(i, j, currSet) schedule(dynamic, 50)
for(i = 0; i < size; i++){
currSet = setList[i];
for(j = 0; j < 500; j++){
count = count + currSet.count(val[j]);
}
}
I had thought this would maybe save a load each iteration of the "j" for loop, and get a speedup, but it actually SLOWED DOWN by about 3x. By this I mean the entire kernel took about 3 times as long to run. Thoughts on why this would occur?
Thanks!
Adding up a few integers is really not enough work to warrant starting threads for.
If you forget to add the reduction clause, you'll suffer from true sharing - all threads want to update that count variable at the same time. This makes all cores fight for the cache line containing tha variable, which will considerably impact your performance.
I just noticed that you set the schedule to be dynamic. You shouldn't. This workload can be divided at compile time already. So don't specify a schedule.
As has already been stated inter-loop dependencies, i.e. threads waiting for data from other threads, or data being accessed by multiple threads successively, can cause a paralleled program to experience slow down and should be avoided as a rule of thumb. Built in functions like reductions can collect individual results and compile them together in an optimised fashion.
Here is a good example of reduction being used in a similar case to yours from the university of Utah
int array[8] = { 1, 1, 1, 1, 1, 1, 1, 1};
int sum = 0, i;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < 8; i++) {
sum += array[i];
}
printf("total %d\n", sum);
source: http://www.eng.utah.edu/~cs4960-01/lecture9.pdf
as an aside: private variables need only be assigned when they are local variables inside a parallel region In both cases it is not necessary for i to be declared private.
see wikipedia: https://en.wikipedia.org/wiki/OpenMP#Data_sharing_attribute_clauses
Data sharing attribute clauses
shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.
private: the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.
see stack exchange answer here: OpenMP: are local variables automatically private?

Simple openmp call for loop not working

I am writing some code that would definitively benefit from trying to integrate openmp some software that I am writing. I am new to openmp, and while testing some very basic test code (see below) I noticed that the execution times are extremely longer with openmp activated (#pragma line). Any insight is much appreciated.
int main()
{
int number=200;
int max = 2000000;
for(int t=1; t<max; t++)
{
double fac = 0.0;
#pragma omp parallel for reduction(+:fac)
for(int n=2; n<=number; n++)
fac += 1;
}
return 0;
}
As currently written the code encounters the parallel region max times. The overhead of entering a parallel region in an OpenMP program is small, but you incur it 2000000 times. You don't actually tell us what the run times are, but I can readily believe that this makes the them extremely longer than the serial version. I suggest you wrap the outer loop in a parallel region, not the inner loop.
Take care when you rewrite your code to ensure that the payload inside the parallel region is significant, and returns some value(s) to the program outside the parallel region. Absent these steps a crafty optimising compiler can determine that a loop returns nothing to the rest of the program and simply optimise it away.
Also insert some timing instructions (use omp_get_wtime), rerun your code and, if matters are still not satisfactory, update your question with the new information you gather.
This is an improved code that actually works as intended. It basically wraps the outer loop, rather than the inner one. When compiled without openmp support it takes 1.49s, with openmp 0.48s.
int main()
{
int number=200;
int max = 2000000;
#pragma omp parallel for
for(int t=1; t<max; t++)
{
double fac = 0.0;
for(int n=2; n<=number; n++)
fac += 1;
}
return 0;
}

Xeon-Phi asynchronous offload from host openMP parallel region

I am using intel's offload pragmas in host openMP code. The code looks as follows
int s1 = f(a,b,c);
#prama offload singnal(s1) in (...) out(x:len)
{
for (int i = 0; i < len; ++i)
{
x[i] = ...
}
}
#pragma omp parallel default(shared)
{
#pragma omp for schedule(dynamic) nowait
for (int i = 0; i < count; ++i)
{
/* code */
}
#pragma omp for schedule(dynamic)
for (int j = 0; j < count2; ++j)
{
/* code */
}
}
#pragma offload wait(s1)
{
/* code */
}
The code offload calculation of $x$ to MIC. The code keeps itself busy by assining some openMP to CPU cores. The above code works as expected. However, the first offload pragma takes a lot of time and has become the bottleneck. Nevertheless overall , it pays off to offload computation of $x$ to MIC. One way to potentially overcome this latency issue I'm trying is as follows
int s1 = f(a,b,c);
#pragma omp parallel default(shared)
{
#pragma omp single nowait
{
#prama offload singnal(s1) in (...) out(x:len)
{
for (int i = 0; i < len; ++i)
{
x[i] = ...
}
}
}
#pragma omp for schedule(dynamic) nowait
for (int i = 0; i < count; ++i)
{
/* code */
}
#pragma omp for schedule(dynamic)
for (int j = 0; j < count2; ++j)
{
/* code */
}
}
#pragma offload wait(s1)
{
/* code */
}
SO this new code, assigns a thread to do the offload while other openmp threads can be used for other worksharing constructs. However this code doesn't work. I get following error message
device 1 does not have a pending signal for wait(0x1)
Offload report points that above piece of code is the main culprit. One temporary work around is using a constant as signal i.e. signal(0), which works. However, I need a more permanent solution. Can anyone shade light on what is going wrong in my code.
Thanks
Let me complement Taylor's reply a bit.
The first offload indeed takes more time than subsequent offloads, because of the initialization stuff going on. Taylor sketched some of the things going on there. You can avoid the dummy offload by using the environment variable OFFLOAD_INIT=on_start. That should let the runtime system do all the initialization ahead of time. The overhead of this does not go away, but it moves from your first offload to the application initialization.
The problem with your second code snippet seems to be that your offloads target different devices. Signalling and waiting only works if the signal and wait happen for the same target device. Since you do not explicitly use the target(mic:0) clause with your offloads, chances are high that the runtime system selects different target devices.
One recommendation i would like to make is to not use plain integers for the signalling. Usually, the signal indicates that a certain buffer is ready. In these cases, it is good practice to use the buffer pointer as the signal handle, since it will be unique for concurrent offloads working with different buffers.
Cheers,
-michael
I can't comment on the 2nd code block. I have some observations about the first.
The first offload always takes a longer period of time since it also setups the offload infrastructure. This structure includes things such as passing environmental variables, copying over the mic implementation of libomp5, setting up the thread pool, etc.
The way to avoid this is to setup a dummy offload first, meaning it doesn't really do anything and is not part of your computation block.
An excellent set of references on optimizing for the xeon phi coprocessor is under the training tab at software.intel.com/mic-developer.
Also take a look at software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated-core-architecture, software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization, and software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization.
Sorry about the long URLs but stackoverflow doesn't allow me to include more than two links as I'm new.

Resources