Simple openmp call for loop not working - openmp

I am writing some code that would definitively benefit from trying to integrate openmp some software that I am writing. I am new to openmp, and while testing some very basic test code (see below) I noticed that the execution times are extremely longer with openmp activated (#pragma line). Any insight is much appreciated.
int main()
{
int number=200;
int max = 2000000;
for(int t=1; t<max; t++)
{
double fac = 0.0;
#pragma omp parallel for reduction(+:fac)
for(int n=2; n<=number; n++)
fac += 1;
}
return 0;
}

As currently written the code encounters the parallel region max times. The overhead of entering a parallel region in an OpenMP program is small, but you incur it 2000000 times. You don't actually tell us what the run times are, but I can readily believe that this makes the them extremely longer than the serial version. I suggest you wrap the outer loop in a parallel region, not the inner loop.
Take care when you rewrite your code to ensure that the payload inside the parallel region is significant, and returns some value(s) to the program outside the parallel region. Absent these steps a crafty optimising compiler can determine that a loop returns nothing to the rest of the program and simply optimise it away.
Also insert some timing instructions (use omp_get_wtime), rerun your code and, if matters are still not satisfactory, update your question with the new information you gather.

This is an improved code that actually works as intended. It basically wraps the outer loop, rather than the inner one. When compiled without openmp support it takes 1.49s, with openmp 0.48s.
int main()
{
int number=200;
int max = 2000000;
#pragma omp parallel for
for(int t=1; t<max; t++)
{
double fac = 0.0;
for(int n=2; n<=number; n++)
fac += 1;
}
return 0;
}

Related

while loop getting stuck - Openmp

I was trying to implement some piece of parallel code and tried to synchronize the threads using an array of flags as shown below
// flags array set to zero initially
#pragma omp parallel for num_threads (n_threads) schedule(static, 1)
for(int i = 0; i < n; i ++){
for(int j = 0; j < i; j++) {
while(!flag[j]);
y[i] -= L[i][j]*y[j];
}
y[i] /= L[i][i];
flag[i] = 1;
}
However, the code always gets stuck after a few iterations when I try to compile it using gcc -O3 -fopenmp <file_name>. I have tried different number of threads like 2, 4, 8 all of them leads to the loop getting stuck. On putting print statements inside critical sections, I figured out that even though the value of flag[i] gets updated to 1, the while loop is still stuck or maybe there is some other problem with the code, I am not aware of.
I also figured out that if I try to do something inside the while block like printf("Hello\n") the problem goes away. I think there is some problem with the memory consistency across threads but I do not know how to resolve this. Any help would be appreciated.
Edit: The single threaded code I am trying to parallelise is
for(int i=0; i<n; i++){
for(int j=0; j < i; j++){
y[i]-=L[i][j]*y[j];
}
y[i]/=L[i][i];
}
You have data race in your code, which is easy to fix, but the bigger problem is that you also have loop carried dependency. The result of your code does depend on the order of execution. Try reversing the i loop without OpenMP, you will get different result, so your code cannot be parallelized efficiently.
One possibility is to parallelize the j loop, but the workload is very small inside this loop, so the OpenMP overheads will be significantly bigger than the speed gain by parallelization.
EDIT: In the case of your updated code I suggest to forget parallelization (because of loop carried dependency) and make sure that inner loop is properly vectorized, so I suggest the following:
for(int i = 0; i < n; i ++){
double sum_yi=y[i];
#pragma GCC ivdep
for(int j = 0; j < i; j++) {
sum_yi -= L[i][j]*y[j];
}
y[i] = sum_yi/L[i][i];
}
#pragma GCC ivdep tells the compiler that there is no loop carried dependency in the loop, so it can vectorize it safely. Do not forget to inform compiler the about the vectorization capabilities of your processor (e.g. use -mavx2 flag if your processor is AVX2 capable).

OpenMP collapse parallel for with parallel max-reduction?

I have the following nested loops that I want to collapse into one for parallelization. Unfortunately the inner loop is a max-reduction rather than standard for loop thus collapse(2) directive apparently can't be used here. Is there any way to collapse these two loops anyway? Thanks!
(note that s is the number of sublists and n is the length of each sublist and suppose n >> s)
#pragma omp parallel for default(shared) private(i,j)
for (i=0; i<n; i++) {
rank[i] = 0;
for (j=0; j<s; j++)
if (rank[i] < sublistrank[j][i])
rank[i] = sublistrank[j][i];
}
In this code the best idea is not to parallelize the inner loop at all, but make sure it is properly vectorized. The inner loop does not access the memory continuously, which prevents vectorization and results in a poor cache utilization. You should rewrite your entire code to ensure continuous memory access (e.g. change the order of indices and use sublistrank[i][j] instead of sublistrank[j][i]).
If also beneficial to use a temporary variable for comparisons and assign it to rank[i] after the loop.
Another comment is that always use your variables in their minimum required scope, it also helps the compiler to create more optimized code. Putting it together your code should look like something like this (assuming you use unsigned int for rank and loop variables)
#pragma omp parallel for default(none) shared(sublistrank, rank)
for (unsigned int i=0; i<n; i++) {
unsigned int max=0;
for (unsigned int j=0; j<s; j++)
if (max < sublistrank[i][j])
max = sublistrank[i][j];
rank[i]=max;
}
I have compared your code and this code on CompilerExporer. You can see that the compiler is able to vectorize it, but not the old one.
Note also that if n is small, the parallel overhead may be bigger than the benefit of parallelization.

How to balance the thread number in nested case when using OpenMP?

This fabulous post teaches me a lot, but I still have a question. For the following code:
double multiply(std::vector<double> const& a, std::vector<double> const& b){
double tmp(0);
int active_levels = omp_get_active_level();
#pragma omp parallel for reduction(+:tmp) if(active_level < 1)
for(unsigned int i=0;i<a.size();i++){
tmp += a[i]+b[i];
}
return tmp;
}
If multiply() is called from another parallel part:
#pragma omp parallel for
for (int i = 0; i < count; i++) {
multiply(a[i], b[i]);
}
Because the outer loop iteration depends on count variable, if count is a big number, it is reasonable. But if count is only 1 and our server is a multiple-core machine(e.g., has 512 cores), then the multiply() function only generate 1 thread. So in this case, the server is under-utilized. BTW, the answer also mentioned:
In any case, writing such code is a bad practice. You should simply leave the parallel regions as they are and allow the end user choose whether nested parallelism should be enabled or not.
So how to balance the thread number in nested case when using OpenMP?
Consider using OpenMP tasks (omp taskloop within one parallel section and an intermediate omp single). This allows you to flexibly use the threads in OpenMP on different nesting levels instead of manually defining numbers of threads for each level or oversubscribing OS threads.
However this comes at increased scheduling costs. At the end of the day, there is no perfect solution that will always do best. Instead you will have to keep measuring and analyzing your performance on practical inputs.

Unexpected slowdown using omp

I'm using OMP to try to get some speedup in a small kernel. It's basically just querying a vector of unordered_sets for membership. I tried to make an optimization, but surprisingly I got a slowdown, and am really curious why.
My first pass was:
vector<unordered_set<uint16_t> > setList = getData();
#pragma omp parallel for default(shared) private(i, j) schedule(dynamic, 50)
for(i = 0; i < size; i++){
for(j = 0; j < 500; j++){
count = count + setList[i].count(val[j]);
}
}
Then I thought I could maybe get a speedup by moving the setList[i] sub expression up one level of nesting and save it in a temp variable, by doing the following:
#pragma omp parallel for default(shared) private(i, j, currSet) schedule(dynamic, 50)
for(i = 0; i < size; i++){
currSet = setList[i];
for(j = 0; j < 500; j++){
count = count + currSet.count(val[j]);
}
}
I had thought this would maybe save a load each iteration of the "j" for loop, and get a speedup, but it actually SLOWED DOWN by about 3x. By this I mean the entire kernel took about 3 times as long to run. Thoughts on why this would occur?
Thanks!
Adding up a few integers is really not enough work to warrant starting threads for.
If you forget to add the reduction clause, you'll suffer from true sharing - all threads want to update that count variable at the same time. This makes all cores fight for the cache line containing tha variable, which will considerably impact your performance.
I just noticed that you set the schedule to be dynamic. You shouldn't. This workload can be divided at compile time already. So don't specify a schedule.
As has already been stated inter-loop dependencies, i.e. threads waiting for data from other threads, or data being accessed by multiple threads successively, can cause a paralleled program to experience slow down and should be avoided as a rule of thumb. Built in functions like reductions can collect individual results and compile them together in an optimised fashion.
Here is a good example of reduction being used in a similar case to yours from the university of Utah
int array[8] = { 1, 1, 1, 1, 1, 1, 1, 1};
int sum = 0, i;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < 8; i++) {
sum += array[i];
}
printf("total %d\n", sum);
source: http://www.eng.utah.edu/~cs4960-01/lecture9.pdf
as an aside: private variables need only be assigned when they are local variables inside a parallel region In both cases it is not necessary for i to be declared private.
see wikipedia: https://en.wikipedia.org/wiki/OpenMP#Data_sharing_attribute_clauses
Data sharing attribute clauses
shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.
private: the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.
see stack exchange answer here: OpenMP: are local variables automatically private?

Xeon-Phi asynchronous offload from host openMP parallel region

I am using intel's offload pragmas in host openMP code. The code looks as follows
int s1 = f(a,b,c);
#prama offload singnal(s1) in (...) out(x:len)
{
for (int i = 0; i < len; ++i)
{
x[i] = ...
}
}
#pragma omp parallel default(shared)
{
#pragma omp for schedule(dynamic) nowait
for (int i = 0; i < count; ++i)
{
/* code */
}
#pragma omp for schedule(dynamic)
for (int j = 0; j < count2; ++j)
{
/* code */
}
}
#pragma offload wait(s1)
{
/* code */
}
The code offload calculation of $x$ to MIC. The code keeps itself busy by assining some openMP to CPU cores. The above code works as expected. However, the first offload pragma takes a lot of time and has become the bottleneck. Nevertheless overall , it pays off to offload computation of $x$ to MIC. One way to potentially overcome this latency issue I'm trying is as follows
int s1 = f(a,b,c);
#pragma omp parallel default(shared)
{
#pragma omp single nowait
{
#prama offload singnal(s1) in (...) out(x:len)
{
for (int i = 0; i < len; ++i)
{
x[i] = ...
}
}
}
#pragma omp for schedule(dynamic) nowait
for (int i = 0; i < count; ++i)
{
/* code */
}
#pragma omp for schedule(dynamic)
for (int j = 0; j < count2; ++j)
{
/* code */
}
}
#pragma offload wait(s1)
{
/* code */
}
SO this new code, assigns a thread to do the offload while other openmp threads can be used for other worksharing constructs. However this code doesn't work. I get following error message
device 1 does not have a pending signal for wait(0x1)
Offload report points that above piece of code is the main culprit. One temporary work around is using a constant as signal i.e. signal(0), which works. However, I need a more permanent solution. Can anyone shade light on what is going wrong in my code.
Thanks
Let me complement Taylor's reply a bit.
The first offload indeed takes more time than subsequent offloads, because of the initialization stuff going on. Taylor sketched some of the things going on there. You can avoid the dummy offload by using the environment variable OFFLOAD_INIT=on_start. That should let the runtime system do all the initialization ahead of time. The overhead of this does not go away, but it moves from your first offload to the application initialization.
The problem with your second code snippet seems to be that your offloads target different devices. Signalling and waiting only works if the signal and wait happen for the same target device. Since you do not explicitly use the target(mic:0) clause with your offloads, chances are high that the runtime system selects different target devices.
One recommendation i would like to make is to not use plain integers for the signalling. Usually, the signal indicates that a certain buffer is ready. In these cases, it is good practice to use the buffer pointer as the signal handle, since it will be unique for concurrent offloads working with different buffers.
Cheers,
-michael
I can't comment on the 2nd code block. I have some observations about the first.
The first offload always takes a longer period of time since it also setups the offload infrastructure. This structure includes things such as passing environmental variables, copying over the mic implementation of libomp5, setting up the thread pool, etc.
The way to avoid this is to setup a dummy offload first, meaning it doesn't really do anything and is not part of your computation block.
An excellent set of references on optimizing for the xeon phi coprocessor is under the training tab at software.intel.com/mic-developer.
Also take a look at software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated-core-architecture, software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization, and software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization.
Sorry about the long URLs but stackoverflow doesn't allow me to include more than two links as I'm new.

Resources