OpenMP for-loop chunk scheduling visualization - performance

Are there tools that visualize execution of OpenMP for-loop chunks?
For example, consider the parallel for-loop below:
#pragma omp parallel for schedule(dynamic, 10) num_threads(4)
for(int i=1; i<100; i++)
{
// do work of uneven execution time.
}
I want to visualize on which thread each of the 10 chunks (say (1,10),(11,20),...,(91,100)) executed and how long they took, without modifying code?
I understand that only four (one per thread) parallel outline functions are started, and that each of these functions ask for chunks in a synchronized manner. I can visualize the four parallel outline functions in tools such as Intel VTune, but am unable to drill this visualization down to the chunk level.
Thanks in advance for your tips and suggestions!

Related

Is there a way to efficiently synchronize a subset of theads using OpenMP?

Because OpenMP nested parallelism has often performance problems, I was wondering if there is a way to implement a partial barrier that would synchronize only a subset of all threads.
Here is an example code structure:
#pragma omp parallel
{
int nth = omp_get_num_threads();
int tid = omp_get_thread_num();
if (tid < nth/2) {
// do some work
...
// I need some synchronization here, but only for nth/2 threads
#pragma omp partial_barrier(nth/2)
// do some more work
...
} else {
// do some other independent work
...
}
}
I did not find anything like that in the openmp standard, but maybe there is a way to efficiently program a similar behaviour with locks or something?
EDIT:
So my actual problem is that I have a computation kernel (a Legendre transform -- part A) that is efficiently parallelized using OpenMP with 4 to 12 threads depending on the problem size.
This computation is followed by a Fast Fourier transform (part B).
I have several independent datasets (typically about 6 to 10) that have to be processed by A followed by B.
I would like to use more parallelism with more threads (48 to 128, depending on the machines).
Since A is not efficiently parallelized with more than 4 to 12 threads, the idea is to split the threads into several groups, each group working on an independent dataset. Because the datasets are independent, I don't need to synchronize all the threads (which is quite expensive when many threads are used) before doing B, only the subset working on a given dataset.
OpenMP tasks with depenencies would do what I need, but my experience is that on some platforms (xeon servers) the performance is significantly lower to what you can get with simple threads.
Is there a way to synchronize a subset of threads efficiently?

What is "implicit synchronization" in OpenMP

What is exactly "implicit synchronization" in OpenMP and how can you spot one? My teacher said that
#pragma omp parallel
printf(“Hello 1\n”);
Has an implicit sync. Why? And how do you see it?
Synchronisation is an important issue in parallel processing and in openmp. In general parallel processing is asynchronous. You know that several threads are working on a problem, but you have no way to know exactly what is their actual state, the iteration they are working on, etc. A synchronisation allows you get control on thread execution.
There are two kinds of synchronisations in openmp: explicit and implicit. An explicit synchronisation is done with a specific openmp construct that allows to create a barrier: #pragma omp barrier. A barrier is a parallel construct that can only be passed by all the threads simultaneously. So after the barrier, you know exactly the state of all threads and, more importantly, what amount of work they have done.
Implicit synchronisation is done in two situations:
at the end of a parallel region. Openmp relies on a fork-join model. When the program starts, a single thread (master thread) is created. When you create a parallel section by #pragma omp parallel, several threads are created (fork). These threads will work concurrently and at the end of the parallel section will be destroyed (join). So at the end of a parallel section, you have a synchronisation and you know precisely the status of all threads (they have finished their work). This is what happens in the example that you give. The parallel section only contains the printf() and at the end, the program waits for the termination of all threads before continuing.
at the end of some openmp constructs like #pragma omp for or #pragma omp sections, there is an implicit barrier. No thread can continue working as long as all the threads have not reached the barrier. This is important to know exactly what work has been done by the different threads.
For instance, consider the following code.
#pragma omp parallel
{
#pragma omp for
for(int i=0; i<N; i++)
A[i]=f(i); // compute values for A
#pragma omp for
for(int j=0; j<N/2; j++)
B[j]=A[j]+A[j+N/2];// use the previously computed vector A
} // end of parallel section
As all the threads work asynchronously, you do not know which threads have finished creating their part of vector A. Without a synchronisation, there is a risk that a thread finishes rapidly its part of the first for loop, enters the second for loop and accesses elements of vector A while the threads that are supposed to compute them are still in the first loop and have not computed the corresponding value of A[i].
This is reason why openmp compilers add an implicit barrier to synchronize all the threads. So you are certain that all threads have finished all their work and that all values of A have been computed when the second for loop starts.
But in some situations, no synchronisation is required. For instance, consider the following code:
#pragma omp parallel
{
#pragma omp for
for(int i=0; i<N; i++)
A[i]=f(i); // compute values for A
#pragma omp for
for(int j=0; j<N/2; j++)
B[j]=g(j);// compute values for B
} // end of parallel section
Obviously the two loops are completely independent and it does not matter if A is properly computed to start the second for loop. So the synchronisation gives nothing for the program correctness
and adding a synchronisation barrier has two major drawbacks:
If function f() has very different running times, you may have some threads that have finished their work, while others are still computing. The synchronisation will force the former threads to wait and this idleness do not exploit properly parallelism.
Synchronisations are expensive. A simple way to realize a barrier is to increment a global counter when reaching the barrier and to wait until the value of the counter is equal to the number of threads omp_get_num_threads(). To avoid races between threads, the incrementation of the global counter must be done with an atomic read-modify-write that requires a large number of cycles and the wait for the proper value of the counter is typically done with a spin lock that wastes processor cycles.
So there is construct to suppress implicit synchronisations and the best way to program the previous loop would be:
#pragma omp parallel
{
#pragma omp for nowait // nowait suppresses implicit synchronisations.
for(int i=0; i<N; i++)
A[i]=f(i); // compute values for A
#pragma omp for
for(int j=0; j<N/2; j++)
B[j]=g(j);// compute values for B
} // end of parallel section
This way, as soon as a thread has finished its work in the first loop, it will immediately start to process the second for loop, and, depending on the actual program, this may reduce significantly execution time.

OpenMP output for "for" loop

I am new to OpenMP and I just tried to write a small program with the parallel for construct. I have trouble understanding the output of my program. I don't understand why thread number 3 prints the output before 1 and 2. Could someone offer me an explanation?
So, the program is:
#pragma omp parallel for
for (i = 0; i < 7; i++) {
printf("We are in thread number %d and are printing %d\n",
omp_get_thread_num(), i);
}
and the output is:
We are in thread number 0 and are printing 0
We are in thread number 0 and are printing 1
We are in thread number 3 and are printing 6
We are in thread number 1 and are printing 2
We are in thread number 1 and are printing 3
We are in thread number 2 and are printing 4
We are in thread number 2 and are printing 5
My processor is a Intel(R) Core(TM) i5-2410M CPU with 4 cores.
Thank you!
OpenMP makes no guarantees of the relative ordering, in time, of the execution of statements by different threads. OpenMP leaves it to the programmer to impose such ordering if it is required. In general it is not required, in many cases not even desirable, which is why OpenMP's default behaviour is as it is. The cost, in time, of imposing such an ordering is likely to be significant.
I suggest you run much larger tests several times, you should observe that the cross-thread sequencing of events is, essentially, random.
If you want to print in order then you can use the ordered construct
#pragma omp parallel for ordered
for (i = 0; i < 7; i++) {
#pragma omp ordered
printf("We are in thread number %d and are printing %d\n",
omp_get_thread_num(), i);
}
I assume this requires threads from larger iterations to wait for the ones with lower iteration so it will have an effect on performance. You can see it used here http://bisqwit.iki.fi/story/howto/openmp/#ExampleCalculatingTheMandelbrotFractalInParallel
That draws the Mandelbrot set as characters using ordered. A much faster solution than using ordered is to fill an array in parallel of the characters and then draw them serially (try the code). Since one uses OpenMP for performance I have never found a good reason to use ordered but I'm sure it has its use somewhere.

OpenMP slows down program instead of speeding it up: a bug in gcc?

I will first give some background about the problem I'm having so you know what I'm trying to do. I have been helping out with the development of a certain software tool and found out that we could benefit greatly from using OpenMP to parallelize some of the biggest loops in this software. We actually parallelized the loops successfully and with just two cores the loops executed 30% faster, which was an OK improvement. On the other hand we noticed a weird phenomenom in a function that traverses through a tree structure using recursive calls. The program actually slowed down here with OpenMP on and the execution time of this function over doubled. We thought that maybe the tree-structure was not balanced enough for parallelization and commented out the OpenMP pragmas in this function. This appeared to have no effect on the execution time though. We are currently using GCC-compiler 4.4.6 with the -fopenmp flag on for OpenMP support. And here is the current problem:
If we don't use any omp pragmas in the code, all runs fine. But if we add just the following to the beginning of the program's main function, the execution time of the tree travelsal function over doubles from 35 seconds to 75 seconds:
//beginning of main function
...
#pragma omp parallel
{
#pragma omp single
{}
}
//main function continues
...
Does anyone have any clues about why this happens? I don't understand why the program slows down so greatly just from using the OpenMP pragmas. If we take off all the omp pragmas, the execution time of the tree traversal function drops back to 35 seconds again. I would guess that this is some sort of compiler bug as I have no other explanation on my mind right now.
Not everything that can be parallelized, should be parallelized. If you are using a single, then only one thread executes it and the rest have to wait until the region is done. They can either spin-wait or sleep. Most implementations start out with a spin-wait, hoping that the single region will not take too long and the waiting threads can see the completion faster than if sleeping. Spin-waits eat up a lot of processor cycles. You can try specifying that the wait should be passive - but this is only in OpenMP V3.0 and is only a hint to the implementation (so it might not have any effect). Basically, unless you have a lot of work in the parallel region that can compensate for the single, the single is going to increase the parallel overhead substantially and may well make it too expensive to parallelize.
First, OpenMP often reduces performance on first try. It can be tricky to to use omp parallel if you don't understand it inside-out. I may be able to help if you can you tell me a little more about the program structure, specifically the following questions annotated by ????.
//beginning of main function
...
#pragma omp parallel
{
???? What goes here, is this a loop? if so, for loop, while loop?
#pragma omp single
{
???? What goes here, how long does it run?
}
}
//main function continues
....
???? Does performance of this code reduce or somewhere else?
Thanks.
Thank you everyone. We were able to fix the issue today by linking with TCMalloc, one of the solutions ejd offered. The execution time dropped immediately and we were able to get around 40% improvement in execution times over a non-threaded version. We used 2 cores. It seems that when using OpenMP on Unix with GCC, you should also pick a replacement for the standard memory management solution. Otherwise the program may just slow down.
I did some more testing and made a small test program to test whether the issue could be memory operation related. I was unable to replicate the issue of an empty parallel-single region causing program to slow down in my small test program, but I was able to replicate the slow down by parallelizing some malloc calls.
When running the test program on Windows 7 64-bit with 2 CPU-cores, no noticeable slow down was caused by using -fopenmp flag with the gcc (g++) compiler and running the compiled program compared to running the program without OpenMP support.
Doing the same on Kubuntu 11.04 64-bit on the same computer, however, raised the execution to over 4 times of the non-OpenMP version. This issue seems to only appear on Unix-systems and not on Windows.
The source of my test program is below. I have also uploaded zipped-source for win and unix version as well as assembly source for win and unix version for both with and without OpenMP-support. This zip can be downloaded here http://www.2shared.com/file/0thqReHk/omp_speed_test_2011_05_11.html
#include <stdio.h>
#include <windows.h>
#include <list>
#include <sys/time.h>
//#include <cstdlib>
using namespace std;
int main(int argc, char* argv[])
{
// #pragma omp parallel
// #pragma omp single
// {}
int start = GetTickCount();
/*
struct timeval begin, end;
int usecs;
gettimeofday(&begin, NULL);
*/
list<void *> pointers;
#pragma omp parallel for default(shared)
for(int i=0; i< 10000; i++)
//pointers.push_back(calloc(20000, sizeof(void *)));
pointers.push_back(malloc(20000));
for(list<void *>::iterator i = pointers.begin(); i!= pointers.end(); i++)
free(*i);
/*
gettimeofday(&end, NULL);
if (end.tv_usec < begin.tv_usec) {
end.tv_usec += 1000000;
begin.tv_sec += 1;
}
usecs = (end.tv_sec - begin.tv_sec) * 1000000;
usecs += (end.tv_usec - begin.tv_usec);
*/
printf("It took %d milliseconds to finish the memory operations", GetTickCount() - start);
//printf("It took %d milliseconds to finish the memory operations", usecs/1000);
return 0;
}
What remains unanswered now is, what can I do to avoid issues such as these on the Unix-platform..

OpenMP: Huge slowdown in what should be ideal scenario

In the code below I'm trying to compare all elements of an array to all other elements in a nested for loop. (It's to run a simple n-body simulation. I'm testing with only 4 bodies for 4 threads on 4 cores). An identical sequential version of the code without OpenMP modifications runs in around 15 seconds for 25M iterations. Last night this code ran in around 30 seconds. Now it runs in around 1 minute! I think the problem may lie in that the threads must write to the array which is passed to the function via a pointer.
The array is dynamically allocated elsewhere and is composed of structs I defined. This is just a hunch. I have verified that the 4 threads are running on 4 separate cores at 100% and that they are accessing the elements of the array properly. Any ideas?
void runSimulation (particle* particles, int numSteps){
//particles is a pointer to an array of structs I've defined and allocated dynamically before calling the function
//Variable Initializations
#pragma omp parallel num_threads(4) private(//The variables inside the loop) shared(k,particles) // 4 Threads for four cores
{
while (k<numSteps){ //Main loop.
#pragma omp master //Check whether it is time to report progress.
{
//Some simple if statements
k=k+1; //Increment step counter for some reason omp doesn't like k++
}
//Calculate new velocities
#pragma omp for
for (i=0; i<numParticles; i++){ //Calculate forces by comparing each particle to all others
Fx = 0;
Fy = 0;
for (j=0; j<numParticles; j++){
//Calcululate the cumulative force by comparing each particle to all others
}
//Calculate accelerations and set new velocities
ax = Fx / particles[i].mass;
ay = Fy / particles[i].mass;
//ARE THESE TWO LINES THE PROBLEM?!
particles[i].xVelocity += deltaT*ax;
particles[i].yVelocity += deltaT*ay;
}
#pragma omp master
//Apply new velocities to create new positions after all forces have been calculated.
for (i=0; i<numParticles; i++){
particles[i].x += deltaT*particles[i].xVelocity;
particles[i].y += deltaT*particles[i].yVelocity;
}
#pragma omp barrier
}
}
}
You are thrashing the cache. All the cores are writing to the same shared structure, which will be continually bouncing around between the cores via the L2 (best case), L3 or main memory/memory bus (worst case). Depending on how stuff is shared this is taking anywhere from 20 to 300 cycles, while writes to private memory in L1 takes 1 cycle or less in ideal conditions.
That explains your slowdown.
If you increase your number of particles the situation may become less severe because you'll often be writing to distinct cache lines, so there will be less thrashing. btown above as the right idea in suggesting a private array.
Not sure if this will fix the problem, but you might try giving each thread its own copy of the full array; the problem might be that the threads are fighting over accessing the shared memory, and you're seeing a lot of cache misses.
I'm not sure of the exact openmp syntax you'd use to do this, but try doing this:
Allocate memory to hold the entire particles array in each thread; do this once, and save all four new pointers.
At the beginning of each main loop iteration, in the master thread, deep-copy the main array four times to each of those new arrays. You can do this quickly with a memcpy().
Do the calculation such that the first thread writes to indices 0 < i < numParticles/4, and so on.
In the master thread, before you apply the new velocities, merge the four arrays into the main array by copying over only the relevant indices. You can do this quickly with a memcpy().
Note that you can parallelize your "apply new velocities" loop without any problems because each iteration only operates on a single index; this is probably the easiest part to parallelize.
The new operations will only be O(N) compared to your calculations which are O(N^2), so they shouldn't take too much time in the long run. There are definitely ways to optimize the steps that I laid out for you, Gabe, but I'll leave those to you.
I don't agree that the problem is cache thrashing since the size of the struct particles must exceed the size of a cache line just from the number of members.
I think the more likely culprit is that the overhead for initializing an omp for is 1000's of cycles http://www.ualberta.ca/CNS/RESEARCH/Courses/2001/PPandV/OpenMP.Eric.pdf and the loop has only a few calculations in it. I'm not remotely surprised the loop is slower with only 4 bodies. If you had a few 100's of bodies the situation would be different. I once worked on a loop a bit like this, and ended up using pthreads directly.

Resources