Implementing OpenMP tasks in FORTRAN? - parallel-processing

I'm new to OpenMP and I'm trying to paralellize an already existing serial code. The code has about 40000 lines, so I can't really post it here.
I'm trying to implement the following code (in C) in FORTRAN:
my_pointer = listhead;
#pragma omp parallel
{
#pragma omp single nowait
{
while(my_pointer) {
#pragma omp task firstprivate(my_pointer)
{
(void) do_independent_work (my_pointer);
}
my_pointer = my_pointer->next ;
}
} // End of single - no implied barrier (nowait)
} // End of parallel region - implied barrier
In my code:
my_pointer = zi ;
listhead = z%first ;
zi%kc(zi%np) is an array of of size zi%np ;
do_independent_work(my_pointer) = ALLOCATE(zi%kc(zi%np)) and initializes the vector to 0 ;
My code is the following:
!$OMP PARALLEL
!$OMP SINGLE
DO WHILE(ASSOCIATED(zi))
IF (zi%compt) THEN
!$OMP TASK
ALLOCATE(zi%kc(zi%np), STAT = AllocateStatus )
IF (AllocateStatus /= 0) STOP "*** zi%kc Allocate failed ***"
FORALL(i=1:zi%np)
zi%kc(i) = 0.0_SDP
END FORALL
!$OMP END TASK
ENDIF
zi => zi%next
ENDDO
!$OMP END SINGLE NOWAIT
!$OMP END PARALLEL
The problem is: the serial version of this code runs without any problem, while the parallel version I implemented crashes for some reason.
I'm I doing something fundamentally wrong?
Also, if I put firstprivate(zi) next to "!$OMP TASK" I get "Error 1 error #7266: A F90 pointer is not permitted in an OpenMP* FIRSTPRIVATE, LASTPRIVATE or REDUCTION clause."
I'm using Parallel Studio XE 2011 with Visual Studio 2010.

Fortran pointers were allowed in OpenMP 3.1, you should update your compiler (2011 is old).

Related

gcc openmp tasks do not work

I have already used OpenMP with "pragma omp for" loops and wanted to try OpenMP tasks now.
But a simple program, which should run 2 tasks parallel does not seem to work.
Did I misunderstand the use of tasks or what is wrong here?
#include<iostream>
#include<omp.h>
//ubuntu 12.04 LTS, gcc 4.6.3
//g++ test_omp.cpp -fopenmp
int main()
{
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{
while(true)
{
usleep(1e6);
#pragma omp critical (c_out)
std::cout<<"task1"<<std::endl;
}
}
#pragma omp task
{
while(true)
{
usleep(1e6);
#pragma omp critical (c_out)
std::cout<<"task2"<<std::endl;
}
}
}
}
}
The output is:
task1
task1
task1
.....
So the second task is not running.
From the OpenMP spec:
When a thread encounters a task construct, a task is generated from
the code for the associated structured block. The data environment of
the task is created according to the data-sharing attribute clauses on
the task construct, per-data environment ICVs, and any defaults that
apply.
The encountering thread may immediately execute the task, or
defer its execution. In the latter case, any thread in the team may be
assigned the task. Completion of the task can be guaranteed using task
synchronization constructs. A task construct may be nested inside an
outer task, but the task region of the inner task is not a part of the
task region of the outer task.
(emphasis mine)
The way I read this: A single thread starts executing your single section. It reaches the task directive, at which point it may decide to either run the task itself, or give it to another thread. The problem occurs when it decides to run the task itself - it never returns.
I'm not quite sure why you have task/single in your example though. What you want to do seems like a case for omp parallel sections instead:
int main()
{
#pragma omp parallel sections num_threads(2)
{
#pragma omp section
{
while(true)
{
usleep(3e5);
#pragma omp critical (c_out)
std::cout<<"task1"<<std::endl;
}
}
#pragma omp section
{
while(true)
{
usleep(5e5);
#pragma omp critical (c_out)
std::cout<<"task2"<<std::endl;
}
}
}
}

Openmp not updating

I am tryint to write a C (gcc) function that will calculate the maximum of an array of doubles while running across multiple threads. I create an array of size omp_get_num_threads, in which I store the local maxima of each thread before finally maximizing this small array. The code is (more or less) the following:
int i;
double *local_max;
double A[1e10]; //made up size
#pragma omp parallel
{
#pragma omp master
{
local_max=(double *)calloc(omp_get_num_threads(),sizeof(double));
}
#pragma omp flush //so that all threads point
//to the correct location of local_max
#pragma omp for
for(i=0;i<1e10;i++){
if(A[i]>local_max[omp_get_thread_num()])
local_max[omp_get_thread_num()]=A[i];
}
}
free(local_max);
This, however, leads to segfaults, and valgrind complains of the usage of uninitialized variables. Turns out, local_max is not actually updated throughout all threads before they enter the for construct. I thought #pragma omp flush should do that? If I replace it with #pragma omp barrier, everything works fine.
Could someone explain to me what is going on?
The easiest solution to your problem is to simply replace the master construct with a single one as it doesn't really matter which thread would make the allocation (unless you are running on a NUMA machine, but then you would also have many other things to worry about):
#pragma omp single
{
local_max=(double *)calloc(omp_get_num_threads(),sizeof(double));
}
The subtle difference between master and single is that there is an implicit barrier at the end of the single while no such barrier exists at the end of master. This implicit barrier makes all other threads to wait until the thread that executes the single block has made it to the end of the block (unless the nowait clause is specified, which removes the implicit barrier). With master the barrier must be added explicitly. It is beyond my comprehension why the OpenMP designers made the decision that master would not have an implicit barrier like single does.
You need to put a barrier to ensure memory allocation has been completed. Memory allocation is a time consuming operation and when your final for loop starts running, local_max is not pointing to a properly allocated space. I modified your code below to demonstrate the behavior.
int i;
double *local_max;
omp_set_num_threads(8);
#pragma omp parallel
{
#pragma omp master
{
for(int k = 0; k < 999999; k++) {} // Lazy man's sleep function
cout << "Master start allocating" << endl;
local_max=(double *)calloc(omp_get_num_threads(),sizeof(double));
cout << "Master finish allocating" << endl;
}
#pragma omp flush
#pragma omp for
for(i=0;i<10;i++){
cout << "for : " << omp_get_thread_num() << " i: " << i << endl;
}
}
free(local_max);
getchar();
return 0;
Better yet, just move the memory allocation before the #pragma omp parallel. No need for flush, or single, or master.

avoid nested parallel region

I want to write a function which employs openMP parallelism but should work whether called from within a parallel region or not. So I used the if clause to suppress parallelism, but this doesn't work as I thought:
#include <omp.h>
#include <stdio.h>
int m=0,s=0;
void func()
{
bool p = omp_in_parallel();
// if clause to suppress nested parallelism
#pragma omp parallel if(!p)
{
/* do some massive work in parallel */
#pragma omp master
++m;
#pragma omp single
++s;
}
}
int main()
{
fprintf(stderr,"running func() serial:\n");
m=s=0;
func();
fprintf(stderr," m=%d s=%d\n",m,s);
fprintf(stderr,"running func() parallel:\n");
m=s=0;
#pragma omp parallel
func();
fprintf(stderr," m=%d s=%d\n",m,s);
}
which creates the output
running func() serial:
m=1 s=1
running func() parallel:
m=16 s=16
Thus the first call to func() worked fine: m and s obtain the values 1 as they should, but the second call to func() from within a parallel region did create nested parallelism (with 16 teams of 1 thread each) even though this was suppressed. That is the omp master and omp single directives bind to the preceeding omp parallel if(!p) directive rather than to the outside parallel region.
Of course, one can fix this problem by the following code
void work()
{
/* do some massive work in parallel */
#pragma omp master
++m;
#pragma omp single
++s;
}
void func()
{
if(omp_in_parallel())
work();
else
#pragma omp parallel
work();
}
but this requires an additional function to be defined etc. Is it possible to do this within a single function (and without repeating code)?
The OpenMP constructs will always bind to the innermost containing construct, even if it isn't active. So I don't think it's possible while retaining the #pragma omp parallel for both code paths (At least with the provided informations about the problem).
Note that it is a good think that it behaves like this, because otherwise the use of conditionals would easily lead to very problematic (read buggy) code. Look at the following example:
void func(void* data, int size)
{
#pragma omp parallel if(size > 1024)
{
//do some work
#pragma omp barrier
//do some more work
}
}
...
#pragma omp parallel
{
//create foo, bar, bar varies massively between different threads (so sometimes bigger, sometimes smaller then 1024
func(foo, bar);
//do more work
}
In general a programmer should not need to know the implementation details of the called functions, only their behaviour. So I really shouldn't have to care about whether func creates a nested parallel section or not and under which exact conditions it creates one. However if the barrier would bind to the outer parallel if the inner is inactive this code would be buggy, since some threads of the outer parallel sections encounter the barrier and some don't. Therefore such details stay hidden inside the innermost containing parallel, even if it isn't active.
Personally I have never encountered a situation where I wanted it to behave differently (which would go against information hiding and such), so maybe you should tell us a bit more about what you are trying to accomplish to get better answers.
I had a look at the openMP standard. The if clause is actually somewhat misleadingly coined, for the #pragma omp parallel directive is not conditional (as I originally thought). Instead the if clause may restrict the number of threads to 1 thereby suppressing parallelisation.
However, this implies omp single or omp master cannot be used for threadsafe once-per-process writes of global shared variables.

How does the SECTIONS directive in OpenMP distribute work?

In OpenMP when using omp sections, will the threads be distributed to the blocks inside the sections, or will each thread be assigned to each sections?
When nthreads == 3:
#pragma omp sections
{
#pragma omp section
{
printf ("id = %d, \n", omp_get_thread_num());
}
#pragma omp section
{
printf ("id = %d, \n", omp_get_thread_num());
}
}
Output:
id=1
id=1
But when I execute the following code:
#pragma omp sections
{
#pragma omp section
{
printf ("id = %d, \n", omp_get_thread_num());
}
#pragma omp section
{
printf ("id = %d, \n", omp_get_thread_num());
}
}
#pragma omp sections
{
#pragma omp section
{
printf ("id = %d, \n", omp_get_thread_num());
}
#pragma omp section
{
printf ("id = %d, \n", omp_get_thread_num());
}
}
Output:
id=1
id=1
id=2
id=2
From these output I can't understand what the concept of sections is in OpenMP.
The code posted by the OP will never execute in parallel, because the parallel keyword does not appear. The fact that the OP got ids different from 0 shows that probably his code was embedded in a parallel directive. However, this is not clear from his post, and might confuse beginners.
The minimum sensible example is (for the first example posted by the OP):
#pragma omp parallel sections
{
#pragma omp section
{
printf ("id = %d, \n", omp_get_thread_num());
}
#pragma omp section
{
printf ("id = %d, \n", omp_get_thread_num());
}
}
On my machine, this prints
id = 0,
id = 1,
showing that the two sections are being executed by different threads.
It's worth noting that however this code can not extract more parallelism than two threads: if it is executed with more threads, the other threads don't have any work to do and will just sit down idle.
The idea of parallel sections is to give the compiler a hint that the various (inner) sections can be performed in parallel, for example:
#pragma omp parallel sections
{
#pragma omp section
{
/* Executes in thread 1 */
}
#pragma omp section
{
/* Executes in thread 2 */
}
#pragma omp section
{
/* Executes in thread 3 */
}
/* ... */
}
This is a hint to the compiler and not guaranteed to happen, though it should. Your output is kind of what is expected; it says that there are #sections being executed in thread id 1, and in thread 2. The output order is non-deterministic as you don't know what thread will run first.
Change the first line from
#pragma omp sections
into
#pragma omp parallel sections
"parallel" directive ensures that the two sections are assigned to two threads.
Then, you will receive the following output
id = 0,
id = 1,
You are missing parallel keyword.
The parallel keyword triggers the openmp run in parallel.
According to OpenMP standard 3.1, section 2.5.2 (emphasis mine):
The sections construct is a noniterative worksharing construct that
contains a set of structured blocks that are to be distributed among
and executed by the threads in a team. Each structured block is
executed once by one of the threads in the team in the context of its
implicit task.
...
Each structured block in the sections construct is preceded by a
section directive except possibly the first block, for which a
preceding section directive is optional. The method of scheduling the
structured blocks among the threads in the team is implementation
defined. There is an implicit barrier at the end of a sections
construct unless a nowait clause is specified.
So, applying these rules to your case, we can argue that:
the different structured blocks identified in a sections directive are executed once, by one thread. In other words you have always four prints, whichever the number of threads
the blocks in the first sections will be executed (in a non-deterministic order) before the blocks in the second sections (also executed in a non-deterministic order). This is because of the implicit barrier at the end of the work-sharing constructs
the scheduling is implementation defined, so that you can't possibly control which thread has been assigned a given section
Your output is thus due to the way your scheduler decided to assign the different blocks to the threads in the team.
It may be helpful to add more information to the output line and to add more sections (if you have the thread-count)
#pragma omp parallel sections
{
#pragma omp section
{
printf ("section 1 id = %d, \n", omp_get_thread_num());
}
#pragma omp section
{
printf ("section 2 id = %d, \n", omp_get_thread_num());
}
#pragma omp section
{
printf ("section 3 id = %d, \n", omp_get_thread_num());
}
}
Then you may get more interesting output like this:
section 1 id = 4,
section 3 id = 3,
section 2 id = 1,
which shows how the sections may be executed in any order, by any available thread.
Note that 'nowait' tells the compiler that threads do not need to wait to exit the section. In Fortran 'nowait' goes at the end of the loop or section, which makes this more obvious.
The #pragma omp parallel is what creates (forks) the threads initially. Only on creating the threads, will the other Openmp constructs be of significance.
Hence,
Method 1:
// this creates the threads
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
// code here
}
#pragma omp section
{
// code here
}
}
}
or
Method 2:
// this creates the threads and creates sections in one line
#pragma omp parallel sections
#pragma omp section
{
// code here
}
#pragma omp section
{
// code here
}
}
If you want really start different threads in different sections, the nowait clause tells compiler that threads do not need to wait to enter a section.
#pragma omp parallel sections nowait
{
...
}

OpenMP in Visual Studio 2005

I am attempting to use OpenMP to create a parallel for loop in Visual Studio 2005 Professional. I have included omp.h and specified the /openmp compiler flag. However, I cannot get even the simplest parallel for loop to compile.
#pragma omp parallel for
for ( int i = 0; i < 10; ++i )
{
int a = i + i;
}
The above produces Compiler Error C3005 at the #pragma line.
Google hasn't been much help. I only found one obscure Japanese website with a user having similar issues. No mention of a resolution.
A standard parallel block compiles fine.
#prgram omp parallel
{
// Do some stuff
}
That is until you try to add a for loop.
#pragma omp parallel
{
#pragma omp for
for ( int i = 0; i < 10; ++i )
{
int a = i + i;
}
}
The above causes Compiler Error C3001. It seems 'for' is confusing to the compiler, but it shouldn't be. Any ideas?
I found the problem. Some genius defined the following macro deep within the headers:
#define for if ( false ) ; else for
My only guess is this was used to get variables declared in for loops to scope properly in Visual C++ 6. Undefining or commenting out the macro resolved the issue.

Resources