Why can't OpenMP have sections inside of for loop? - openmp

I am trying to parallelize a range-based for loop in OpenMP, by making each iteration of the loop an OpenMP section. I want to do this:
#pragma omp parallel sections
{
for ( auto x : range ) {
#pragma omp section
// LOTS OF CODE HERE
}
}
However, this doesn't work - the compiler complains that the section must be within a #pragma omp sections construct . Am I doing something wrong, or is OpenMP not smart enough to understand that this section is within a for loop?

OpenMP sections are for somewhat unrelated bits of code that can be executed in parallel. The idea is that within your algorithms, there are different parts which aren't super organized, but which could be executed in any order and even in parallel. This is a very ad-hoc way of parallelizing, which you don't expect to see much in a code.
Anyway, your code isn't standard compliant since all blocks within a sections construct must be enclosed inside a section block (but the first for which this is optional as it is implicitly enclosed in a first sectionblock). Here, if you where to add this implicit #pragma omp section right before your for loop, you'd see how little sense the code would have: the loop is in a section and the body into another...
Here, since you have a for loop, this structures your code very well and should permit you to use a omp parallel for directive. You only need to rewrite a bit your for such as to explicit better the loop boundaries.

Related

openacc declare issue with pow function in C++ within acc compute region

I'm trying to use openACC to accelerate some of my code. One portion of the code used pow() function from standard library. However there is an error during compilation
PGCC-S-0155-Procedures called in a compute region must have acc routine information: pow
I've roughly know that I need to declare the #pragma acc routine seq in order to remove such type of error. But as I understand this need to be added in the source code of the function(I might be wrong). So how can I work around this?
Include accelmath.h, instead of cmath.h

gcc/clang: How to force ordering of items on the stack?

Consider the following code:
int a;
int b;
Is there a way to force that a precedes b on the stack?
One way to do the ordering would be to put b in a function:
void foo() {
int b;
}
...
int a;
foo();
However, that would generally work only if b isn't inlined.
Maybe there's a different way to do that? Putting an inline assembler between the two declarations may do a trick, but I am not sure.
Your initial question was about forcing a function call to not be inlined.
To improve on Jordy Baylac's answer, you might try to declare the function within the block calling it, and perhaps use a statement expr:
#define FOO_WITHOUT_INLINING(c,i) ({ \
extern int foo (char, int) __attribute__((noinline)); \
int r = foo(c,i); \
r; })
(If the type of foo is unknown, you could use typeof)
However, I still think that your question is badly formulated (and is meaningless, if one avoid reading your comments which should really go inside the question, which should have mentioned your libmill). By definition of inlining, a compiler can inline any function as it wants without changing the semantics of the program.
For example, a user of your library might legitimately compile it with -flto -O2 (both at compiling and at linking stage). I don't know what would happen then.
I believe you might redesign your code, perhaps using -fsplit-stack; are you implementing some call/cc in C? Then look inside the numerous existing implementations of it, and inside Gabriel Kerneis CPC.... See also setcontext(3) & longjmp(3)
Perhaps you might need to use somewhere the return_twice (and/or nothrow) function attribute of GCC, or some _Pragma like GCC optimize
Then you edited your question to change it completely (asking about order of variables on the call stack), still without mentioning in the question your libmill and its go macro (as you should; comments are volatile so should not contain most of the question).
But the C compiler is not even supposed to have a call stack (an hypothetical C99 conforming compiler could do whole program optimization to avoid any call stack) in the compiled program. And GCC is certainly allowed to put some variables outside of the call stack (e.g. only in registers) and it is doing that. And some implementations (IA64 probably) have two call stacks.
So your changed question is completely meaniningless: a variable might not sit on the stack (e.g. only be in a register, or even disappear completely if the compiler can prove it is useless after some other optimizations), and the compiler is allowed to optimize and use the same call stack slot for two variables (and GCC is doing such an optimization quite often). So you cannot force any order on the call stack layout.
If you need to be sure that two local variables a & b have some well defined order on the call stack, make them into a struct e.g.
struct { int _a, _b; } _locals;
#define a _locals._a
#define b _locals._b
then, be sure to put the &_locals somewhere (e.g. in a volatile global or thread-local variable). Since some versions of GCC (IIRC 4.8 or 4.7) had some optimization passes to reorder the fields of non-escaping struct-s
BTW, you might customize GCC with your MELT extension to help about that (e.g. introduce your own builtin or pragma doing part of the work).
Apparently, you are inventing some new dialect of C (à la CPC); then you should say that!
below there is a way, using gcc attributes:
char foo (char, int) __attribute__ ((noinline));
and, as i said, you can try -fno-inline-functions option, but this is for all functions in the compilation process
It is still unclear for me why you want function not to be inline-d, but here is non-pro solution I am proposing:
You can make this function in separate object something.o file.
Since you will include header only, there will be no way for the compiler to inline the function.
However linker might decide to inline it later at linking time.

Dynamic/Nested Parallelism of GPU with OpenMP programming model

I've question is related with declare target construct of OpenMP and dynamic/nested parallelism feature of GPUs. OpenACC 2.0 supports dynamic parallelism in two ways; routine directive and using parallel/kernels directives in nested way. But using nested device directives such as target, teams or distribute is prohibited in OpenMP. But declare target directive seems to me similar with routine directive of OpenACC.
However it's not still clear semantic and its usage. My sample code is like this.
#pragma omp declare target
void foo(){
//work-1
#pragma omp target teams distribute parallel for
for(...)
{
if(...){
foo();
{
//work-2
}
}
//work-3
}
#pragma omp end declare target
int main(){
//work
foo();
}
In that example, I am expecting that function foo will be mapped as device function. but since it has target construct inside, compiler will transform this code in some way. My question is here, what will happen when the threads encounters foo function invocation inside of target construct?
Currently OpenMP does not support nesting of target regions. So your code will most likely not compile. Or it can crash at runtime when nested target construct encountered. Or produce unexpected result. Basically the result of non-conforming program is unspecified.

cython: private variables in prange

I am trying to re-write in cython a fortran subroutine that uses openmp. I have found no difficulty in re-writing the fortran subroutine itself in cython. The non openmp version works fine. However, I am not sure what to do about the openmp directive....
!$omp parallel do private(x, y, z)
In cython, I understand that you get the openmp parallel do using cython.parallel.prange. However, I don't see how to declare private variables for the loop.
Is this even possible?
thanks for any insight you can provide.
Yes, it is possible. The docs say that:
If you assign to a variable in a prange block, it becomes lastprivate, meaning that the variable will contain the value from the last iteration. ..... Variables assigned to in a parallel with block will be private and unusable after the block, as there is no concept of a sequentially last value.

Does #defines decrease performance?

I would like to know that if we #define a particular function like this
#define POST_NOTIFICATION(NAME, OBJECT) [[NSNotificationCenter defaultCenter] postNotificationName:NAME object:OBJECT]
will using above decrease performance?
No. #define is a preprocessor directive meaning that anywhere the preprocessor sees the POST_NOTIFICATION symbol, it will replace it with the [[NSNotificationCenter ...]] code.
No, it won't decrease performance. The #define directives are preprocessor directives, which are "replaced" in code before compiling, so the final binary code is the same.
However you should refrain from using defines ... It might seem to make the code more readable, however it does not really do it ... Also for one-lines like in the example you gave the benefit probably is not that high to use it.
However to answer your question, the final binary is not different if you use this construct, so there is no performance decrease.

Resources