Dynamic/Nested Parallelism of GPU with OpenMP programming model

Dynamic/Nested Parallelism of GPU with OpenMP programming model - openmp

I've question is related with declare target construct of OpenMP and dynamic/nested parallelism feature of GPUs. OpenACC 2.0 supports dynamic parallelism in two ways; routine directive and using parallel/kernels directives in nested way. But using nested device directives such as target, teams or distribute is prohibited in OpenMP. But declare target directive seems to me similar with routine directive of OpenACC.
However it's not still clear semantic and its usage. My sample code is like this.
#pragma omp declare target
void foo(){
//work-1
#pragma omp target teams distribute parallel for
for(...)
{
if(...){
foo();
{
//work-2
}
}
//work-3
}
#pragma omp end declare target
int main(){
//work
foo();
}
In that example, I am expecting that function foo will be mapped as device function. but since it has target construct inside, compiler will transform this code in some way. My question is here, what will happen when the threads encounters foo function invocation inside of target construct?

Currently OpenMP does not support nesting of target regions. So your code will most likely not compile. Or it can crash at runtime when nested target construct encountered. Or produce unexpected result. Basically the result of non-conforming program is unspecified.

Related

Compile-time AVX detection when using multi-versioning

I have quite big function compiled for two different architectures:
__attribute__ ((target ("arch=broadwell"))) void doStuff()
{
doStuffImpl()
}
__attribute__ ((target ("arch=nocona"))) void doStuff()
{
doStuffImpl();
}
__attribute__((always_inline)) void doStuffImpl()
{
(...)
}
I know this is old way of doing multi-versioning, but I'm using gcc 4.9.3. Also actually doStuffImpl() is not single function, but bunch of functions with inlining, where doStuff() is last actual function call, but I don't think it changes anything.
Function contains some code that is auto-vectorized by compiler, but also I need to add some hand-crafted intrinsics there. Obviously different in two different flavours.
Question is: how can I recognise in compile-time which SIMD extensions are available?
I was trying something like:
#ifdef __AVX2__
AVX_intrinsics();
#elif defined __SSE4.2__
SSE_intrinsics();
#endif
But it seems that defines comes from "global" -march flag, not the one from multiversioning override.
Godbolt (intrinsics are garbage, but shows my point)
I could extract this part and do separate multiversioned function, but that would add cost of dispatching and function call.
Is there any way to do compile time differentiation of two multiversioning variants of function?

As answered in the comments:
I'd recommend moving each of the CPU targets to a separate translation unit, which is compiled with the corresponding compiler flags. The common doStuffImpl function can be implemented in a header, included in each of the TUs. In that header, you can use predefined macros like __AVX__ to test for available ISA extensions. The __attribute__((target)) attributes are no longer needed and can be removed in this case.

Why can't OpenMP have sections inside of for loop?

I am trying to parallelize a range-based for loop in OpenMP, by making each iteration of the loop an OpenMP section. I want to do this:
#pragma omp parallel sections
{
for ( auto x : range ) {
#pragma omp section
// LOTS OF CODE HERE
}
}
However, this doesn't work - the compiler complains that the section must be within a #pragma omp sections construct . Am I doing something wrong, or is OpenMP not smart enough to understand that this section is within a for loop?

OpenMP sections are for somewhat unrelated bits of code that can be executed in parallel. The idea is that within your algorithms, there are different parts which aren't super organized, but which could be executed in any order and even in parallel. This is a very ad-hoc way of parallelizing, which you don't expect to see much in a code.
Anyway, your code isn't standard compliant since all blocks within a sections construct must be enclosed inside a section block (but the first for which this is optional as it is implicitly enclosed in a first sectionblock). Here, if you where to add this implicit #pragma omp section right before your for loop, you'd see how little sense the code would have: the loop is in a section and the body into another...
Here, since you have a for loop, this structures your code very well and should permit you to use a omp parallel for directive. You only need to rewrite a bit your for such as to explicit better the loop boundaries.

gcc/clang: How to force ordering of items on the stack?

Consider the following code:
int a;
int b;
Is there a way to force that a precedes b on the stack?
One way to do the ordering would be to put b in a function:
void foo() {
int b;
}
...
int a;
foo();
However, that would generally work only if b isn't inlined.
Maybe there's a different way to do that? Putting an inline assembler between the two declarations may do a trick, but I am not sure.

Your initial question was about forcing a function call to not be inlined.
To improve on Jordy Baylac's answer, you might try to declare the function within the block calling it, and perhaps use a statement expr:
#define FOO_WITHOUT_INLINING(c,i) ({ \
extern int foo (char, int) __attribute__((noinline)); \
int r = foo(c,i); \
r; })
(If the type of foo is unknown, you could use typeof)
However, I still think that your question is badly formulated (and is meaningless, if one avoid reading your comments which should really go inside the question, which should have mentioned your libmill). By definition of inlining, a compiler can inline any function as it wants without changing the semantics of the program.
For example, a user of your library might legitimately compile it with -flto -O2 (both at compiling and at linking stage). I don't know what would happen then.
I believe you might redesign your code, perhaps using -fsplit-stack; are you implementing some call/cc in C? Then look inside the numerous existing implementations of it, and inside Gabriel Kerneis CPC.... See also setcontext(3) & longjmp(3)
Perhaps you might need to use somewhere the return_twice (and/or nothrow) function attribute of GCC, or some _Pragma like GCC optimize
Then you edited your question to change it completely (asking about order of variables on the call stack), still without mentioning in the question your libmill and its go macro (as you should; comments are volatile so should not contain most of the question).
But the C compiler is not even supposed to have a call stack (an hypothetical C99 conforming compiler could do whole program optimization to avoid any call stack) in the compiled program. And GCC is certainly allowed to put some variables outside of the call stack (e.g. only in registers) and it is doing that. And some implementations (IA64 probably) have two call stacks.
So your changed question is completely meaniningless: a variable might not sit on the stack (e.g. only be in a register, or even disappear completely if the compiler can prove it is useless after some other optimizations), and the compiler is allowed to optimize and use the same call stack slot for two variables (and GCC is doing such an optimization quite often). So you cannot force any order on the call stack layout.
If you need to be sure that two local variables a & b have some well defined order on the call stack, make them into a struct e.g.
struct { int _a, _b; } _locals;
#define a _locals._a
#define b _locals._b
then, be sure to put the &_locals somewhere (e.g. in a volatile global or thread-local variable). Since some versions of GCC (IIRC 4.8 or 4.7) had some optimization passes to reorder the fields of non-escaping struct-s
BTW, you might customize GCC with your MELT extension to help about that (e.g. introduce your own builtin or pragma doing part of the work).
Apparently, you are inventing some new dialect of C (à la CPC); then you should say that!

below there is a way, using gcc attributes:
char foo (char, int) __attribute__ ((noinline));
and, as i said, you can try -fno-inline-functions option, but this is for all functions in the compilation process

It is still unclear for me why you want function not to be inline-d, but here is non-pro solution I am proposing:
You can make this function in separate object something.o file.
Since you will include header only, there will be no way for the compiler to inline the function.
However linker might decide to inline it later at linking time.

Storing pairs in a GCC rope with c++11

I'm using a GCC extension rope to store pairs of objects in my program and am running into some C++11 related trouble. The following compiles under C++98
#include <ext/rope>
typedef std::pair<int, int> std_pair;
int main()
{
__gnu_cxx::rope<std_pair> r;
}
but not with C++11 under G++ 4.8.2 or 4.8.3.
What happens is that the uninitialised_copy_n algorithm is pulled in from two places, the ext/memory and the C++11 version of the memory header. The gnu_cxx namespace is pulled in by rope and the std namespace is pulled in by pair and there are now two identically defined methods in scope leading to a compile error.
I assume this is a bug in a weird use case for a rarely used library but what would be the correct fix? You can't remove the function from ext/memory to avoid breaking existing code and it now required to be in std. I've worked around it using my own pair class but how should this be fixed properly?

If changing the libstdc++ headers is an option (and I asked in the comments whether you were looking for a way to fix it in libstdc++, or work around it in your program), then the simple solution, to me, seems to be to make sure there is only one uninitialized_copy_n function. ext/memory already includes <memory>, which provides std::uninitialized_copy_n. So instead of defining __gnu_cxx::uninitialized_copy_n, it can have using std::uninitialized_copy_n; inside the __gnu_cxx namespace. It can even conditionalize this on C++11 support, so that pre-C++11 code gets the custom implementation of those functions, and C++11 code gets the std implementation of those functions.
This way, code that attempts to use __gnu_cxx::uninitialized_copy_n, whether directly or through ADL, will continue to work, but there is no ambiguity between std::uninitialized_copy_n and __gnu_cxx::uninitialized_copy_n, because they are the very same function.

gcc - gdb - pretty print stl

I'm currently doing some research on the STL, especially for printing the STL content during debug. I know there are many different approaches.
Like:
http://sourceware.org/gdb/wiki/STLSupport
or using a shared library to print the content of a container
What I'm currently looking for is, why g++ deletes functions, which are not used for example I have following code and use the compile setting g++ -g main.cpp -o main.o.
include <vector>
include <iostream>
using namespace std;
int main() {
std::vector<int> vec;
vec.push_back(10);
vec.push_back(20);
vec.push_back(30);
return;
}
So when I debug this code I will see that I can't use print vec.front(). The message I receive is:
Cannot evaluate function -- may be inlined
Therefore I tried to use the setting -fkeep-inline-functions, but no changes.
When i use nm main.o | grep front I see that there is no line entry for the method .front(). Doing the same again but, with an extra vec.front() entry within my code I can use print vec.front(), and using nm main.o | grep front where I see the entry
0000000000401834 W _ZNSt6vectorIiSaIiEE5frontEv
Can someone explain me how I can keep all functions within my code without loosing them. I think, that dead functions do not get deleted as long as I don't set optimize settings or do following.
How to tell compiler to NOT optimize certain code away?
Why I need it: Current Python implementations use the internal STL implementation to print the content of a container, but it would be much more interesting to use functions which are defined by ISO/IEC 14882. I know it's possible to write a shared library, which can be compiled to your actual code before you debug it, to maintain that you have all STL functions, but who wants to compile an extra lib to its code, before debugging. It would also be interesting to know if there are some advantages and disadvantages of this two approaches (Shared Lib. and Python)?

What's exactly a dead function, isn't it a function which is available in my source code but isn't used?
There are two cases to consider:
int unused_function() { return 42; }
int main() { return 0; }
If you compile above program, the unused_function is dead -- never called. However, it would still be present in the final executable (even with optimization [1]).
Now consider this:
template <typename T> int unused_function(T*) { return 42; }
int main() { return 0; }
In this case, unused_function will not be present, even when you turn off all optimizations.
Why? Because the template is not a "real" function. It's a prototype, from which the compiler can create "real" functions (called "template instantiation") -- one for each type T. Since you've never used unused_function, the compiler didn't create any "real" instances of it.
You can request that the compiler explicitly instantiate all functions in a given class, with explicit instantiation request, like so:
#include <vector>
template class std::vector<int>;
int main() { return 0; }
Now, even though none of the vector functions are used, they are all instantiated into the final binary.
[1] If you are using the GNU ld (or gold), you could still get rid of unused_function in this case, by compiling with -ffunction-sections and linking with -Wl,--gc-sections.

Thanks for your answer. Just to repeat, template functions don't get initiated by the gcc, because they are prototypes. Only when the function is used or it gets explicitly initiated it will be available within my executable.
So what we have mentioned until yet is :
function definition int unusedFunc() { return 10; }
function prototype int protypeFunc(); (just to break it down)
What happens when you inline functions? I always thought, that the function will be inserted within my source code, but now I read, that compilers often decide what to do on their own. (Sounds strange, because their must be rule). It doesn't matter if you use the keyword inline, for example.
inline int inlineFunc() { return 10; }
A friend of mine also told me that he hasn't had access to addresses of functions, although he hasn't used inline. Are there any function types I forgot? He also told me that their should be differences within the object data format.
#edit - forgot:
nested functions
function pointers
overloaded functions

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Dynamic/Nested Parallelism of GPU with OpenMP programming model - openmp

Currently OpenMP does not support nesting of target regions. So your code will most likely not compile. Or it can crash at runtime when nested target construct encountered. Or produce unexpected result. Basically the result of non-conforming program is unspecified.

Related

Compile-time AVX detection when using multi-versioning

Why can't OpenMP have sections inside of for loop?

gcc/clang: How to force ordering of items on the stack?

Storing pairs in a GCC rope with c++11

gcc - gdb - pretty print stl

Categories

Resources