Compile-time AVX detection when using multi-versioning - gcc

I have quite big function compiled for two different architectures:
__attribute__ ((target ("arch=broadwell"))) void doStuff()
{
doStuffImpl()
}
__attribute__ ((target ("arch=nocona"))) void doStuff()
{
doStuffImpl();
}
__attribute__((always_inline)) void doStuffImpl()
{
(...)
}
I know this is old way of doing multi-versioning, but I'm using gcc 4.9.3. Also actually doStuffImpl() is not single function, but bunch of functions with inlining, where doStuff() is last actual function call, but I don't think it changes anything.
Function contains some code that is auto-vectorized by compiler, but also I need to add some hand-crafted intrinsics there. Obviously different in two different flavours.
Question is: how can I recognise in compile-time which SIMD extensions are available?
I was trying something like:
#ifdef __AVX2__
AVX_intrinsics();
#elif defined __SSE4.2__
SSE_intrinsics();
#endif
But it seems that defines comes from "global" -march flag, not the one from multiversioning override.
Godbolt (intrinsics are garbage, but shows my point)
I could extract this part and do separate multiversioned function, but that would add cost of dispatching and function call.
Is there any way to do compile time differentiation of two multiversioning variants of function?

As answered in the comments:
I'd recommend moving each of the CPU targets to a separate translation unit, which is compiled with the corresponding compiler flags. The common doStuffImpl function can be implemented in a header, included in each of the TUs. In that header, you can use predefined macros like __AVX__ to test for available ISA extensions. The __attribute__((target)) attributes are no longer needed and can be removed in this case.

Related

gcc/clang: How to force ordering of items on the stack?

Consider the following code:
int a;
int b;
Is there a way to force that a precedes b on the stack?
One way to do the ordering would be to put b in a function:
void foo() {
int b;
}
...
int a;
foo();
However, that would generally work only if b isn't inlined.
Maybe there's a different way to do that? Putting an inline assembler between the two declarations may do a trick, but I am not sure.
Your initial question was about forcing a function call to not be inlined.
To improve on Jordy Baylac's answer, you might try to declare the function within the block calling it, and perhaps use a statement expr:
#define FOO_WITHOUT_INLINING(c,i) ({ \
extern int foo (char, int) __attribute__((noinline)); \
int r = foo(c,i); \
r; })
(If the type of foo is unknown, you could use typeof)
However, I still think that your question is badly formulated (and is meaningless, if one avoid reading your comments which should really go inside the question, which should have mentioned your libmill). By definition of inlining, a compiler can inline any function as it wants without changing the semantics of the program.
For example, a user of your library might legitimately compile it with -flto -O2 (both at compiling and at linking stage). I don't know what would happen then.
I believe you might redesign your code, perhaps using -fsplit-stack; are you implementing some call/cc in C? Then look inside the numerous existing implementations of it, and inside Gabriel Kerneis CPC.... See also setcontext(3) & longjmp(3)
Perhaps you might need to use somewhere the return_twice (and/or nothrow) function attribute of GCC, or some _Pragma like GCC optimize
Then you edited your question to change it completely (asking about order of variables on the call stack), still without mentioning in the question your libmill and its go macro (as you should; comments are volatile so should not contain most of the question).
But the C compiler is not even supposed to have a call stack (an hypothetical C99 conforming compiler could do whole program optimization to avoid any call stack) in the compiled program. And GCC is certainly allowed to put some variables outside of the call stack (e.g. only in registers) and it is doing that. And some implementations (IA64 probably) have two call stacks.
So your changed question is completely meaniningless: a variable might not sit on the stack (e.g. only be in a register, or even disappear completely if the compiler can prove it is useless after some other optimizations), and the compiler is allowed to optimize and use the same call stack slot for two variables (and GCC is doing such an optimization quite often). So you cannot force any order on the call stack layout.
If you need to be sure that two local variables a & b have some well defined order on the call stack, make them into a struct e.g.
struct { int _a, _b; } _locals;
#define a _locals._a
#define b _locals._b
then, be sure to put the &_locals somewhere (e.g. in a volatile global or thread-local variable). Since some versions of GCC (IIRC 4.8 or 4.7) had some optimization passes to reorder the fields of non-escaping struct-s
BTW, you might customize GCC with your MELT extension to help about that (e.g. introduce your own builtin or pragma doing part of the work).
Apparently, you are inventing some new dialect of C (à la CPC); then you should say that!
below there is a way, using gcc attributes:
char foo (char, int) __attribute__ ((noinline));
and, as i said, you can try -fno-inline-functions option, but this is for all functions in the compilation process
It is still unclear for me why you want function not to be inline-d, but here is non-pro solution I am proposing:
You can make this function in separate object something.o file.
Since you will include header only, there will be no way for the compiler to inline the function.
However linker might decide to inline it later at linking time.

What is the correct jmp_buf size?

I got a library compiled wit GCC for the ARM Cortex-M3 processor compiled as static lib. This library has a jmp_buf at its interface.
struct png_struct_def {
#ifdef PNG_SETJMP_SUPPORTED
jmp_buf jmpbuf;
#endif
png_error_ptr error_fn;
// a lot more members ...
};
typedef png_struct_def png_struct;
When I pass a png_struct address to the library function it stores the value not to error_fn but the last field of jmpbuf. Obviously it has been compiled with another assumptions of the size of a jmp_buf.
Both compiler versions are arm-none-eabi-gcc. Why is the code incompatible. And what is the "correct" jmp_buf size? I can see in the disassembly that only the first half of the jmp_buf is used. Why does the size changes between the versions of GCC when it is too large anyway?
Edit:
The library is compiled with another library that I can't recompile because the source code is not available. This other library uses this interface. So I can't change the structure of the interface.
You may simply re-order the data declaration. I would suggest the following,
typedef struct if {
int some_value;
union
{
jmp_buf jmpbuf;
char pad[511];
} __attribute__ ((__transparent_union__));
} *ifp;
The issue is that depending on the ARM library, different registers maybe saved. At a maximum, 16 32bit general purpose registers and 32 64bit NEON registers might be saved. This gives around 320 bytes. If you struct is not used many times, then you can over-allocate. This should work no matter which definition of jmp_buf you get.
If you can not recompile the library, you may try to use,
typedef struct if {
char pad[LIB_JMPBUF_SZ];
int some_value;
} *ifp;
where you calculated the jmp_buf size. The libc may have changed the definition of jmp_buf between versions. Also, even though the compiler names match, one may support floating point and another one does not, etc. Even if the versions match, it is conceivable that the compiler configuration can give different jmp_buf sizes.
Both suggestion are non-portable. The 2nd suggestion will not work if your code calls setjmp() or longjmp(). Ie, I assume that the library is using these functions and the caller allocates the space.

gcc - gdb - pretty print stl

I'm currently doing some research on the STL, especially for printing the STL content during debug. I know there are many different approaches.
Like:
http://sourceware.org/gdb/wiki/STLSupport
or using a shared library to print the content of a container
What I'm currently looking for is, why g++ deletes functions, which are not used for example I have following code and use the compile setting g++ -g main.cpp -o main.o.
include <vector>
include <iostream>
using namespace std;
int main() {
std::vector<int> vec;
vec.push_back(10);
vec.push_back(20);
vec.push_back(30);
return;
}
So when I debug this code I will see that I can't use print vec.front(). The message I receive is:
Cannot evaluate function -- may be inlined
Therefore I tried to use the setting -fkeep-inline-functions, but no changes.
When i use nm main.o | grep front I see that there is no line entry for the method .front(). Doing the same again but, with an extra vec.front() entry within my code I can use print vec.front(), and using nm main.o | grep front where I see the entry
0000000000401834 W _ZNSt6vectorIiSaIiEE5frontEv
Can someone explain me how I can keep all functions within my code without loosing them. I think, that dead functions do not get deleted as long as I don't set optimize settings or do following.
How to tell compiler to NOT optimize certain code away?
Why I need it: Current Python implementations use the internal STL implementation to print the content of a container, but it would be much more interesting to use functions which are defined by ISO/IEC 14882. I know it's possible to write a shared library, which can be compiled to your actual code before you debug it, to maintain that you have all STL functions, but who wants to compile an extra lib to its code, before debugging. It would also be interesting to know if there are some advantages and disadvantages of this two approaches (Shared Lib. and Python)?
What's exactly a dead function, isn't it a function which is available in my source code but isn't used?
There are two cases to consider:
int unused_function() { return 42; }
int main() { return 0; }
If you compile above program, the unused_function is dead -- never called. However, it would still be present in the final executable (even with optimization [1]).
Now consider this:
template <typename T> int unused_function(T*) { return 42; }
int main() { return 0; }
In this case, unused_function will not be present, even when you turn off all optimizations.
Why? Because the template is not a "real" function. It's a prototype, from which the compiler can create "real" functions (called "template instantiation") -- one for each type T. Since you've never used unused_function, the compiler didn't create any "real" instances of it.
You can request that the compiler explicitly instantiate all functions in a given class, with explicit instantiation request, like so:
#include <vector>
template class std::vector<int>;
int main() { return 0; }
Now, even though none of the vector functions are used, they are all instantiated into the final binary.
[1] If you are using the GNU ld (or gold), you could still get rid of unused_function in this case, by compiling with -ffunction-sections and linking with -Wl,--gc-sections.
Thanks for your answer. Just to repeat, template functions don't get initiated by the gcc, because they are prototypes. Only when the function is used or it gets explicitly initiated it will be available within my executable.
So what we have mentioned until yet is :
function definition int unusedFunc() { return 10; }
function prototype int protypeFunc(); (just to break it down)
What happens when you inline functions? I always thought, that the function will be inserted within my source code, but now I read, that compilers often decide what to do on their own. (Sounds strange, because their must be rule). It doesn't matter if you use the keyword inline, for example.
inline int inlineFunc() { return 10; }
A friend of mine also told me that he hasn't had access to addresses of functions, although he hasn't used inline. Are there any function types I forgot? He also told me that their should be differences within the object data format.
#edit - forgot:
nested functions
function pointers
overloaded functions

Equivalent for GCC's naked attribute

I've got an application written in pure C, mixed with some functions that contain pure ASM. Naked attribute isn't available for x86 (why? why?!) and my asm functions don't like when prologue and epilogue is messing with the stack. Is it somehow possible to create a pure assembler function that can be referenced from C code parts? I simply need the address of such ASM function.
Just use asm() outside a function block. The argument of asm() is simply ignored by the compiler and passed directly on to the assembler. For complex functions a separate assembly source file is the better option to avoid the awkward syntax.
Example:
#include <stdio.h>
asm("_one: \n\
movl $1,%eax \n\
ret \n\
");
int one();
int main() {
printf("result: %d\n", one());
return 0;
}
PS: Make sure you understand the calling conventions of your platform. Many times you can not just copy/past assembly code.
PPS: If you care about performance, use extended asm instead. Extended asm essentially inlines the assembly code into your C/C++ code and is much faster, especially for short assembly functions. For larger assembly functions a seperate assembly source file is preferable, so this answer is really a hack for the rare case that you need a function pointer to a small assembly function.
Good news everyone. GCC developers finally implemented attribute((naked)) for x86. The feature will be available in GCC 8.
Certainly, just create a .s file (assembly source), which is run through gas (the assembler) to create a normal object file.

GCC 4.7 fails to inline with message "function body not available"

I am trying to compile some legacy code with more modern toolchains. I have tracked down one of my issues to the switch from gcc 4.6 to gcc 4.7:
Some of the functions are annotated with the inline keyword. Gcc fails on these with the error message:
error: inlining failed in call to always_inline 'get_value_global': function body not available
What is the correct way of dealing with this issue? How could the function body not be available? Should the compiler not make sure that it is available in all situations that require it?
Edit
As requested (in a deleted comment), an example of a signature of a function resulting in the error:
inline struct a_value_fmt const *find_a_value_format(struct base_fmt *base)
{
/* the code */
}
That error is typical to inline functions declared in source files, rather than in header files, in which case the compiler is not able to inline them (as the code of the function to be inlined must be visible to the compiler in the same source file being compiled). So, first thing I would check is that all functions declared inline are indeed defined in header files.
It may be that a change in GCC diagnostics in 4.7 caused the error to surface, and that it went silent in GCC 4.6 (but that's just a speculation).
The quoted error indicates that the function is declared with __attribute__((always_inline)). Note that GCC may fail to inline and report a different (and quite obscure) error if function is declared always_inline, but not with the inline keyword - so make sure that any function declared as always_inline is also declared as inline.
Few more tips:
General advice, which may not be applicable: since this is a legacy codebase, you may want to re-evaluate which functions should be inlined, being on the critical path, and which aren't, based on updated profiling results. Sometimes, inline is used generously, even when it is not required, or redundant. In some cases, the fix may be to remove the inline keyword in places where it is not needed.
When functions are declared in header files, the compiler considers them for inlining automatically (given they are small enough, and the compiler thinks that inlining them will improve performance, based on its heuristics) - even when the inline keyword is not used. inline is sort of a "recommendation" to the compiler, and it doesn't have to obey (unless it is given along with the always_inline attribute).
Modern compilers make relatively smart inlining decisions, so it's usually best to let the compiler do it's thing, and declare functions as inline (and moving their implementations to header files) in the appliation hot spots, based on profiling results.

Resources