How do I force gcc to inline a function? - gcc

Does __attribute__((always_inline)) force a function to be inlined by gcc?

Yes.
From documentation v4.1.2
From documentation latest
always_inline
Generally, functions are not inlined unless optimization is specified. For functions declared inline, this attribute inlines the function even if no optimization level was specified.

It should. I'm a big fan of manual inlining. Sure, used in excess it's a bad thing. But often times when optimizing code, there will be one or two functions that simply have to be inlined or performance goes down the toilet. And frankly, in my experience C compilers typically do not inline those functions when using the inline keyword.
I'm perfectly willing to let the compiler inline most of my code for me. It's only those half dozen or so absolutely vital cases that I really care about. People say "compilers do a good job at this." I'd like to see proof of that, please. So far, I've never seen a C compiler inline a vital piece of code I told it to without using some sort of forced inline syntax (__forceinline on msvc __attribute__((always_inline)) on gcc).

Yes, it will. That doesn't necessarily mean it's a good idea.

According to the gcc optimize options documentation, you can tune inlining with parameters:
-finline-limit=n
By default, GCC limits the size of functions that can be inlined. This flag
allows coarse control of this limit. n is the size of functions that can be
inlined in number of pseudo instructions.
Inlining is actually controlled by a number of parameters, which may be specified
individually by using --param name=value. The -finline-limit=n option sets some
of these parameters as follows:
max-inline-insns-single is set to n/2.
max-inline-insns-auto is set to n/2.
I suggest reading more in details about all the parameters for inlining, and setting them appropriately.

I want to add here that I have a SIMD math library where inlining is absolutely critical for performance. Initially I set all functions to inline but the disassembly showed that even for the most trivial operators it would decide to actually call the function. Both MSVC and Clang showed this, with all optimization flags on.
I did as suggested in other posts in SO and added __forceinline for MSVC and __attribute__((always_inline)) for all other compilers. There was a consistent 25-35% improvement in performance in various tight loops with operations ranging from basic multiplies to sines.
I didn't figure out why they had such a hard time inlining (perhaps templated code is harder?) but the bottom line is: there are very valid use cases for inlining manually and huge speedups to be gained.
If you're curious this is where I implemented it. https://github.com/redorav/hlslpp

Yes. It will inline the function regardless of any other options set. See here.

One can also use __always_inline. I have been using that for C++ member functions for GCC 4.8.1. But could not found a good explanation in GCC doc.

Actually the answer is "no". All it means is that the function is a candidate for inlining even with optimizations disabled.

Related

Is there any performance difference between macros and functions in Rust?

In Rust Macros are executed at compile time. They generally expand
into new pieces of code that the compiler will then need to further
process.
But after macros compiled or before compiled is there any performance difference between normal function vs macros?
I assume you're talking about runtime performance. Compile-time wise macros are usually slower as they are compiled for each invocation.
Macros are like #[inline(always)] functions. This can be good or bad for performance, depending on lot of characteristics like number of calls to the code, code size or instruction cache pressure. Always benchmark before making a decision.
If you can use a function, prefer that. It can always be marked #[inline(always)] if deemed good for performance, while using more familiar syntax and faster compile times.

Do compilers take the "status quo" when optimizations produced worse results?

To my knowledge, when using optimizations there is a risk to face the "maybe will be worse" case (i.e. the performance will be degraded, or the code size will be higher, or both). However do compilers able to detect such cases and return to the "status quo" (i.e. fall back to the original non-optimized code) when optimizations produced worse results? Can someone give (if possible) a particular examples of what compilers (for example, gcc, Clang (LLVM), etc.) do in this case?
In JIT compilers there is a thing called Deoptimization. Normally the compiler will optimize heavily assuming something, but during execution some of the assumption may fail. For example the compiler will assume the inmput of a function is always an integer and produce a highly efficient code for integer manipulation, but if, and such things happen in dynamic languages, the input is suddenly and array or a string, the code should revert. See v8 turbofan speculative optimizator for example.
For non JIT there is no way to deoptimize during runtime, but the compiler may create multiple execution paths. Your question is not fully logical because how would compiler know if it created unoptimal code? It can only use the same algorithm it used to do the optimization itself. That's probably why you are downwoted.

efficiency loss due to use of function pointer in place of if-block

Suppose we have a Fortran function (for example a mathematical optimization algorithm) that takes as input, another Fortran function:
myOptimizer(func)
Now depending on the user's choice, the input function could be from a list of several different functions. This list of choices can be implemented via an if-block:
if (userChoice=='func1') then
myOptimizer(func1)
elseif (userChoice=='func2') then
myOptimizer(func2)
elseif (userChoice=='func3') then
myOptimizer(func3)
end if
Alternatively, I could also define function pointers, and write this as,
if (userChoice=='func1') then
func => func1
elseif (userChoice=='func2') then
func => func2
elseif (userChoice=='func3') then
func => func3
end if
myOptimizer(func)
Based on my tests with Intel Fortran Compiler 2017 with O2 flag, the second implementation happens to be slower by several factors (4-5 times slower than the if-block implementation). From the software development perspective, I would strongly prefer the second approach since it results in much more concise and cleaner code, at least in my problem where there is a fixed workflow, with different possible input functions to the workflow. However, performance also equally matters in the problem.
Is this loss of performance by indirect function calls, expected in all Fortran codes? or is it a compiler-dependent issue? Is there a solution to using indirect function calls without performance loss? How about other languages such as C/C++?
This is a pure guess based on how compilers generally work and what might explain the 4-5x perf difference.
In the first version, maybe the compiler is inlining myOptimizer() into each call site with func1, func2, and func3 inlined into the optimizer, so when it runs there's no actual function pointer or function call happening.
An indirect function-call isn't much more expensive than a regular function call on modern x86 hardware. It's the lack of inlining that really hurts, especially for FP code. Spilling / reloading all the floating-point registers around a function call is expensive, especially if the function is fairly small.
i.e. what's probably hurting you is that your 2nd version convinces the compiler not to undo the indirection. This would be true in C / C++ as well.
Hand-holding your compiler into making fast asm probably means you have to write it the first way, unless there's a profile-guided optimization option you can use that might make the compiler realize this is a hot spot and it's worth trying harder with the source written the 2nd way. (Sorry I don't use Fortran, and I only know a few of the options for Intel's C/C++ compiler from looking at its asm output vs. gcc and clang on http://gcc.godbolt.org/)
To see if my hypothesis is right, check the compiler-generated asm. If the first version doesn't actually pass a function pointer to a stand-alone definition of myOptimizer, but the 2nd one does, that's probably all there is to it.
See How to remove "noise" from GCC/clang assembly output? for more about looking at compiler output. Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” is a good intro to reading compiler output and why you might want to.

Can using inline lower performance?

Can making function inline in some specific cases lower the overall performance of application?
Inlining (particularly large functions) can increase the code size so that cache performance is affected and overall performance decreases.
Inlining works with every language that supports it.
The functions that deserve inlining are usually short and simple, it's unuseful to inline long and complicated functions, but things like property getter or functions that performs some arithmetical calculations gets a lot of performance boost when inlined.
In languages like C and C++ the compiler choose the best inline strategy for your method when you specify inline keyword, a function marked inline is "recommended for inlining", but is the compiler that choose if is the case to inline for real or not.
However you should put the inline keyword only when the function deserves inlining for real.
Some C and C++ compilers supports special keywords like __forceinline or some other special attributes that say to the compiler "this is very recommended for inlining" and these keywords should be used with caution, since inlining big and complicated functions can lower down performances.

How do modern compilers use mmx/3dnow/sse instructions?

I've been reading up on the x86 instruction set extensions, and they only seem useful in some quite specific circumstances (eg HADDPD - (Horizontal-Add-Packed-Double) in SSE3). These require a certain register layout that needs to be either deliberately set up, or occur from the series of instructions before it. How often do general-purpose compilers like gcc actually use these instructions (or a subset thereof), or are they mainly to be used in hand-coded assembler? How does the compiler detect where it is appropriate to use SIMD instructions?
Generally, few compilers use them. GCC and Visual Studio arn't usually able to use the SIMD instructions. If you enable SSE as a compiler flag, it will use the scalar SSE instructions for regular floating-point operations, but generally, don't expect the vectorized ones to be used automatically. Recent versions of GCC might be able to use them in some cases, but didn't work last I tried. Intel's C++ compiler is the only big compiler I know of that is able to auto-vectorize some loops.
In general though, you'll have to use them yourself. Either in raw assembler, or by using compiler intrinsics. In general, I'd say intrinsics are the better approach, since they better allow the compiler to understand the code, and so schedule and optimize, but in practice, I know MSVC at least doesn't always generate very efficient code from intrinsics, so plain asm may be the best solution there. Experiment, see what works. But don't expect the compiler to use these instructions for you, unless you 1) use the right compiler, and 2) write fairly simple loops that can be trivially vectorized.
Update 2012
Ok, so three years have passed since I wrote this answer. GCC has been able to auto-vectorize (simple) code for a couple of years now, and in VS2012, MSVC finally gains the same capability. Of course, the main part of my answer still applies: compilers can still only vectorize fairly trivial code. For anything more complex, you're stuck fiddling with intrinsics or inline asm.
Mono can use SIMD extensions as long as you use its classes for vectors. You can read about it here: http://tirania.org/blog/archive/2008/Nov-03.html
GCC should do some automatic vectorisation as long as you're using -O3 or a specific flag. They have an info page here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html
The question of how to exploit SSE and other small vector units automatically (without direction from the programmer in the form of special language constructs or specially blessed compiler "intrinsics") has been a topic of compiler research for some time. Most results seem to be specialized to a particular problem domain, such as digital signal processing. I have not kept up with the literature on this topic, but what I have read suggests that exploiting the vector (SSE) unit is still a topic for research, and that one should have low expectations of general-purpose compilers commonly used in the field.
Suggested search term: vectorizing compiler
I have seen gcc use sse to zero out a default std::string object. Not a particularly powerful use of sse, but it exists. In most cases, though you will have to write your own.
I know this because I had allowed the stack to become unaligned and it crashed, otherwise I probably wouldn't have noticed!
If you use the vector pascal compiler you will get efficient SIMD code for types for which SIMD gives an advantage. Basically this is anything of length less than 64 bits. ( for 64 bit reals it is actually slower to do SIMD).
Latest versions of the compiler will also automatically parallelise accross cores

Resources