Can using inline lower performance? - performance

Can making function inline in some specific cases lower the overall performance of application?

Inlining (particularly large functions) can increase the code size so that cache performance is affected and overall performance decreases.

Inlining works with every language that supports it.
The functions that deserve inlining are usually short and simple, it's unuseful to inline long and complicated functions, but things like property getter or functions that performs some arithmetical calculations gets a lot of performance boost when inlined.
In languages like C and C++ the compiler choose the best inline strategy for your method when you specify inline keyword, a function marked inline is "recommended for inlining", but is the compiler that choose if is the case to inline for real or not.
However you should put the inline keyword only when the function deserves inlining for real.
Some C and C++ compilers supports special keywords like __forceinline or some other special attributes that say to the compiler "this is very recommended for inlining" and these keywords should be used with caution, since inlining big and complicated functions can lower down performances.

Related

Is there any performance difference between macros and functions in Rust?

In Rust Macros are executed at compile time. They generally expand
into new pieces of code that the compiler will then need to further
process.
But after macros compiled or before compiled is there any performance difference between normal function vs macros?
I assume you're talking about runtime performance. Compile-time wise macros are usually slower as they are compiled for each invocation.
Macros are like #[inline(always)] functions. This can be good or bad for performance, depending on lot of characteristics like number of calls to the code, code size or instruction cache pressure. Always benchmark before making a decision.
If you can use a function, prefer that. It can always be marked #[inline(always)] if deemed good for performance, while using more familiar syntax and faster compile times.

Why is `math.Sin` disallowed in a Go constant?

According to Effective Go, the function math.Sin cannot be used to define a constant because that function must happen at run-time.
What is the reasoning behind this limitation? Floating-point consistency? Quirk of the Sin implementation? Something else?
There is support for this sort of thing in other languages. In C, for example: as of version 4.3, GCC supports compile-time calculation of the sine function. (See section "General Optimizer Improvements").
However, as noted in this blog post by Bruce Dawson, this can cause unexpected issues. (See section "Compile-time versus run-time sin").
Is this a relevant concern in Go? Or is this usage restricted for a different reason?
Go doesn't support initializing a constant with the result of a function. Functions are called at runtime, not at compile time. But constants are defined at compile time.
It would be possible to make exceptions for certain functions (like math.Sin for example), but that would make the spec more complicated. The Go developers generally prefer to keep the spec simple and consistent.
Go simply lacks the concept. There is no way of marking a function as pure (its return value depends only on its arguments, and it doesn't alter any kind of mutable state or perform I/O), there is no way for the compiler to infer pureness, and there's no attempt to evaluate any expression containing a function call at compile-time (because doing so for anything except a pure function of constant arguments would be a source of weird behavior and bugs, and because adding the machinery needed to make it work right would introduce quite a bit of complexity).
Yes, this is a substantial loss, which forces a tradeoff between code with bad runtime behavior, and code which is flat-out ugly. Go partisans will choose the ugly code and tell you that you are a bad human being for not finding it beautiful.
The best thing you have available to you is code generation. The integration of go generate into the toolchain and the provision of a complete Go parser in the standard library makes it relatively easy to munge code at build time, and one of the things that you can do with this ability is create more advanced constant-folding if you so choose. You still get all of the debuggability peril of code generation, but it's something.

efficiency loss due to use of function pointer in place of if-block

Suppose we have a Fortran function (for example a mathematical optimization algorithm) that takes as input, another Fortran function:
myOptimizer(func)
Now depending on the user's choice, the input function could be from a list of several different functions. This list of choices can be implemented via an if-block:
if (userChoice=='func1') then
myOptimizer(func1)
elseif (userChoice=='func2') then
myOptimizer(func2)
elseif (userChoice=='func3') then
myOptimizer(func3)
end if
Alternatively, I could also define function pointers, and write this as,
if (userChoice=='func1') then
func => func1
elseif (userChoice=='func2') then
func => func2
elseif (userChoice=='func3') then
func => func3
end if
myOptimizer(func)
Based on my tests with Intel Fortran Compiler 2017 with O2 flag, the second implementation happens to be slower by several factors (4-5 times slower than the if-block implementation). From the software development perspective, I would strongly prefer the second approach since it results in much more concise and cleaner code, at least in my problem where there is a fixed workflow, with different possible input functions to the workflow. However, performance also equally matters in the problem.
Is this loss of performance by indirect function calls, expected in all Fortran codes? or is it a compiler-dependent issue? Is there a solution to using indirect function calls without performance loss? How about other languages such as C/C++?
This is a pure guess based on how compilers generally work and what might explain the 4-5x perf difference.
In the first version, maybe the compiler is inlining myOptimizer() into each call site with func1, func2, and func3 inlined into the optimizer, so when it runs there's no actual function pointer or function call happening.
An indirect function-call isn't much more expensive than a regular function call on modern x86 hardware. It's the lack of inlining that really hurts, especially for FP code. Spilling / reloading all the floating-point registers around a function call is expensive, especially if the function is fairly small.
i.e. what's probably hurting you is that your 2nd version convinces the compiler not to undo the indirection. This would be true in C / C++ as well.
Hand-holding your compiler into making fast asm probably means you have to write it the first way, unless there's a profile-guided optimization option you can use that might make the compiler realize this is a hot spot and it's worth trying harder with the source written the 2nd way. (Sorry I don't use Fortran, and I only know a few of the options for Intel's C/C++ compiler from looking at its asm output vs. gcc and clang on http://gcc.godbolt.org/)
To see if my hypothesis is right, check the compiler-generated asm. If the first version doesn't actually pass a function pointer to a stand-alone definition of myOptimizer, but the 2nd one does, that's probably all there is to it.
See How to remove "noise" from GCC/clang assembly output? for more about looking at compiler output. Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” is a good intro to reading compiler output and why you might want to.

How do I force gcc to inline a function?

Does __attribute__((always_inline)) force a function to be inlined by gcc?
Yes.
From documentation v4.1.2
From documentation latest
always_inline
Generally, functions are not inlined unless optimization is specified. For functions declared inline, this attribute inlines the function even if no optimization level was specified.
It should. I'm a big fan of manual inlining. Sure, used in excess it's a bad thing. But often times when optimizing code, there will be one or two functions that simply have to be inlined or performance goes down the toilet. And frankly, in my experience C compilers typically do not inline those functions when using the inline keyword.
I'm perfectly willing to let the compiler inline most of my code for me. It's only those half dozen or so absolutely vital cases that I really care about. People say "compilers do a good job at this." I'd like to see proof of that, please. So far, I've never seen a C compiler inline a vital piece of code I told it to without using some sort of forced inline syntax (__forceinline on msvc __attribute__((always_inline)) on gcc).
Yes, it will. That doesn't necessarily mean it's a good idea.
According to the gcc optimize options documentation, you can tune inlining with parameters:
-finline-limit=n
By default, GCC limits the size of functions that can be inlined. This flag
allows coarse control of this limit. n is the size of functions that can be
inlined in number of pseudo instructions.
Inlining is actually controlled by a number of parameters, which may be specified
individually by using --param name=value. The -finline-limit=n option sets some
of these parameters as follows:
max-inline-insns-single is set to n/2.
max-inline-insns-auto is set to n/2.
I suggest reading more in details about all the parameters for inlining, and setting them appropriately.
I want to add here that I have a SIMD math library where inlining is absolutely critical for performance. Initially I set all functions to inline but the disassembly showed that even for the most trivial operators it would decide to actually call the function. Both MSVC and Clang showed this, with all optimization flags on.
I did as suggested in other posts in SO and added __forceinline for MSVC and __attribute__((always_inline)) for all other compilers. There was a consistent 25-35% improvement in performance in various tight loops with operations ranging from basic multiplies to sines.
I didn't figure out why they had such a hard time inlining (perhaps templated code is harder?) but the bottom line is: there are very valid use cases for inlining manually and huge speedups to be gained.
If you're curious this is where I implemented it. https://github.com/redorav/hlslpp
Yes. It will inline the function regardless of any other options set. See here.
One can also use __always_inline. I have been using that for C++ member functions for GCC 4.8.1. But could not found a good explanation in GCC doc.
Actually the answer is "no". All it means is that the function is a candidate for inlining even with optimizations disabled.

What are good heuristics for inlining functions?

Considering that you're trying solely to optimize for speed, what are good heuristics for deciding whether to inline a function or not? Obviously code size should be important, but are there any other factors typically used when (say) gcc or icc is determining whether to inline a function call? Has there been any significant academic work in the area?
Wikipedia has a few paragraphs about this, with some links at the bottom:
In addition to memory size and cache issues, another consideration is register pressure. From the compiler's point of view "the added variables from the inlined procedure may consume additional registers, and in an area where register pressure is already high this may force spilling, which causes additional RAM accesses."
Languages with JIT compilers and runtime class loading have other tradeoffs since the virtual methods aren't known statically, yet the JIT can collect runtime profiling information, such as method call frequency:
Design, Implementation, and Evaluation of Optimizations in a Just-in-Time Compiler (for Java) talks about method inlining of static methods and dynamically loaded classes and its improvements on performance.
Practicing JUDO: Java Under Dynamic Optimizations claims that their "inlining policy is based on the code size and profiling information. If the execution frequency of a method entry is below a certain threshold, the method is then not inlined because it is regarded as a cold method. To avoid code explosion, we do not inline a method with a bytecode size of more than 25 bytes. . . . To avoid inlining along a deep call chain, inlining stops when the accumulated inlined bytecode size along the call chain exceeds 40 bytes." Although they have runtime profiling information (method call frequency) they are still careful to avoid inlining large functions or chains of functions to prevent bloat.
A search on Google Scholar reveals a number of papers, such as
The effect of code expanding optimizations on instruction cache design
Function Inlining under Code Size Constraints
for Embedded Processors
A search on Google Books reveals quite a number of books with papers or chapters about function inlining in various contexts.
The Compiler Design Handbook: Optimizations and Machine Code Generation has a chapter about Statisical and Machine Learning Techniques in Compiler Design, with heuristics to set various parameters, profiling the results. This chapter references the Vaswani et al paper Microarchitecture Sensitive Empirical Models for Compiler Optimizations where they propose "the use of empirical modeling
techniques for building microarchitecture sensitive models for compiler optimizations".
(Some other books talk about inling from the programmer's point of view, such as C++ for Game Programmers, which talks about the dangers of inlining functions too often and the differences between inlining and macros. Compilers often ignore the programmer's inline requests if they can determine that they would do more harm than good; this can be overridden with macros as a last resort.)
A function call implies some additional code (the function prologue, where the new stack frame is set up, and the function epilogue, where it's cleaned up). If your compiler sees that the function code is small in comparison to the prologue and epilogue, it can decide it's not worth it to make an actual call, and will inline the function.
The only benefit I see of calling a function instead of inlining it are size-related. I guess inlining a function then unrolling a loop can result in a significant size increase.
as far as I have saw, function size is the only factor compilers used to determine inline. However if you do profile guided optimization (PGO), i believe compiler is able to use other variables, such as number of calls/call setup time.
In .NET is is mostly based on size. Measure the size of the parent function and child function in compiled bytes. Then measure the size of the combined function. If the combined function is smaller, then inlining is a good idea.
The reason for this is to make it possible to shove as much code into the CPU's cache as possible. Cache misses are far more expensive than function calls in modern CPUs.

Resources