What are good heuristics for inlining functions? - gcc

Considering that you're trying solely to optimize for speed, what are good heuristics for deciding whether to inline a function or not? Obviously code size should be important, but are there any other factors typically used when (say) gcc or icc is determining whether to inline a function call? Has there been any significant academic work in the area?

Wikipedia has a few paragraphs about this, with some links at the bottom:
In addition to memory size and cache issues, another consideration is register pressure. From the compiler's point of view "the added variables from the inlined procedure may consume additional registers, and in an area where register pressure is already high this may force spilling, which causes additional RAM accesses."
Languages with JIT compilers and runtime class loading have other tradeoffs since the virtual methods aren't known statically, yet the JIT can collect runtime profiling information, such as method call frequency:
Design, Implementation, and Evaluation of Optimizations in a Just-in-Time Compiler (for Java) talks about method inlining of static methods and dynamically loaded classes and its improvements on performance.
Practicing JUDO: Java Under Dynamic Optimizations claims that their "inlining policy is based on the code size and profiling information. If the execution frequency of a method entry is below a certain threshold, the method is then not inlined because it is regarded as a cold method. To avoid code explosion, we do not inline a method with a bytecode size of more than 25 bytes. . . . To avoid inlining along a deep call chain, inlining stops when the accumulated inlined bytecode size along the call chain exceeds 40 bytes." Although they have runtime profiling information (method call frequency) they are still careful to avoid inlining large functions or chains of functions to prevent bloat.
A search on Google Scholar reveals a number of papers, such as
The effect of code expanding optimizations on instruction cache design
Function Inlining under Code Size Constraints
for Embedded Processors
A search on Google Books reveals quite a number of books with papers or chapters about function inlining in various contexts.
The Compiler Design Handbook: Optimizations and Machine Code Generation has a chapter about Statisical and Machine Learning Techniques in Compiler Design, with heuristics to set various parameters, profiling the results. This chapter references the Vaswani et al paper Microarchitecture Sensitive Empirical Models for Compiler Optimizations where they propose "the use of empirical modeling
techniques for building microarchitecture sensitive models for compiler optimizations".
(Some other books talk about inling from the programmer's point of view, such as C++ for Game Programmers, which talks about the dangers of inlining functions too often and the differences between inlining and macros. Compilers often ignore the programmer's inline requests if they can determine that they would do more harm than good; this can be overridden with macros as a last resort.)

A function call implies some additional code (the function prologue, where the new stack frame is set up, and the function epilogue, where it's cleaned up). If your compiler sees that the function code is small in comparison to the prologue and epilogue, it can decide it's not worth it to make an actual call, and will inline the function.
The only benefit I see of calling a function instead of inlining it are size-related. I guess inlining a function then unrolling a loop can result in a significant size increase.

as far as I have saw, function size is the only factor compilers used to determine inline. However if you do profile guided optimization (PGO), i believe compiler is able to use other variables, such as number of calls/call setup time.

In .NET is is mostly based on size. Measure the size of the parent function and child function in compiled bytes. Then measure the size of the combined function. If the combined function is smaller, then inlining is a good idea.
The reason for this is to make it possible to shove as much code into the CPU's cache as possible. Cache misses are far more expensive than function calls in modern CPUs.

Related

Is there any reliable data for benchmarking math library?

I'm going to test the performance for some math function like pow, exp and log etc. Is there any reliable test data for that?
As those functions were highly optimized in exist modern system library like glibm or in OpenJDK, the general random inputs may lead to a quick convergence, or triggered some short path.
Call them in a loop to test throughput or latency, depending on whether the input to the next call depends on the output of the previous or not. Probably with data from a small to medium sized array of random values for the throughput test.
You want your compiler to produce an asm loop that does minimal work beyond the function call, so use appropriate techniques for whatever language and compiler you choose. (Idiomatic way of performance evaluation?)
You might disassemble or single-step through their execution to look for data-dependent branching to figure out which ranges of inputs might be faster or slower. (Or for open-source math libraries like glibc, the commented source could be good to look at.)

efficiency loss due to use of function pointer in place of if-block

Suppose we have a Fortran function (for example a mathematical optimization algorithm) that takes as input, another Fortran function:
myOptimizer(func)
Now depending on the user's choice, the input function could be from a list of several different functions. This list of choices can be implemented via an if-block:
if (userChoice=='func1') then
myOptimizer(func1)
elseif (userChoice=='func2') then
myOptimizer(func2)
elseif (userChoice=='func3') then
myOptimizer(func3)
end if
Alternatively, I could also define function pointers, and write this as,
if (userChoice=='func1') then
func => func1
elseif (userChoice=='func2') then
func => func2
elseif (userChoice=='func3') then
func => func3
end if
myOptimizer(func)
Based on my tests with Intel Fortran Compiler 2017 with O2 flag, the second implementation happens to be slower by several factors (4-5 times slower than the if-block implementation). From the software development perspective, I would strongly prefer the second approach since it results in much more concise and cleaner code, at least in my problem where there is a fixed workflow, with different possible input functions to the workflow. However, performance also equally matters in the problem.
Is this loss of performance by indirect function calls, expected in all Fortran codes? or is it a compiler-dependent issue? Is there a solution to using indirect function calls without performance loss? How about other languages such as C/C++?
This is a pure guess based on how compilers generally work and what might explain the 4-5x perf difference.
In the first version, maybe the compiler is inlining myOptimizer() into each call site with func1, func2, and func3 inlined into the optimizer, so when it runs there's no actual function pointer or function call happening.
An indirect function-call isn't much more expensive than a regular function call on modern x86 hardware. It's the lack of inlining that really hurts, especially for FP code. Spilling / reloading all the floating-point registers around a function call is expensive, especially if the function is fairly small.
i.e. what's probably hurting you is that your 2nd version convinces the compiler not to undo the indirection. This would be true in C / C++ as well.
Hand-holding your compiler into making fast asm probably means you have to write it the first way, unless there's a profile-guided optimization option you can use that might make the compiler realize this is a hot spot and it's worth trying harder with the source written the 2nd way. (Sorry I don't use Fortran, and I only know a few of the options for Intel's C/C++ compiler from looking at its asm output vs. gcc and clang on http://gcc.godbolt.org/)
To see if my hypothesis is right, check the compiler-generated asm. If the first version doesn't actually pass a function pointer to a stand-alone definition of myOptimizer, but the 2nd one does, that's probably all there is to it.
See How to remove "noise" from GCC/clang assembly output? for more about looking at compiler output. Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” is a good intro to reading compiler output and why you might want to.

Impact of separating class definition from declaration on program size

I am working on a microcontroller with tight memory constraints. Hence I watch memory consumption.
I have some library with classes which are only visible in the cpp file. These class do not show up in the header file. The classes were directly implemented. Now I started to separate the declaration from the implementation. The point is that I need to expose some of them in the header file. However I noticed that this separation affects the program size. For some of the classes it increases memory consumption for some it decreases it.
Why is it that separation of definition and implementation affects compiled program size? How might I leverage this to decrease compiled program size?
When a class is only used inside a single translation unit (file) the compiler is free to perform whatever optimisations it likes. It can completely get rid of the v-table, split the class up and turn it into a more procedural structure if this works better. When you export the class outside, the compiler can't make assumptions about who might be using it and so the optimisations it can perform are more limited.
However, particularly on microcontrollers, there are lots of aggressive post linker optimisations such as procedural abstraction that might be done to reduce code size on the finished program. Sometimes if the compiler has optimised the separate modules less due to the situation described above, bigger gains can be achieved at this stage as there is more unoptimised repeated code blocks.
These days extra memory is so cheap it is rarely worth trying to write your program around saving a few bytes. Having a clearer and easier to maintain code base will quickly pay for any BOM savings at the first instance you have to add new features. If you really want to carefully control memory usage then I'd recommend moving to C (or an extremely limited subset of C++) and getting a really good understanding of how your compiler is optimising.

Can using inline lower performance?

Can making function inline in some specific cases lower the overall performance of application?
Inlining (particularly large functions) can increase the code size so that cache performance is affected and overall performance decreases.
Inlining works with every language that supports it.
The functions that deserve inlining are usually short and simple, it's unuseful to inline long and complicated functions, but things like property getter or functions that performs some arithmetical calculations gets a lot of performance boost when inlined.
In languages like C and C++ the compiler choose the best inline strategy for your method when you specify inline keyword, a function marked inline is "recommended for inlining", but is the compiler that choose if is the case to inline for real or not.
However you should put the inline keyword only when the function deserves inlining for real.
Some C and C++ compilers supports special keywords like __forceinline or some other special attributes that say to the compiler "this is very recommended for inlining" and these keywords should be used with caution, since inlining big and complicated functions can lower down performances.

Does profile-guided optimization done by compiler notably hurt cases not covered with profiling dataset?

This question is not specific to C++, AFAIK certain runtimes like Java RE can do profiled-guided optimization on the fly, I'm interested in that too.
MSDN describes PGO like this:
I instrument my program and run it under profiler, then
the compiler uses data gathered by profiler to automatically reorganize branching and loops in such way that branch misprediction is reduced and most often run code is placed compactly to improve its locality
Now obviously profiling result will depend on a dataset used.
With normal manual profiling and optimization I'd find some bottlenecks and improve those bottlenecks and likely leave all the other code untouched. PGO seems to improve often run code at expense of making rarely run code slower.
Now what if that slowered code is run often on another dataset that the program will see in real world? Will the program performance degrade compared to a program compiled without PGO and how bad will the degradation likely be? In other word, does PGO really improve my code performance for the profiling dataset and possibly worsen it for other datasets? Are there any real examples with real data?
Disclaimer: I have not done more with PGO than read up on it and tried it once with a sample project for fun. A lot of the following is based on my experience with the "non-PGO" optimizations and educated guesses. TL;DR below.
This page lists the optimizations done by PGO. Lets look at them one-by-one (grouped by impact):
Inlining – For example, if there exists a function A that frequently calls function B, and function B is relatively small, then profile-guided optimizations will inline function B in function A.
Register Allocation – Optimizing with profile data results in better register allocation.
Virtual Call Speculation – If a virtual call, or other call through a function pointer, frequently targets a certain function, a profile-guided optimization can insert a conditionally-executed direct call to the frequently-targeted function, and the direct call can be inlined.
These apparently improves the prediction whether or not some optimizations pay off. No direct tradeoff for non-profiled code paths.
Basic Block Optimization – Basic block optimization allows commonly executed basic blocks that temporally execute within a given frame to be placed in the same set of pages (locality). This minimizes the number of pages used, thus minimizing memory overhead.
Function Layout – Based on the call graph and profiled caller/callee behavior, functions that tend to be along the same execution path are placed in the same section.
Dead Code Separation – Code that is not called during profiling is moved to a special section that is appended to the end of the set of sections. This effectively keeps this section out of the often-used pages.
EH Code Separation – The EH code, being exceptionally executed, can often be moved to a separate section when profile-guided optimizations can determine that the exceptions occur only on exceptional conditions.
All of this may reduce locality of non-profiled code paths. In my experience, the impact would be noticable or severe if this code path has a tight loop that does exceed L1 code cache (and maybe even thrashes L2). That sounds exactly like a path that should have been included in a PGO profile :)
Dead Code separation can have a huge impact - both ways - because it can reduce disk access.
If you rely on exceptions being fast, you are doing it wrong.
Size/Speed Optimization – Functions where the program spends a lot of time can be optimized for speed.
The rule of thumb nowadays is to "optimize for size by default, and only optimize for speed where needed (and verify it helps). The reason is again code cache - in most cases, the smaller code will also be the faster code, because of code cache. So this kind of automates what you should do manually. Compared to a global speed optimization, this would slow down non-profiled code paths only in very atypical cases ("weird code" or a target machine with unusual cache behavior).
Conditional Branch Optimization – With the value probes, profile-guided optimizations can find if a given value in a switch statement is used more often than other values. This value can then be pulled out of the switch statement. The same can be done with if/else instructions where the optimizer can order the if/else so that either the if or else block is placed first depending on which block is more frequently true.
I would file that under "improved prediction", too, unless you feed the wrong PGO information.
The typical case where this can pay a lot are run time parameter / range validation and similar paths that should never be taken in a normal execution.
The breaking case would be:
if (x > 0) DoThis() else DoThat();
in a relevant tight loop and profiling only the x > 0 case.
Memory Intrinsics – The expansion of intrinsics can be decided better if it can be determined if an intrinsic is called frequently. An intrinsic can also be optimized based on the block size of moves or copies.
Again, mostly better informaiton with a small possibility of penalizing untested data.
Example: - this is all an "educated guess", but I think it's quite illustrativefor the entire topic.
Assume you have a memmove that is always called on well aligned non-overlapping buffers with a length of 16 bytes.
A possible optimization is verifying these conditions and use inlined MOV instructions for this case, calling to a general memmove (handling alignment, overlap and odd length) only when the conditions are not met.
The benefits can be significant in a tight loop of copying structs around, as you improve locality, reduce expected path instruction, likely with more chances for pairing/reordering.
The penalty is comparedly small, though: in the general case without PGO, you would either always call the full memmove, or nline the full memmove implementation. The optimization adds a few instructions (including a conditional jump) to something rather complex, I'd assume a 10% overhead at most. In most cases, these 10% will be below the noise due to cache access.
However, there is a very slight slight chance for significant impact if the unexpected branch is taken frequently and the additional instructions for the expected case together with the instructions for the default case push a tight loop out of the L1 code cache
Note that you are already at the limits of what the compiler could do for you. The additional instructions can be expected to be a few bytes, compared to a few K in code cache. A static optimizer could hit the same fate depending on how well it can hoist invariants - and how much you let it.
Conclusion:
Many of the optimizations are neutral.
Some optimizations can have slight negative impact on non-profiled code paths
The impact is usually much smaller than the possible gains
Very rarely, a small impact can be emphasized by other contributing pathological factors
Few optimizations (namely, layout of code sections) can have large impact, but again the possible gains signidicantly outweight that
My gut feel would further claim that
A static optimizer, on a whole, would be at least equally likely to create a pathological case
It would be pretty hard to actually destroy performance even with bad PGO input.
At that level, I would be much more afraid of PGO implementation bugs/shortcomings than of failed PGO optimizations.
PGO can most certainly affect the run time of the code that is run less frequently. After all you are modifying the locality of some functions/blocks and that will make the blocks that are now together to be more cache friendly.
What I have seen is that teams identify their high priority scenarios. Then they run those to train the optimization profiler and measure the improvement. You don't want to run all the scenarios under PGO because if you do you might as well not run any.
As in everything related to performance you need to measure before you apply it. Masure your most common scenarios to see if they improved at all by using PGO training. And also measure the less common scenarios to see if they regressed at all.

Resources