Call to DLL function performance overhead? - performance

I'm in a situation when I will often call two DLL functions from other DLL, and I need those functions to be separate, as I call them in different places as well. But since I'm often calling them together, does it make sense to create another DLL function that will call those two, to call it in places where I need to call them both?

Possibly, yes, due to inlining. Calling a DLL function is relatively cheap on modern processors (especially when it is often called and the cache is warm). However, if the function last for a very short time, it can still add a significant overhead. Adding a DLL function calling the two others in the same module can prevent a function call. In some case, the inlining help the compiler to optimize the code further. For example, the constant propagation is often an optimization that matters a lot in such a case. That being said, if the two functions are not related each other, the benefit should be small.
Note that this is a micro-optimization and its benefit should be relatively small in most cases. If the frequency of DLL calls is very high, then it should be much better to redesign your code (eg. typically by working on data chunks, using an optimizing JIT).

Related

Functions For One-Time Use

In any language, is it better to use functions which are a bit long to write on the main function if you're only going to use it a few times? I heard that in Python, doing so will make the function faster. Is it also true in Javascript or in any other language?
Example:
function main(){
blahblahblahblah1();
blahblahblahblah2();
blahblahblahblah3();
blahblahblahblah1();
blahblahblahblah2();
blahblahblahblah3();
}
setInterval(main,1);
Is it better to group the code (blahblahblah) to make it look like this:
function blahblahblah(){
blahblahblah1();
blahblahblah2();
blahblahblah3();
}
function main(){
blahblahblah();
}
setInterval(main,1);
The performance impact of this decision is too small to measure in any language I'm aware of - the interpreter/compiler optimizes this away. The only cases where this may matter is if you're writing software for extremely constrained devices like embedded systems (but you wouldn't be doing that in Python or Javascript), or where you're writing software with extreme performance requirements like graphics engines for video games (but again, you wouldn't be using Python or Javascript for that).
Instead of optimizing for performance, I'd encourage you to optimize for readability, testability and bug resistance. Your second example is not equivalent to the first (it only executes the functions once, the first example executes them twice). Assuming you meant to have them equivalent, the second example is a little more readable, because the logic of executing the three functions is abstracted into a higher-level function.
if you use a logic only once in your code, you shouldnt create a function - the compiler/interpreter will optimize the code anyway. if you plan to reuse the logic, then create a function - it makes the code smaller and better to maintain for you - even its minimal slower for the extra functionlookup and call, these are minimal performance effects...

If or function pointers in fortran

as it is so common with Fortran, I'm writing a massively parallel scientific code. In the beginning of my code I read my configuration file which tells me which type of solver I want to use. Now that means that in a subroutine (during the main run) I have
if(solver.eq.1)then
call solver1()
elseif(solver.eq.2)then
call solver2()
else
call solver3()
endif
Edit to avoid some confusion: This if is inside my time integration loop and I have one that is inside 3 nested loops.
Now my question is, wouldn't it be more efficient to use function pointers instead as the solver variable will not change during execution, except at the initialisation procedure.
Obviously function pointers are F2003. That shouldn't be a problem as long as I use gfortran 4.6. But I'm mainly using a BlueGene P, there is a f2003 compiler, so I suppose it's going to work there as well although I couldn't find any conclusive evidence on the web.
Knowing nothing about Fortran, this is my answer: The main problem with branching is that a CPU potentially cannot speculatively execute code across them. To mitigate this problem, branch prediction was introduced (which is very sophisticated in modern CPUs).
Indirect calls through a function pointer can be a problem for the prediction unit of the CPU. If it can't predict where the call will actually go, this will stall the pipeline.
I am quite sure that the CPU will correctly predict that your branch will always be taken or not taken because it is a trivial case of prediction.
Maybe the CPU can speculate across the indirect call, maybe it can't. This is why you need to test which is better.
If it cannot, you will certainly notice in your benchmark.
In addition, maybe you can hoist the if test out of your inner loop so it won't be called often. This will make the actual performance of the branch irrelevant.
If you only plan to use the function pointers once, at initialisation, and you are running codes on a BlueGene, isn't your concern for the efficiency mis-directed ? Generally, any initialisation which works is OK, if it takes 1sec instead of 1msec it's probably going to have 0 impact on total execution time.
Code initialisation routines for clarity, ease of modification, that sort of thing.
EDIT
My guess is that using function pointers rather than your current code will have no impact on execution speed. But it's just a (educated perhaps) guess and I'll be very interested in any data you gather on this question.
If you solver routines take a non-trivial runtime, then the trivial runtime of the IF statements is likely to be immaterial. If the sovler routines have a comparable runtine to the IF statement, then the total runtime is very short, so why do your care? This seems an optimization unlikely to pay off.
The first rule of runtime optimization is to profile your code is see what portions are consuming the runtime. Otherwise you are likely to optimize portions that are unimportant, which will accomplish nothing.
For what its worth, someone else recently had a very similar concern: Fortran Subroutine Pointers for Mismatching Array Dimensions
After a brief search I couldn't find the answer to the question, so I ran a little benchmark myself (see this link for the Makefile & dependencies). The benchmark consists of:
Draw random number to select method a, b, or c, which all perform a simple addition to their single integer argument
Call the chosen method 100 million times, using either a procedure pointer or if-statements
Repeat the above 5 times
The result with gfortran 4.8.5 on an CPU E5-2630 v3 # 2.40GHz is:
Time per call (proc. pointer): 1.89 ns
Time per call (if statement): 1.89 ns
In other words, there is not much of a performance difference!

Does more human-logical source code tend to produce more optimized compiled code?

I'm working on a large performance-critical project that is very branch heavy. In the process of designing algorithms for this product, my employer often reminds me to write code that is more "human logical", or written in a manner that more closely aligns with the way we logically think.
While this makes sense to me from a few different perspectives (e.g. ease of understanding/remembering, code maintenance, etc.), I'm also wondering whether this approach could also ever be expected to lead to a more optimized compiled output.
Could this be the case due to the fact that compilers are written by humans, and optimizers are often designed to recognize familiar code blocks?
I would love to hear some thoughts on why this could/not be the case.
Consider two different kinds of code, library code and application code.
Library code (like a string class library) is likely to own the program counter a lot of the time, like this:
while(some test){
massage some data, while seldom calling sub-functions
}
That kind of code will benefit from compiler optimization.
(So to answer your question, people write benchmark functions like this, and the compiler-writers use those as test cases.)
On the other hand, application code tends to look like this:
if (some test){
do a bunch of things, including many function calls
} else if (some other test){
do a bunch of things, including many function calls
} else {
do a bunch of things, including many function calls
}
In this case, the time you save by branch prediction or cycle-shaving might be 1 time unit, say, while the do a bunch of things... might spend from 10^2 to 10^8 time units, with or without I/O.
So the benefit of compiler optimization of this code tends to be completely lost in the noise.
That's not to say it can't be optimized.
It's just that the compiler can't do it - it's your job.
If you want to make the latter kind of code run fast, the best way is to find out which lines of code are on the call stack a high percent of time, and if possible, finding a way to avoid doing them.
(Here's an example of a 43x speedup.)
What is "human logical" probably varies from human to human.
For instance, if I am a newbie performing tasks according to written instructions I will (usually), over time, learn some tasks by heart whereas for others I will return to the instructions simply because the tasks are not performed often enough/are too boring or both. Others in the same situation may or may not function similiarly and it is not certain that the tasks they'll learn will be the ones I learn.
For programming it works similarly. Some may construct a loop in one manner and perform a test inside it for the sake of readability while I might do the test outside for performance reasons. What is more wrong and what is more right?
There is a widespread belief that compilers will optimize anything. This is true but as I've written (drastically) in another post, GIGO (Garbage In = Garbage Out) applies. Compilers don't operate in a vacuum: given a set of rules they'll perform safe optimizations on source code to the extent of their (the compilers') constructors' imagination and competence in code optimizations. Bloat source code will become optimized bloat machine code. In the same manner lean and mean source code will become optimized lean and mean machine code. In critical places it is possible to feed the compiler source code that it "feels" (YES! they do have personalities) absolutely comfortable in optimizing and the resulting machine code will fly.
We've all experienced poorly performing software. If we're lucky we've experienced software that performs incredibly well. One developer can learn to write a piece of code that performs well in the same amount of time that another writes code that performs poorly.

When should I consider the performance impact of a function call?

In a recent conversation with a fellow programmer, I asserted that "if you're writing the same code more than once, it's probably a good idea to refactor that functionality such that it can be called once from each of those places."
My fellow programmer buddy instead insisted that the performance impact of making these function calls was not acceptable.
Now, I'm not looking for validation of who was right. I'm simply curious to know if there are situations or patterns where I should consider the performance impact of a function call before refactoring.
"My fellow programmer buddy instead insisted that the performance impact of making these function calls was not acceptable."
...to which the proper answer is "Prove it."
The old saw about premature optimization applies here. Anyone who isn't familiar with it needs to be educated before they do any more harm.
IMHO, if you don't have the attitude that you'd rather spend a couple hours writing a routine that can be used for both than 10 seconds cutting and pasting code, you don't deserve to call yourself a coder.
Don't even consider the effect of calling overhead if the code isn't in a loop that's being called millions of times, in an area where the user is likely to notice the difference. Once you've met those conditions, go ahead and profile to see if your worries are justified.
Modern compilers of languages such as Java will inline certain function calls anyway. My opinion is that the design is way more important over the few instructions spent with function call. The only situation I can think about would be writing some really fine tuned code in assembler.
You need to ask yourself several questions:
Cost of time spent on optimizing code vs cost of throwing more hardware at it.
How does this impact maintainability?
How does going in either direction impact your deadline?
Does this really beg optimization when many modern compilers will do it for you anyway? Do not try to outsmart the compiler.
And of course, which will help you sleep better at night? :)
My bet is that there was a time in which the performance cost of a call to an external method or function WAS something to be concerned with, in the same way that the lengths of variable names and such all needed to be evaluated with respect to performance implications.
With the monumental increases in processor speed and memory resources int he last two decades, I propose that these concerns are no longer as pertinent as they once were.
We have been able use long variable names without concern for some time, and the cost of a call to external code is probably negligible in most cases.
There might be exceptions. If you place a function call within a large loop, you may see some impact, depending upon the number of iterations.
I propose that in most cases you will find that refactoring code into discrete function calls will have a negligible impact. There might be occasions in which there IS an impact. However, proper TESTING of a refactoring will reveal this. In those minority of cases, your friend might be correct. For most of the rest of the time, I propose that your friend is clining a little to closely to practices which pre-date most modern processors and storage media.
You care about function call overhead the same time you care about any other overhead: when your performance profiling tool indicates that it's a problem.
for the c/c++ family:
the 'cost' of the call is not important. if it needs to be fast, you just have to make sure the compiler is able to inline it. that means that:
the body must be visible to the compiler
the body is indeed small enough to be considered an inline candidate.
the method does not require dynamic dispatch
there are a few ways to break this default ability. for example:
huge instruction count already in the callsite. even with early inlining, the compiler may pop a trivial function out of line (even though it could generate more instructions/slower execution). early inlining is the compiler's ability to inline a function early on, when it sees the call costs more than the inline.
recursion
the inline keyword is more or less useless in this era, regarding its original intent. however, many compilers offer a means to restore the meaning, with a compiler specific directive. using this directive (correctly) helps considerably. learning how to use it correctly takes time. if in doubt, omit the directive and leave it up to the compiler.
assuming you are using a modern compiler, there is no excuse to avoid the function, unless you're also willing to go down to assembly for this particular program.
as it stands, and if performance is crucial, you really have two choices:
1) learn to write well organized programs for speed. downside: longer compile times
2) maintain a poorly written program
i prefer 1. any day.
(yes, i have spent a lot of time writing performance critical programs)

What are good heuristics for inlining functions?

Considering that you're trying solely to optimize for speed, what are good heuristics for deciding whether to inline a function or not? Obviously code size should be important, but are there any other factors typically used when (say) gcc or icc is determining whether to inline a function call? Has there been any significant academic work in the area?
Wikipedia has a few paragraphs about this, with some links at the bottom:
In addition to memory size and cache issues, another consideration is register pressure. From the compiler's point of view "the added variables from the inlined procedure may consume additional registers, and in an area where register pressure is already high this may force spilling, which causes additional RAM accesses."
Languages with JIT compilers and runtime class loading have other tradeoffs since the virtual methods aren't known statically, yet the JIT can collect runtime profiling information, such as method call frequency:
Design, Implementation, and Evaluation of Optimizations in a Just-in-Time Compiler (for Java) talks about method inlining of static methods and dynamically loaded classes and its improvements on performance.
Practicing JUDO: Java Under Dynamic Optimizations claims that their "inlining policy is based on the code size and profiling information. If the execution frequency of a method entry is below a certain threshold, the method is then not inlined because it is regarded as a cold method. To avoid code explosion, we do not inline a method with a bytecode size of more than 25 bytes. . . . To avoid inlining along a deep call chain, inlining stops when the accumulated inlined bytecode size along the call chain exceeds 40 bytes." Although they have runtime profiling information (method call frequency) they are still careful to avoid inlining large functions or chains of functions to prevent bloat.
A search on Google Scholar reveals a number of papers, such as
The effect of code expanding optimizations on instruction cache design
Function Inlining under Code Size Constraints
for Embedded Processors
A search on Google Books reveals quite a number of books with papers or chapters about function inlining in various contexts.
The Compiler Design Handbook: Optimizations and Machine Code Generation has a chapter about Statisical and Machine Learning Techniques in Compiler Design, with heuristics to set various parameters, profiling the results. This chapter references the Vaswani et al paper Microarchitecture Sensitive Empirical Models for Compiler Optimizations where they propose "the use of empirical modeling
techniques for building microarchitecture sensitive models for compiler optimizations".
(Some other books talk about inling from the programmer's point of view, such as C++ for Game Programmers, which talks about the dangers of inlining functions too often and the differences between inlining and macros. Compilers often ignore the programmer's inline requests if they can determine that they would do more harm than good; this can be overridden with macros as a last resort.)
A function call implies some additional code (the function prologue, where the new stack frame is set up, and the function epilogue, where it's cleaned up). If your compiler sees that the function code is small in comparison to the prologue and epilogue, it can decide it's not worth it to make an actual call, and will inline the function.
The only benefit I see of calling a function instead of inlining it are size-related. I guess inlining a function then unrolling a loop can result in a significant size increase.
as far as I have saw, function size is the only factor compilers used to determine inline. However if you do profile guided optimization (PGO), i believe compiler is able to use other variables, such as number of calls/call setup time.
In .NET is is mostly based on size. Measure the size of the parent function and child function in compiled bytes. Then measure the size of the combined function. If the combined function is smaller, then inlining is a good idea.
The reason for this is to make it possible to shove as much code into the CPU's cache as possible. Cache misses are far more expensive than function calls in modern CPUs.

Resources