Is a statically linked executable faster than a dynamically linked executable? - performance

Since the dynamically linked libraries have to be resolved at run-time, are statically linked executables faster than dynamically linked executables?

Static linking produces a larger executable file than dynamic linking because it has to compile all of the library code directly into the executable. The benefit is a reduction in overhead from no longer having to call functions from a library, and anywhere from somewhat to noticeably faster load times.
A dynamically linked executable will be smaller because it places calls at runtime to shared code libraries. There are several benefits to this, but the ones important from a speed or optimization perspective are the reduction in the amount of disk space and memory consumed, and improved multitasking because of reduced total resource consumption (particularly in Windows).
So it's a tradeoff: there are arguments to be made why either one might be marginally faster. It would depend on a lot of different things, such as to what extent speed-critical routines in the program relied on calls to library functions. But the important point to emphasize in the above statement is that it might be marginally faster. The speed difference will be nearly imperceptible, and difficult to distinguish even from normal, expected fluctuations.
If you really care, benchmark it and see. But I advise this is a waste of time, and that there are more effective and more important ways to increase your application's speed. You will be much better off in the long run considering factors other than speed when making the "to dynamically link or to statically link" decision. For example, static linking may be worth considering if you need to make your application easier to deploy, particularly to diverse user environments. Or, dynamic linking may be a better option (particularly if those shared libraries are not your own) because your application will automatically reap the benefits of upgrades made to any of the shared libraries that it calls without having to lift a finger.
EDIT: Microsoft makes the specific recommendation that you prefer dynamic linking over static linking:
It is not recommended to redistribute
C/C++ applications that statically
link to Visual C++ libraries. It is
often mistakenly assumed that by
statically linking your program to
Visual C++ libraries it is possible to
significantly improve the performance
of an application. However the impact
on performance of dynamically loading
Visual C++ libraries is insignificant
in almost all cases. Furthermore,
static linking does not allow for
servicing the application and its
dependent libraries by either the
application's author or Microsoft. For
example, consider an application that
is statically linked to a particular
library, running on a client computer
with a new version of this library.
The application still uses code from
the previous version of this library,
and does not benefit from library
improvements, such as security
enhancements. Authors of C/C++
applications are strongly advised to
think through the servicing scenario
before deciding to statically link to
dependent libraries, and use dynamic
linking whenever possible.

It depends on the state of your disk and whether or not the DLLs might be used in other processes. A cold start happens when your program and its DLLs were never loaded before. An EXE without DLLs has a faster cold start since only one file needs to be found. You would have to have a badly fragmented disk that's almost full to not have this case.
A DLL can start to pay off when it is already loaded in another process. Now the code pages of the DLL are simply shared, startup overhead is very low and memory usage is efficient.
A somewhat similar case is a warm start, a startup where the DLL is still available in the file system cache from a previous time it was used. The difference between a cold and a warm start can be quite significant on a sluggish disk. The one reason that everybody likes a SSD.

No, I don't think so. in most of the cases only a copy of the library in memory per program makes the overall system less memory. suppose you have 100 programs using the libc library statically, and libc is ~2-3MB, so it makes the size of the program increase.
But same in a dynamic we can share stuff, so fewer bytes in the memory means more bytes in Caches, More bytes in cache means faster.
Though it has loading overhead, your overall system performance is faster.

Related

Why prefer distributing a shared library with executables instead of linking statically?

Scenario: two unrelated pieces of software are going to be distributed with their own copy of the same shared library. They will both be installed on the same machine (running Windows), and they're going to be run at the same time.
In this scenario - from my understanding, the two programs won't share the library in memory without somehow specifying it - which doesn't seem to be the norm (correct me if I'm wrong)... In other words, most or all of the programs that use this library will have their own copy of it, both in memory and on disk, which is the same as what statically linked programs would have - roughly speaking.
Is it preferable for the writers of each program to ship the shared library (together with their programs) over linking with the library statically, or is the difference negligible?

How compilation and linking at runtime is happening?

In a tutorial I've encountered a new concept (for me), that I never thought is possible. Actually, I thought that compilation is an entirely pre-run-time process. This is the phrase from tutorial: "Compile time occurs before link time (when the output of one or more compiled files are joined together) and runtime (when a program is executed). In some programming languages it may be necessary for some compilation and linking to occur at runtime".
My questions are:
Is pre-run-time compilation and linking processes absolutely different from run-time compilation and linking? If yes, please explain the main differences.
How are code sections that need to be compiled(linked) during run-time marked and where is that information kept? (This may be different from language to language, if possible, please give a specific example).
Thank you very much for your time!
Runtime compilation
The best (most well known) example I'm personally aware of is the just in time compilation used by Java. As you might know Java code is being compiled into bytecode which can be interpreted by the Java Virtual Machine. It's therefore different from let's say C++ which is first fully (preprocessed) compiled (and linked) into an executable which can be ran directly by the OS without any virtual machine.
The Java bytecode is instead interpreted by the VM, which maps them to processor specific instructions. That being said the JVM does JIT, which takes that bytecode and compiles it (during runtime) into machine code. Here we arrive at your second question. Even in Java it can depend on which JVM you are using but basically there are pieces of code called hotspots, the pieces of code that are run frequently and which might be compiled so the application's performance improves. This is done during runtime because the normal compiler does not have (or well might not have) all the necessary data to make a proper judgement which pieces of code are in fact ran frequently. Therefore JIT requires some kind of runtime statistics gathering, which is done parallel to the program execution and is done by the JVM. What kind of statistics are gathered, what can be optimised (compiled in runtime) etc. depends on the implementation (you obviously cannot do everything a normal compiler would do due to memory and time constraints - guess this partly answers the first question? you don't compile everything and usually only a limited set of optimisations are supported in runtime compilation). You can try looking for such info but from my experience usually it's very badly documented and hard to find (at least when it comes to official sources, not presentations/blogs etc.)
Runtime linking
Linker is a different pair of shoes. We cannot use the Java example anymore since it doesn't really have a linker like C or C++ (instead it has a classloader which takes care of loading files and putting it all together).
Usually linking is performed by a linker after the compilation step (static linking), this has pros (no dependencies) and cons (higher memory imprint as we cannot use a shared library, when the library number changes you need to recompile your sources).
Runtime linking (dynamic/late linking) is actually performed by the OS and it's the OS linker's job to first load shared libraries and then attach them to a running process. Furthermore there are also different types of dynamic linking: explicit and implicit. This has the benefit of not having to recompile the source when the version number changes since it's dynamic and library sharing but also drawbacks, what if you have different programs that use the same library but require different versions (look for DLL hell). So yes those two concepts are also quite different.
Again how it's all done, how it's decided what and how should be linked, is OS specific, for instance Microsoft has the dynamic-link library concept.

When does the MacOSX 'free' library call madvise, and is there any way to control it?

I've got a C++ program that is notably slower on OSX 10.8.2 than on Linux. Profiling shows that the reason is that calls to free (that result from STL operations, FWIW), are much slower on OSX, because they go and call madvise, and real time gets consumed in there.
Is there any way to modulate this behavior of OS/X?
Well, yes!
I had horrible performance issues with malloc/free in Linux and started looking for a replacement.
Two options came to mind tbbmalloc (part of Intel TBB which is free BTW) and Google malloc.
After extensive testing it wasn't clear which was faster (of the two) but both were significantly faster than LIBC's implementation.
I went with tbbmalloc since it was working smoother, google malloc had a bug that caused virtual memory to be very large (reserved but not committed) which was very bad for my app (IT daemons would kill it).
The good:
Much better performance than libc's malloc. Was 3x-300x in STL heavy app.
Simple integration. No code change. Add/change 1 line the executable's makefile. No change to SOs.
The bad:
Mem checkers will not with replacements. for memchk/valgrind/etc. revert to the original malloc.
App would take 10-30% more memory.
The app I developed was a CAD application that used 10s of GBs, building and destroying 10s of millions various structures (lots of STL maps, vectors, hash_maps).
How to do this:
In the linker command, add -ltbbmalloc and make sure the library is in the lib search path (-L flag).

What are the possible side effects of using GCC profiling flag -pg?

There is a device driver for a camera device provided to us as a .so library file by the vendor.
Only the header file with API's is available which provides the list of functions that we can work with the device. Our application is linked with the .so library file provided by the vendor and uses the interface functions provided for our objective.
When we wanted to measure the time taken by our application in handling different tasks, we have added GCC -pg flag and compiled+built our application.
But we found that using this executable built with -pg, we are observing random failure in the camera image acquire functions. Since we are using the .so library file, we do not know what is going wrong inside that function.
So in general I wanted to understand what could be the possible reasons of such a failure mode. Any pointers or documents that can help what goes inside profiling and its side effects is appreciated.
This answer is a helpful overview of how the gcc -pg flag profiler actually works. The take-home point is mostly to do with possible changes to timing. If your library has any kind of time-sensitivity in it, introducing profiler overheads might be changing the time it takes to execute parts of the code, and perhaps violating some kind of constraint.
If you look at the gprof documentation, it would explain the implementation details:
Profiling works by changing how every function in your program is
compiled so that when it is called, it will stash away some
information about where it was called from. From this, the profiler
can figure out what function called it, and can count how many times
it was called. This change is made by the compiler when your program
is compiled with the `-pg' option, which causes every function to call
mcount (or _mcount, or __mcount, depending on the OS and compiler) as
one of its first operations.
So the timing of your application would change quite a bit when you turn on -pg.
If you would like to instrument your code without significantly affecting the timings, you could possibly look at oprofile. It does not pose as significant an overhead as gprof does.
Another fairly recent tool that serves as a good lightweight profiling tool is perf.
The profiling tools are useful primarily in understanding the CPU bound pieces of your library/application and can help you optimize those critical pieces. Most of the time they serve to identify some culprit function/method which wastes CPU cycles. So do not use it as the sole piece for debugging any and all issues.
Most vendor libraries would also provide means to turn on extra debugging or dumping extra information during runtime. They include means such as environment variables, log files, /proc or /sys interfaces for drivers, etc. and sometimes even tools to increase debugging levels at runtime. See if you can leverage these.
If you have defined APIs in a library/driver, you should run unit-tests on them instead of trying to debug the whole application you've built.
If you find a certain unit-test fails, send the source code of the unit-test to your vendor, and ask them to fix the bug. If it is not a bug, your vendor would at least point you towards the right set of APIs or the semantics to use.

Windows malloc replacement (e.g., tcmalloc) and dynamic crt linking

A C++ program that uses several DLLs and QT should be equipped with a malloc replacement (like tcmalloc) for performance problems that can be verified to be caused by Windows malloc. With linux, there is no problem, but with windows, there are several approaches, and I find none of them appealing:
1. Put new malloc in lib and make sure to link it first (Other SO-question)
This has the disadvantage, that for example strdup will still use the old malloc and a free may crash the program.
2. Remove malloc from the static libcrt library with lib.exe (Chrome)
This is tested/used(?) for chrome/chromium, but has the disadvantage that it just works with static linking the crt. Static linking has the problem if one system library is linked dynamically against msvcrt there may be mismatches in the heap allocation/deallocation. If I understand it correctly, tcmalloc could be linked dynamically such that there is a common heap for all self-compiled dlls (which is good).
3. Patch crt-source code (firefox)
Firefox's jemalloc apparently patches the windows CRT source code and builds a new crt. This has again the static/dynamic linking problem above.
One could think of using this to generate a dynamic MSVCRT, but I think this is not possible, because the license forbids providing a patched MSVCRT with the same name.
4. Dynamically patching loaded CRT at run time
Some commercial memory allocators can do such magic. tcmalloc can do, too, but this seems rather ugly. It had some issues, but they have been fixed. Currently, with tcmalloc it does not work under 64 bit windows.
Are there better approaches? Any comments?
Q: A C++ program that is split accross several dlls should:
A) replace malloc?
B) ensure that allocation and de-allocation happens in the same dll module?
A: The correct answer is B. A c++ application design that incorporates multiple DLLs SHOULD ensure that a mechanism exists to ensure that things that are allocated on the heap in one dll, are free'd by the same dll module.
Why would you split a c++ program into several dlls anyway? By c++ program I mean that the objects and types you are dealing with are c++ templates, STL objects, classes etc. You CAN'T pass c++ objects accross dll boundries without either lot of very careful design and lots of compiler specific magic, or suffering from massive duplication of object code in the various dlls, and as a result an application that is extremely version sensitive. Any small change to a class definition will force a rebuild of all exe's and dll's, removing at least one of the major benefits of a dll approach to app development.
Either stick to a straight C interface between app and dll's, suffer hell, or just compile the entire c++ app as one exe.
It's a bold claim that a C++ program "should be equipped with a malloc replacement (like tcmalloc) for performance problems...."
"[In] 6 out of 8 popular benchmarks ... [real-sized applications] replacing back the custom allocator, in which people had invested significant amounts of time and money, ... with the system-provided dumb allocator [yielded] better performance. ... The simplest custom allocators, tuned for very special situations, are the only ones that can provide gains." --Andrei Alexandrescu
Most system allocators are about as good as a general purpose allocator can be. You can do better only if you have a very specific allocation pattern.
Typically, such special patterns apply only to a portion of the program, in which case, it's better to apply the custom allocator to the specific portion that can benefit than it is to globally replace the allocator.
C++ provides a few ways to selectively replace the allocator. For example, you can provide an allocator to an STL container or you can override new and delete on a class by class basis. Both of these give you much better control than any hack which globally replaces the allocator.
Note also that replacing malloc and free will not necessarily change the allocator used by operators new and delete. While the global new operator is typically implemented using malloc, there is no requirement that it do so. So replacing malloc may not even affect most of the allocations.
If you're using C, chances are you can wrap or replace key malloc and free calls with your custom allocator just where it matters and leave the rest of the program to use the default allocator. (If that's not the case, you might want to consider some refactoring.)
System allocators have decades of development behind them. They are stable and well-tested. They perform extremely well for general cases (in terms of raw speed, thread contention, and fragmentation). They have debugging versions for leak detection and support for tracking tools. Some even improve the security of your application by providing defenses against heap buffer overrun vulnerabilities. Chances are, the libraries you want to use have been tested only with the system allocator.
Most of the techniques to replace the system allocator forfeit these benefits. In some cases, they can even increase memory demand (because they can't be shared with the DLL runtime possibly used by other processes). They also tend to be extremely fragile in the face of changes in the compiler version, runtime version, and even OS version. Using a tweaked version of the runtime prevents your users from getting benefits of runtime updates from the OS vendor. Why give all that up when you can retain those benefits by applying a custom allocator just to the exceptional part of the program that can benefit from it?
Where does your premise "A C++ program that uses several DLLs and QT should be equipped with a malloc replacement" come from?
On Windows, if the all the dlls use the shared MSVCRT, then there is no need to replace malloc. By default, Qt builds against the shared MSVCRT dll.
One will run into problems if they:
1) mix dlls that use static linking vs using the shared VCRT
2) AND also free memory that was not allocated where it came from (ie, free memory in a statically linked dll that was allocated by the shared VCRT or vice versa).
Note that adding your own ref counted wrapper around a resource can help mitigate that problems associated with resources that need to be deallocated in particular ways (ie, a wrapper that disposes of one type of resource via a call back to the originating dll, a different wrapper for a resource that originates from another dll, etc).
nedmalloc? also NB that smplayer uses a special patch to override malloc, which may be the direction you're headed in.

Resources