I'm trying to understand how the -pg (or -p) flag works when compiling C code with GCC.
The official GCC documentation only states:
-pg
Generate extra code to write profile information suitable for the analysis program gprof. You must use this option when compiling the source files you want data about, and you must also use it when linking.
This really interests me, as I'm doing a small research on profilers. I'm trying to pick the best tool for the job.
Compiling with -pg instruments your code, so that Gprof reports detailed information. See gprof's manual, 9.1 Implementation of Profiling:
Profiling works by changing how every function in your program is compiled so that when it is called, it will stash away some information about where it was called from. From this, the profiler can figure out what function called it, and can count how many times it was called. This change is made by the compiler when your program is compiled with the -pg option, which causes every function to call mcount (or _mcount, or __mcount, depending on the OS and compiler) as one of its first operations.
The mcount routine, included in the profiling library, is responsible for recording in an in-memory call graph table both its parent routine (the child) and its parent's parent. This is typically done by examining the stack frame to find both the address of the child, and the return address in the original parent. Since this is a very machine-dependent operation, mcount itself is typically a short assembly-language stub routine that extracts the required information, and then calls __mcount_internal (a normal C function) with two arguments—frompc and selfpc. __mcount_internal is responsible for maintaining the in-memory call graph, which records frompc, selfpc, and the number of times each of these call arcs was traversed.
...
Please note that with such an instrumenting profiler, you're profiling the same code you would compile in release without profiling instrumentation. There is an overhead associated with the instrumentation code itself. Also, the instrumentation code may alter instruction and data cache usage.
Contrary to an instrumenting profiler, a sampling profiler like Intel VTune works on noninstrumented code by looking at the target program's program counter at regular intervals using operating system interrupts. It can also query special CPU registers to give you even more insight of what's going on.
See also Profilers Instrumenting Vs Sampling.
This link gives a brief explanation of how gprof works.
This link gives an extensive critique of it.
(Check my answer to the archived question.)
From "Measuring Function Duration with Ftrace":
Instrumentation comes in two main
forms—explicitly declared tracepoints, and implicit tracepoints.
Explicit tracepoints consist of developer defined
declarations which specify the location of the
tracepoint, and additional information about what data
should be collected at a particular trace site. Implicit
tracepoints are placed into the code automatically by the compiler, either due to compiler flags or by developer redefinition of commonly used macros.
To instrument functions implicitly, when
the kernel is configured to support function tracing, the kernel build system adds -pg to the flags used with
the compiler. This causes the compiler to add code to
the prologue of each function, which calls a special assembly routine called mcount. This compiler option is
specifically intended to be used for profiling and tracing
purposes.
Related
I have a library that is currently dynamically linked against glibc.
This library dynamically loaded into an application that is also dynamically linked against glibc. I have no control over the application, only over the shared object.
However, sometimes loading the library causes the application to get SIGKILLd because it has pretty strict real-time requirements and rlimits set accordingly. Looking at this with a profiler tells me that most of the time is actually spent in the linker. So essentially dynamic linking is actually too slow (sometimes). Well that's not a problem I ever thought I'd have :)
I was hoping to solve this issue by producing a statically linked shared object. However, googling this issue and reading multiple other SO threads have warned me not to try to static link glibc. But these seem glibc specific issues.
So my question is, if I were to statically link this shared library against musl and then let a (dynamically linked) glibc application dlopen it, would that be safe? Is there a problem in general with multiple libc's?
Looking at this with a profiler tells me that most of the time is actually spent in the linker.
Something is very wrong with your profiling methodology.
First, the "linker" does not run when the application runs, only the loader (aka rtld, aka ld-linux) does. I assume you mean't the loader, not the linker.
Second, the loader does have some runtime cost at startup, but since every function you call is only resolved once, proportion of the loader runtime cost for the duration of an application which runs for any appreciable time (longer than about 1 minute) should quickly approach zero.
So essentially dynamic linking is actually too slow (sometimes).
You can ask the loader to resolve all dynamic symbols in your shared library at load time by linking with -Wl,-z,now linker flag.
if I were to statically link this shared library against musl and then let a (dynamically linked) glibc application dlopen it, would that be safe?
Not only this wouldn't be safe, it would most likely not work at all (except for most trivial shared library).
Is there a problem in general with multiple libc's?
Linking multiple libc's into a single process will cause too many problems to count.
Update:
resolving all symbols at load time is exactly the opposite of what I want, as the process gets sigkilled during loading of the shared object, after that it runs fine.
It sounds from this that you are using dlopen while the process is already executing time-critical real-time tasks.
That is not a wise thing to do: dlopen (among other things) calls malloc, reads data from disk, performs mmap calls, etc. etc. All of these require locks, and can wait arbitrarily long.
The usual solution is for the application to perform initialization (which loading your library would be part of) before entering time-critical loop.
Since you are not in control of the application, the only thing you can do is tell the application developers that their current requirements (if these are in fact their requirements) are not satisfiable -- they must provide some way to perform initialization before entering time-critical section, or they will always risk a SIGKILL. Making your library load faster will only make that SIGKILL appear with lower frequency, but it will not remove it completely.
Update 2:
yes, i'm aware that the best I can do is lower the frequency and not "solve" the problem, only try to mitigate it.
You should look into prelink. It can dramatically lower the time required to perform relocations. It's not a guarantee that your chosen prelink address will be available, so you may still get SIGKILLed sometimes, but this could be an effective mitigation.
It is theoretically possible to do something like that, but you will have to write a new version of the musl startup code that copes with the fact that the thread pointer and TCB have already been set up by glibc, and run that code from an ELF constructor in the shared object. Some musl functionality will be unavailable due to TCB layout differences.
I don't think it is likely that this will solve your actual problem. Even if it is time-related, it is possible that this hack makes things worse because it increases the amount of run-time relocations needed.
A great comment on my answer describing how to use linker scripts to make a ctor-like function list pointed out that recent GNU ld has much improved support for grafting new sections into system linker scripts with -Wl,-T... and INSERT BEFORE/INSERT AFTER. This got me thinking about other linker script tricks.
For a network card firmware I modified the linker script to group together the runtime modules of the firmware so that they would all be in a contiguous block that could be in L1 cache without conflicts. To clean up stragglers (where I couldn't group by .o) I used section attributes on individual functions. Performance counters verified that it actually worked (reduced L1 instruction cache misses to almost nothing).
What other clever things have you accomplished with linker scripts?
On a certain platform, for reasons I won't go into, I needed to have a section of executable which I could discard after load. Now unfortunately unmapping the memory for the executable was not possible so I was compelled to resort to linker trickery.
What I ended up doing was introducing a section of the executable which aliased the bss. That way, presuming I could sneak some code in early enough, I could copy the data out, reinitialize the bss, and so long as my aliased section was smaller than the total bss of the executable, paid no cost for the privilege. There are a couple of problems in that I couldn't really change the crt at all and the earliest point I could inject code was still after tls initialization (which used some bss), but nothing impossible to work around.
I'm still sort of surprised it worked, I would have thought that the bss was initialized by the crt after all the program sections were loaded. I haven't tried it on any platform where I have access to the loader or crt source.
Does GCC generate reentrant code for all scenarios ?
no, you must write reentrant code.
Reentrancy is something that ISO C and C++ are capable of by design, so that includes GCC. It is still your responsibility to code the function for reentrancy.
A C compiler that does not generate reentrant code even when a function is coded correctly for reentrancy would be the exception rather than the rule, and would be for reasons of architectural constraint (such as having insufficient resources to support stack, so generating static frames). In these situations the compiler documentation should make this clear.
Some articles you might read:
Jack Ganssle on Rentrancy in 1993
Same author in 2001 on the same subject
No, GCC does not guarantee for the code written by you. Here is a good link for writing re-entrant code.
https://www.ibm.com/support/knowledgecenter/en/ssw_aix_71/generalprogramming/writing_reentrant_thread_safe_code.html
Re-entrancy is not something that the compiler has any control over - it's up to the programmer to write re-entrant code. To do this you need to avoid all the obvious pitfalls, e.g. globals (including local static variables), shared resources, threads, calls to other non-reentrant functions, etc.
Having said that, some cross-compilers for small embedded systems, e.g. 8051, may not generate reentrant code by default, and you may have to request reentrant code for specific functions via e.g. a #pragma.
GCC generates reentrant code on at least the majority of platforms it compiles for (especially if you avoid passing or returning structures by value) but it is possible that a particular language or platform ABI might dictate otherwise. You'll need to be much more specific for any more conclusive statement to be made; I know it's certainly basically reentrant on desktop processors if the code being compiled is itself basically reentrant (weird global state tricks can get you into trouble on any platform, of course).
No, GCC cannot possibly guarantee re-entrant code that you write.
However, on the major platforms, the compiler produced or included code, such as math intrinsics or function calls, are re-entrant. As GCC doesn't support platforms where non-reentrant function calls are common, such as the 8051, there is little risk in having a compiler issue with reentrancy.
There are GCC ports which have bugs and issues, such as the MSP430 version.
I'm using g++ to compile and link a project consisting of about 15 c++ source files and 4 shared object files. Recently the linking time more than doubled, but I don't have the history of the makefile available to me. Is there any way to profile g++ to see what part of the linking is taking a long time?
Edit: After I noticed that the makefile was using -O3 optimizations all the time, I managed to halve the linking time just by removing that switch. Is there any good way I could have found this without trial and error?
Edit: I'm not actually interested in profiling how ld works. I'm interested in knowing how I can match increases in linking time to specific command line switches or object files.
Profiling g++ will prove futile, because g++ doesn't perform linking, the linker ld does.
Profiling ld will also likely not show you anything interesting, because linking time is most often dominated by disk I/O, and if your link isn't, you wouldn't know what to make of the profiling data, unless you understand ld internals.
If your link time is noticeable with only 15 files in the link, there is likely something wrong with your development system [1]; either it has a disk that is on its last legs and is constantly retrying, or you do not have enough memory to perform the link (linking is often RAM-intensive), and your system swaps like crazy.
Assuming you are on an ELF based system, you may also wish to try the new gold linker (part of binutils), which is often several times faster than the GNU ld.
[1] My typical links involve 1000s of objects, produce 200+MB executables, and finish in less than 60s.
If you have just hit your RAM limit, you'll be probably able to hear the disk working, and a system activity monitor will tell you that. But if linking is still CPU-bound (i.e. if CPU usage is still high), that's not the issue. And if linking is IO-bound, the most common culprit can be runtime info. Have a look at the executable size anyway.
To answer your problem in a different way: are you doing heavy template usage? For each usage of a template with a different type parameter, a new instance of the whole template is generated, so you get more work for the linker. To make that actually noticeable, though, you'd need to use some library really heavy on templates. A lot of ones from the Boost project qualifies - I got template-based code bloat when using Boost::Spirit with a complex grammar. And ~4000 lines of code compiled to 7,7M of executable - changing one line doubled the number of specializations required and the size of the final executable. Inlining helped a lot, though, leading to 1,9M of output.
Shared libraries might be causing other problems, you might want to look at documentation for -fvisibility=hidden, and it will improve your code anyway. From GCC manual for -fvisibility:
Using this feature can very substantially
improve linking and load times of shared object libraries, produce
more optimized code, provide near-perfect API export and prevent
symbol clashes. It is *strongly* recommended that you use this in
any shared objects you distribute.
In fact, the linker normally must support the possibility for the application or for other libraries to override symbols defined into the library, while typically this is not the intended usage. Note that using that is not for free however, it does require (trivial) code changes.
The link suggested by the docs is: http://gcc.gnu.org/wiki/Visibility
Both gcc and g++ support the -v verbose flag, which makes them output details of the current task.
If you're interested in really profiling the tools, you may want to check out Sysprof or OProfile.
When writing C/C++ code, in order to debug the binary executable the debug option must be enabled on the compiler/linker. In the case of GCC, the option is -g. When the debug option is enabled, how does the affect the binary executable? What additional data is stored in the file that allows the debugger function as it does?
-g tells the compiler to store symbol table information in the executable. Among other things, this includes:
symbol names
type info for symbols
files and line numbers where the symbols came from
Debuggers use this information to output meaningful names for symbols and to associate instructions with particular lines in the source.
For some compilers, supplying -g will disable certain optimizations. For example, icc sets the default optimization level to -O0 with -g unless you explicitly indicate -O[123]. Also, even if you do supply -O[123], optimizations that prevent stack tracing will still be disabled (e.g. stripping frame pointers from stack frames. This has only a minor effect on performance).
With some compilers, -g will disable optimizations that can confuse where symbols came from (instruction reordering, loop unrolling, inlining etc). If you want to debug with optimization, you can use -g3 with gcc to get around some of this. Extra debug info will be included about macros, expansions, and functions that may have been inlined. This can allow debuggers and performance tools to map optimized code to the original source, but it's best effort. Some optimizations really mangle the code.
For more info, take a look at DWARF, the debugging format originally designed to go along with ELF (the binary format for Linux and other OS's).
A symbol table is added to the executable which maps function/variable names to data locations, so that debuggers can report back meaningful information, rather than just pointers. This doesn't effect the speed of your program, and you can remove the symbol table with the 'strip' command.
In addition to the debugging and symbol information
Google DWARF (A Developer joke on ELF)
By default most compiler optimizations are turned off when debugging is enabled.
So the code is the pure translation of the source into Machine Code rather than the result of many highly specialized transformations that are applied to release binaries.
But the most important difference (in my opinion)
Memory in Debug builds is usually initialized to some compiler specific values to facilitate debugging. In release builds memory is not initialized unless explicitly done so by the application code.
Check your compiler documentation for more information:
But an example for DevStudio is:
0xCDCDCDCD Allocated in heap, but not initialized
0xDDDDDDDD Released heap memory.
0xFDFDFDFD "NoMansLand" fences automatically placed at boundary of heap memory. Should never be overwritten. If you do overwrite one, you're probably walking off the end of an array.
0xCCCCCCCC Allocated on stack, but not initialized
-g adds debugging information in the executable, such as the names of variables, the names of functions, and line numbers. This allows a debugger, such as gdb to step through code line by line, set breakpoints, and inspect the values of variables. Because of this additional information using -g increases the size of the executable.
Also, gcc allows to use -g together with -O flags, which turn on optimization. Debugging an optimized executable can be very tricky, because variables may be optimized away, or instructions may be executed in a different order. Generally, it is a good idea to turn off optimization when using -g, even though it results in much slower code.
Just as a matter of interest, you can crack open a hexeditor and take a look at an executable produced with -g and one without. You can see the symbols and things that are added. It may change the assembly (-S) too, but I'm not sure.
There is some overlap with this question which covers the issue from the other side.
Some operating systems (like z/OS) produce a "side file" that contains the debug symbols. This helps avoid bloating the executable with extra information.