I'm trying to compile code on GCC that uses OpenACC to offload to an NVIDIA GPU but I haven't been able to find a similar compiler option to the one mentioned above. Is there a way to tell GCC to be more verbose on all operations related to offloading?
Unfortunately, GCC does not yet provide a user-friendly interface to such information (it's on the long TODO list...).
What you currently have to do is look at the dump files produced by -fdump-tree-[...] for the several compiler passes that are involved, and gather information that way, which requires understanding of GCC internals. Clearly not quite ideal :-/ -- and patches welcome probably is not the answer you've been hoping for.
Typically, for a compiler it is rather trivial to produce diagnostic messages for wrong syntax in source code ("expected [...] before/after/instead of [...]"), but what you're looking for is diagnostic messages for failed optimizations, and similar, which is much harder to produce in a form that's actually useful for a user, and so far we (that is, the GCC developers) have not been able to spend the required amount of time on this.
Related
I've recently played around with the target_clones attribute available from gcc 6.1 and onward. It's quite nifty, but, for now, it requires a somewhat clumsy approach; every function that one wants multi-versioned has to have an attribute declared manually. This is less than optimal because:
It puts compiler-specific stuff in the code.
It requires the developer to identify which functions should receive this treatment.
Let's take the example where I want to compile some code that will take advantage of AVX2 instructions, where available. -fopt-info-vect will tell me which functions were vectorized, if I build with -mavx2, so the compiler already knows this. Is there a way to, globally, tell the compiler: "If you find a function which you feel could be optimized with AVX2, make multiple versions, with and without AVX2, of that function."? And if not, can we have one, please?
I'm trying to create a tool similar to TraceGL, but for C-type languages:
As you can see, the tool above highlights code flows that were not executed in red.
In terms of building this tool for Objective-C, for example, I know that gcov (and libprofile_rt in clang) output data files that can help determine how many times a given line of code has been executed. However, would the gcov data files be able to tell me when a given line of code occurred during a program's execution?
For example, if line X is called during code paths A and B, would I be able to ascertain from the gcov that code paths A and B called line X given line X alone?
As far as I know, GCOV instrumentation data only tells that some point in the code was executed (and maybe how many times). But there is no relationship between the code points that are instrumented.
It sounds like what you want is to determine paths through the code. To do that, you either need to do static analysis of the code (requiring a full up C parser, name resolver, flow analyzer), or you need to couple the dynamic instrumentation points together in execution order.
The first requires you find machinery capable of processing C in all of its glory; you don't want to repeat that yourself. GCC, Clang, our DMS Toolkit are choices. I know the GCC and Clang do pretty serious analysis; I'm pretty sure you could find at least intraprocedural control flow analysis; I know that DMS can do this. You'd have to customize GCC and Clang to extract this data. You'd have to configure DMS to extract this data; configuration is easier than customization because it is a design property rather than a "custom" action. YMMV.
Then, using the GCOV data, you could determine the flows between the GCOV data points. It isn't clear to me that this buys you anything beyond what you already get with just the static control flow analysis, unless your goal is to exhibit execution traces.
To do this dynamically, what you could do is force each data collection point in the instrumented code to note that it is the most recent point encountered; before doing that, it would record the most recent point encountered before it was. This would produce in effect a chain of references between points which would match the control flow. This has two problems from your point of view, I think: a) you'd have to modify GCOV or some other tool to insert this different kind of instrumentation, b) you have to worry about what and how you record "predecessors" when a data collection point gets hit more than once.
gcov (or lcov) is one option. It does produce most of the information you are looking for, though how often those files are updated depends on how often __gcov_flush() is called. It's not really intended to be real time, and does not include all of the information you are looking for (notably, the 'when'). There is a short summary of the gcov data format here and in the header file here. lcov data is described here.
For what you are looking for DTrace should be able to provide all of the information you need, and in real time. For Objective-C on Apple platforms there are dtrace probes for the runtime which allow you to trace pretty much anything. There are a number of useful guides and examples out there for learning about dtrace and how to write scripts. Brendan Gregg provides some really great examples. Big Nerd Ranch has done a series of articles on it.
I'm trying to understand how the -pg (or -p) flag works when compiling C code with GCC.
The official GCC documentation only states:
-pg
Generate extra code to write profile information suitable for the analysis program gprof. You must use this option when compiling the source files you want data about, and you must also use it when linking.
This really interests me, as I'm doing a small research on profilers. I'm trying to pick the best tool for the job.
Compiling with -pg instruments your code, so that Gprof reports detailed information. See gprof's manual, 9.1 Implementation of Profiling:
Profiling works by changing how every function in your program is compiled so that when it is called, it will stash away some information about where it was called from. From this, the profiler can figure out what function called it, and can count how many times it was called. This change is made by the compiler when your program is compiled with the -pg option, which causes every function to call mcount (or _mcount, or __mcount, depending on the OS and compiler) as one of its first operations.
The mcount routine, included in the profiling library, is responsible for recording in an in-memory call graph table both its parent routine (the child) and its parent's parent. This is typically done by examining the stack frame to find both the address of the child, and the return address in the original parent. Since this is a very machine-dependent operation, mcount itself is typically a short assembly-language stub routine that extracts the required information, and then calls __mcount_internal (a normal C function) with two arguments—frompc and selfpc. __mcount_internal is responsible for maintaining the in-memory call graph, which records frompc, selfpc, and the number of times each of these call arcs was traversed.
...
Please note that with such an instrumenting profiler, you're profiling the same code you would compile in release without profiling instrumentation. There is an overhead associated with the instrumentation code itself. Also, the instrumentation code may alter instruction and data cache usage.
Contrary to an instrumenting profiler, a sampling profiler like Intel VTune works on noninstrumented code by looking at the target program's program counter at regular intervals using operating system interrupts. It can also query special CPU registers to give you even more insight of what's going on.
See also Profilers Instrumenting Vs Sampling.
This link gives a brief explanation of how gprof works.
This link gives an extensive critique of it.
(Check my answer to the archived question.)
From "Measuring Function Duration with Ftrace":
Instrumentation comes in two main
forms—explicitly declared tracepoints, and implicit tracepoints.
Explicit tracepoints consist of developer defined
declarations which specify the location of the
tracepoint, and additional information about what data
should be collected at a particular trace site. Implicit
tracepoints are placed into the code automatically by the compiler, either due to compiler flags or by developer redefinition of commonly used macros.
To instrument functions implicitly, when
the kernel is configured to support function tracing, the kernel build system adds -pg to the flags used with
the compiler. This causes the compiler to add code to
the prologue of each function, which calls a special assembly routine called mcount. This compiler option is
specifically intended to be used for profiling and tracing
purposes.
Does GCC generate reentrant code for all scenarios ?
no, you must write reentrant code.
Reentrancy is something that ISO C and C++ are capable of by design, so that includes GCC. It is still your responsibility to code the function for reentrancy.
A C compiler that does not generate reentrant code even when a function is coded correctly for reentrancy would be the exception rather than the rule, and would be for reasons of architectural constraint (such as having insufficient resources to support stack, so generating static frames). In these situations the compiler documentation should make this clear.
Some articles you might read:
Jack Ganssle on Rentrancy in 1993
Same author in 2001 on the same subject
No, GCC does not guarantee for the code written by you. Here is a good link for writing re-entrant code.
https://www.ibm.com/support/knowledgecenter/en/ssw_aix_71/generalprogramming/writing_reentrant_thread_safe_code.html
Re-entrancy is not something that the compiler has any control over - it's up to the programmer to write re-entrant code. To do this you need to avoid all the obvious pitfalls, e.g. globals (including local static variables), shared resources, threads, calls to other non-reentrant functions, etc.
Having said that, some cross-compilers for small embedded systems, e.g. 8051, may not generate reentrant code by default, and you may have to request reentrant code for specific functions via e.g. a #pragma.
GCC generates reentrant code on at least the majority of platforms it compiles for (especially if you avoid passing or returning structures by value) but it is possible that a particular language or platform ABI might dictate otherwise. You'll need to be much more specific for any more conclusive statement to be made; I know it's certainly basically reentrant on desktop processors if the code being compiled is itself basically reentrant (weird global state tricks can get you into trouble on any platform, of course).
No, GCC cannot possibly guarantee re-entrant code that you write.
However, on the major platforms, the compiler produced or included code, such as math intrinsics or function calls, are re-entrant. As GCC doesn't support platforms where non-reentrant function calls are common, such as the 8051, there is little risk in having a compiler issue with reentrancy.
There are GCC ports which have bugs and issues, such as the MSP430 version.
I'm using g++ to compile and link a project consisting of about 15 c++ source files and 4 shared object files. Recently the linking time more than doubled, but I don't have the history of the makefile available to me. Is there any way to profile g++ to see what part of the linking is taking a long time?
Edit: After I noticed that the makefile was using -O3 optimizations all the time, I managed to halve the linking time just by removing that switch. Is there any good way I could have found this without trial and error?
Edit: I'm not actually interested in profiling how ld works. I'm interested in knowing how I can match increases in linking time to specific command line switches or object files.
Profiling g++ will prove futile, because g++ doesn't perform linking, the linker ld does.
Profiling ld will also likely not show you anything interesting, because linking time is most often dominated by disk I/O, and if your link isn't, you wouldn't know what to make of the profiling data, unless you understand ld internals.
If your link time is noticeable with only 15 files in the link, there is likely something wrong with your development system [1]; either it has a disk that is on its last legs and is constantly retrying, or you do not have enough memory to perform the link (linking is often RAM-intensive), and your system swaps like crazy.
Assuming you are on an ELF based system, you may also wish to try the new gold linker (part of binutils), which is often several times faster than the GNU ld.
[1] My typical links involve 1000s of objects, produce 200+MB executables, and finish in less than 60s.
If you have just hit your RAM limit, you'll be probably able to hear the disk working, and a system activity monitor will tell you that. But if linking is still CPU-bound (i.e. if CPU usage is still high), that's not the issue. And if linking is IO-bound, the most common culprit can be runtime info. Have a look at the executable size anyway.
To answer your problem in a different way: are you doing heavy template usage? For each usage of a template with a different type parameter, a new instance of the whole template is generated, so you get more work for the linker. To make that actually noticeable, though, you'd need to use some library really heavy on templates. A lot of ones from the Boost project qualifies - I got template-based code bloat when using Boost::Spirit with a complex grammar. And ~4000 lines of code compiled to 7,7M of executable - changing one line doubled the number of specializations required and the size of the final executable. Inlining helped a lot, though, leading to 1,9M of output.
Shared libraries might be causing other problems, you might want to look at documentation for -fvisibility=hidden, and it will improve your code anyway. From GCC manual for -fvisibility:
Using this feature can very substantially
improve linking and load times of shared object libraries, produce
more optimized code, provide near-perfect API export and prevent
symbol clashes. It is *strongly* recommended that you use this in
any shared objects you distribute.
In fact, the linker normally must support the possibility for the application or for other libraries to override symbols defined into the library, while typically this is not the intended usage. Note that using that is not for free however, it does require (trivial) code changes.
The link suggested by the docs is: http://gcc.gnu.org/wiki/Visibility
Both gcc and g++ support the -v verbose flag, which makes them output details of the current task.
If you're interested in really profiling the tools, you may want to check out Sysprof or OProfile.