GCC code statistics/analysis - gcc

Does GCC/G++ have an option available to output analysis?
It would be useful to be able to compare differences between the previous code with the new one (size, sizes of classes/structures). Those can then be diff'd with the previous output for comparison, which could be useful for many purposes.
If no such output analysis is available, what is the best way to obtain such information?

GCCXML is a (GCC variant) that dumps symbol and type declaration data in an XML format. That may or may not have the "properties" you care about in them.
If you want specific information, you may be able to bend GCC to produce it. The learning curve for this is likely long and hard, because GCC wants to be a compiler, not a your-favorite-property-dumper, and it is a big, complex tool. You probably have some kind of chance with "struct size" as the compiler must compute that somewhere and it seems reasonable that it would be stored with information about the struct declaration. I'd expect that some of the command line switches do output some information and you might consider trying to emulate those. If you want really odd properties such as "register pressure inside a loop" you'll have to reach deeply inside a compiler.
If you want general properties derivable from the source code you will want to use a language-processing framework that has a strong C front end integrated into it. Clang is one. It likely has a learning curve similar to that for GCC, but is said to be better designed for tasks like yours. I have no specific experience with this.
Our DMS Software Reengineering Toolkit is explicitly designed to support such tasks. It has a full C Front End, with APIs for building full parse trees, symbol tables relating identifiers to their point of declaration, actual type, and full control and data flow analysis. DMS also has a and a full C++ Front End, with similar properties, but it does not yet provide flow analysis information. DMS lets you write arbitrary code on top of this compute whatever (arbitrary property) you like.

Related

Does a Tool for Automatically Visualizing a Project's Source Code's Control Flow In-Line Exist?

I would like to be able to use a tool that lets you visualize a program's control flow(s) in the context of its source code. To clarify, such a tool should basically show what happens in a program by spitting out a human-readable abstract syntax tree in the form of a multidigraph with nodes containing snippets of source-code translation units. The resulting graph initial node would, I presume, contain the block of code starting with a program's entry point (that'd be main for a C or C++ program.) New nodes would be created when a node needs to reference another block of code, whether that might be in the current file or in another one, and arrows would connect the nodes. Does such a tool exist, or would it have to be created from scratch?
You aren't going to get a tool that does this for arbitrary languages off the shelf. There are too many languages, each with its own syntax and semantics. You somehow need a tool per language. You might find such tools for very commonly used languages, e.g, Understand for Software.
I think that the only way to do this is to build metatools that enable the construction of language-specific tools relatively easily. Such a tool has to have the common machinery needed by all such language processing tools: strong parsers (so writing grammars for languages is relatively straightforward), AST construction machinery, symbol table support, routines to build control and data flow graphs. By providing such machinery, one can build language front ends for modest costs.
There's a class of tools that does this, program transformation. Most of them have parsing engines, but not the rest of the mechanisms I have suggested above.
I believe this enough to have invested 20 years of my life to building
such meta tools. Our DMS Software Reengineering toolkit shows its strength in being able to parse some 50+ languages, including the stunningly hard C++14 (both MS and GNU variants). It shows symbol table support and control flow graph construction for COBOL, Java, C, C++. (We can't do everything at once; pedaling as fast as practical).
[DMS builds these graphs as data structures rather than "showing" them; the examples on that page are drawn with the additional help of DOT].
One of the few other tools that tries to do this is Clang/LLVM; this covers a wide variety of popular languages. Clang doesn't have any specific support for parsing that I know about; you get to code it all yourself. I think you get control flow graphs only after you convert the language to LLVM. I don't think it has any specific support for drawing control flow graphs, either.
An older tool with a good reputation for multi-language support in this space is CoCo/R;
I don't know a lot about it. I know it parses,
and has some support for ASTs; I don't know what it does
about control flow analysis.

Compilers - stop in the middle of the process

I just started to learn about compilers, and I wondered - is there a way to tell GCC (or any other compiler) to stop in the middle of the process, and give me a file with the current state (for example - a token tree)?
In general, yes!
When building a compiler, often one of your most important debugging tools is to see the current representation of the program, or even better, the states throughout compilation, so that when a transformation goes wrong -- i.e. a compiler bug -- it can be identified.
Dumping all the state of a compiler can be a lot however; Even just the intermediate representation for a large function can be tens-of-thousands of lines of plain-text. Then, there's also all the associated data structures: Control-flow graphs, dependency information, symbols etc.
See this question for some more detail on answers for LLVM and GCC.

Static call graph generation for the Linux kernel

I'm looking for a tool to statically generate a call graph of the Linux kernel (for a given kernel configuration). The generated call graph should be "complete", in the sense that all calls are included, including potential indirect ones which we can assume are only done through the use of function pointers in the case of the Linux kernel.
For instance, this could be done by analyzing the function pointer types: this approach would lead to superfluous edges in the graph, but that's ok for me.
ncc seems to implement this idea, however I didn't succeed in making it work on the 3.0 kernel. Any other suggestions?
I'm guessing this approach could also lead to missing edges in cases where function pointer casts are used, so I'd also be interested in knowing whether this is likely in the Linux kernel.
As a side note, there seems to be other tools that are able to do semantic analysis of the source to infer potential pointer values, but AFAICT, none of them are design to be used in a project such as the Linux kernel.
Any help would be much appreciated.
We've done global points-to analysis (with indirect function pointers) and full call graph construction of monolithic C systems of 26 million lines (18,000 compilation units).
We did it using our DMS Software Reengineering Toolkit, its C Front End and its associated flow analysis machinery. The points-to analysis machinery (and the other analyses) are conservative; yes, you get some bogus points-to and therefore call edges as a consequence. These are pretty hard to avoid.
You can help such analyzers by providing certain crucial facts about key functions, and by harnessing knowledge such as "embedded systems [and OSes] tend not to have cycles in the call graph", which means you can eliminate some of these. Of course, you have to allow for exceptions; my moral: "in big systems, everything happens."
The particular problem included dynamically loaded(!) C modules using a special loading scheme specific to this particular software, but that just added to the problem.
Casts on function pointers shouldn't lose edges; a conservative analysis should simply assume that the cast pointer matches any function in the system with signature corresponding to the casted result. More problematic are casts which produce sort-of-compatible signatures; if you cast a function pointer to void* foo(uint) when the actual function being called accepts an int, the points to analysis will necessarily conservatively choose the wrong functions. You can't blame the analyzer for that; the cast lies in that case. Yes, we saw this kind of trash in the 26 million line system.
This is certainly the right scale for analyzing Linux (which I think is a mere 8 million lines or so :-). But we haven't tried it specifically on Linux.
Setting up this tool is complicated because you have to capture all the details about the compilations themselves, and in particular the configuration of the Linux kernal you want to generate. So you pretty much have to intercept the compiler calls to get the command line switches, etc.

What levels should static analyzers analyze?

I've noticed that some static analyzers operate on source code, while others operate on bytecode (e.g., FindBugs). I'm sure there are even some that work on object code.
My question is a simple one, what are the advantages and disadvantages of writing different kinds of static analyzers for different levels of analysis?
Under "static analyzers" I'm including linters, bug finders, and even full-blown verifiers.
And by levels of analysis I would include source code, high-level IRs, low-level IRs, bytecode, object code, and compiler plugins that have access to all phases.
These different facets can influence the level at which an analyzer may decide to work:
Designing a static analyzer is a lot of work. It would be a shame not to factor this work for several languages compiled to the same bytecode, especially when the bytecode retains most of the structure of the source program: Java (FindBugs), .NET (various tools related to Code Contracts). In some cases, the common target language was made up for the purpose of analysis although the compilation scheme wasn't following this path.
Related to 1, you may hope that your static analyzer will be a little less costly to write if it works on a normalized version of the program with a minimum number of constructs. When authoring static analyzers, having to write the treatment for repeat until when you have already written while do is a bother. You may structure your analyzer so that several functions are shared for these two cases, but the care-free way to handle this is to translate one to the other, or to translate the source to an intermediate language that only has one of them.
On the other hand as already pointed out in Flash Sheridan's answer, source code contains the most information. For instance, in languages with fuzzy semantics, bugs at the source level may be removed by compilation. C and C++ have numerous "undefined behaviors" where the compiler is allowed to do anything, including generating a program that works accidentally. Fine, you might think, if the bug is not in the executable it's not a problematic bug. But when you ever re-compile the program for another architecture or with the next version of the compiler, the bug may appear again. This is one reason for not doing the analysis after any phase that might potentially remove bugs.
Some properties can only be checked with reasonable precision on compiled code. That includes absence of compiler-introduced bugs as pointed out again by Flash Sheridan, but also worst-case execution time. Similarly, many languages do not let you know what floating-point code does precisely unless you look at the assembly generated by the compiler (this is because existing hardware does not make it convenient for them to guarantee more). The choice is then to write an imprecise source-level analyzer that takes into account all possibilities, or to analyze precisely one particular compilation of a floating-point program, as long as it is understood that it is that precise assembly code that will be executed.
Source code analysis is the most generally useful, of course; sometimes heuristics even need to analyze comments or formatting. But you’re right that even object code analysis can be necessary, e.g., to detect bugs introduced by GCC misfeatures. Thomas Reps, head of GrammaTech and a Wisconsin professor, gave a good talk on this at Stanford a couple of years ago: http://pages.cs.wisc.edu/~reps/#TOPLAS-WYSINWYX.

Assembly Analysis Tools

Does anyone have any suggestions for assembly file analysis tools? I'm attempting to analyze ARM/Thumb-2 ASM files generated by LLVM (or alternatively GCC) when passed the -S option. I'm particularly interested in instruction statistics at the basic block level, e.g. memory operation counts, etc. I may wind up rolling my own tool in Python, but was curious to see if there were any existing tools before I started.
Update: I've done a little searching, and found a good resource for disassembly tools / hex editors / etc here, but unfortunately it is mainly focused on x86 assembly, and also doesn't include any actual assembly file analyzers.
What you need is a tool for which you can define an assembly language syntax, and then build custom analyzers. You analyzers might be simple ("how much space does an instruction take?") or complex ("How many cycles will this isntruction take to execute?" [which depends on the preceding sequence of instructions and possibly a sophisticated model of the processor you care about]).
One designed specifically to do that is the New Jersey Machine Toolkit. It is really designed to build code generators and debuggers. I suspect it would be good at "instruction byte count". It isn't clear it is good at more sophisticated analyses. And I believe it insists you follow its syntax style, rather than yours.
One not designed specifically to do that, but good at parsing/analyzing langauges in general is our
DMS Software Reengineering Toolkit.
DMS can be given a grammar description for virtually any context free language (that covers most assembly language syntax) and can then parse a specific instance of that grammar (assembly code) into ASTs for further processing. We've done with with several assembly langauges, including the IBM 370, Motorola's 8 bit CPU line, and a rather peculiar DSP, without trouble.
You can specify an attribute grammar (computation over an AST) to DMS easily. These are great way to encode analyses that need just local information, such as "How big is this instruction?". For more complex analysese, you'll need a processor model that is driven from a series of instructions; passing such a machine model the ASTs for individual instructions would be an easy way to apply a machine model to compute more complex things as "How long does this instruction take?".
Other analyses such as control flow and data flow, are provided in generic form by DMS. You can use an attribute evaluator to collect local facts ("control-next for this instruction is...", "data from this instruction flows to,...") and feed them to the flow analyzers to compute global flow facts ("if I execute this instruction, what other instructions might be executed downstream?"..)
You do have to configure DMS for your particular (assembly) language. It is designed to be configured for tasks like these.
Yes, you can likely code all this in Python; after all, its a Turing machine. But likely not nearly as easily.
An additional benefit: DMS is willing to apply transformations to your code, based on your analyses. So you could implement your optimizer with it, too. After all, you need to connect the analysis indication the optimization is safe, to the actual optimization steps.
I have written many disassemblers, including arm and thumb. Not production quality but for the purposes of learning the assembler. For both the ARM and Thumb the ARM ARM (ARM Architectural Reference Manual) has a nice chart from which you can easily count up data operations from load/store, etc. maybe an hours worth of work, maybe two. At least up front, you would end up with data values being counted though.
The other poster may be right, as with the chart I am talking about it should be very simple to write a program to examine the ASCII looking for ldr, str, add, etc. No need to parse everything if you are interested in memory operations counts, etc. Of course the downside is that you are likely not going to be able to examine loops. One function may have a load and store, another may have a load and store but have it wrapped by a loop, causing many more memory operations once executed.
Not knowing what you really are interested in, my guess is you might want to simulate the code and count these sorts of things. I wrote a thumb simulator (thumbulator) that attempts to do just that. (and I have used it to compare llvm execution vs gcc execution when it comes to number of instructions executed, fetches, memory operations, etc) The problem may be that it is thumb only, no ARM no Thumb2. Thumb2 could be added easier than ARM. There exists an armulator from arm, which is in the gdb sources among other places. I cant remember now if it executes thumb2. My understanding is that when arm was using it would accurately tell you these sorts of statistics.
You can plug your statistics into LLVM code generator, it's quite flexible and it is already collecting some stats, which could be used as an example.

Resources