Does a Tool for Automatically Visualizing a Project's Source Code's Control Flow In-Line Exist? - static-analysis

I would like to be able to use a tool that lets you visualize a program's control flow(s) in the context of its source code. To clarify, such a tool should basically show what happens in a program by spitting out a human-readable abstract syntax tree in the form of a multidigraph with nodes containing snippets of source-code translation units. The resulting graph initial node would, I presume, contain the block of code starting with a program's entry point (that'd be main for a C or C++ program.) New nodes would be created when a node needs to reference another block of code, whether that might be in the current file or in another one, and arrows would connect the nodes. Does such a tool exist, or would it have to be created from scratch?

You aren't going to get a tool that does this for arbitrary languages off the shelf. There are too many languages, each with its own syntax and semantics. You somehow need a tool per language. You might find such tools for very commonly used languages, e.g, Understand for Software.
I think that the only way to do this is to build metatools that enable the construction of language-specific tools relatively easily. Such a tool has to have the common machinery needed by all such language processing tools: strong parsers (so writing grammars for languages is relatively straightforward), AST construction machinery, symbol table support, routines to build control and data flow graphs. By providing such machinery, one can build language front ends for modest costs.
There's a class of tools that does this, program transformation. Most of them have parsing engines, but not the rest of the mechanisms I have suggested above.
I believe this enough to have invested 20 years of my life to building
such meta tools. Our DMS Software Reengineering toolkit shows its strength in being able to parse some 50+ languages, including the stunningly hard C++14 (both MS and GNU variants). It shows symbol table support and control flow graph construction for COBOL, Java, C, C++. (We can't do everything at once; pedaling as fast as practical).
[DMS builds these graphs as data structures rather than "showing" them; the examples on that page are drawn with the additional help of DOT].
One of the few other tools that tries to do this is Clang/LLVM; this covers a wide variety of popular languages. Clang doesn't have any specific support for parsing that I know about; you get to code it all yourself. I think you get control flow graphs only after you convert the language to LLVM. I don't think it has any specific support for drawing control flow graphs, either.
An older tool with a good reputation for multi-language support in this space is CoCo/R;
I don't know a lot about it. I know it parses,
and has some support for ASTs; I don't know what it does
about control flow analysis.

Related

Building complete control flow graph for Linux kernel binary

Are there any tools that can build the control flow graph for an entire Linux kernel binary? For example, consider Linux kernel compiled for x86 architecture (vmlinux file).
Is it possible to determine all execution paths (disregarding indirect branches or other control flows that need runtime information) using static analysis only? Are there any tools suitable for this?
Our DMS Software Reengineering Toolkit with its C Front End can do this.
DMS provides generic parsing, control flow graph and call graph construction; the C front end provide C-specific parsing details and the logic for constructing C-specific flowgraphs include indirect-gotos as well as a points-to analysis that has beem used on code systems of some 16 million lines, so it should handle the Linux kernal. The flow graphs are produced one-per-compilation unit; the call graph is for a set of linked compilation units. All this information is available as DMS data structures, and/or exportable as XML if you insist and can stomach gigabytes of output.
You can see examples of Control flow, Data Flow, and Call graphs.
You can try CppDepend, it provides a powerful dependency graph with many features.
However you have to analyze the source code and not the binaries.
There are two tools(CodeViz and Egypt) that can generate call graph during the compiling.
I don't think it will help you a lot to learn the Linux kernel. Many execution paths depend on Macros and runtime conditions, so the call graph generated by the static analyzer is not very practical. You still need to use printk and dmesg to figure out what happened in some functions. Instead of using these tools, printk is more useful.
GrammaTech CodeSonar can perform static analysis on binary code (https://www.grammatech.com/products/binary-analysis) and it allows you to visualize and navigate the control-flow graph. This is a commercial tool though.

Process and tools used in refactoring software

I am looking for re-factoring software for language like C++/Java/C# ... that provide information on how they do re-factoring.
What method they use to detect portion of code that needs re-factoring and how do they keep the program integrity when applying changes. And if possible what tools do they use.
Thanks
If you want to understand how refactoring tools work, you need, as a foundation, to learn essentially how compilers work: parsing, symbol-table construction, various kinds of control and dataflow analysis, program analysis, program transformation. Refactoring engines build on top of this.
Details on how tools handle specific refactorings tend to be found in the software engineering research literature. Check out http://scholar.google.com, and use the search term "refactoring"; you'll get buried under papers that address different kinds of refactorings, and different approaches to doing them.
The question about "finding (single) refactoring tools for list of languages?" is pretty hard to answer. Most refactoring tools are difficult to build (see compiler technology discussion above), so you tend not to see "one" tool that does them all, but rather one tool per language/IDE. Language specific tools are relatively easy to find: google "refactoring tool language".
One insight, however, is that the machinery to do such refactoring tasks have a lot of basic technology foundations in common in the abstract; see my first paragraph above.
It is unfortunate that the way most refactoring tools are built, is to construct all
this machinery for just the one specific refactoring tool, which helps explain why they are hard to build, therefore expensive to build and therefore rare. They are also built using traditional compiler techniques (e.g., traditional parsers [with limitations that cause people to continually complain], and procedural programming (after all, that's the way we've done that since the days of the pyramids, right?).
Tools called program transformation engines try to instantiate this shared commonality, thereby amortizing the cost of building the baseline across many tasks, and to address the problem of building complex code transformations in easier ways, to make it easier to implement such tasks, by providing non-procedural means to express what needs to be done.
You can see an example of program transformations applied to Java; the message from that example is the same transformation engine can be used to "refactor" code in other languages, which is I think what you were originally trying to ask about. (Full disclosure: I'm behind the tool in the example).

GCC code statistics/analysis

Does GCC/G++ have an option available to output analysis?
It would be useful to be able to compare differences between the previous code with the new one (size, sizes of classes/structures). Those can then be diff'd with the previous output for comparison, which could be useful for many purposes.
If no such output analysis is available, what is the best way to obtain such information?
GCCXML is a (GCC variant) that dumps symbol and type declaration data in an XML format. That may or may not have the "properties" you care about in them.
If you want specific information, you may be able to bend GCC to produce it. The learning curve for this is likely long and hard, because GCC wants to be a compiler, not a your-favorite-property-dumper, and it is a big, complex tool. You probably have some kind of chance with "struct size" as the compiler must compute that somewhere and it seems reasonable that it would be stored with information about the struct declaration. I'd expect that some of the command line switches do output some information and you might consider trying to emulate those. If you want really odd properties such as "register pressure inside a loop" you'll have to reach deeply inside a compiler.
If you want general properties derivable from the source code you will want to use a language-processing framework that has a strong C front end integrated into it. Clang is one. It likely has a learning curve similar to that for GCC, but is said to be better designed for tasks like yours. I have no specific experience with this.
Our DMS Software Reengineering Toolkit is explicitly designed to support such tasks. It has a full C Front End, with APIs for building full parse trees, symbol tables relating identifiers to their point of declaration, actual type, and full control and data flow analysis. DMS also has a and a full C++ Front End, with similar properties, but it does not yet provide flow analysis information. DMS lets you write arbitrary code on top of this compute whatever (arbitrary property) you like.

Assembly Analysis Tools

Does anyone have any suggestions for assembly file analysis tools? I'm attempting to analyze ARM/Thumb-2 ASM files generated by LLVM (or alternatively GCC) when passed the -S option. I'm particularly interested in instruction statistics at the basic block level, e.g. memory operation counts, etc. I may wind up rolling my own tool in Python, but was curious to see if there were any existing tools before I started.
Update: I've done a little searching, and found a good resource for disassembly tools / hex editors / etc here, but unfortunately it is mainly focused on x86 assembly, and also doesn't include any actual assembly file analyzers.
What you need is a tool for which you can define an assembly language syntax, and then build custom analyzers. You analyzers might be simple ("how much space does an instruction take?") or complex ("How many cycles will this isntruction take to execute?" [which depends on the preceding sequence of instructions and possibly a sophisticated model of the processor you care about]).
One designed specifically to do that is the New Jersey Machine Toolkit. It is really designed to build code generators and debuggers. I suspect it would be good at "instruction byte count". It isn't clear it is good at more sophisticated analyses. And I believe it insists you follow its syntax style, rather than yours.
One not designed specifically to do that, but good at parsing/analyzing langauges in general is our
DMS Software Reengineering Toolkit.
DMS can be given a grammar description for virtually any context free language (that covers most assembly language syntax) and can then parse a specific instance of that grammar (assembly code) into ASTs for further processing. We've done with with several assembly langauges, including the IBM 370, Motorola's 8 bit CPU line, and a rather peculiar DSP, without trouble.
You can specify an attribute grammar (computation over an AST) to DMS easily. These are great way to encode analyses that need just local information, such as "How big is this instruction?". For more complex analysese, you'll need a processor model that is driven from a series of instructions; passing such a machine model the ASTs for individual instructions would be an easy way to apply a machine model to compute more complex things as "How long does this instruction take?".
Other analyses such as control flow and data flow, are provided in generic form by DMS. You can use an attribute evaluator to collect local facts ("control-next for this instruction is...", "data from this instruction flows to,...") and feed them to the flow analyzers to compute global flow facts ("if I execute this instruction, what other instructions might be executed downstream?"..)
You do have to configure DMS for your particular (assembly) language. It is designed to be configured for tasks like these.
Yes, you can likely code all this in Python; after all, its a Turing machine. But likely not nearly as easily.
An additional benefit: DMS is willing to apply transformations to your code, based on your analyses. So you could implement your optimizer with it, too. After all, you need to connect the analysis indication the optimization is safe, to the actual optimization steps.
I have written many disassemblers, including arm and thumb. Not production quality but for the purposes of learning the assembler. For both the ARM and Thumb the ARM ARM (ARM Architectural Reference Manual) has a nice chart from which you can easily count up data operations from load/store, etc. maybe an hours worth of work, maybe two. At least up front, you would end up with data values being counted though.
The other poster may be right, as with the chart I am talking about it should be very simple to write a program to examine the ASCII looking for ldr, str, add, etc. No need to parse everything if you are interested in memory operations counts, etc. Of course the downside is that you are likely not going to be able to examine loops. One function may have a load and store, another may have a load and store but have it wrapped by a loop, causing many more memory operations once executed.
Not knowing what you really are interested in, my guess is you might want to simulate the code and count these sorts of things. I wrote a thumb simulator (thumbulator) that attempts to do just that. (and I have used it to compare llvm execution vs gcc execution when it comes to number of instructions executed, fetches, memory operations, etc) The problem may be that it is thumb only, no ARM no Thumb2. Thumb2 could be added easier than ARM. There exists an armulator from arm, which is in the gdb sources among other places. I cant remember now if it executes thumb2. My understanding is that when arm was using it would accurately tell you these sorts of statistics.
You can plug your statistics into LLVM code generator, it's quite flexible and it is already collecting some stats, which could be used as an example.

Java or C for image processing

I am looking in to learning a programming language (take a course) for use in image analysis and processing. Possibly Bioinformatics too. Which language should I go for? C or Java? Other languages are not an option for me. Also please explain why either of the languages is a better option for my application.
You have to balance raw processing power and developer time. Java is getting pretty fast too and if you are finished a couple of days early, you have more time to process the data.
It all depends on volume.
More importantly, I suggest you look for the libraries and frameworks which already exist, see which fits closest to what needs to be done, and choose whatever language the library was written be it C, Java or Fortran.
For Java I found BioJava.org as a starting point.
Java isn't TOOO bad for image processing. If you manage your source objects appropriately, you ll have a chance at getting reasonable performance out of it. Some of the things I like with Java that relates to imaging:
Java Advanced Imaging
2D Graphics utilities (take a look at BufferedImages)
ImageJ, etc
Get it to work with JAMA
Ask someone in the field you're working in (ie, bioinformatics)
For solar images, the majority of the work is done in IDL, Fortran, Matlab, Python, C or Perl (PDL). (Roughly in that order ... IDL is definitely first, as the majority of the instrument calibration software is written in IDL)
Because of this, there's a lot of toolkits already written in those languages for our field. Frequently, with large reference data sets, the PI releases some software package as an example of how to interpret / interact with the data format. I can only assume that Bioinformatics would be similar.
If you end up going a different route than the rest of the field, you're going to have a much harder time working with other scientists as you can't share code as easily.
Note -- There are a number of the visualization tools that have been released in our field that were written in Java, but they assume that the images have already been prepped by some other process.
The most popular computer vision (image processing, image analysis) library is OpenCV which is written in C++, but can also be used with Python, and Java (official OpenCV4Android and non-official JavaCV).
There are Bioinformatic applications that are basically image processing, so OpenCV will take care of that. But there are also some which are not, they are, for example, based on Machine Learning, so if you need something other than image/video related you will need another Bioinformatic oriented library. Opencv also has a machine learning module but it is more focused for computer vision.
About the languages C vs Java, most has been said in the other answers. I should add that these libraries are now C++ based and not plain C. If your applications have real-time processing needs, C++ will probably be better for that, if not, Java will be more than enough as it is more friendly.
Ideally, you would use something like Java or (even better) Python for "high-level" stuff, and compile in C the routines that require a lot of processing power (for instance using Cython, etc).
Some scientific libraries exist for Python (SciPy and NumPy), and they are a good start, although it isn't yet straightforward to combine Python and C (you need to tweak things a bit).
just my two pence worth: java doesn't allow the use of pointers as opposed to C/C++ or C#. So if you are going to manipulate pixels directly, i.e. write your own image processing functions then they will be much slower than the equivalent in C++. On the otherhand C++ is a total nightmare of a language compared to java. it will take you at least twice as long to write the equivalent bit of code in c++. so with all the productivity gain you can probably afford to buy a computer that makes up for the difference in runtime ;-)
i know other languages aren't an option for you, but personally i can highly recommend c# for image processing or computer vision: it allows pointers and hence IP functions in c# are only half as slow as in C++ (an acceptable trade-off i think) and it has excellent integration with native C++ and a good wrapper library for opencv.
Disclaimer: I work for TunaCode.
If you have to make a choice between different languages to get started on Image Processing, I would recommend to start with C++. You can raw pointer access which is a must if you want to operate on individual pixels.
Next, what kind of Imaging are you interested in? Just for fun image filters or some heavy stuff like motion estimation, tracking and detection etc? For that I would recommend you take a look at CUVILib since sooner than later, you will need performance on Imaging functionality and that's what CUVI provides. You can use it as standalone if it serves your purposes or you can plug it with other libraries like Intel IPP, ITK, OpenCV etc.

Resources