I wonder the other major difference about the distribution format of GCC between its source code and its precompiled binary code, except for the former can be customized installation(including cross-compiler) but the latter cannot!
I wonder the other major difference about the distribution format
Distribution format has nothing to do with your question.
The differences are exactly the same as for any other program: if you have source code, you can debug, customize, or modify the program; if you only have binary, you can't (except to the degree that the program author has envisioned and explicitly programmed for).
Related
I'm looking into how the v8 compiler works. I read an article which states source code is tokenized, parsed, an AST is constructed, then bytecode is generated (https://medium.com/dailyjs/understanding-v8s-bytecode-317d46c94775)
Is this bytecode an intermediate representation?
Short answer: No. Usually people use the terms "bytecode" and "intermediate representation" to mean two different things.
Long answer: It depends a bit on your definition (but for most definitions, "no" is still the right answer).
"Bytecode" in virtual machines like V8 refers to a representation that is used as input for an interpreter. The article you linked to gives a good description.
"Intermediate representation" or IR usually refers to data that a compiler uses internally, as an intermediate step (hence the name) between its input (usually the AST = abstract syntax tree, i.e. parsed version of the source text) and its output (usually machine code or byte code, but it could be anything, as in a source-to-source compiler).
So in a traditional setup, you have:
source --(parser)--> AST --(compiler front-end)--> IR --(compiler back-end)--> machine code
where the IR is usually modified several times as the compiler performs various optimizations on it, before finally generating machine code from it. There can also be several different IRs; for example V8's earlier optimizing compiler ("Crankshaft") had two: high-level IR "Hydrogen" and low-level IR "Lithium", whereas V8's current optimizing compiler ("Turbofan") even has three: "JavaScript-level nodes", "Simplified nodes", and "Machine-level nodes".
Now if you wanted to draw the boxes in your whiteboard diagram of the system a little differently, then instead of having a "parser" and a "compiler" you could treat everything between source and machine code as one big "compiler" (which as a first step parses the source). In that case, the AST would be a form of intermediate representation. But, as stated above, usually when people use the term IR they mean "compiler IR", not the AST.
In a virtual machine like V8, the overall execution pipeline is more complicated than described above. It starts with:
source --(parser)--> AST --(bytecode generator)--> bytecode
This bytecode is primarily used as input for V8's interpreter.
As an optimization, when V8 decides to run a function through the optimizing compiler, it does not start with the source code and a parser again, but instead the optimizing compiler uses the bytecode as its input. In diagram form:
bytecode --(interpreter)--> program execution
bytecode --(compiler front-end)--> IR --(compiler back-end)--> machine code --(CPU)--> program execution
Now here's the part where your perspective comes in: since the bytecode in V8 is not only used as input for the interpreter, but also as input for the optimizing compiler and in that sense as a step on the way from source text to machine code, if you wanted to call it a special form of intermediate representation, you wouldn't technically be wrong. It would be an unusual definition of the term though. When a compiler theory textbook talks about "intermediate representation", it does not mean "bytecode".
I just started to learn about compilers, and I wondered - is there a way to tell GCC (or any other compiler) to stop in the middle of the process, and give me a file with the current state (for example - a token tree)?
In general, yes!
When building a compiler, often one of your most important debugging tools is to see the current representation of the program, or even better, the states throughout compilation, so that when a transformation goes wrong -- i.e. a compiler bug -- it can be identified.
Dumping all the state of a compiler can be a lot however; Even just the intermediate representation for a large function can be tens-of-thousands of lines of plain-text. Then, there's also all the associated data structures: Control-flow graphs, dependency information, symbols etc.
See this question for some more detail on answers for LLVM and GCC.
Contrary to what applies for RANDOM_NUMBER and RAND() (http://gcc.gnu.org/onlinedocs/gfortran/RANDOM_005fNUMBER.html#RANDOM_005fNUMBER), in the gfortran compiler 4.8.0, there is no documentation (brief or detailed) for a random number generator that runs as RANDOM(RAND). I'm using the Geany 1.23 frontend for gfortran and when I'm calling RANDOM(RAND), "RANDOM" appears in brown, while "RAND" appears in blue.
Any idea as to where I can find documentation for this built-in random number generator? I'm asking because although good random number generation conduct dictates the use of portable code -and I do so-, RANDOM(RAND) appears to work equally well for my application.
I have no idea about Fortran or how to put this message as a comment, but I just wanted to point out that Geany's keywords are just a simple list in a text (config) file and don't necessarily represent a full or accurate set of language/builtin/standard keywords (though they usually do). Looking at Geany's commit history, the 'random' keyword has been in since the Fortran filetype was initially added, by someone who, as far as I know, neither speaks nor uses Fortran himself.
I've noticed that some static analyzers operate on source code, while others operate on bytecode (e.g., FindBugs). I'm sure there are even some that work on object code.
My question is a simple one, what are the advantages and disadvantages of writing different kinds of static analyzers for different levels of analysis?
Under "static analyzers" I'm including linters, bug finders, and even full-blown verifiers.
And by levels of analysis I would include source code, high-level IRs, low-level IRs, bytecode, object code, and compiler plugins that have access to all phases.
These different facets can influence the level at which an analyzer may decide to work:
Designing a static analyzer is a lot of work. It would be a shame not to factor this work for several languages compiled to the same bytecode, especially when the bytecode retains most of the structure of the source program: Java (FindBugs), .NET (various tools related to Code Contracts). In some cases, the common target language was made up for the purpose of analysis although the compilation scheme wasn't following this path.
Related to 1, you may hope that your static analyzer will be a little less costly to write if it works on a normalized version of the program with a minimum number of constructs. When authoring static analyzers, having to write the treatment for repeat until when you have already written while do is a bother. You may structure your analyzer so that several functions are shared for these two cases, but the care-free way to handle this is to translate one to the other, or to translate the source to an intermediate language that only has one of them.
On the other hand as already pointed out in Flash Sheridan's answer, source code contains the most information. For instance, in languages with fuzzy semantics, bugs at the source level may be removed by compilation. C and C++ have numerous "undefined behaviors" where the compiler is allowed to do anything, including generating a program that works accidentally. Fine, you might think, if the bug is not in the executable it's not a problematic bug. But when you ever re-compile the program for another architecture or with the next version of the compiler, the bug may appear again. This is one reason for not doing the analysis after any phase that might potentially remove bugs.
Some properties can only be checked with reasonable precision on compiled code. That includes absence of compiler-introduced bugs as pointed out again by Flash Sheridan, but also worst-case execution time. Similarly, many languages do not let you know what floating-point code does precisely unless you look at the assembly generated by the compiler (this is because existing hardware does not make it convenient for them to guarantee more). The choice is then to write an imprecise source-level analyzer that takes into account all possibilities, or to analyze precisely one particular compilation of a floating-point program, as long as it is understood that it is that precise assembly code that will be executed.
Source code analysis is the most generally useful, of course; sometimes heuristics even need to analyze comments or formatting. But you’re right that even object code analysis can be necessary, e.g., to detect bugs introduced by GCC misfeatures. Thomas Reps, head of GrammaTech and a Wisconsin professor, gave a good talk on this at Stanford a couple of years ago: http://pages.cs.wisc.edu/~reps/#TOPLAS-WYSINWYX.
Does GCC/G++ have an option available to output analysis?
It would be useful to be able to compare differences between the previous code with the new one (size, sizes of classes/structures). Those can then be diff'd with the previous output for comparison, which could be useful for many purposes.
If no such output analysis is available, what is the best way to obtain such information?
GCCXML is a (GCC variant) that dumps symbol and type declaration data in an XML format. That may or may not have the "properties" you care about in them.
If you want specific information, you may be able to bend GCC to produce it. The learning curve for this is likely long and hard, because GCC wants to be a compiler, not a your-favorite-property-dumper, and it is a big, complex tool. You probably have some kind of chance with "struct size" as the compiler must compute that somewhere and it seems reasonable that it would be stored with information about the struct declaration. I'd expect that some of the command line switches do output some information and you might consider trying to emulate those. If you want really odd properties such as "register pressure inside a loop" you'll have to reach deeply inside a compiler.
If you want general properties derivable from the source code you will want to use a language-processing framework that has a strong C front end integrated into it. Clang is one. It likely has a learning curve similar to that for GCC, but is said to be better designed for tasks like yours. I have no specific experience with this.
Our DMS Software Reengineering Toolkit is explicitly designed to support such tasks. It has a full C Front End, with APIs for building full parse trees, symbol tables relating identifiers to their point of declaration, actual type, and full control and data flow analysis. DMS also has a and a full C++ Front End, with similar properties, but it does not yet provide flow analysis information. DMS lets you write arbitrary code on top of this compute whatever (arbitrary property) you like.