Is this an intermediate representation? - v8

I'm looking into how the v8 compiler works. I read an article which states source code is tokenized, parsed, an AST is constructed, then bytecode is generated (https://medium.com/dailyjs/understanding-v8s-bytecode-317d46c94775)
Is this bytecode an intermediate representation?

Short answer: No. Usually people use the terms "bytecode" and "intermediate representation" to mean two different things.
Long answer: It depends a bit on your definition (but for most definitions, "no" is still the right answer).
"Bytecode" in virtual machines like V8 refers to a representation that is used as input for an interpreter. The article you linked to gives a good description.
"Intermediate representation" or IR usually refers to data that a compiler uses internally, as an intermediate step (hence the name) between its input (usually the AST = abstract syntax tree, i.e. parsed version of the source text) and its output (usually machine code or byte code, but it could be anything, as in a source-to-source compiler).
So in a traditional setup, you have:
source --(parser)--> AST --(compiler front-end)--> IR --(compiler back-end)--> machine code
where the IR is usually modified several times as the compiler performs various optimizations on it, before finally generating machine code from it. There can also be several different IRs; for example V8's earlier optimizing compiler ("Crankshaft") had two: high-level IR "Hydrogen" and low-level IR "Lithium", whereas V8's current optimizing compiler ("Turbofan") even has three: "JavaScript-level nodes", "Simplified nodes", and "Machine-level nodes".
Now if you wanted to draw the boxes in your whiteboard diagram of the system a little differently, then instead of having a "parser" and a "compiler" you could treat everything between source and machine code as one big "compiler" (which as a first step parses the source). In that case, the AST would be a form of intermediate representation. But, as stated above, usually when people use the term IR they mean "compiler IR", not the AST.
In a virtual machine like V8, the overall execution pipeline is more complicated than described above. It starts with:
source --(parser)--> AST --(bytecode generator)--> bytecode
This bytecode is primarily used as input for V8's interpreter.
As an optimization, when V8 decides to run a function through the optimizing compiler, it does not start with the source code and a parser again, but instead the optimizing compiler uses the bytecode as its input. In diagram form:
bytecode --(interpreter)--> program execution
bytecode --(compiler front-end)--> IR --(compiler back-end)--> machine code --(CPU)--> program execution
Now here's the part where your perspective comes in: since the bytecode in V8 is not only used as input for the interpreter, but also as input for the optimizing compiler and in that sense as a step on the way from source text to machine code, if you wanted to call it a special form of intermediate representation, you wouldn't technically be wrong. It would be an unusual definition of the term though. When a compiler theory textbook talks about "intermediate representation", it does not mean "bytecode".

Related

Compilers - stop in the middle of the process

I just started to learn about compilers, and I wondered - is there a way to tell GCC (or any other compiler) to stop in the middle of the process, and give me a file with the current state (for example - a token tree)?
In general, yes!
When building a compiler, often one of your most important debugging tools is to see the current representation of the program, or even better, the states throughout compilation, so that when a transformation goes wrong -- i.e. a compiler bug -- it can be identified.
Dumping all the state of a compiler can be a lot however; Even just the intermediate representation for a large function can be tens-of-thousands of lines of plain-text. Then, there's also all the associated data structures: Control-flow graphs, dependency information, symbols etc.
See this question for some more detail on answers for LLVM and GCC.

How can a compiler be cross platform(hardware)?

I just realized that binary compilers convert source code to the binary of the destination platform. Kind of obvious... but if a compiler works such way, then how can the same compiler be used for different systems like x86, ARM, MIPS, etc?
Shouldn't they be supposed to "know" the machine-language of the hardware platform to be able to know how to build the binary? Does a compiler(like gcc) knows the machine language of every single platform that is supported?
How is that system possible, and how can a compiler be optimized for that many platforms at the same time?
Yes, they have to "know" the machine language for every single platform they support. This is a required to generate machine code. However, compilation is a multi-step process. Usually, the first steps of the compilation are common to most architectures.
Taken from wikipedia
Structure of a compiler
Compilers bridge source programs in high-level
languages with the underlying hardware.
A compiler requires
determining the correctness of the syntax of programs,
generating correct and efficient object code,
run-time organization, and
formatting output according to assembler and/or linker conventions.
A
compiler consists of three main parts: the frontend, the middle-end,
and the backend.
The front end
checks whether the program is correctly
written in terms of the programming language syntax and semantics.
Here legal and illegal programs are recognized. Errors are reported,
if any, in a useful way. Type checking is also performed by collecting
type information. The frontend then generates an intermediate
representation or IR of the source code for processing by the
middle-end.
The middle end
is where optimization takes place. Typical
transformations for optimization are removal of useless or unreachable
code, discovery and propagation of constant values, relocation of
computation to a less frequently executed place (e.g., out of a loop),
or specialization of computation based on the context. The middle-end
generates another IR for the following backend. Most optimization
efforts are focused on this part.
The back end
is responsible for translating the IR from the middle-end into assembly code. The target
instruction(s) are chosen for each IR instruction. Register allocation
assigns processor registers for the program variables where possible.
The backend utilizes the hardware by figuring out how to keep parallel
execution units busy, filling delay slots, and so on. Although most
algorithms for optimization are in NP, heuristic techniques are
well-developed.
More this article which describes the structure of a compiler and on this one which deals with Cross compilers.
The http://llvm.org/ project will answer all of your questions in this regard :)
In a nutshell, cross HW compilers emit "intermediate representation" of the code , which is HW agnostic and then its being customized via the native tool chain
Yes it is possible, it's called Cross Compiler. Compilers usually first they generate the object code which is not understanable by the current machine but it can be migrated to the destiny machine with another compiler. Next, object code is "compiled" again and linked with external libraries of the target machines.
TL;DR: Yes, the compilers knows the target code, but you can compile in another hardware.
I recommend you to read attached links for information.
Every platform has its own toolchain, toolchain includes gcc,gdb,ld,nm etc.
Let's take specific example of gcc as of now. GCC source code has many layers including architecture dependent and independent part. Architecture dependent part contains procedures to handle architecture specific things like their stack, function calls, floating point operations. We need to cross compile the gcc source code for a specific architecture like for ARM. You can see its steps here for reference:- http://www.ailis.de/~k/archives/19-arm-cross-compiling-howto.html#toolchain.
This architecture dependent part is responsible for handling machine language operations.

Meaning of 'debugging symbols' with respect to GDB

Wikipedia says:
A debug symbol is information that expresses which programming-language constructs generated a specific piece of machine code in a given executable module.
Any examples of what kind of programming-language constructs are used for the purpose?
What is the meaning of "constructs" in this context? Functions?
The programming language constructs reffered to are things like if statements, while loops, assignment statements, etc etc.
Debug symbols are usually files that map addresses of executable chunks of machine bytecode with the original source code file and line number they represent. This is what allows you to do things like put a breakpoint on an if statement, and have the machine stop when the execution reached that particular bit of bytecode.

What levels should static analyzers analyze?

I've noticed that some static analyzers operate on source code, while others operate on bytecode (e.g., FindBugs). I'm sure there are even some that work on object code.
My question is a simple one, what are the advantages and disadvantages of writing different kinds of static analyzers for different levels of analysis?
Under "static analyzers" I'm including linters, bug finders, and even full-blown verifiers.
And by levels of analysis I would include source code, high-level IRs, low-level IRs, bytecode, object code, and compiler plugins that have access to all phases.
These different facets can influence the level at which an analyzer may decide to work:
Designing a static analyzer is a lot of work. It would be a shame not to factor this work for several languages compiled to the same bytecode, especially when the bytecode retains most of the structure of the source program: Java (FindBugs), .NET (various tools related to Code Contracts). In some cases, the common target language was made up for the purpose of analysis although the compilation scheme wasn't following this path.
Related to 1, you may hope that your static analyzer will be a little less costly to write if it works on a normalized version of the program with a minimum number of constructs. When authoring static analyzers, having to write the treatment for repeat until when you have already written while do is a bother. You may structure your analyzer so that several functions are shared for these two cases, but the care-free way to handle this is to translate one to the other, or to translate the source to an intermediate language that only has one of them.
On the other hand as already pointed out in Flash Sheridan's answer, source code contains the most information. For instance, in languages with fuzzy semantics, bugs at the source level may be removed by compilation. C and C++ have numerous "undefined behaviors" where the compiler is allowed to do anything, including generating a program that works accidentally. Fine, you might think, if the bug is not in the executable it's not a problematic bug. But when you ever re-compile the program for another architecture or with the next version of the compiler, the bug may appear again. This is one reason for not doing the analysis after any phase that might potentially remove bugs.
Some properties can only be checked with reasonable precision on compiled code. That includes absence of compiler-introduced bugs as pointed out again by Flash Sheridan, but also worst-case execution time. Similarly, many languages do not let you know what floating-point code does precisely unless you look at the assembly generated by the compiler (this is because existing hardware does not make it convenient for them to guarantee more). The choice is then to write an imprecise source-level analyzer that takes into account all possibilities, or to analyze precisely one particular compilation of a floating-point program, as long as it is understood that it is that precise assembly code that will be executed.
Source code analysis is the most generally useful, of course; sometimes heuristics even need to analyze comments or formatting. But you’re right that even object code analysis can be necessary, e.g., to detect bugs introduced by GCC misfeatures. Thomas Reps, head of GrammaTech and a Wisconsin professor, gave a good talk on this at Stanford a couple of years ago: http://pages.cs.wisc.edu/~reps/#TOPLAS-WYSINWYX.

How does Go compile so quickly?

I've Googled and poked around the Go website, but I can't find an explanation for Go's extraordinary build times. Are they products of the language features (or lack thereof), a highly optimized compiler, or something else? I'm not trying to promote Go; I'm just curious.
Dependency analysis.
The Go FAQ used to contain the following sentence:
Go provides a model for software
construction that makes dependency
analysis easy and avoids much of the
overhead of C-style include files and
libraries.
While the phrase is not in the FAQ anymore, this topic is elaborated upon in the talk Go at Google, which compares the dependency analysis approach of C/C++ and Go.
That is the main reason for fast compilation. And this is by design.
I think it's not that Go compilers are fast, it's that other compilers are slow.
C and C++ compilers have to parse enormous amounts of headers - for example, compiling C++ "hello world" requires compiling 18k lines of code, which is almost half a megabyte of sources!
$ cpp hello.cpp | wc
18364 40513 433334
Java and C# compilers run in a VM, which means that before they can compile anything, the operating system has to load the whole VM, then they have to be JIT-compiled from bytecode to native code, all of which takes some time.
Speed of compilation depends on several factors.
Some languages are designed to be compiled fast. For example, Pascal was designed to be compiled using a single-pass compiler.
Compilers itself can be optimized too. For example, the Turbo Pascal compiler was written in hand-optimized assembler, which, combined with the language design, resulted in a really fast compiler working on 286-class hardware. I think that even now, modern Pascal compilers (e.g. FreePascal) are faster than Go compilers.
There are multiple reasons why the Go compiler is much faster than most C/C++ compilers:
Top reason: Most C/C++ compilers exhibit exceptionally bad designs (from compilation speed perspective). Also, from compilation speed perspective, some parts of the C/C++ ecosystem (such as editors in which programmers are writing their code) aren't designed with speed-of-compilation in mind.
Top reason: Fast compilation speed was a conscious choice in the Go compiler and also in the Go language
The Go compiler has a simpler optimizer than C/C++ compilers
Unlike C++, Go has no templates and no inline functions. This means that Go doesn't need to perform any template or function instantiation.
The Go compiler generates low-level assembly code sooner and the optimizer works on the assembly code, while in a typical C/C++ compiler the optimization passes work on an internal representation of the original source code. The extra overhead in the C/C++ compiler comes from the fact that the internal representation needs to be generated.
Final linking (5l/6l/8l) of a Go program can be slower than linking a C/C++ program, because the Go compiler is going through all of the used assembly code and maybe it is also doing other extra actions that C/C++ linkers aren't doing
Some C/C++ compilers (GCC) generate instructions in text form (to be passed to the assembler), while the Go compiler generates instructions in binary form. Extra work (but not much) needs to be done in order to transform the text into binary.
The Go compiler targets only a small number of CPU architectures, while the GCC compiler targets a large number of CPUs
Compilers which were designed with the goal of high compilation speed, such as Jikes, are fast. On a 2GHz CPU, Jikes can compile 20000+ lines of Java code per second (and the incremental mode of compilation is even more efficient).
Compilation efficiency was a major design goal:
Finally, it is intended to be fast: it should take at most a few seconds to build a large executable on a single computer. To meet these goals required addressing a number of linguistic issues: an expressive but lightweight type system; concurrency and garbage collection; rigid dependency specification; and so on. FAQ
The language FAQ is pretty interesting in regards to specific language features relating to parsing:
Second, the language has been designed to be easy to analyze and can be parsed without a symbol table.
While most of the above is true, there is one very important point that was not really mentionend: Dependency management.
Go only needs to include the packages that you are importing directly (as those already imported what they need). This is in stark contrast to C/C++, where every single file starts including x headers, which include y headers etc. Bottom line: Go's compiling takes linear time w.r.t to the number of imported packages, where C/C++ take exponential time.
A good test for the translation efficiency of a compiler is self-compilation: how long does it take a given compiler to compile itself? For C++ it takes a very long time (hours?). By comparison, a Pascal/Modula-2/Oberon compiler would compile itself in less than one second on a modern machine [1].
Go has been inspired by these languages, but some of the main reasons for this efficiency include:
A clearly defined syntax that is mathematically sound, for efficient scanning and parsing.
A type-safe and statically-compiled language that uses separate compilation with dependency and type checking across module boundaries, to avoid unnecessary re-reading of header files and re-compiling of other modules - as opposed to independent compilation like in C/C++ where no such cross-module checks are performed by the compiler (hence the need to re-read all those header files over and over again, even for a simple one-line "hello world" program).
An efficient compiler implementation (e.g. single-pass, recursive-descent top-down parsing) - which of course is greatly helped by points 1 and 2 above.
These principles have already been known and fully implemented in the 1970s and 1980s in languages like Mesa, Ada, Modula-2/Oberon and several others, and are only now (in the 2010s) finding their way into modern languages like Go (Google), Swift (Apple), C# (Microsoft) and several others.
Let's hope that this will soon be the norm and not the exception. To get there, two things need to happen:
First, software platform providers such as Google, Microsoft and Apple should start by encouraging application developers to use the new compilation methodology, while enabling them to re-use their existing code base. This is what Apple is now trying to do with the Swift programming language, which can co-exist with Objective-C (since it uses the same runtime environment).
Second, the underlying software platforms themselves should eventually be re-written over time using these principles, while simultaneously redesigning the module hierarchy in the process to make them less monolithic. This is of course a mammoth task and may well take the better part of a decade (if they are courageous enough to actually do it - which I am not at all sure in the case of Google).
In any case, it's the platform that drives language adoption, and not the other way around.
References:
[1] http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.System.pdf, page 6: "The compiler compiles itself in about 3 seconds". This quote is for a low cost Xilinx Spartan-3 FPGA development board running at a clock frequency of 25 MHz and featuring 1 MByte of main memory. From this one can easily extrapolate to "less than 1 second" for a modern processor running at a clock frequency well above 1 GHz and several GBytes of main memory (i.e. several orders of magnitude more powerful than the Xilinx Spartan-3 FPGA board), even when taking I/O speeds into account. Already back in 1990 when Oberon was run on a 25MHz NS32X32 processor with 2-4 MBytes of main memory, the compiler compiled itself in just a few seconds. The notion of actually waiting for the compiler to finish a compilation cycle was completely unknown to Oberon programmers even back then. For typical programs, it always took more time to remove the finger from the mouse button that triggered the compile command than to wait for the compiler to complete the compilation just triggered. It was truly instant gratification, with near-zero wait times. And the quality of the produced code, even though not always completely on par with the best compilers available back then, was remarkably good for most tasks and quite acceptable in general.
Go was designed to be fast, and it shows.
Dependency Management: no header file, you just need to look at the packages that are directly imported (no need to worry about what they import) thus you have linear dependencies.
Grammar: the grammar of the language is simple, thus easily parsed. Although the number of features is reduced, thus the compiler code itself is tight (few paths).
No overload allowed: you see a symbol, you know which method it refers to.
It's trivially possible to compile Go in parallel because each package can be compiled independently.
Note that Go isn't the only language with such features (modules are the norm in modern languages), but they did it well.
Quoting from the book "The Go Programming Language" by Alan Donovan and Brian Kernighan:
Go compilation is notably faster than most other compiled languages, even when building from scratch. There are three main reasons for the compiler’s speed. First, all imports must be explicitly listed at the beginning of each source file, so the compiler does not have to read and process an entire file to determine its dependencies. Second, the dependencies of a package form a directed acyclic graph, and because there are no cycles, packages can be compiled separately and perhaps in parallel. Finally, the object file for a compiled Go package records export information not just for the package itself, but for its dependencies too. When compiling a package, the compiler must read one object file for each import but need not look beyond these files.
The basic idea of compilation is actually very simple. A recursive-descent parser, in principle, can run at I/O bound speed. Code generation is basically a very simple process. A symbol table and basic type system is not something that requires a lot of computation.
However, it is not hard to slow down a compiler.
If there is a preprocessor phase, with multi-level include directives, macro definitions, and conditional compilation, as useful as those things are, it is not hard to load it down. (For one example, I'm thinking of the Windows and MFC header files.) That is why precompiled headers are necessary.
In terms of optimizing the generated code, there is no limit to how much processing can be added to that phase.
Simply ( in my own words ), because the syntax is very easy ( to analyze and to parse )
For instance, no type inheritance means, not problematic analysis to find out if the new type follows the rules imposed by the base type.
For instance in this code example: "interfaces" the compiler doesn't go and check if the intended type implement the given interface while analyzing that type. Only until it's used ( and IF it is used ) the check is performed.
Other example, the compiler tells you if you're declaring a variable and not using it ( or if you are supposed to hold a return value and you're not )
The following doesn't compile:
package main
func main() {
var a int
a = 0
}
notused.go:3: a declared and not used
This kinds of enforcements and principles make the resulting code safer, and the compiler doesn't have to perform extra validations that the programmer can do.
At large all these details make a language easier to parse which result in fast compilations.
Again, in my own words.
Go imports dependencies once for all files, so the import time doesn't increase exponentially with project size.
Simpler linguistics means interpreting them takes less computing.
What else?

Resources