How does a compiler build the syntax tree? - compiler-theory

At which point the compiler builds the syntax tree? How does it form the tree and translate the tree while building the executable?

A compiler that builds a syntax tree does so during the parsing step. It does so, typically by generating a tree node for each grammar rule that matches the input stream.
Code generation requires considerable analysis of the tree to understand types, operations, opportunities for optimizations, etc. Often this is hard to do well on the tree directly, so other intermediate representations are used (triples, static single assignment, ...). Often even the intermediate stages are inappropriate for machine code generations, so some kind of representation of machine intructions might be constructed (RTL), ...
The point is that trees aren't the only representation the compiler uses to generate code.
It is well worth your trouble to read an introductory compiler text book (Aho and Ullman, "Compilers") to get more details.

Related

Is this an intermediate representation?

I'm looking into how the v8 compiler works. I read an article which states source code is tokenized, parsed, an AST is constructed, then bytecode is generated (https://medium.com/dailyjs/understanding-v8s-bytecode-317d46c94775)
Is this bytecode an intermediate representation?
Short answer: No. Usually people use the terms "bytecode" and "intermediate representation" to mean two different things.
Long answer: It depends a bit on your definition (but for most definitions, "no" is still the right answer).
"Bytecode" in virtual machines like V8 refers to a representation that is used as input for an interpreter. The article you linked to gives a good description.
"Intermediate representation" or IR usually refers to data that a compiler uses internally, as an intermediate step (hence the name) between its input (usually the AST = abstract syntax tree, i.e. parsed version of the source text) and its output (usually machine code or byte code, but it could be anything, as in a source-to-source compiler).
So in a traditional setup, you have:
source --(parser)--> AST --(compiler front-end)--> IR --(compiler back-end)--> machine code
where the IR is usually modified several times as the compiler performs various optimizations on it, before finally generating machine code from it. There can also be several different IRs; for example V8's earlier optimizing compiler ("Crankshaft") had two: high-level IR "Hydrogen" and low-level IR "Lithium", whereas V8's current optimizing compiler ("Turbofan") even has three: "JavaScript-level nodes", "Simplified nodes", and "Machine-level nodes".
Now if you wanted to draw the boxes in your whiteboard diagram of the system a little differently, then instead of having a "parser" and a "compiler" you could treat everything between source and machine code as one big "compiler" (which as a first step parses the source). In that case, the AST would be a form of intermediate representation. But, as stated above, usually when people use the term IR they mean "compiler IR", not the AST.
In a virtual machine like V8, the overall execution pipeline is more complicated than described above. It starts with:
source --(parser)--> AST --(bytecode generator)--> bytecode
This bytecode is primarily used as input for V8's interpreter.
As an optimization, when V8 decides to run a function through the optimizing compiler, it does not start with the source code and a parser again, but instead the optimizing compiler uses the bytecode as its input. In diagram form:
bytecode --(interpreter)--> program execution
bytecode --(compiler front-end)--> IR --(compiler back-end)--> machine code --(CPU)--> program execution
Now here's the part where your perspective comes in: since the bytecode in V8 is not only used as input for the interpreter, but also as input for the optimizing compiler and in that sense as a step on the way from source text to machine code, if you wanted to call it a special form of intermediate representation, you wouldn't technically be wrong. It would be an unusual definition of the term though. When a compiler theory textbook talks about "intermediate representation", it does not mean "bytecode".

How to apply Grammatical Evolution string to a solution

I am learning Grammatical Evolution, but one thing that I can't seem to grasp is how to use the strings that are evolved from grammar into solving an actual problem. Is it converted into a neural network or converted into an equation, or something else? How does it receive inputs and print out outputs?
Grammatical Evolution (GE) makes distinction between genotype and phenotype (genotype–phenotype distinction), which means an evolved genotype is not a solution by itself, but it maps to a solution.
Mutations and crossover are performed over genotypes, but to evaluate the fitness a genotype should be first transformed into a phenotype. In Grammatical Evolution this means generation of a string conforming to the chosen grammar. This solution string then should be executed, and the result of the execution evaluated to estimate the fitness of the solution.
How to actually execute the generated solution?
It highly depends on the implementation of a GE system.
If it generates solutions in some real programming language, they should be compiled and/or executed with the corresponding toolchain, ran with some test input, and the output evaluated to estimate the fitness.
If a GE system is able to execute a solution internally, no external toolchain is involved. It might be convenient to generate a syntax tree-like structure according to the grammar (instead of unstructured text), because it's quite easy to execute such a structure.
How to execute a syntax tree?
There exist an entire class of so called tree walk interpreters — not super performant, but reasonably simple in implementation. Usually such an interpreter first parses a source text and builds a syntax tree, then executes it; but in a GE system it is possible to directly generate a syntax tree, so no parsing is involved.
I can suggest "A tree-walk interpreter" chapter of a freely available book "Crafting interpreters" as a good example of constructing such an interpreter.

What algorithm do I use to resolve the compilation order for a given set of files?

I'm trying to implement a small problem to understand better in my compilers class. This is the problem as follows: Assume, I have a bunch of files to compile as follows:
a depends on nothing
b depends on c
c depends on f
d dependes on a
e depends on b
f depends on nothing
So in this case the compilation order for the files to be successfully compiled is a,f,c,b,d,e. I want to write my own algorithm to output the desired dependency just as an exercise. I know the linker does it automatically in C++ etc, but this is just a personal exercise. How can I go about solving this problem. Any references to algorithms/readings is much appreciated, since I'm fairly new.
Based on the comment of #ajb, a quick google search brings up the Wikipedia article for Topological Sorting. However, it seems to me that if you're going to go through the trouble of making a graph to represent the problem, there's a really easy way to do this.
First, for each file you're compiling, make a node. Then have an edge from each node to it's dependency, and have an edge to a special node for if the file requires no dependency. Once that is done, all you have to do is reverse the edges and compile in a breadth first search from that special node.
If you need to worry about circular dependencies or any of that jazz, then it gets a lot more complicated, but it's still doable.
Since you're asking for literature, there is a book called Data Structures and Algorithms in C++ that goes over all kinds of data structures and algorithms (what a surprise!) including graph algorithms in chapter 13.

Big-O for a compiler [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Does anyone have insight into the typical big-O complexity of a compiler?
I know it must be >= n (where n is the number of lines in the program), because it needs to scan each line at least once.
I believe it must also be >= n.logn for a procedural language, because the program can introduce O(n) variables, functions, procedures, and types etc., and when these are referenced within the program it will take O(log n) to look up each reference.
Beyond that my very informal understanding of compiler architecture has reached its limits and I am not sure if forward declarations, recursion, functional languages, and/or other tricks will increase the algorithmic complexity of the compiler.
So, in summary:
For a 'typical' procedural language (C, pascal, C#, etc.) is there a limiting big-O for an efficiently designed compiler (as a measure of number of lines)
For a 'typical' functional language (lisp, Haskell, etc.) is there a limiting big-O for an efficiently designed compiler (as a measure of number of lines)
This question is unanswerable in it's current form. The complexity of a compiler certainly wouldn't be measured in lines of code or characters in the source file. This would describe the complexity of the parser or lexer, but no other part of the compiler will ever even touch that file.
After parsing, everything will be in terms of various AST's representing the source file in a more structured manner. A compiler will have a lot of intermediate languages, each with it's own AST. The complexity of various phases would be in terms of the size of the AST, which doesn't correlate at all to the character count or even to the previous AST necessarily.
Consider this, we can parse most languages in linear time to the number of characters and generate some AST. Simple operations such as type checking are generally O(n) for a tree with n leaves. But then we'll translate this AST into a form with potentially, double, triple or even exponentially more nodes then on the original tree. Now we again run single pass optimizations on our tree, but this might be O(2^n) relative to the original AST and lord knows what to the character count!
I think you're going to find it quite impossible to even find what n should be for some complexity f(n) for a compiler.
As a nail in the coffin, compiling some languages is undecidable including java, C# and Scala (it turns out that nominal subtyping + variance leads to undecidable typechecking). Of course C++'s templating system is turing complete which makes decidable compilation equivalent to the halting problem (undecidable). Haskell + some extensions is undecidable. And many others that I can't think of off the top of my head. There is no worst case complexity for these languages' compilers.
Reaching back to what I can remember from my compilers class... some of the details here may be a bit off, but the general gist should be pretty much correct.
Most compilers actually have multiple phases that they go through, so it'd be useful to narrow down the question somewhat. For example, the code is usually run through a tokenizer that pretty much just creates objects to represent the smallest possible units of text. var x = 1; would be split into tokens for the var keyword, a name, an assignment operator, and a literal number, followed by a statement finalizer (';'). Braces, parentheses, etc. each have their own token type.
The tokenizing phase is roughly O(n), though this can be complicated in languages where keywords can be contextual. For example, in C#, words like from and yield can be keywords, but they could also be used as variables, depending on what's around them. So depending on how much of that sort of thing you have going on in the language, and depending on the specific code that's being compiled, just this first phase could conceivably have O(n²) complexity. (Though that would be highly uncommon in practice.)
After tokenizing, then there's the parsing phase, where you try to match up opening/closing brackets (or the equivalent indentations in some languages), statement finalizers, and so forth, and try to make sense of the tokens. This is where you need to determine whether a given name represents a particular method, type, or variable. A wise use of data structures to track what names have been declared within various scopes can make this task pretty much O(n) in most cases, but again there are exceptions.
In one video I saw, Eric Lippert said that correct C# code can be compiled in the time between a user's keystrokes. But if you want to provide meaningful error and warning messages, then the compiler has to do a great deal more work.
After parsing, there can be a number of extra phases including optimizations, conversion to an intermediate format (like byte code), conversion to binary code, just-in-time compilation (and extra optimizations that can be applied at that point), etc. All of these can be relatively fast (probably O(n) most of the time), but it's such a complex topic that it's hard to answer the question even for a single language, and practically impossible to answer it for a genre of languages.
As fas as i know:
It depends on the type of parser the compiler uses in it's parsing step.
The main type of parsers are LL and LR, and both have different complexities.

How to generate code for AST tree parsed from a fictive language?

I read the article over at http://parsingintro.sourceforge.net/ and decided to try to rewrite it as an exercise in Ruby. Two reasons made me do this, I wanted to learn more about how to code Ruby (background in Java, PHP, C and some Python) and I wanted to learn more about parsers / compilers.
I have all the code posted at https://github.com/parse/boatcaptain. The AST tree is being generated, unfortunatly the author of the article doesn't get into concepts such as code generation and optimizations.
Can anyone help me by pointing me in the right direction on how to achieve this AST tree into "code"? This is the AST tree that is generated
I wrote a calculator in Java a few years ago, it uses a lot of similar terminology and techniques as I used in this parser. But in the calculator I had methods for eval()-ing my "classes" and therefore getting output, should I aim for doing something similar here? Source for calculator: https://github.com/parse/Uppsala-University-Courses/blob/master/ImpOOP-Calculator/src/Calculator.java
I would love feedback on my way of writing Ruby as well, I believe I still write Ruby like I would write Python, missing some nice advantages of Ruby.
Code Generation in its most basic form is simply traversing your intermediate form - the AST - and emitting corresponding instructions in your target language.
Firstly you'll need to choose a target language. What platform do you want your input file to run on? The main options open to you are:
A source-to-source translator
A compiler to native code
A compiler to bytecode (to be run just-in-time on a VM)
The choice of target language can determine the amount of work you'll have to put in to map between languages. Mapping object-oriented classes down to ASM could/would be tricky, for example. Mapping inherently procedural code to stack-based code could also prove a challenge.
Whichever language you choose, the problem will no doubt boil down to the following procedure: visit the nodes of your tree, and depending on their type, emit the corresponding instruction.
Say you come across the following node in your AST (as in the one you linked to):
=
delta /
alpha beta
Seeing as it's an 'assignment' node, the Code Generator then knows it has to evaluate the RHS of the tree before sticking that value into the LHS; 'delta'. So we follow the RHS node down, and see it is a division operation. We then know we have to evaluate both the LHS and RHS of this node, before dividing them, and sticking the result in 'delta'.
So now we move down the LHS, see it's a variable, and we emit a 'load' instruction. We go back up and then down the RHS, and likewise emit a 'load' for 'beta'. We then walk back up the tree (taking both alpha and beta with us), emit the divide instruction on both the operands, store that result, pass it up the tree to the assignment emitter, and let that store it in 'delta'.
So the resulting code for this snippet might be:
load alpha
load beta
tmp = div alpha beta
store delta tmp
As for pre-existing Ruby Code Generator libraries, I'm not aware of any, sorry. I hope this answer wasn't too general or simplistic for you.

Resources