Wikipedia says:
A debug symbol is information that expresses which programming-language constructs generated a specific piece of machine code in a given executable module.
Any examples of what kind of programming-language constructs are used for the purpose?
What is the meaning of "constructs" in this context? Functions?
The programming language constructs reffered to are things like if statements, while loops, assignment statements, etc etc.
Debug symbols are usually files that map addresses of executable chunks of machine bytecode with the original source code file and line number they represent. This is what allows you to do things like put a breakpoint on an if statement, and have the machine stop when the execution reached that particular bit of bytecode.
Related
I'm looking into how the v8 compiler works. I read an article which states source code is tokenized, parsed, an AST is constructed, then bytecode is generated (https://medium.com/dailyjs/understanding-v8s-bytecode-317d46c94775)
Is this bytecode an intermediate representation?
Short answer: No. Usually people use the terms "bytecode" and "intermediate representation" to mean two different things.
Long answer: It depends a bit on your definition (but for most definitions, "no" is still the right answer).
"Bytecode" in virtual machines like V8 refers to a representation that is used as input for an interpreter. The article you linked to gives a good description.
"Intermediate representation" or IR usually refers to data that a compiler uses internally, as an intermediate step (hence the name) between its input (usually the AST = abstract syntax tree, i.e. parsed version of the source text) and its output (usually machine code or byte code, but it could be anything, as in a source-to-source compiler).
So in a traditional setup, you have:
source --(parser)--> AST --(compiler front-end)--> IR --(compiler back-end)--> machine code
where the IR is usually modified several times as the compiler performs various optimizations on it, before finally generating machine code from it. There can also be several different IRs; for example V8's earlier optimizing compiler ("Crankshaft") had two: high-level IR "Hydrogen" and low-level IR "Lithium", whereas V8's current optimizing compiler ("Turbofan") even has three: "JavaScript-level nodes", "Simplified nodes", and "Machine-level nodes".
Now if you wanted to draw the boxes in your whiteboard diagram of the system a little differently, then instead of having a "parser" and a "compiler" you could treat everything between source and machine code as one big "compiler" (which as a first step parses the source). In that case, the AST would be a form of intermediate representation. But, as stated above, usually when people use the term IR they mean "compiler IR", not the AST.
In a virtual machine like V8, the overall execution pipeline is more complicated than described above. It starts with:
source --(parser)--> AST --(bytecode generator)--> bytecode
This bytecode is primarily used as input for V8's interpreter.
As an optimization, when V8 decides to run a function through the optimizing compiler, it does not start with the source code and a parser again, but instead the optimizing compiler uses the bytecode as its input. In diagram form:
bytecode --(interpreter)--> program execution
bytecode --(compiler front-end)--> IR --(compiler back-end)--> machine code --(CPU)--> program execution
Now here's the part where your perspective comes in: since the bytecode in V8 is not only used as input for the interpreter, but also as input for the optimizing compiler and in that sense as a step on the way from source text to machine code, if you wanted to call it a special form of intermediate representation, you wouldn't technically be wrong. It would be an unusual definition of the term though. When a compiler theory textbook talks about "intermediate representation", it does not mean "bytecode".
Let's say I am to design a JIT interpreter that translates IL or bytecode to executable instructions at runtime. Every time a variable name is encountered in the code, the JIT interpreter has to translate that into the respective memory address, right?
What technique do JIT interpreters use in order to resolve variable references in a performant enough manner? Do they use hashing, are the variables compiled to addresses ahead of time, or am I missing something altogether?
There is a huge variety of possible answers to this question, just as there are a huge variety of answers to how to design a JIT in general.
But to take one example, consider the JVM. Java bytecode actually does not contain variable names at all, except for debugging/reflection metadata. Instead, the compiler assigns each variable an "index" from 0 to 65535 and bytecode instructions use that index. However, the VM is free to make further optimizations if it wants to. For example, it may convert everything into SSA form and then compile it into machine code, in which case variables will end up being turned into machine-registers or fixed offsets in the stack frame or optimized away entirely.
Consider another example: CPython. Python actually maintains variable names at runtime, due to its high level, flexible nature. However, the interperter still performs a few optimizations. For example, classes with a __slots__ attribute will allocate a fixed size array for the fields, and use a name -> index hashmap for dynamic lookups. I am not familiar with the implementation, but I think it does something similar with local variables. Note that normal local variable accesses (not using reflection), can be converted to a fixed offset at "compile" time.
So in short, the answer to
Do they use hashing, are the variables compiled to addresses ahead of time, or am I missing something altogether?
is yes.
Supposedly Forth programs can be "compiled" but I don't see how that is true if they have words that are only evaluated at runtime. For example, there is the word DOES> which stores words for evaluation at runtime. If those words include an EVALUATE or INTERPRET word then there will be a runtime need for the dictionary.
To support such statements it would mean the entire word list (dictionary) would have to be embedded inside the program, essentially what interpreted programs (not compiled programs) do.
This would seem to prevent you from compiling small programs using Forth because the entire dictionary would have to be embedded in the program, even if you used only a fraction of the words in the dictionary.
Is this correct, or is there some way to compile Forth programs without embedding the dictionary? (maybe by not using runtime words at all ??)
Forth programs can be compiled with or without word headers. The headers include the word names (called "name space").
In the scenario you describe, where the program may include run-time evalutation calls such as EVALUATE, the headers will be needed.
The dictionary can be divided into three logically distinct parts: name space, code space, and data space. Code and data are needed for program execution, names are usually not.
A normal Forth program will usually not do any runtime evaluation. So in most cases, the names aren't needed in a compiled program.
The code after DOES> is compiled, so it's not evaluated at run time. It's executed at run time.
Even though names are included, they usually don't add much to program size.
Many Forths do have a way to leave out the names from a program. Some have a switch to remove word headers (the names). Other have cross compilers which keep the names in the host system during compile time, but generate target code without names.
No, the entire dictionary need not be embedded, nor compiled. All that need remain is just the list of words used, and their parent words, (& grandparents, etc.). And the even names of the words aren't necessary, the word locations are enough. Forth code compiled by such methods can be about as compact as it gets, rivaling or even surpassing assembly language in executable size.
Proof by example: Tom Almy's ForthCMP, an '80s-'90s MSDOS compiler that shrunk executable code way down. Its README says:
. Compiles Forth into machine code -- not interpreted.
. ForthCMP is written in Forth so that Forth code can be executed
during compilation, as is customary in Forth applications.
. Very fast -- ForthCMP compiles Forth code into an executable file
in a single pass.
. Generated code is extremely compact. Over 110 Forth "primitives"
are compiled in-line. ForthCMP performs constant expression
folding, strength reduction, register optimization, DO...LOOP
optimization, tail recursion, and various "peephole"
optimizations.
. Built-in assembler.
4C.COM runs under emulators like dosemu or dosbox.
A "Hello World" compiles into a 117 byte .COM file, a wc program compiles to a 3K .COM file (from 5K of source code). No dictionary or external libraries, (aside from standard MSDOS calls, i.e. the OS it runs on).
Forth can be a bear to get your head around from the outside because there is NO standard implementation of the language. Much of what people see are from the early days of Forth when the author (Charles Moore) was still massaging his own thoughts. Or worse, homemade systems that people call Forth because it has a stack but are really not Forth.
So is Forth Interpreted or Compiled?
Short answer: both
Early years:
Forth had a text interpreter facing the programmer. So Interpreted: Check
But... The ':' character enabled the compiler which "compiled" the addresses of the words in the language so it was "compiled" but not as native machine code. It was lists of addresses where the code was in memory. The clever part was that those addresses could be run with a list "interpreter" that was only 2 or 3 instructions on most machines and a few more on an old 8 bit CPU. That meant it was still pretty fast and quite space efficient.
These systems are more of an image system so yes the system goes along with your program but some of those system kernels were 8K bytes for the entire run-time including the compiler and interpreter. Not heavy lifting.
This is what most people think of as Forth. See JonesForth for a literate example. (This was called "threaded code" at the time, not to be confused with multi-threading)
1990ish
Forth gurus and Chuck Moore began to realize that a Forth language primitive could be as little as one machine instruction on modern machines so why not just compile the instruction rather than the address. This became very useful with 32bit machines since the address was sometimes bigger than the instruction. They could then replace the little 3 instruction interpreter with the native CALL/Return instructions of the processor. This was called sub-routine threading. The front end interpreter did not disappear. It simply kicked off native code sub-routines
Today
Commercial Forth systems generate native code, inline many/most primitives and do many of the other optimization tricks you see in modern compilers.
They still have an interpreter facing the programmer. :-)
You can also buy (or build) Forth cross-compilers that create standalone executables for different CPUs that include multi-tasking, TCP/IP stacks and guess what, that text interpreter can be compiled into the executable as an option for remote debugging and configuration if you want it.
So is Forth Interpreted or Compiled? Still both.
You are right that a program that executes INTERPRET (EVALUATE, LOAD, INCLUDE etc.) is obliged to have a dictionary. That is hardly a disadvantage because even a 64 bit executable is merely a 50 K for Linux or MS-Windows. Modern single board computer like the MSP430 can have the whole dictionary in flash memory. See ciforth and noforth respectively. Then there is scripting. If you use Forth as a scripting language, it is similar to perl or python The script is small, and doesn't contain the whole language. It requires though that the language is installed on your computer.
In case of really small computers you can resort to cross compiling or using an umbellical Forth where the dictionary is positioned on a host computer and communicates and programs via a serial line. These are special techniques that are normally not needed. You can't use INTERPRETing code in those cases on the sbc, because obviously there is no dictionary there.
Note: mentioning the DOES> instruction doesn't serve to make the question clearer. I recommend that you edit this out.
Is it possible to distinguish between a Prolog Interpreter and Prolog Compiler from its usage or intermediary files generated?
Wikipedia has a good compilation of Prolog implementations
http://en.wikipedia.org/wiki/Comparison_of_Prolog_implementations
This is a question about the notation used in table.
Does the column "Compiled Code" means that the corresponding Prolog is implemented with a Prolog Compiler?
(I am not sure if stackoverflow is a good place to ask about this. If not, please let me know, I will remove this thread.)
"Compiled Code" in this table means that any given Prolog program is itself compiled by the respective Prolog system, and the compiled form is executed.
Most of these systems compile Prolog programs to abstract machine code before executing it. Examples of abstract machines for Prolog (like the JVM for Java) are the WAM, ZIP, TOAM etc.
Some of these systems even compile Prolog code to native machine code, for example via JIT compilation, just like Java systems can compile Java code to native machine code.
In practice, you usually do not create intermediary files when working with Prolog: You run the Prolog system, load your source file, and the system compiles the file on the fly and in memory to abstract machine code, without creating an intermediary file. You usually can create such files manually if you need them, but you typically do not.
Thus, the creation of intermediary files is not a criterion that lets you distinguish a compiler from an interpreter.
I just have a question which I can't find a accurate answer online ..
Using swipl-ld
can help to combine Prolog and C code together, and eventually generating one signal
executable binary.
But there is one thing I am confused with...
In the generated binary, does the Prolog Interpreter (Virtual Machine or others) still exist?
If so, then probably the original Prolog code is stored as string in the .rodata section of ELF binary, but after a searching inside this section, I didn't find the code.. But perhaps the original code has been transformed into bytecode, and that's why I just can't find it at all..
If not, then how can Prolog code directly translate into semantic equivalent asm code based on SWI-Prolog? I have read some materials about the implementation of GNU-Prolog, based on WAM virtual machine, however, I haven't found any materials about the implementation of SWI-Prolog...
Could anyone give me some help?
The compiled binary does not contain your original source code nor the whole Prolog interpreter. However it does contain your program in form of bytecode compiled by the qsave_program/2 predicate. This bytecode is executed by Prolog emulator, which is a subset of the Prolog interpreter used during a normal interactive dialog, and which is also included in the compiled binary.
All relevant information can be found in the Generating Runtime Applications section of the SWI-Prolog documentation.