Title basically, why not run all possible optimizations by default?
Enabling optimizations is less convenient when you're in a write-compile-test cycle: it makes compilation take longer, and it makes the resulting code work less well with a debugger (because the generated code may be structured very differently from what you wrote). So when trying to rapidly implement and test changes, many people prefer to have optimizations off, and then to turn then back on for final testing and for the build that will be shipped.
(There are downsides to this, of course. There are some warnings that the compiler can only issue when optimizing, such as uninitialized variables. And there are many cases where buggy code only actually misbehaves when optimizations are on. So there is some benefit to using optimizations more often. Still, if you're trying to track down a bug that is reproducible with optimizations off, it is definitely easier to debug it that way.)
This means that if you were to count up the total number of times that the compiler is run, the vast majority will be runs where the user prefers not to optimize. Therefore it makes a certain amount of sense for "no optimization" to be the default.
Of course, now people are used to this behavior, and so even if it were better to have the opposite default, changing it would cause confusion.
I recently discovered that one can use JIT (just in time) compilation with R using the compiler package (I summarizes my findings on this topic in a recent blog post).
One of the questions I was asked is:
Is there any pitfall? it sounds too good to be true, just put one line
of code and that's it.
After looking around I could find one possible issue having to do with the "start up" time for the JIT. But is there any other issue to be careful about when using JIT?
I guess that there will be some limitation having to do with R's environments architecture, but I can not think of a simple illustration of the problem off the top of my head, any suggestions or red flags will be of great help?
the output of a simple test with rpart could be an advice not to use enableJIT in ALL cases:
library(rpart)
fo <- function() for(i in 1:500){rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)}
system.time(fo())
#User System verstrichen
#2.11 0.00 2.11
require(compiler)
enableJIT(3)
system.time(fo())
#User System verstrichen
#35.46 0.00 35.60
Any explanantion?
The rpart example given above, no longer seems to be an issue:
library("rpart")
fo = function() {
for(i in 1:500){
rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)
}
} system.time(fo())
# user system elapsed
# 1.212 0.000 1.206
compiler::enableJIT(3)
# [1] 3
system.time(fo())
# user system elapsed
# 1.212 0.000 1.210
I've also tried a number of other examples, such as
growing a vector;
A function that's just a wrapper around mean
While I don't always get a speed-up, I've never experience a significant slow-down.
R> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04 LTS
In principle, once the byte-code is compiled and loaded, it should always be interpreted at least as fast as the original AST interpreter. Some code will benefit from big speedups, this is usually code with a lot of scalar operations and loops where most time is spent in R interpretation (I've seen examples with 10x speedup but arbitrary micro-benchmarks could indeed inflate this as needed). Some code will run at the same speed, this is usually code well vectorized and hence spending nearly no time in interpretation. Now, compilation itself can be slow. Hence, the just in time compiler now does not compile functions when it guesses it won't pay off (and the heuristics change over time, this is already in 3.4.x). The heuristics don't always guess it right, so there may be situations when compilation won't pay off. Typical problematic patterns are code generation, code modification and manipulation of bindings of environments captured in closures.
Packages can be byte-compiled at installation time so that the compilation cost is not paid (repeatedly) at run time, at least for code that is known ahead of time. This is now the default in development version of R. While the loading of compiled code is much faster than compiling it, in some situations one may be loading even code that won't be executed, so there actually may be an overhead, but overall pre-compilation is beneficial. Recently some parameters of the GC have been tuned to reduce the cost of loading code that won't be executed.
My recommendation for package writers would be to use the defaults (just-in-time compilation is now on by default in released versions, byte-compilation at package installation time is now on in the development version). If you find an example where the byte-code compiler does not perform well, please submit a bug report (I've also seen a case involving rpart in earlier versions). I would recommend against code generation and code manipulation and particularly so in hot loops. This includes defining closures, deleting and inserting bindings in environments captured by closures. Definitely one should not do eval(parse(text= in hot loops (and this had been bad already without byte-compilation). It is always better to use branches than to generate new closures (without branches) dynamically. Also it is better to write code with loops than to dynamically generate code with huge expressions (without loops). Now with the byte-code compiler, it is now often ok to write loops operating on scalars in R (the performance won't be as bad as before, so one could more often get away without switching to C for the performance critical parts).
Further to the previous answer, experimentation shows the problem is not with the compilation of the loop, it is with the compilation of closures. [enableJIT(0) or enableJIT(1) leave the code fast, enableJIT(2) slows it down dramatically, and enableJIT(3) is slightly faster than the previous option (but still very slow)]. Also contrary to Hansi's comment, cmpfun slows execution to a similar extent.
It is obvious that given any large codebase separate compilation of modules can significantly cut down compilation time. Still, I wonder if it worth to compile a production release in a single step, considering that the longer the code the more optimization can take place across. Is this presumption right?
The question is not language or tool specific but I’m particularly interested in C++ and GCC.
Yes it is. In fact, it's a compilation strategy that's used by almost any modern compiler out there.
You may want to look up IPO and LTO.
From what I have read java (usually) seems to compile java to not very (is at all?) optimised java bytecode, leaving it to the jit to optimise. Is this true? And if it is has there been any exploration (possibly in alternative implementations) of getting the compiler to optimise the code so the jit has less work to do (is this possible)?
Also many people seem to have a dislike for native code generation (sometimes referred to as ahead of time compilation) for Java (and many other high level memory managed languages) , for many reasons such as loss of portability (and ect.) , but also partially because (at least for those languages that have a just in time compiler) the thinking goes that ahead of time compilation to machine code will miss the possible optimisations that can be done by a jit compiler and therefore may be slower in the long run.
This leads me to wonder whether anyone has ever tried to implement http://en.wikipedia.org/wiki/Profile-guided_optimization (compiling to a binary + some extras then running the program and analysing the runtime information of the test run to generate a hopefully more optimised binary for real world usage) for java/(other memory managed languages) and how this would compare to jit code? Anyone have a clue?
Personally, I think the big difference is not between JIT compiling and AOT compiling, but between class-compilation and whole-program optimization.
When you run javac, it only looks at a single .java file, compiling it into a single .class file. All the interface implementations and virtual methods and overrides are checked for validity but left unresolved (because it's impossible to know the true method invocation targets without analyzing the whole program).
The JVM uses "runtime loading and linking" to assemble all of your classes into a coherent program (and any class in your program can invoke specialized behavior to change the default loading/linking behavior).
But then, at runtime, the JVM can remove the vast majority of virtual methods. It can inline all of your getters and setters, turning them into raw fields. And when those raw fields are inlined, it can perform constant-propagation to further optimize the code. (At runtime, there's no such thing as a private field.) And if there's only one thread running, the JVM can eliminate all synchronization primitives.
To make a long story short, there are a lot of optimizations that aren't possible without analyzing the whole program, and the best time for doing whole program analysis is at runtime.
Profile-guided optimization has some caveats, one of them mentioned even in the Wiki article you linked. It's results are valid
for the given samples, representing how your code is actually used by the user or other code.
for the given platform (CPU, memory + other hardware, OS, whatever).
From the performance point of view there are quite big differences even among platforms that are usually considered (more or less) the same (e.g. compare a single core, old Athlon with 512M with a 6 core Intel with 8G, running on Linux, but with very different kernel versions).
for the given JVM and its config.
If any of these change then your profiling results (and the optimizations based on them) are not necessary valid any more. Most likely some of the optimizations will still have a beneficial effect, but some of them may turn out suboptimal (or even degrading performance).
As it was mentioned the JIT JVMs do something very similar to profiling, but they do it on the fly. It's also called 'hotspot', because it constantly monitors the executed code, looks for hot spots that are executed frequently and will try to optimize only those parts. At this point it will be able to exploit more knowledge about the code (knowing the context of it, how it is used by other classes, etc.) so - as mentioned by you and the other answers - it can do better optimizations as a static one. It will continue monitoring and if its needed it will do another turn of optimization later, this time trying even harder (looking for more, more expensive optimizations).
Working on the real life data (usage statistics + platform + config) it can avoid the caveats mentioned before.
The price of it is some additional time it needs to spend on "profiling" + JIT-ing. Most of the time its spent quite well.
I guess a profile-guided optimizer could still compete with it (or even beat it), but only in some special cases, if you can avoid the caveats:
you are quite sure that your samples represent the real life scenario well and they won't change too much during execution.
you know your target platform quite precisely and can do the profiling on it.
and of course you know/control the JVM and its config.
It will happen rarely and I guess in general JIT will give you better results, but I have no evidence for it.
Another possibility for getting value from the profile-guided optimization if you target a JVM that can't do JIT optimization (I think most small devices have such a JVM).
BTW one disadvantage mentioned in other answers would be quite easy to avoid: if static/profile guided optimization is slow (which is probably the case) then do it only for releases (or RCs going to testers) or during nightly builds (where time does not matter so much).
I think the much bigger problem would be to have good sample test cases. Creating and maintaining them is usually not easy and takes a lot of time. Especially if you want to be able to execute them automatically, which would be quite essential in this case.
The official Java Hot Spot compiler does "adaptive optimisation" at runtime, which is essentially the same as the profile-guided optimisation you mentioned. This has been a feature of at least this particular Java implementation for a long time.
The trade-off to performing more static analysis or optimisation passes up-front at compile time is essentially the (ever-diminishing) returns you get from this extra effort against the time it takes for the compiler to run. A compiler like MLton (for Standard ML) is a whole-program optimising compiler with a lot of static checks. It produces very good code, but becomes very, very slow on medium-to-large programs, even on a fast system.
So the Java approach seems to be to use JIT and adaptive optimisation as much as possible, with the initial compilation pass just producing an acceptable valid binary. The absolute opposite end is to use an approach like that of something like MLKit, which does a lot of static inference of regions and memory behaviour.
I was thinking more about the programming language i am designing. and i was wondering, what are ways i could minimize its compile time?
Your main problem today is I/O. Your CPU is many times faster than main memory and memory is about 1000 times faster than accessing the hard disk.
So unless you do extensive optimizations to the source code, the CPU will spend most of the time waiting for data to be read or written.
Try these rules:
Design your compiler to work in several, independent steps. The goal is to be able to run each step in a different thread so you can utilize multi-core CPUs. It will also help to parallelize the whole compile process (i.e. compile more than one file at the same time)
It will also allow you to load many source files in advance and preprocess them so the actual compile step can work faster.
Try to allow to compile files independently. For example, create a "missing symbol pool" for the project. Missing symbols should not cause compile failures as such. If you find a missing symbol somewhere, remove it from the pool. When all files have been compiled, check that the pool is empty.
Create a cache with important information. For example: File X uses symbols from file Y. This way, you can skip compiling file Z (which doesn't reference anything in Y) when Y changes. If you want to go one step further, put all symbols which are defined anywhere in a pool. If a file changes in such a way that symbols are added/removed, you will know immediately which files are affected (without even opening them).
Compile in the background. Start a compiler process which checks the project directory for changes and compile them as soon as the user saves the file. This way, you will only have to compile a few files each time instead of everything. In the long run, you will compile much more but for the user, turnover times will be much shorter (= time user has to wait until she can run the compiled result after a change).
Use a "Just in time" compiler (i.e. compile a file when it is used, for example in an import statement). Projects are then distributed in source form and compiled when run for the first time. Python does this. To make this perform, you can precompile the library during the installation of your compiler.
Don't use header files. Keep all information in a single place and generate header files from the source if you have to. Maybe keep the header files just in memory and never save them to disk.
what are ways i could minimize its compile time?
No compilation (interpreted language)
Delayed (just in time) compilation
Incremental compilation
Precompiled header files
I've implemented a compiler myself, and ended up having to look at this once people started batch feeding it hundreds of source files. I was quite suprised what I found out.
It turns out that the most important thing you can optimize is not your grammar. It's not your lexical analyzer or your parser either. Instead, the most important thing in terms of speed is the code that reads in your source files from disk. I/O's to disk are slow. Really slow. You can pretty much measure your compiler's speed by the number of disk I/Os it performs.
So it turns out that the absolute best thing you can do to speed up a compiler is to read the entire file into memory in one big I/O, do all your lexing, parsing, etc. from RAM, and then write out the result to disk in one big I/O.
I talked with one of the head guys maintaining Gnat (GCC's Ada compiler) about this, and he told me that he actually used to put everything he could onto RAM disks so that even his file I/O was really just RAM reads and writes.
In most languages (pretty well everything other than C++), compiling individual compilation units is quite fast.
Binding/linking is often what's slow - the linker has to reference the whole program rather than just a single unit.
C++ suffers as - unless you use the pImpl idiom - it requires the implementation details of every object and all inline functions to compile client code.
Java (source to bytecode) suffers because the grammar doesn't differentiate objects and classes - you have to load the Foo class to see if Foo.Bar.Baz is the Baz field of object referenced by the Bar static field of the Foo class, or a static field of the Foo.Bar class. You can make the change in the source of the Foo class between the two, and not change the source of the client code, but still have to recompile the client code, as the bytecode differentiates between the two forms even though the syntax doesn't. AFAIK Python bytecode doesn't differentiate between the two - modules are true members of their parents.
C++ and C suffer if you include more headers than are required, as the preprocessor has to process each header many times, and the compiler compile them. Minimizing header size and complexity helps, suggesting better modularity would improve compilation time. It's not always possible to cache header compilation, as what definitions are present when the header is preprocessed can alter its semantics, and even syntax.
C suffers if you use the preprocessor a lot, but the actual compilation is fast; much of C code uses typedef struct _X* X_ptr to hide implementation better than C++ does - a C header can easily consist of typedefs and function declarations, giving better encapsulation.
So I'd suggest making your language hide implementation details from client code, and if you are an OO language with both instance members and namespaces, make the syntax for accessing the two unambiguous. Allow true modules, so client code only has to be aware of the interface rather than implementation details. Don't allow preprocessor macros or other variation mechanism to alter the semantics of referenced modules.
Here are some performance tricks that we've learned by measuring compilation speed and what affects it:
Write a two-pass compiler: characters to IR, IR to code. (It's easier to write a three-pass compiler that goes characters -> AST -> IR -> code, but it's not as fast.)
As a corollary, don't have an optimizer; it's hard to write a fast optimizer.
Consider generating bytecode instead of native machine code. The virtual machine for Lua is a good model.
Try a linear-scan register allocator or the simple register allocator that Fraser and Hanson used in lcc.
In a simple compiler, lexical analysis is often the greatest performance bottleneck. If you are writing C or C++ code, use re2c. If you're using another language (which you will find much more pleasant), read the paper aboug re2c and apply the lessons learned.
Generate code using maximal munch, or possibly iburg.
Surprisingly, the GNU assembler is a bottleneck in many compilers. If you can generate binary directly, do so. Or check out the New Jersey Machine-Code Toolkit.
As noted above, design your language to avoid anything like #include. Either use no interface files or precompile your interface files. This tactic dramatically reduces the burdern on the lexer, which as I said is often the biggest bottleneck.
Here's a shot..
Use incremental compilation if your toolchain supports it.
(make, visual studio, etc).
For example, in GCC/make, if you have many files to compile, but only make changes in one file, then only that one file is compiled.
Eiffel had an idea of different states of frozen, and recompiling didn't necessarily mean that the whole class was recompiled.
How much can you break up the compliable modules, and how much do you care to keep track of them?
Make the grammar simple and unambiguous, and therefore quick and easy to parse.
Place strong restrictions on file inclusion.
Allow compilation without full information whenever possible (eg. predeclaration in C and C++).
One-pass compilation, if possible.
One thing surprisingly missing in answers so far: make you you're doing a context free grammar, etc. Have a good hard look at languages designed by Wirth such as Pascal & Modula-2. You don't have to reimplement Pascal, but the grammar design is custom made for fast compiling. Then see if you can find any old articles about the tricks Anders pulled implementing Turbo Pascal. Hint: table driven.
it depends on what language/platform you're programming for. for .NET development, minimise the number of projects that you have in your solution.
In the old days you could get dramatic speedups by setting up a RAM drive and compiling there. Don't know if this still holds true, though.
In C++ you could use distributed compilation with tools like Incredibuild
A simple one: make sure the compiler can natively take advantage of multi-core CPUs.
Make sure that everything can be compiled the fist time you try to compile it. E.g. ban forward references.
Use a context free grammar so that you can find the correct parse tree without a symbol table.
Make sure that the semantics can be deduced from the syntax so you can construct the correct AST directly rather than by mucking with a parse tree and symbol table.
How serious a compiler is this?
Unless the syntax is pretty convoluted, the parser should be able to run no more than 10-100 times slower than just indexing through the input file characters.
Similarly, code generation should be limited by output formatting.
You shouldn't be hitting any performance issues unless you're doing a big, serious compiler, capable of handling mega-line apps with lots of header files.
Then you need to worry about precompiled headers, optimization passes, and linking.
I haven't seen much work done for minimizing the compile time. But some ideas do come to mind:
Keep the grammar simple. Convoluted grammar will increase your compile time.
Try making use of parallelism, either using multicore GPU or CPU.
Benchmark a modern compiler and see what are the bottlenecks and what you can do in you compiler/language to avoid them.
Unless you are writing a highly specialized language, compile time is not really an issue..
Make a build system that doesn't suck!
There's a huge amount of programs out there with maybe 3 source files that take under a second to compile, but before you get that far you'd have to sit through an automake script that takes about 2 minutes checking things like the size of an int. And if you go to compile something else a minute later, it makes you sit through almost exactly the same set of tests.
So unless your compiler is doing awful things to the user like changing the size of its ints or changing basic function implementations between runs, just dump that info out to a file and let them get it in a second instead of 2 minutes.