Inlining functions in Go

Inlining functions in Go - sorting

Go has a nice generic implementation of introsort using the generic sort.Interface. But one of the advantages of C++'s std::sort is that you can specify a compare functor which is inlined and unnecessary function calls are omitted.
Can we force the current native Go compilers somehow to inline those sort.Swap and sort.Less calls? I do not consider gccgo because it gives us terrible results compared to the native compiler.

For Go 1.x the answer is a sound no.
The Go compiler may inline some classes of functions on some platforms in some specific ways. You might even be able to figure out some trickery through which you can make the current compiler inline the sort-"functor" on your architecture but there is absolutely no guarantee whatsoever that this will hold in future releases and it may work against you (even break your code, depending on the level of hackery involved) when the compiler changes.
A major release of Go (2.x?) might provide some guarantees about optimizations; but for Go 1.x there's nothing like that. And even if the Go authors wanted to do this it's difficult to provide such guarantees for all platforms (remember Go runs on many architectures).
Even more generally it is a bad idea to rely on compiler-specific behavior, in any language.

Related

Do compilers take the "status quo" when optimizations produced worse results?

To my knowledge, when using optimizations there is a risk to face the "maybe will be worse" case (i.e. the performance will be degraded, or the code size will be higher, or both). However do compilers able to detect such cases and return to the "status quo" (i.e. fall back to the original non-optimized code) when optimizations produced worse results? Can someone give (if possible) a particular examples of what compilers (for example, gcc, Clang (LLVM), etc.) do in this case?

In JIT compilers there is a thing called Deoptimization. Normally the compiler will optimize heavily assuming something, but during execution some of the assumption may fail. For example the compiler will assume the inmput of a function is always an integer and produce a highly efficient code for integer manipulation, but if, and such things happen in dynamic languages, the input is suddenly and array or a string, the code should revert. See v8 turbofan speculative optimizator for example.
For non JIT there is no way to deoptimize during runtime, but the compiler may create multiple execution paths. Your question is not fully logical because how would compiler know if it created unoptimal code? It can only use the same algorithm it used to do the optimization itself. That's probably why you are downwoted.

Why is `math.Sin` disallowed in a Go constant?

According to Effective Go, the function math.Sin cannot be used to define a constant because that function must happen at run-time.
What is the reasoning behind this limitation? Floating-point consistency? Quirk of the Sin implementation? Something else?
There is support for this sort of thing in other languages. In C, for example: as of version 4.3, GCC supports compile-time calculation of the sine function. (See section "General Optimizer Improvements").
However, as noted in this blog post by Bruce Dawson, this can cause unexpected issues. (See section "Compile-time versus run-time sin").
Is this a relevant concern in Go? Or is this usage restricted for a different reason?

Go doesn't support initializing a constant with the result of a function. Functions are called at runtime, not at compile time. But constants are defined at compile time.
It would be possible to make exceptions for certain functions (like math.Sin for example), but that would make the spec more complicated. The Go developers generally prefer to keep the spec simple and consistent.

Go simply lacks the concept. There is no way of marking a function as pure (its return value depends only on its arguments, and it doesn't alter any kind of mutable state or perform I/O), there is no way for the compiler to infer pureness, and there's no attempt to evaluate any expression containing a function call at compile-time (because doing so for anything except a pure function of constant arguments would be a source of weird behavior and bugs, and because adding the machinery needed to make it work right would introduce quite a bit of complexity).
Yes, this is a substantial loss, which forces a tradeoff between code with bad runtime behavior, and code which is flat-out ugly. Go partisans will choose the ugly code and tell you that you are a bad human being for not finding it beautiful.
The best thing you have available to you is code generation. The integration of go generate into the toolchain and the provision of a complete Go parser in the standard library makes it relatively easy to munge code at build time, and one of the things that you can do with this ability is create more advanced constant-folding if you so choose. You still get all of the debuggability peril of code generation, but it's something.

Pointer aliasing- in C++0x

I'm thinking about (just as an idea) disjointed pointer aliasing in C++0x. I was thinking about seeing if it could be implemented similarly to const correctness- that is, enforced by the compiler. What would be the requirements for such a thing? As this is more of a thought experiment, I'm perfectly happy to look at solutions that destroy legacy code or redefine half the language and that kind of thing.
What I'd really rather not do is have, say, restrict from C99 where the programmer just promises it. It should be enforced.
I was thinking about having unique_ptr be not part of the library, but part of the language. That way, the compiler can perform special optimizations on it and write their own unique pointer classes if they need to.

The Standard C++ Library (including std::unique_ptr) is a part of the language.
Also, conforming programs are not allowed to add declarations and definitions to the namespace std.
Upon seeing an instantiation of std::unique_ptr<T>, the compiler knows everything about the behavior of this instantiation - it's exactly that behavior which was implemented as a part of the language implementation the compiler itself is a part of and the compiler is free to perform "special optimizations" coming from the guarantees of the C++ standard.
As an example for something coming from the same line of thinking, GCC already does this with a number of standard C99 functions in hosted mode - it may replace standard function calls with inline insn sequence or with calls to other functions - precisely because GCC knows the exact semantics by just knowing the name of the function.

How does Go compile so quickly?

I've Googled and poked around the Go website, but I can't find an explanation for Go's extraordinary build times. Are they products of the language features (or lack thereof), a highly optimized compiler, or something else? I'm not trying to promote Go; I'm just curious.

Dependency analysis.
The Go FAQ used to contain the following sentence:
Go provides a model for software
construction that makes dependency
analysis easy and avoids much of the
overhead of C-style include files and
libraries.
While the phrase is not in the FAQ anymore, this topic is elaborated upon in the talk Go at Google, which compares the dependency analysis approach of C/C++ and Go.
That is the main reason for fast compilation. And this is by design.

I think it's not that Go compilers are fast, it's that other compilers are slow.
C and C++ compilers have to parse enormous amounts of headers - for example, compiling C++ "hello world" requires compiling 18k lines of code, which is almost half a megabyte of sources!
$ cpp hello.cpp | wc
18364 40513 433334
Java and C# compilers run in a VM, which means that before they can compile anything, the operating system has to load the whole VM, then they have to be JIT-compiled from bytecode to native code, all of which takes some time.
Speed of compilation depends on several factors.
Some languages are designed to be compiled fast. For example, Pascal was designed to be compiled using a single-pass compiler.
Compilers itself can be optimized too. For example, the Turbo Pascal compiler was written in hand-optimized assembler, which, combined with the language design, resulted in a really fast compiler working on 286-class hardware. I think that even now, modern Pascal compilers (e.g. FreePascal) are faster than Go compilers.

There are multiple reasons why the Go compiler is much faster than most C/C++ compilers:
Top reason: Most C/C++ compilers exhibit exceptionally bad designs (from compilation speed perspective). Also, from compilation speed perspective, some parts of the C/C++ ecosystem (such as editors in which programmers are writing their code) aren't designed with speed-of-compilation in mind.
Top reason: Fast compilation speed was a conscious choice in the Go compiler and also in the Go language
The Go compiler has a simpler optimizer than C/C++ compilers
Unlike C++, Go has no templates and no inline functions. This means that Go doesn't need to perform any template or function instantiation.
The Go compiler generates low-level assembly code sooner and the optimizer works on the assembly code, while in a typical C/C++ compiler the optimization passes work on an internal representation of the original source code. The extra overhead in the C/C++ compiler comes from the fact that the internal representation needs to be generated.
Final linking (5l/6l/8l) of a Go program can be slower than linking a C/C++ program, because the Go compiler is going through all of the used assembly code and maybe it is also doing other extra actions that C/C++ linkers aren't doing
Some C/C++ compilers (GCC) generate instructions in text form (to be passed to the assembler), while the Go compiler generates instructions in binary form. Extra work (but not much) needs to be done in order to transform the text into binary.
The Go compiler targets only a small number of CPU architectures, while the GCC compiler targets a large number of CPUs
Compilers which were designed with the goal of high compilation speed, such as Jikes, are fast. On a 2GHz CPU, Jikes can compile 20000+ lines of Java code per second (and the incremental mode of compilation is even more efficient).

Compilation efficiency was a major design goal:
Finally, it is intended to be fast: it should take at most a few seconds to build a large executable on a single computer. To meet these goals required addressing a number of linguistic issues: an expressive but lightweight type system; concurrency and garbage collection; rigid dependency specification; and so on. FAQ
The language FAQ is pretty interesting in regards to specific language features relating to parsing:
Second, the language has been designed to be easy to analyze and can be parsed without a symbol table.

While most of the above is true, there is one very important point that was not really mentionend: Dependency management.
Go only needs to include the packages that you are importing directly (as those already imported what they need). This is in stark contrast to C/C++, where every single file starts including x headers, which include y headers etc. Bottom line: Go's compiling takes linear time w.r.t to the number of imported packages, where C/C++ take exponential time.

A good test for the translation efficiency of a compiler is self-compilation: how long does it take a given compiler to compile itself? For C++ it takes a very long time (hours?). By comparison, a Pascal/Modula-2/Oberon compiler would compile itself in less than one second on a modern machine [1].
Go has been inspired by these languages, but some of the main reasons for this efficiency include:
A clearly defined syntax that is mathematically sound, for efficient scanning and parsing.
A type-safe and statically-compiled language that uses separate compilation with dependency and type checking across module boundaries, to avoid unnecessary re-reading of header files and re-compiling of other modules - as opposed to independent compilation like in C/C++ where no such cross-module checks are performed by the compiler (hence the need to re-read all those header files over and over again, even for a simple one-line "hello world" program).
An efficient compiler implementation (e.g. single-pass, recursive-descent top-down parsing) - which of course is greatly helped by points 1 and 2 above.
These principles have already been known and fully implemented in the 1970s and 1980s in languages like Mesa, Ada, Modula-2/Oberon and several others, and are only now (in the 2010s) finding their way into modern languages like Go (Google), Swift (Apple), C# (Microsoft) and several others.
Let's hope that this will soon be the norm and not the exception. To get there, two things need to happen:
First, software platform providers such as Google, Microsoft and Apple should start by encouraging application developers to use the new compilation methodology, while enabling them to re-use their existing code base. This is what Apple is now trying to do with the Swift programming language, which can co-exist with Objective-C (since it uses the same runtime environment).
Second, the underlying software platforms themselves should eventually be re-written over time using these principles, while simultaneously redesigning the module hierarchy in the process to make them less monolithic. This is of course a mammoth task and may well take the better part of a decade (if they are courageous enough to actually do it - which I am not at all sure in the case of Google).
In any case, it's the platform that drives language adoption, and not the other way around.
References:
[1] http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.System.pdf, page 6: "The compiler compiles itself in about 3 seconds". This quote is for a low cost Xilinx Spartan-3 FPGA development board running at a clock frequency of 25 MHz and featuring 1 MByte of main memory. From this one can easily extrapolate to "less than 1 second" for a modern processor running at a clock frequency well above 1 GHz and several GBytes of main memory (i.e. several orders of magnitude more powerful than the Xilinx Spartan-3 FPGA board), even when taking I/O speeds into account. Already back in 1990 when Oberon was run on a 25MHz NS32X32 processor with 2-4 MBytes of main memory, the compiler compiled itself in just a few seconds. The notion of actually waiting for the compiler to finish a compilation cycle was completely unknown to Oberon programmers even back then. For typical programs, it always took more time to remove the finger from the mouse button that triggered the compile command than to wait for the compiler to complete the compilation just triggered. It was truly instant gratification, with near-zero wait times. And the quality of the produced code, even though not always completely on par with the best compilers available back then, was remarkably good for most tasks and quite acceptable in general.

Go was designed to be fast, and it shows.
Dependency Management: no header file, you just need to look at the packages that are directly imported (no need to worry about what they import) thus you have linear dependencies.
Grammar: the grammar of the language is simple, thus easily parsed. Although the number of features is reduced, thus the compiler code itself is tight (few paths).
No overload allowed: you see a symbol, you know which method it refers to.
It's trivially possible to compile Go in parallel because each package can be compiled independently.
Note that Go isn't the only language with such features (modules are the norm in modern languages), but they did it well.

Quoting from the book "The Go Programming Language" by Alan Donovan and Brian Kernighan:
Go compilation is notably faster than most other compiled languages, even when building from scratch. There are three main reasons for the compiler’s speed. First, all imports must be explicitly listed at the beginning of each source file, so the compiler does not have to read and process an entire file to determine its dependencies. Second, the dependencies of a package form a directed acyclic graph, and because there are no cycles, packages can be compiled separately and perhaps in parallel. Finally, the object file for a compiled Go package records export information not just for the package itself, but for its dependencies too. When compiling a package, the compiler must read one object file for each import but need not look beyond these files.

The basic idea of compilation is actually very simple. A recursive-descent parser, in principle, can run at I/O bound speed. Code generation is basically a very simple process. A symbol table and basic type system is not something that requires a lot of computation.
However, it is not hard to slow down a compiler.
If there is a preprocessor phase, with multi-level include directives, macro definitions, and conditional compilation, as useful as those things are, it is not hard to load it down. (For one example, I'm thinking of the Windows and MFC header files.) That is why precompiled headers are necessary.
In terms of optimizing the generated code, there is no limit to how much processing can be added to that phase.

Simply ( in my own words ), because the syntax is very easy ( to analyze and to parse )
For instance, no type inheritance means, not problematic analysis to find out if the new type follows the rules imposed by the base type.
For instance in this code example: "interfaces" the compiler doesn't go and check if the intended type implement the given interface while analyzing that type. Only until it's used ( and IF it is used ) the check is performed.
Other example, the compiler tells you if you're declaring a variable and not using it ( or if you are supposed to hold a return value and you're not )
The following doesn't compile:
package main
func main() {
var a int
a = 0
}
notused.go:3: a declared and not used
This kinds of enforcements and principles make the resulting code safer, and the compiler doesn't have to perform extra validations that the programmer can do.
At large all these details make a language easier to parse which result in fast compilations.
Again, in my own words.

Go imports dependencies once for all files, so the import time doesn't increase exponentially with project size.
Simpler linguistics means interpreting them takes less computing.
What else?

How do modern compilers use mmx/3dnow/sse instructions?

I've been reading up on the x86 instruction set extensions, and they only seem useful in some quite specific circumstances (eg HADDPD - (Horizontal-Add-Packed-Double) in SSE3). These require a certain register layout that needs to be either deliberately set up, or occur from the series of instructions before it. How often do general-purpose compilers like gcc actually use these instructions (or a subset thereof), or are they mainly to be used in hand-coded assembler? How does the compiler detect where it is appropriate to use SIMD instructions?

Generally, few compilers use them. GCC and Visual Studio arn't usually able to use the SIMD instructions. If you enable SSE as a compiler flag, it will use the scalar SSE instructions for regular floating-point operations, but generally, don't expect the vectorized ones to be used automatically. Recent versions of GCC might be able to use them in some cases, but didn't work last I tried. Intel's C++ compiler is the only big compiler I know of that is able to auto-vectorize some loops.
In general though, you'll have to use them yourself. Either in raw assembler, or by using compiler intrinsics. In general, I'd say intrinsics are the better approach, since they better allow the compiler to understand the code, and so schedule and optimize, but in practice, I know MSVC at least doesn't always generate very efficient code from intrinsics, so plain asm may be the best solution there. Experiment, see what works. But don't expect the compiler to use these instructions for you, unless you 1) use the right compiler, and 2) write fairly simple loops that can be trivially vectorized.
Update 2012
Ok, so three years have passed since I wrote this answer. GCC has been able to auto-vectorize (simple) code for a couple of years now, and in VS2012, MSVC finally gains the same capability. Of course, the main part of my answer still applies: compilers can still only vectorize fairly trivial code. For anything more complex, you're stuck fiddling with intrinsics or inline asm.

Mono can use SIMD extensions as long as you use its classes for vectors. You can read about it here: http://tirania.org/blog/archive/2008/Nov-03.html
GCC should do some automatic vectorisation as long as you're using -O3 or a specific flag. They have an info page here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html

The question of how to exploit SSE and other small vector units automatically (without direction from the programmer in the form of special language constructs or specially blessed compiler "intrinsics") has been a topic of compiler research for some time. Most results seem to be specialized to a particular problem domain, such as digital signal processing. I have not kept up with the literature on this topic, but what I have read suggests that exploiting the vector (SSE) unit is still a topic for research, and that one should have low expectations of general-purpose compilers commonly used in the field.
Suggested search term: vectorizing compiler

I have seen gcc use sse to zero out a default std::string object. Not a particularly powerful use of sse, but it exists. In most cases, though you will have to write your own.
I know this because I had allowed the stack to become unaligned and it crashed, otherwise I probably wouldn't have noticed!

If you use the vector pascal compiler you will get efficient SIMD code for types for which SIMD gives an advantage. Basically this is anything of length less than 64 bits. ( for 64 bit reals it is actually slower to do SIMD).
Latest versions of the compiler will also automatically parallelise accross cores

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio