Performance of Google's Go? - performance

So has anyone used Google's Go? I was wondering how the mathematical performance (e.g. flops) is compared to other languages with a garbage collector... like Java or .NET?
Has anyone investigated this?

Theoretical performance: The theoretical performance of pure Go programs is somewhere between C/C++ and Java. This assumes an advanced optimizing compiler and it also assumes the programmer takes advantage of all features of the language (be it C, C++, Java or Go) and refactors the code to fit the programming language.
Practical performance (as of July 2011): The standard Go compiler (5g/6g/8g) is currently unable to generate efficient instruction streams for high-performance numerical codes, so the performance will be lower than C/C++ or Java. There are multiple reasons for this: each function call has an overhead of a couple of additional instructions (compared to C/C++ or Java), no function inlining, average-quality register allocation, average-quality garbage collector, limited ability to erase bound checks, no access to vector instructions from Go, compiler has no support for SSE2 on 32-bit x86 CPUs, etc.
Bottom line: As a rule of thumb, expect the performance of numerical codes implemented in pure Go, compiled by 5g/6g/8g, to be 2 times lower than C/C++ or Java. Expect the performance to get better in the future.
Practical performance (September 2013): Compared to older Go from July 2011, Go 1.1.2 is capable of generating more efficient numerical codes but they remain to run slightly slower than C/C++ and Java. The compiler utilizes SSE2 instructions even on 32-bit x86 CPUs which causes 32-bit numerical codes to run much faster, most likely thanks to better register allocation. The compiler now implements function inlining and escape analysis. The garbage collector has also been improved but it remains to be less advanced than Java's garbage collector. There is still no support for accessing vector instructions from Go.
Bottom line: The performance gap seems sufficiently small for Go to be an alternative to C/C++ and Java in numerical computing, unless the competing implementation is using vector instructions.

The Go math package is largely written in assembler for performance.
Benchmarks are often unreliable and are subject to interpretation. For example, Robert Hundt's paper Loop Recognition in C++/Java/Go/Scala looks flawed. The Go blog post on Profiling Go Programs dissects Hundt's claims.

You're actually asking several different questions. First of all, Go's math performance is going to be about as fast as anything else. Any language that compiles down to native code (which arguably includes even JIT languages like .NET) is going to perform extremely well at raw math -- as fast as the machine can go. Simple math operations are very easy to compile into a zero-overhead form. This is the area where compiled (including JIT) languages have a advantage over interpreted ones.
The other question you asked was about garbage collection. This is, to a certain extent, a bit of a side issue if you're talking about heavy math. That's not to say that GC doesn't impact performance -- actually it impacts quite a bit. But the common solution for tight loops is to avoid or minimize GC sweeps. This is often quite simple if you're doing a tight loop -- you just re-use your old variables instead of constantly allocating and discarding them. This can speed your code by several orders of magnitude.
As for the GC implementations themselves -- Go and .NET both use mark-and-sweep garbage collection. Microsoft has put a lot of focus and engineering into their GC engine, and I'm obliged to think that it's quite good all things considered. Go's GC engine is a work in progress, and while it doesn't feel any slower than .NET's architecture, the Golang folks insist that it needs some work. The fact that Go's specification disallows destructors goes a long way in speeding things up, which may be why it doesn't seem that slow.
Finally, in my own anecdotal experience, I've found Go to be extremely fast. I've written very simple and easy programs that have stood up in my own benchmarks against highly-optimized C code from some long-standing and well-respected open source projects that pride themselves on performance.
The catch is that not all Go code is going to be efficient, just like not all C code is efficient. You've got to build it correctly, which often means doing things differently than what you're used to from other languages. The profiling blog post mentioned here several times is a good example of that.

Google did a study comparing Go to some other popular languages (C++, Java, Scala). They concluded it was not as strong performance-wise:
https://days2011.scala-lang.org/sites/days2011/files/ws3-1-Hundt.pdf
Quote from the Conclusion, about Go:
Go offers interesting language features, which also allow for a concise and standardized notation. The compilers for this language are still immature, which reflects in both performance and binary sizes.

Related

Java vs. Julia: Differences in JIT Compiling and Resulting Performance

I've recently started reading about JIT compilation. On another note, I've read that well-written Julia code often performs on-par with statically compiled languages (see, e.g., paragraph 2 of the introduction section of the Julia docs) while I've recurrently heard Java often does not. Why is that?
On the surface, they seem to have in common that they both run JIT-compiled bytecode in a VM (although I am aware that Java dynamically infers which code to JIT). While I can rationalize the performance difference in Julia vs. (purely) interpreted languages like (vanilla) Python, how come two JIT-compiled languages have such different reputations for performance? Speaking of performance, I am particularly referring to scientific computing applications.
Please note that this question is intentionally phrased broadly. I feel like its possible answers could give me insights into what defines fast Julia code, given the way Julia's compiler works in comparison to other JIT compiled languages.
While AFAIK there is currently one implementation of Julia, there are several implementations of Java and not all behave the same nor use the same technics internally. Thus it does not mean much to compare languages. For example, GCJ is a GNU compiler meant to compile Java codes to native ones (ie. no JIT nor bytecode). It is now a defunct project since the open-source JIT-based implementations super-seeded this project (AFAIK even performance-wise).
The primary reference Java VM implementation is HotSpot (made by Oracle). The JIT of HotSpot use an adaptative strategy for compiling functions so to reduce the latency of the compilation. The code can be interpreted for a short period of time and if it is executed many times, then the JIT use more aggressive optimizations with multiple levels. As a result hot loops are very aggressively optimized while glue code executed once is mostly interpreted. Meanwhile, Julia is based on the LLVM compiler stack capable of producing very efficient code (it is used by Clang which is a compiler used to compile C/C++ code to native one), but it is also not yet very well suited for very dynamic codes (it works but the latency is pretty big compared to other existing JIT implementations).
The thing is Java and Julia target different domains. Java is used for example on embedded systems where latency matters a lot. It is also use for GUI applications and Web servers. Introducing a high latency during the execution is not very reasonable. This is especially why Java implementation spent a lot of time in the past so to optimize the GC (Garbage Collector) in order to reduce the latency of collections. Julia mainly target HPC/scientific applications that do not care much about latency. The main goal of Julia is to minimize the wall-clock time and not the responsiveness of the application.
I've read that well-written Julia code often performs on-par with statically compiled languages
Well, optimizing JITs like the one of Julia or the one of HotSpot are very good nowadays to compile scalar codes in hot loops. Their weakness lies in the capability to apply high-level expensive computations. For example, optimizing compilers like ICC/PGI can use the polyhedral model so to completely rewrite loops and vectorize them efficiently using SIMD instructions. This is frequent in HPC (numerically intensive) applications but very rare in embedded/Web/GUI ones. The use of the best specific instructions on the available platform is not always great in most JIT implementations (eg. bit operations) though the situation is rapidly improving. On the other hand, JIT can outperform static compilers by using runtime informations. For example, they can assume a value is a constant and optimize expressions based on that (eg. a runtime-dependent stride of 1 of a multi-dimensional array do not need additional multiplications). Still, static compilers can do similar optimisation with profile-guided optimizations (unfortunately rarely used in practice).
However, there is a catch: languages likes C/C++ compiled natively have access to lower-level features barely available in Java. This is a bit different in Julia since the link with native language code is easier and inline assembly is possible (as pointed out by #OscarSmith) enabling skilled developers to write efficient wrappers. Julia and Java use a GC that can speed up a bit some unoptimized codes but also slow down a lot some others (typically code manipulating big data-structures with a lot of references likes trees and graphs, especially in parallel codes). This is why a C/C++ code can significantly outperform a Julia/Java code. While JIT implementations can sometime (but rarely) outperform static C/C++ compilers, no compilers are perfect and there are case where nearly all implementations perform poorly. The use of (low-level) intrinsics enable developers to implement possibly faster codes at the expense of a lower portability and a higher complexity. This is not rare for SIMD code since auto-vectorization is far from being great so far. More generally, the access to lower-level features (eg. operating system specific functions, parallel tools) help to write faster codes for skilled programmers.
Chosen algorithms and methods matters often far more than the target language implementation. The best algorithm/method in one language implementation may not be the best in another. Two best algorithms/methods of two different implementation are generally hard to compare (it is fair to compare only the performance of codes if one is is nearly impossible to maintain and is very hard/long to write without bugs?). This is partially why comparing language implementation is so difficult, even on a specific problem to solve.
(purely) interpreted languages like (vanilla) Python
While the standard implementation of Python is the CPython interpreter, there are fast JIT for Python like PyPy or Pyston.
Speaking of performance, I am particularly referring to scientific computing applications
Note that scientific computing applications is still quite broad. While physicist tends to write heavily numerically intensive applications operating on large contiguous arrays where the use of multiple threads and SIMD instruction is critical, biologist tends to write codes requiring very different optimizations. For example, genomic codes tends to do a lot of string matching operations. They also often make use of complex data-structures/algorithms (eg. phylogenetic tree, compression).
Some Java features like boxing are performance killers for such applications though there are often complex way to mitigate their cost.
You may be interested by this famous language benchmark:
Julia VS C-GCC (one can see that Julia and Java are slow for binary trees, as expected, certainly due to the GC, though the Java's GC is more efficient at the expense of a bigger memory usage)
Julia VS Java-OpenJDK
C-GCC VS C-Clang
As you can see in the benchmark, the fastest implementations are generally the more-complex and/or bigger ones using the best algorithms and lower-level methods/tricks.

most suitable language for computationally and memory expensive algorithms

Let's say you have to implement a tool to efficiently solve an NP-hard problem, with unavoidable possible explosion of memory usage (the output size in some cases exponential to the input size) and you are particularly concerned about the performances of this tool at running time. The source code has also to be readable and understandable once the underlying theory is known, and this requirement is as important as the efficiency of the tool itself.
I personally think that 3 languages could be suitable for these three requirements: c++, scala, java.
They all provide the right abstraction on data types that makes it possible to compare different structures or apply the same algorithms (which is also important) to different data types.
C++ has the advantage of being statically compiled and optimized, and with function inlining (if the data structures and algorithms are designed carefully) and other optimisation techniques it's possible to achieve a performance close to that of pure C while maintaining a fairly good readability.
If you also put a lot of care in data representation you can optimise the cache performance, which can gain orders of magnitude in speed when the cache miss rate is low.
Java is instead JIT compiled, which allows to apply optimisations during runtime, and in this category of algorithms that could have different behaviours between different runs, that may be a plus. I fear instead that such an approach could suffer from garbage collector, however in the case of this algorithm it's common to continuously allocate memory and java heap performance is notoriously better than C/C++ and if you implement your own memory manager inside the language you could even achieve good efficiency.
This approach instead is not able to inline method invocation (which induces a huge performance penalty) and doesn't give you control over the cache performance. Among the pros there's a better and cleaner syntax than C++.
My concerns about scala are more or less the same as Java, plus the fact that I can't control how the language is optimised unless I have a deep knowledge on the compiler and the standard library. But well: I get a very clean syntax :)
What's your take on the subject? Have you had to deal with this already? Would you implement an algorithm with such properties and requirements in any of these languages or would you suggest something else? How would you compare them?
Usually I’d say “C++” in a heartbeat. The secret being that C++ simply produces less (memory) garbage that needs managing.
On the other hand, your observation that
however in the case of this algorithm it's common to continuously allocate memory
is a hint that Java / Scala may actually be more suited. But then you could use a small object heap in C++ as well. Boost has one that uses the standard allocator interface, if memory serves.
Another advantage of C++ is obviously the use of abstraction without penalty through templates – i.e. that you can easily create generic algorithmic components that can interact without incurring a runtime overhead due to abstraction. In fact, you noted that
it's possible to achieve a performance close to that of pure C while maintaining a fairly good readability
– this is looking at things the wrong way: Templates allow C++ to achieve performance superior to that of C while still maintaining high abstraction.
D might be worth a look, seeing as how it tries to be a better C++.
From a superficial glance, it has better source code readability than C++ does, so that's one of your points covered.
It also has memory management, which makes playing with algorithms a bit easier.
And templates
Here is a stackoverflow discussion comparing the performance of C++ and D
The languages you noticed were my first guesses as well.
Each language has a different take on how to handle specific issues like compilation, memory management and source code, but in theory, any of them should be fitting to your problem.
It is impossible to tell which is best, and there is likely no major difference if you are familiar enough with all of them to work around their respective quirks.
And obviously, if you actually find the need to optimize (I'm not sure if that's a given), that's possible in each language. Lower level languages obviously offer more options, but are also (far) more complex to actually improve.
A single note about C++ vs Java: This is really a holy war, and if you've followed the recent development you'll probably have your own opinion. I, for one, think Java offers enough good aspects to make up for its flaws, usually.
And a final note on C++ vs C: According to my knowledge, the difference usually amounts to a sufficiently low percentage to ignore this. It it doesn't make a difference for the source code, it's fine to go with C, if C++ could make for easier-to-read source code, go with C++. In any case, the choice is kind of negligible.
In the end, remember that money spent on a few hours of programming/optimizing this could as well go into slightly superior hardware to make up for missed tiny details.
It all boils down to: Any of your options is fine as long as you do it right (domain knowledge).
I would use a language which makes it very easy to work on the algorithm. Get the algorithm right and it could very easily outweigh any advantage from fine-tuning the wrong algorithm. Don't be scared to play around in a language normally thought of as slow in execution speed if that language makes it easier to express algorithmic ideas. It is usually much easier to transcribe the right algorithm into another language than it is to eek-out the last dregs of speed from the wrong algorithm in the fastest executing language.
So do it in a language you are comfortable with and which is expressive. You might surprise yourself and find that what is produced is fast enough!

Is it possible to design a dynamic language without significant performance loss?

Is it possible to design something like Ruby or Clojure without the significant performance loss in many situations compared with C/Java? Does hardware design play a role?
Edit: With significant I mean in an order of magnitudes, not just ten procent
Edit: I suspect that delnan is correct with me meaning dynamic languages so I changed the title
Performance depends on many things. Of course the semantics of the language have to be preserved even if we are compiling it - you can't remove dynamic dispatch from Ruby, it would speed things up drmatically but it would totally break 95% of the all Ruby code in the world. But still, much of the performance depends on how smart the implementation is.
I assume, by "high-level", you mean "dynamic"? Haskell and OCaml are extremely high-level, yet are is compiled natively and can outperform C# or Java, even C and C++ in some corner cases - especially if parallelism comes into play. And they certainly weren't designed with performance as #1 goal. But compiler writers, especially those focused onfunctional languages, are a very clever folk. If you or I started a high-level language, even if we used e.g. LLVM as backend for native compilation, we wouldn't get anywhere near this performance.
Making dynamic languages run fast is harder - they delay many decisions (types, members of a class/an object, ...) to runtime instead of compiletime, and while static code analysis can sometimes prove it's not possible in lines n and m, you still have to carry an advanced runtime around and do quite a few things a static language's compiler can do at compiletime. Even dynamic dispatch can be optimized with a smarter VM (Inline Cache anyone?), but it's a lot of work. More than a small new-fangeled language could do, that is.
Also see Steve Yegge's Dynamic Languages Strike Back.
And of course, what is a significant peformance loss? 100 times slower than C reads like a lot, but as we all know, 80% of execution time is spent in 20% of the code = 80% of the code won't have notable impact on the percieved performance of the whole program. For the remaining 20%, you can always rewrite it in C or C++ and call it from the dynamic language. For many applications, this suffices (for some, you don't even need to optimize). For the rest... well, if performance is that critical, you should propably write it in a language designed for performance.
Don't confuse the language design with the platform that it runs on.
For instance, Java is a high-level language. It runs on the JVM (as does Clojure - identified above, and JRuby - a Java version of Ruby). The JVM will perform byte-code analysis and optimise how the code runs (making use of escape analysis, just-in-time compilation etc.). So the platform has an effect on the performance that is largely independent of the language itself (see here for more info on Java performance and comparisons to C/C++)
Loss compared to what? If you need a garbage collector or closures then you need them, and you're going to pay the price regardless. If a language makes them easy for you to get at, that doesn't mean you have to use them when you don't need them.
If a language is interpreted instead of compiled, that's going to introduce an order of magnitude slowdown. But such a language may have compensating advantages, like ease of use, platform independence, and not having to compile. And, the programs you write in them may not run long enough for speed to be an issue.
There may be language implementations that introduce slowness for no good reason, but those don't have to be used.
You might want to look at what the DARPA HPCS initiative has come up with. There were 3 programming languages proposed: Sun's Fortress, IBM's X10 and Cray's Chapel. The latter two are still under development. Whether any of these meet your definition of high-level I don't know.
And yes, hardware design certainly does play a part. All 3 of these languages are targeted at supercomputers with very many processors and exhibit features appropriate to that domain.
It's certainly possible. For example, Objective-C is a dynamically-typed language that has performance comparable to C++ (although a wee bit slower, generally speaking, but still roughly equivalent).

Fortran's performance

Fortran's performances on Computer Language Benchmark Game are surprisingly bad. Today's result puts Fortran 14th and 11th on the two quad-core tests, 7th and 10th on the single cores.
Now, I know benchmarks are never perfect, but still, Fortran was (is?) often considered THE language for high performance computing and it seems like the type of problems used in this benchmark should be to Fortran's advantage. In an recent article on computational physics, Landau (2008) wrote:
However, [Java] is not as efficient or
as well supported for HPC and parallel
processing as are FORTRAN and C, the
latter two having highly developed
compilers and many more scientific
subroutine libraries available.
FORTRAN, in turn, is still the
dominant language for HPC, with
FORTRAN 90/95 being a surprisingly
nice, modern, and effective language;
but alas, it is hardly taught by any
CS departments, and compilers can be
expensive.
Is it only because of the compiler used by the language shootout (Intel's free compiler for Linux) ?
No, this isn't just because of the compiler.
What benchmarks like this -- where the program differs from benchmark to benchmark -- is largely the amount of effort (and quality of effort) that the programmer put into writing any given program. I suspect that Fortran is at a significant disadvantage in that particular metric -- unlike C and C++, the pool of programmers who'd want to try their hand at making the benchmark program better is pretty small, and unlike most anything else, they likely don't feel like they have something to prove either. So, there's no motivation for someone to spend a few days poring over generated assembly code and profiling the program to make it go faster.
This is fairly clear from the results that were obtained. In general, with sufficient programming effort and a decent compiler, neither C, C++, nor Fortran will be significantly slower than assembly code -- certainly not more than 5-10%, at worst, except for pathological cases. The fact that the actual results obtained here are more variant than that indicates to me that "sufficient programming effort" has not been expended.
There are exceptions when you allow the assembly to use vector instructions, but don't allow the C/C++/Fortran to use corresponding compiler intrinsics -- automatic vectorization is not even a close approximation of perfect and probably never will be. I don't know how much those are likely to apply here.
Similarly, an exception is in things like string handling, where you depend heavily on the runtime library (which may be of varying quality; Fortran is rarely a case where a fast string library will make money for the compiler vendor!), and on the basic definition of a "string" and how that's represented in memory.
Some random thoughts:
Fortran used to do very well because it was easier to identify loop invariants which made some optimizations easier for the compiler. Since then
Compilers have gotten much more sophisticated. Enormous effort has been put into c and c++ compilers in particular. Have the fortran compilers kept up? I suppose the gfortran uses the same back end of gcc and g++, but what of the intel compiler? It used to be good, but is it still?
Some languages have gotten a lot specialized keywords and syntax to help the compiler (restricted and const int const *p in c, and inline in c++). Not knowing fortran 90 or 95 I can't say if these have kept pace.
I've looked at these tests. It's not like the compiler is wrong or something. In most tests Fortran is comparable to C++ except some where it gets beaten by a factor of 10. These tests just reflect what one should know from the beggining - that Fortran is simply NOT an all-around interoperable programming language - it is suited for efficient computation, has good list operations & stuff but for example IO sucks unless you are doing it with specific Fortran-like methods - like e.g. 'unformatted' IO.
Let me give you an example - the 'reverse-complement' program that is supposed to read a large (of order of 10^8 B) file from stdin line-by-line, does something with it & prints the resulting large file to stdout. The pretty straighforward Fortran program is about 10 times slower on a single core (~10s) than a HEAVILY optimized C++ (~1s). When you try to play with the program, you'll see that only simple formatted read & write take more than 8 seconds. In a Fortran way, if you care for efficiency, you'd just write an unformatted structure to a file & read it back in no time (which is totally non-portable & stuff but who cares anyway - an efficient code is supposed to be fast & optimized for a specific machine, not able to run everywhere).
So the short answer is - don't worry, just do your job - and if you want to write a super-efficient operating system, than sorry - Fortran is just not the way for that kind of performance.
This benchmark is stupid at all.
For example, they measure CPU-time for the whole program to run. As mcmint stated (and it might be actually true) Fortran I/O sucks*. But who cares? In real-world tasks one read input for some seconds than do calculations for hours/days/months and finally write output for the seconds. Thats why in most benchmarks I/O operations are excluded from time measurements (if you of course do not benchmark I/O by itself).
Norber Wiener in his book God & Golem, Inc. wrote
Render unto man the things which are man’s and unto the computer the things which are the computer’s.
In my opinion the usage of this principle while implementing algorithm in any programming language means:
Write as readable and simple code as you can and let compiler do the optimizations.
Especially it is important in real-world (huge) applications. Dirty tricks (so heavily used in many benchmarks) even if they might improve the efficiency to some extent (5%, maybe 10%) are not for the real-world projects.
/* C/C++ uses stream I/O, but Fortran traditionally uses record-based I/O. Further reading. Anyway I/O in that benchmarks are so surprising. The usage of stdin/stdout redirection might also be the source of problem. Why not simply use the ability of reading/writing files provided by the language or standard library? Once again this woud be more real-world situation.
I would like to say that even if the benchmark do not bring up the best results for FORTRAN, this language will still be used and for a long time. Reasons of use are not just performance but also some kind of thing called easyness of programmability. Lots of people that learnt to use it in the 60's and 70's are now too old for getting into new stuff and they know how to use FORTRAN pretty well. I mean, there are a lot of human factors for a language to be used. The programmer also matters.
Considering they did not publish the exact compiler options they used for the Intel Fortran Compiler, I have little faith in their benchmark.
I would also remark that both Intel's math library, MKL, and AMD's math library, ACML, use the Intel Fortran Compiler.
Edit:
I did find the compilation options when you click on the benchmark's name. The result is surprising since the optimization level seems reasonable. It may come down to the efficiency of the algorithm.

Are functional languages inherently slow? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Why are functional languages always tailing behind C in benchmarks? If you have a statically typed functional language, it seems to me it could be compiled to the same code as C, or to even more optimized code since more semantics are available to the compiler. Why does it seem like all functional languages are slower than C, and why do they always need garbage collection and excessive use of the heap?
Does anyone know of a functional language appropriate for embedded / real-time applications, where memory allocation is kept to a minimum and the produced machine code is lean and fast?
Are functional languages inherently slow?
In some sense, yes. They require infrastructure that inevitably adds overheads over what can theoretically be attained using assembler by hand. In particular, first-class lexical closures only work well with garbage collection because they allow values to be carried out of scope.
Why are functional languages always tailing behind C in benchmarks?
Firstly, beware of selection bias. C acts as a lowest common denominator in benchmark suites, limiting what can be accomplished. If you have a benchmark comparing C with a functional language then it is almost certainly an extremely simple program. Arguably so simple that it is of little practical relevance today. It is not practically feasible to solve more complicated problems using C for a mere benchmark.
The most obvious example of this is parallelism. Today, we all have multicores. Even my phone is a multicore. Multicore parallelism is notoriously difficult in C but can be easy in functional languages (I like F#). Other examples include anything that benefits from persistent data structures, e.g. undo buffers are trivial with purely functional data structures but can be a huge amount of work in imperative languages like C.
Why does it seem like all functional languages are slower than C, and why do they always need garbage collection and excessive use of the heap?
Functional languages will seem slower because you'll only ever see benchmarks comparing code that is easy enough to write well in C and you'll never see benchmarks comparing meatier tasks where functional languages start to excel.
However, you've correctly identified what is probably the single biggest bottleneck in functional languages today: their excessive allocation rates. Nice work!
The reasons why functional languages allocate so heavily can be split into historical and inherent reasons.
Historically, Lisp implementations have been doing a lot of boxing for 50 years now. This characteristic spread to many other languages which use Lisp-like intermediate representations. Over the years, language implementers have continually resorted to boxing as a quick fix for complications in language implementation. In object oriented languages, the default has been to always heap allocate every object even when it can obviously be stack allocated. The burden of efficiency was then pushed onto the garbage collector and a huge amount of effort has been put into building garbage collectors that can attain performance close to that of stack allocation, typically by using a bump-allocating nursery generation. I think that a lot more effort should be put into researching functional language designs that minimize boxing and garbage collector designs that are optimized for different requirements.
Generational garbage collectors are great for languages that heap allocate a lot because they can be almost as fast as stack allocation. But they add substantial overheads elsewhere. Today's programs are increasingly using data structures like queues (e.g. for concurrent programming) and these give pathological behaviour for generational garbage collectors. If the items in the queue outlive the first generation then they all get marked, then they all get copied ("evacuated"), then all of the references to their old locations get updated and then they become eligible for collection. This is about 3× slower than it needs to be (e.g. compared to C). Mark region collectors like Beltway (2002) and Immix (2008) have the potential to solve this problem because the nursery is replaced with a region that can either be collected as if it were a nursery or, if it contains mostly reachable values, it can be replaced with another region and left to age until it contains mostly unreachable values.
Despite the pre-existence of C++, the creators of Java made the mistake of adopting type erasure for generics, leading to unnecessary boxing. For example, I benchmarked a simple hash table running 17× faster on .NET than the JVM partly because .NET did not make this mistake (it uses reified generics) and also because .NET has value types. I actually blame Lisp for making Java slow.
All modern functional language implementations continue to box excessively. JVM-based languages like Clojure and Scala have little choice because the VM they target cannot even express value types. OCaml sheds type information early in its compilation process and resorts to tagged integers and boxing at run-time to handle polymorphism. Consequently, OCaml will often box individual floating point numbers and always boxes tuples. For example, a triple of bytes in OCaml is represented by a pointer (with an implicit 1-bit tag embedded in it that gets checked repeatedly at run-time) to a heap-allocated block with a 64 bit header and 192 bit body containing three tagged 63-bit integers (where the 3 tags are, again, repeatedly examined at run time!). This is clearly insane.
Some work has been done on unboxing optimizations in functional languages but it never really gained traction. For example, the MLton compiler for Standard ML was a whole-program optimizing compiler that did sophisticated unboxing optimizations. Sadly, it was before its time and the "long" compilation times (probably under 1s on a modern machine!) deterred people from using it.
The only major platform to have broken this trend is .NET but, amazingly, it appears to have been an accident. Despite having a Dictionary implementation very heavily optimized for keys and values that are of value types (because they are unboxed) Microsoft employees like Eric Lippert continue to claim that the important thing about value types is their pass-by-value semantics and not the performance characteristics that stem from their unboxed internal representation. Eric seems to have been proven wrong: more .NET developers seem to care more about unboxing than pass-by-value. Indeed, most structs are immutable and, therefore, referentially transparent so there is no semantic difference between pass-by-value and pass-by-reference. Performance is visible and structs can offer massive performance improvements. The performance of structs even saved Stack Overflow and structs are used to avoid GC latency in commercial software like Rapid Addition's!
The other reason for heavy allocation by functional languages is inherent. Imperative data structures like hash tables use huge monolithic arrays internally. If these were persistent then the huge internal arrays would need to be copied every time an update was made. So purely functional data structures like balanced binary trees are fragmented into many little heap-allocated blocks in order to facilitate reuse from one version of the collection to the next.
Clojure uses a neat trick to alleviate this problem when collections like dictionaries are only written to during initialization and are then read from a lot. In this case, the initialization can use mutation to build the structure "behind the scenes". However, this does not help with incremental updates and the resulting collections are still substantially slower to read than their imperative equivalents. On the up-side, purely functional data structures offer persistence whereas imperative ones do not. However, few practical applications benefit from persistence in practice so this is often not advantageous. Hence the desire for impure functional languages where you can drop to imperative style effortlessly and reap the benefits.
Does anyone know of a functional language appropriate for embedded / real-time applications, where memory allocation is kept to a minimum and the produced machine code is lean and fast?
Take a look at Erlang and OCaml if you haven't already. Both are reasonable for memory constrained systems but neither generate particularly great machine code.
Nothing is inherently anything. Here is an example where interpreted OCaml runs faster than equivalent C code, because the OCaml optimizer has different information available to it, due to differences in the language. Of course, it would be foolish to make a general claim that OCaml is categorically faster than C. The point is, it depends upon what you're doing, and how you do it.
That said, OCaml is an example of a (mostly) functional language which is actually designed for performance, in contrast to purity.
Functional languages require the elimination of mutable state that is visible at the level of the language abstraction. Therefore, data that would be mutated in place by an imperative language needs to be copied instead, with the mutation taking place on the copy. For a simple example, see a quick sort in Haskell vs. C.
Furthermore, garbage collection is required because free() is not a pure function, as it has side effects. Therefore, the only way to free memory that does not involve side effects at the level of the language abstraction is with garbage collection.
Of course, in principle, a sufficiently smart compiler could optimize out much of this copying. This is already done to some degree, but making the compiler sufficiently smart to understand the semantics of your code at that level is just plain hard.
The short answer: because C is fast. As in, blazingly ridiculously crazy fast. A language simply doesn't have to be 'slow' to get its rear handed to it by C.
The reason why C is fast is that it was created by really great coders, and gcc has been optimized over the course of a couple more decades and by dozens more brilliant coders than 99% of languages out there.
In short, you're not going to beat C except for specialized tasks that require very specific functional programming constructs.
The control flow of proceedural languages much better matches the actual processing patterns of modern computers.
C maps very closely onto the assembly code its compilation produces, hence the nickname "cross-platform assembly". Computer manufacturers have spent a few decades making assembly code run as fast as possible, so C inherits all of this raw speed.
In comparison, the no side-effects, inherent parallelism of functional languages does not map onto a single processor at all well. The arbitrary order in which functions can be invoked needs to be serialised down to the CPU bottleneck: without extremely clever compilation, you're going to be context switching all the time, none of the pre-fetching will work because you're constantly jumping all over the place, ... Basically, all the optimisation work that computer manufacturers have done for nice, predictable proceedural languages is pretty much useless.
However! With the move towards lots of less powerful cores (rather than one or two turbo-charged cores), functional languages should begin to close the gap, as they naturally scale horizontally.
C is fast because it's basically a set of macros for assembler :) There is no "behind the scene" when you are writing a program in C. You alloc memory when you decide it's time to do that and you free in the same fashion. This is a huge advantage when you are writing a real time application, where predictabily is important (more than anything else, actually).
Also, C compilers are generally extremly fast because language itself is simple. It even doesn't make any type checkings :) This also means that is easier to make hard to find errors.
Ad advantage with the lack of type checking is that a function name can just be exported with its name for example and this makes C code easy to link with other language's code
Well Haskell is only 1.8 times slower than GCC's C++, which is faster than GCC's C implementation for typical benchmark tasks.
That makes Haskell very fast, even faster than C#(Mono that is).
relative Language
speed
1.0 C++ GNU g++
1.1 C GNU gcc
1.2 ATS
1.5 Java 6 -server
1.5 Clean
1.6 Pascal Free Pascal
1.6 Fortran Intel
1.8 Haskell GHC
2.0 C# Mono
2.1 Scala
2.2 Ada 2005 GNAT
2.4 Lisp SBCL
3.9 Lua LuaJIT
source
For the record I use Lua for Games on the iPhone, thus you could easily use Haskell or Lisp if you prefer, since they are faster.
As for now, functional languages aren't used heavily for industry projects, so not enough serious work goes into optimizers. Also, optimizing imperative code for an imperative target is probably way easier.
Functional languages have one feat that will let them outdo imperative languages really soon now: trivial parallelization.
Trivial not in the sense that it is easy, but that it can be built into the language environment, without the developer needing to think about it.
The cost of robust multithreading in a thread-agnostic language like C is prohibitive for many projects.
I disagree with tuinstoel. The important question is whether the functional language provides a faster development time and results in faster code when it is used to what functional languages were meant to be used. See the efficiency issues section on Wikipedia for a glimpse of what I mean.
One more reason for bigger executable size could be lazy evaluation and non-strictness. The compiler can't figure out at compile-time when certain expressions get evaluated, so some runtime gets stuffed into the executable to handle this (to call upon the evaluation of the so-called thunks). As for performance, laziness can be both good and bad. On one hand it allows for additional potential optimization, on the other hand the code size can be larger and programmers are more likely to make bad decisions, e.g. see Haskell's foldl vs. foldr vs. foldl' vs. foldr'.

Resources