Java vs. Julia: Differences in JIT Compiling and Resulting Performance - performance

I've recently started reading about JIT compilation. On another note, I've read that well-written Julia code often performs on-par with statically compiled languages (see, e.g., paragraph 2 of the introduction section of the Julia docs) while I've recurrently heard Java often does not. Why is that?
On the surface, they seem to have in common that they both run JIT-compiled bytecode in a VM (although I am aware that Java dynamically infers which code to JIT). While I can rationalize the performance difference in Julia vs. (purely) interpreted languages like (vanilla) Python, how come two JIT-compiled languages have such different reputations for performance? Speaking of performance, I am particularly referring to scientific computing applications.
Please note that this question is intentionally phrased broadly. I feel like its possible answers could give me insights into what defines fast Julia code, given the way Julia's compiler works in comparison to other JIT compiled languages.

While AFAIK there is currently one implementation of Julia, there are several implementations of Java and not all behave the same nor use the same technics internally. Thus it does not mean much to compare languages. For example, GCJ is a GNU compiler meant to compile Java codes to native ones (ie. no JIT nor bytecode). It is now a defunct project since the open-source JIT-based implementations super-seeded this project (AFAIK even performance-wise).
The primary reference Java VM implementation is HotSpot (made by Oracle). The JIT of HotSpot use an adaptative strategy for compiling functions so to reduce the latency of the compilation. The code can be interpreted for a short period of time and if it is executed many times, then the JIT use more aggressive optimizations with multiple levels. As a result hot loops are very aggressively optimized while glue code executed once is mostly interpreted. Meanwhile, Julia is based on the LLVM compiler stack capable of producing very efficient code (it is used by Clang which is a compiler used to compile C/C++ code to native one), but it is also not yet very well suited for very dynamic codes (it works but the latency is pretty big compared to other existing JIT implementations).
The thing is Java and Julia target different domains. Java is used for example on embedded systems where latency matters a lot. It is also use for GUI applications and Web servers. Introducing a high latency during the execution is not very reasonable. This is especially why Java implementation spent a lot of time in the past so to optimize the GC (Garbage Collector) in order to reduce the latency of collections. Julia mainly target HPC/scientific applications that do not care much about latency. The main goal of Julia is to minimize the wall-clock time and not the responsiveness of the application.
I've read that well-written Julia code often performs on-par with statically compiled languages
Well, optimizing JITs like the one of Julia or the one of HotSpot are very good nowadays to compile scalar codes in hot loops. Their weakness lies in the capability to apply high-level expensive computations. For example, optimizing compilers like ICC/PGI can use the polyhedral model so to completely rewrite loops and vectorize them efficiently using SIMD instructions. This is frequent in HPC (numerically intensive) applications but very rare in embedded/Web/GUI ones. The use of the best specific instructions on the available platform is not always great in most JIT implementations (eg. bit operations) though the situation is rapidly improving. On the other hand, JIT can outperform static compilers by using runtime informations. For example, they can assume a value is a constant and optimize expressions based on that (eg. a runtime-dependent stride of 1 of a multi-dimensional array do not need additional multiplications). Still, static compilers can do similar optimisation with profile-guided optimizations (unfortunately rarely used in practice).
However, there is a catch: languages likes C/C++ compiled natively have access to lower-level features barely available in Java. This is a bit different in Julia since the link with native language code is easier and inline assembly is possible (as pointed out by #OscarSmith) enabling skilled developers to write efficient wrappers. Julia and Java use a GC that can speed up a bit some unoptimized codes but also slow down a lot some others (typically code manipulating big data-structures with a lot of references likes trees and graphs, especially in parallel codes). This is why a C/C++ code can significantly outperform a Julia/Java code. While JIT implementations can sometime (but rarely) outperform static C/C++ compilers, no compilers are perfect and there are case where nearly all implementations perform poorly. The use of (low-level) intrinsics enable developers to implement possibly faster codes at the expense of a lower portability and a higher complexity. This is not rare for SIMD code since auto-vectorization is far from being great so far. More generally, the access to lower-level features (eg. operating system specific functions, parallel tools) help to write faster codes for skilled programmers.
Chosen algorithms and methods matters often far more than the target language implementation. The best algorithm/method in one language implementation may not be the best in another. Two best algorithms/methods of two different implementation are generally hard to compare (it is fair to compare only the performance of codes if one is is nearly impossible to maintain and is very hard/long to write without bugs?). This is partially why comparing language implementation is so difficult, even on a specific problem to solve.
(purely) interpreted languages like (vanilla) Python
While the standard implementation of Python is the CPython interpreter, there are fast JIT for Python like PyPy or Pyston.
Speaking of performance, I am particularly referring to scientific computing applications
Note that scientific computing applications is still quite broad. While physicist tends to write heavily numerically intensive applications operating on large contiguous arrays where the use of multiple threads and SIMD instruction is critical, biologist tends to write codes requiring very different optimizations. For example, genomic codes tends to do a lot of string matching operations. They also often make use of complex data-structures/algorithms (eg. phylogenetic tree, compression).
Some Java features like boxing are performance killers for such applications though there are often complex way to mitigate their cost.
You may be interested by this famous language benchmark:
Julia VS C-GCC (one can see that Julia and Java are slow for binary trees, as expected, certainly due to the GC, though the Java's GC is more efficient at the expense of a bigger memory usage)
Julia VS Java-OpenJDK
C-GCC VS C-Clang
As you can see in the benchmark, the fastest implementations are generally the more-complex and/or bigger ones using the best algorithms and lower-level methods/tricks.

Related

Does the speed of programming language depend on compiler?

I am trying to learn how computer programs work and have this question. I often read articles like "C/C++ are faster than java" or "Java and C#: speed comparison". In all cases programs written in any language is translated to assembly language. So, what is the reason of speed differences in those languages. Does that mean, the compiler of one language generates better and faster assembly code ?
Sort of.
There are several reasons why speeds will differ between compilers/interpreters/programming languages, some of it having to do with the compiler, and some of it having to do with the language itself.
Some programming languages require more overhead.
If your language is very high level, it'll have more overhead compared to C, which is very low level. (garbage collection is a good example of this). It becomes a tradeoff. Do I want blazing fast binaries, or do I want to be able to write programs easily?
Languages are designed to do different things.
For example, PHP is designed to be used on web servers, and nobody in their right mind would try and use it to create a top-tier fps game. Different languages are better suited for different tasks, and will be faster in some areas then in others.
Not all languages compile to assembly.
While C/C++ may compile to assembly, languages like Java instead compile to bytecode and is run against the java virtual machine, for interoperability reasons. Once again, this is a tradeoff -- you gain portability at the expense of overhead.
Furthermore, C/C++ doesn't even have to compile to assembly. For example, enscriptem ultimately will compile C/C++ to Javascript so it can run on web browsers.
Compilers are not magic.
They're programs, and like all programs, have bugs and will improve (or degrade) over time. I could try writing a C compiler over the weekend, and I'd bet a million dollars that it would perform several orders of magnitude worse then a compiler/interpreter for the slowest language you can think of.
Compiler/interpreter optimization is an ongoing field of research and study.
Every year, researchers are writing and publishing papers on a new way to compile and make programs run faster. If a language is newer, it may not yet have had the time to fully apply every optimization available. (see above). Some optimizations may apply to only one kind of compiler/interpreter.
So, to summarize, the speed of a language is a mixture of the intrinsic features of the language itself, along with the maturity of the compiler/interpreter/platform used.
Compilers and interpreters are not some monolithic magical process that's constant between all programming languages -- they're all different, have different benefits and disadvantages, and are constantly in a state of flux.

Why Clojure/Lisp Programs are faster than other dynamic languages?

Based on language shootouts in a past few years, Clojure and other Lisps consistently perform better than most other dynamic languages. Why is that so?
Is it because of its homoiconicity?
Edit:
I did not know that Clojure is compiled into bytecode just like Java and Scala.
This stackoverflow thread threw light on why Clojure reaps the advantage of being both compiled and interpreted.
It is almost impossible to answer this question - it depends very much on how well the benchmark code is written, what exactly is being tested, whether you are allowed to use libraries that exploit native code, whether you are writing "idiomatic" code or optimising for performance etc.
So as always, you should treat all microbenchmarks with caution.
Having said that, the following reasons might give Clojure an advantage however in certain circumstances:
Clojure is always compiled - first down to bytecode, then down to native code by the JVM JIT compiler. This can give it a speed advantage in many cases, particularly over langauges that rely on some form of interpretation. In theory at least, you should be able to match pure Java speed in any circumstance where Clojure enables you to produce the same bytecode (which is reasonably often, though not always....)
Clojure can exploit JVM primitives and static typing - despite being a dynamic language, Clojure will compile statically types or primitive code if you give it enough hints. This can easily give a 10x boost in performance, though at the expense of making your code a bit longer/uglier.....
Clojure has heavily optimised certain data structures and operations - in particular the immutable persistent data structures and certain functional programming constructs like "reduce".
Macros enable powerful compile-time optimisations - if you use macros cleverly, you can do some quite sophisticated optimisations at compile time, effectively using code-generation to produce the code that will be most efficient at runtime. This is an advantage many Lisps share (especially Common Lisp, which was one of the big inspirations for Clojure). As nulvinge points out, homoiconicity isn't strictly necessary to achieve this (C++ also has macros!), but being a homoiconic language makes macros a lot easier.
The modern JVMs are brilliantly engineered - Clojure takes advantage of the thousands of man-years of engineering that have gone into the JVM, the Java runtime libraries, the garbage collection algorithms etc. Non-JVM languages don't get these benefits.

Performance of Google's Go?

So has anyone used Google's Go? I was wondering how the mathematical performance (e.g. flops) is compared to other languages with a garbage collector... like Java or .NET?
Has anyone investigated this?
Theoretical performance: The theoretical performance of pure Go programs is somewhere between C/C++ and Java. This assumes an advanced optimizing compiler and it also assumes the programmer takes advantage of all features of the language (be it C, C++, Java or Go) and refactors the code to fit the programming language.
Practical performance (as of July 2011): The standard Go compiler (5g/6g/8g) is currently unable to generate efficient instruction streams for high-performance numerical codes, so the performance will be lower than C/C++ or Java. There are multiple reasons for this: each function call has an overhead of a couple of additional instructions (compared to C/C++ or Java), no function inlining, average-quality register allocation, average-quality garbage collector, limited ability to erase bound checks, no access to vector instructions from Go, compiler has no support for SSE2 on 32-bit x86 CPUs, etc.
Bottom line: As a rule of thumb, expect the performance of numerical codes implemented in pure Go, compiled by 5g/6g/8g, to be 2 times lower than C/C++ or Java. Expect the performance to get better in the future.
Practical performance (September 2013): Compared to older Go from July 2011, Go 1.1.2 is capable of generating more efficient numerical codes but they remain to run slightly slower than C/C++ and Java. The compiler utilizes SSE2 instructions even on 32-bit x86 CPUs which causes 32-bit numerical codes to run much faster, most likely thanks to better register allocation. The compiler now implements function inlining and escape analysis. The garbage collector has also been improved but it remains to be less advanced than Java's garbage collector. There is still no support for accessing vector instructions from Go.
Bottom line: The performance gap seems sufficiently small for Go to be an alternative to C/C++ and Java in numerical computing, unless the competing implementation is using vector instructions.
The Go math package is largely written in assembler for performance.
Benchmarks are often unreliable and are subject to interpretation. For example, Robert Hundt's paper Loop Recognition in C++/Java/Go/Scala looks flawed. The Go blog post on Profiling Go Programs dissects Hundt's claims.
You're actually asking several different questions. First of all, Go's math performance is going to be about as fast as anything else. Any language that compiles down to native code (which arguably includes even JIT languages like .NET) is going to perform extremely well at raw math -- as fast as the machine can go. Simple math operations are very easy to compile into a zero-overhead form. This is the area where compiled (including JIT) languages have a advantage over interpreted ones.
The other question you asked was about garbage collection. This is, to a certain extent, a bit of a side issue if you're talking about heavy math. That's not to say that GC doesn't impact performance -- actually it impacts quite a bit. But the common solution for tight loops is to avoid or minimize GC sweeps. This is often quite simple if you're doing a tight loop -- you just re-use your old variables instead of constantly allocating and discarding them. This can speed your code by several orders of magnitude.
As for the GC implementations themselves -- Go and .NET both use mark-and-sweep garbage collection. Microsoft has put a lot of focus and engineering into their GC engine, and I'm obliged to think that it's quite good all things considered. Go's GC engine is a work in progress, and while it doesn't feel any slower than .NET's architecture, the Golang folks insist that it needs some work. The fact that Go's specification disallows destructors goes a long way in speeding things up, which may be why it doesn't seem that slow.
Finally, in my own anecdotal experience, I've found Go to be extremely fast. I've written very simple and easy programs that have stood up in my own benchmarks against highly-optimized C code from some long-standing and well-respected open source projects that pride themselves on performance.
The catch is that not all Go code is going to be efficient, just like not all C code is efficient. You've got to build it correctly, which often means doing things differently than what you're used to from other languages. The profiling blog post mentioned here several times is a good example of that.
Google did a study comparing Go to some other popular languages (C++, Java, Scala). They concluded it was not as strong performance-wise:
https://days2011.scala-lang.org/sites/days2011/files/ws3-1-Hundt.pdf
Quote from the Conclusion, about Go:
Go offers interesting language features, which also allow for a concise and standardized notation. The compilers for this language are still immature, which reflects in both performance and binary sizes.

Is it possible to design a dynamic language without significant performance loss?

Is it possible to design something like Ruby or Clojure without the significant performance loss in many situations compared with C/Java? Does hardware design play a role?
Edit: With significant I mean in an order of magnitudes, not just ten procent
Edit: I suspect that delnan is correct with me meaning dynamic languages so I changed the title
Performance depends on many things. Of course the semantics of the language have to be preserved even if we are compiling it - you can't remove dynamic dispatch from Ruby, it would speed things up drmatically but it would totally break 95% of the all Ruby code in the world. But still, much of the performance depends on how smart the implementation is.
I assume, by "high-level", you mean "dynamic"? Haskell and OCaml are extremely high-level, yet are is compiled natively and can outperform C# or Java, even C and C++ in some corner cases - especially if parallelism comes into play. And they certainly weren't designed with performance as #1 goal. But compiler writers, especially those focused onfunctional languages, are a very clever folk. If you or I started a high-level language, even if we used e.g. LLVM as backend for native compilation, we wouldn't get anywhere near this performance.
Making dynamic languages run fast is harder - they delay many decisions (types, members of a class/an object, ...) to runtime instead of compiletime, and while static code analysis can sometimes prove it's not possible in lines n and m, you still have to carry an advanced runtime around and do quite a few things a static language's compiler can do at compiletime. Even dynamic dispatch can be optimized with a smarter VM (Inline Cache anyone?), but it's a lot of work. More than a small new-fangeled language could do, that is.
Also see Steve Yegge's Dynamic Languages Strike Back.
And of course, what is a significant peformance loss? 100 times slower than C reads like a lot, but as we all know, 80% of execution time is spent in 20% of the code = 80% of the code won't have notable impact on the percieved performance of the whole program. For the remaining 20%, you can always rewrite it in C or C++ and call it from the dynamic language. For many applications, this suffices (for some, you don't even need to optimize). For the rest... well, if performance is that critical, you should propably write it in a language designed for performance.
Don't confuse the language design with the platform that it runs on.
For instance, Java is a high-level language. It runs on the JVM (as does Clojure - identified above, and JRuby - a Java version of Ruby). The JVM will perform byte-code analysis and optimise how the code runs (making use of escape analysis, just-in-time compilation etc.). So the platform has an effect on the performance that is largely independent of the language itself (see here for more info on Java performance and comparisons to C/C++)
Loss compared to what? If you need a garbage collector or closures then you need them, and you're going to pay the price regardless. If a language makes them easy for you to get at, that doesn't mean you have to use them when you don't need them.
If a language is interpreted instead of compiled, that's going to introduce an order of magnitude slowdown. But such a language may have compensating advantages, like ease of use, platform independence, and not having to compile. And, the programs you write in them may not run long enough for speed to be an issue.
There may be language implementations that introduce slowness for no good reason, but those don't have to be used.
You might want to look at what the DARPA HPCS initiative has come up with. There were 3 programming languages proposed: Sun's Fortress, IBM's X10 and Cray's Chapel. The latter two are still under development. Whether any of these meet your definition of high-level I don't know.
And yes, hardware design certainly does play a part. All 3 of these languages are targeted at supercomputers with very many processors and exhibit features appropriate to that domain.
It's certainly possible. For example, Objective-C is a dynamically-typed language that has performance comparable to C++ (although a wee bit slower, generally speaking, but still roughly equivalent).

Are functional languages inherently slow? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Why are functional languages always tailing behind C in benchmarks? If you have a statically typed functional language, it seems to me it could be compiled to the same code as C, or to even more optimized code since more semantics are available to the compiler. Why does it seem like all functional languages are slower than C, and why do they always need garbage collection and excessive use of the heap?
Does anyone know of a functional language appropriate for embedded / real-time applications, where memory allocation is kept to a minimum and the produced machine code is lean and fast?
Are functional languages inherently slow?
In some sense, yes. They require infrastructure that inevitably adds overheads over what can theoretically be attained using assembler by hand. In particular, first-class lexical closures only work well with garbage collection because they allow values to be carried out of scope.
Why are functional languages always tailing behind C in benchmarks?
Firstly, beware of selection bias. C acts as a lowest common denominator in benchmark suites, limiting what can be accomplished. If you have a benchmark comparing C with a functional language then it is almost certainly an extremely simple program. Arguably so simple that it is of little practical relevance today. It is not practically feasible to solve more complicated problems using C for a mere benchmark.
The most obvious example of this is parallelism. Today, we all have multicores. Even my phone is a multicore. Multicore parallelism is notoriously difficult in C but can be easy in functional languages (I like F#). Other examples include anything that benefits from persistent data structures, e.g. undo buffers are trivial with purely functional data structures but can be a huge amount of work in imperative languages like C.
Why does it seem like all functional languages are slower than C, and why do they always need garbage collection and excessive use of the heap?
Functional languages will seem slower because you'll only ever see benchmarks comparing code that is easy enough to write well in C and you'll never see benchmarks comparing meatier tasks where functional languages start to excel.
However, you've correctly identified what is probably the single biggest bottleneck in functional languages today: their excessive allocation rates. Nice work!
The reasons why functional languages allocate so heavily can be split into historical and inherent reasons.
Historically, Lisp implementations have been doing a lot of boxing for 50 years now. This characteristic spread to many other languages which use Lisp-like intermediate representations. Over the years, language implementers have continually resorted to boxing as a quick fix for complications in language implementation. In object oriented languages, the default has been to always heap allocate every object even when it can obviously be stack allocated. The burden of efficiency was then pushed onto the garbage collector and a huge amount of effort has been put into building garbage collectors that can attain performance close to that of stack allocation, typically by using a bump-allocating nursery generation. I think that a lot more effort should be put into researching functional language designs that minimize boxing and garbage collector designs that are optimized for different requirements.
Generational garbage collectors are great for languages that heap allocate a lot because they can be almost as fast as stack allocation. But they add substantial overheads elsewhere. Today's programs are increasingly using data structures like queues (e.g. for concurrent programming) and these give pathological behaviour for generational garbage collectors. If the items in the queue outlive the first generation then they all get marked, then they all get copied ("evacuated"), then all of the references to their old locations get updated and then they become eligible for collection. This is about 3× slower than it needs to be (e.g. compared to C). Mark region collectors like Beltway (2002) and Immix (2008) have the potential to solve this problem because the nursery is replaced with a region that can either be collected as if it were a nursery or, if it contains mostly reachable values, it can be replaced with another region and left to age until it contains mostly unreachable values.
Despite the pre-existence of C++, the creators of Java made the mistake of adopting type erasure for generics, leading to unnecessary boxing. For example, I benchmarked a simple hash table running 17× faster on .NET than the JVM partly because .NET did not make this mistake (it uses reified generics) and also because .NET has value types. I actually blame Lisp for making Java slow.
All modern functional language implementations continue to box excessively. JVM-based languages like Clojure and Scala have little choice because the VM they target cannot even express value types. OCaml sheds type information early in its compilation process and resorts to tagged integers and boxing at run-time to handle polymorphism. Consequently, OCaml will often box individual floating point numbers and always boxes tuples. For example, a triple of bytes in OCaml is represented by a pointer (with an implicit 1-bit tag embedded in it that gets checked repeatedly at run-time) to a heap-allocated block with a 64 bit header and 192 bit body containing three tagged 63-bit integers (where the 3 tags are, again, repeatedly examined at run time!). This is clearly insane.
Some work has been done on unboxing optimizations in functional languages but it never really gained traction. For example, the MLton compiler for Standard ML was a whole-program optimizing compiler that did sophisticated unboxing optimizations. Sadly, it was before its time and the "long" compilation times (probably under 1s on a modern machine!) deterred people from using it.
The only major platform to have broken this trend is .NET but, amazingly, it appears to have been an accident. Despite having a Dictionary implementation very heavily optimized for keys and values that are of value types (because they are unboxed) Microsoft employees like Eric Lippert continue to claim that the important thing about value types is their pass-by-value semantics and not the performance characteristics that stem from their unboxed internal representation. Eric seems to have been proven wrong: more .NET developers seem to care more about unboxing than pass-by-value. Indeed, most structs are immutable and, therefore, referentially transparent so there is no semantic difference between pass-by-value and pass-by-reference. Performance is visible and structs can offer massive performance improvements. The performance of structs even saved Stack Overflow and structs are used to avoid GC latency in commercial software like Rapid Addition's!
The other reason for heavy allocation by functional languages is inherent. Imperative data structures like hash tables use huge monolithic arrays internally. If these were persistent then the huge internal arrays would need to be copied every time an update was made. So purely functional data structures like balanced binary trees are fragmented into many little heap-allocated blocks in order to facilitate reuse from one version of the collection to the next.
Clojure uses a neat trick to alleviate this problem when collections like dictionaries are only written to during initialization and are then read from a lot. In this case, the initialization can use mutation to build the structure "behind the scenes". However, this does not help with incremental updates and the resulting collections are still substantially slower to read than their imperative equivalents. On the up-side, purely functional data structures offer persistence whereas imperative ones do not. However, few practical applications benefit from persistence in practice so this is often not advantageous. Hence the desire for impure functional languages where you can drop to imperative style effortlessly and reap the benefits.
Does anyone know of a functional language appropriate for embedded / real-time applications, where memory allocation is kept to a minimum and the produced machine code is lean and fast?
Take a look at Erlang and OCaml if you haven't already. Both are reasonable for memory constrained systems but neither generate particularly great machine code.
Nothing is inherently anything. Here is an example where interpreted OCaml runs faster than equivalent C code, because the OCaml optimizer has different information available to it, due to differences in the language. Of course, it would be foolish to make a general claim that OCaml is categorically faster than C. The point is, it depends upon what you're doing, and how you do it.
That said, OCaml is an example of a (mostly) functional language which is actually designed for performance, in contrast to purity.
Functional languages require the elimination of mutable state that is visible at the level of the language abstraction. Therefore, data that would be mutated in place by an imperative language needs to be copied instead, with the mutation taking place on the copy. For a simple example, see a quick sort in Haskell vs. C.
Furthermore, garbage collection is required because free() is not a pure function, as it has side effects. Therefore, the only way to free memory that does not involve side effects at the level of the language abstraction is with garbage collection.
Of course, in principle, a sufficiently smart compiler could optimize out much of this copying. This is already done to some degree, but making the compiler sufficiently smart to understand the semantics of your code at that level is just plain hard.
The short answer: because C is fast. As in, blazingly ridiculously crazy fast. A language simply doesn't have to be 'slow' to get its rear handed to it by C.
The reason why C is fast is that it was created by really great coders, and gcc has been optimized over the course of a couple more decades and by dozens more brilliant coders than 99% of languages out there.
In short, you're not going to beat C except for specialized tasks that require very specific functional programming constructs.
The control flow of proceedural languages much better matches the actual processing patterns of modern computers.
C maps very closely onto the assembly code its compilation produces, hence the nickname "cross-platform assembly". Computer manufacturers have spent a few decades making assembly code run as fast as possible, so C inherits all of this raw speed.
In comparison, the no side-effects, inherent parallelism of functional languages does not map onto a single processor at all well. The arbitrary order in which functions can be invoked needs to be serialised down to the CPU bottleneck: without extremely clever compilation, you're going to be context switching all the time, none of the pre-fetching will work because you're constantly jumping all over the place, ... Basically, all the optimisation work that computer manufacturers have done for nice, predictable proceedural languages is pretty much useless.
However! With the move towards lots of less powerful cores (rather than one or two turbo-charged cores), functional languages should begin to close the gap, as they naturally scale horizontally.
C is fast because it's basically a set of macros for assembler :) There is no "behind the scene" when you are writing a program in C. You alloc memory when you decide it's time to do that and you free in the same fashion. This is a huge advantage when you are writing a real time application, where predictabily is important (more than anything else, actually).
Also, C compilers are generally extremly fast because language itself is simple. It even doesn't make any type checkings :) This also means that is easier to make hard to find errors.
Ad advantage with the lack of type checking is that a function name can just be exported with its name for example and this makes C code easy to link with other language's code
Well Haskell is only 1.8 times slower than GCC's C++, which is faster than GCC's C implementation for typical benchmark tasks.
That makes Haskell very fast, even faster than C#(Mono that is).
relative Language
speed
1.0 C++ GNU g++
1.1 C GNU gcc
1.2 ATS
1.5 Java 6 -server
1.5 Clean
1.6 Pascal Free Pascal
1.6 Fortran Intel
1.8 Haskell GHC
2.0 C# Mono
2.1 Scala
2.2 Ada 2005 GNAT
2.4 Lisp SBCL
3.9 Lua LuaJIT
source
For the record I use Lua for Games on the iPhone, thus you could easily use Haskell or Lisp if you prefer, since they are faster.
As for now, functional languages aren't used heavily for industry projects, so not enough serious work goes into optimizers. Also, optimizing imperative code for an imperative target is probably way easier.
Functional languages have one feat that will let them outdo imperative languages really soon now: trivial parallelization.
Trivial not in the sense that it is easy, but that it can be built into the language environment, without the developer needing to think about it.
The cost of robust multithreading in a thread-agnostic language like C is prohibitive for many projects.
I disagree with tuinstoel. The important question is whether the functional language provides a faster development time and results in faster code when it is used to what functional languages were meant to be used. See the efficiency issues section on Wikipedia for a glimpse of what I mean.
One more reason for bigger executable size could be lazy evaluation and non-strictness. The compiler can't figure out at compile-time when certain expressions get evaluated, so some runtime gets stuffed into the executable to handle this (to call upon the evaluation of the so-called thunks). As for performance, laziness can be both good and bad. On one hand it allows for additional potential optimization, on the other hand the code size can be larger and programmers are more likely to make bad decisions, e.g. see Haskell's foldl vs. foldr vs. foldl' vs. foldr'.

Resources