What is the most performant lisp on the JVM - performance

What is the most performant (fastest) lisp implementation on the JVM?
By lisp implementation I consider all implementations of any language in lisp family, like Common Lisp, Scheme, Clojure, ...
I know that Clojure can be made pretty fast using type hints, that ABCL is in general not considered to be fast. I don't have experience using any Scheme on JVM, but heard that Kawa is pretty fast too.

With Clojure you can get to the speed of Java (with type hints of course) and you can not get faster than java (exept in some very rare cases). I don't know about the other lisps the are maybe the same speed but not faster.
So that said about the standard speed of calls and so on.
Clojure has data structures are not always as fast as possible but the really make up for that with there other properties like thread safety, immutable and fast reading.
To make the data structures faster Rich invented transient with make them mutable in a way that they are still functional (and the are a LOT faster) and he is already working on the next big thing (read about the Emerging Languages camp talk of rich).
Its much easier to write concurrent code with clojure so that is really imported for to make fast programmes.
So the next thing is math. There are three levels of Speed on the JVM. Math with boxed Types,primitiv Types with overflow checking,or without overflow checking. Clojure provides all of those so no limit there.
So the next thing is how fast can you work with Java, if you have to use wrappers you wont perform as good and java calls are used often in most JVM languages. To implement clojure in clojure, clojure needed to add a low level construct so that you can interact with java without any overhead.
So clojure is as fast as it gets on the JVM.
P.S.
Protocols are like really fast multmethods witch are not that generic but the dispatch fast enough to use them in clojure core (and so not depend on java anymore). Look them up the are way cool.

not a whole lot of good data though this and some others seem to indicate the obvious. Immutible languages suffer slightly when doing non-immutable tasks, and non-immutable languages suffer when doing highly parallel tasks.
When considering these questions it helps to consider the "fail back option". Clojure can fall back to java for any part of your code that the profiler tells you is not pulling its wight.
in short: I vote clojure :)

I'd be surprised if it is not clojure. No other JVM lisp I know of has had as much attention paid to performance. The fastest Scheme is probably SISC - it's compiled to a FASL format, but still not "native" JVM instruction level.

I've been testing various lisps over the last few weeks. Kawa is the fastest JVM implementation I've tried so far in terms of startup time, REPL response time and running basic scripts. The author posted some performance statistics in 2010, which shows that Kawa outperforms clojure by a clear margin. YMMV.
http://per.bothner.com/blog/2010/Kawa-in-shootout/

Related

Why Clojure/Lisp Programs are faster than other dynamic languages?

Based on language shootouts in a past few years, Clojure and other Lisps consistently perform better than most other dynamic languages. Why is that so?
Is it because of its homoiconicity?
Edit:
I did not know that Clojure is compiled into bytecode just like Java and Scala.
This stackoverflow thread threw light on why Clojure reaps the advantage of being both compiled and interpreted.
It is almost impossible to answer this question - it depends very much on how well the benchmark code is written, what exactly is being tested, whether you are allowed to use libraries that exploit native code, whether you are writing "idiomatic" code or optimising for performance etc.
So as always, you should treat all microbenchmarks with caution.
Having said that, the following reasons might give Clojure an advantage however in certain circumstances:
Clojure is always compiled - first down to bytecode, then down to native code by the JVM JIT compiler. This can give it a speed advantage in many cases, particularly over langauges that rely on some form of interpretation. In theory at least, you should be able to match pure Java speed in any circumstance where Clojure enables you to produce the same bytecode (which is reasonably often, though not always....)
Clojure can exploit JVM primitives and static typing - despite being a dynamic language, Clojure will compile statically types or primitive code if you give it enough hints. This can easily give a 10x boost in performance, though at the expense of making your code a bit longer/uglier.....
Clojure has heavily optimised certain data structures and operations - in particular the immutable persistent data structures and certain functional programming constructs like "reduce".
Macros enable powerful compile-time optimisations - if you use macros cleverly, you can do some quite sophisticated optimisations at compile time, effectively using code-generation to produce the code that will be most efficient at runtime. This is an advantage many Lisps share (especially Common Lisp, which was one of the big inspirations for Clojure). As nulvinge points out, homoiconicity isn't strictly necessary to achieve this (C++ also has macros!), but being a homoiconic language makes macros a lot easier.
The modern JVMs are brilliantly engineered - Clojure takes advantage of the thousands of man-years of engineering that have gone into the JVM, the Java runtime libraries, the garbage collection algorithms etc. Non-JVM languages don't get these benefits.

most suitable language for computationally and memory expensive algorithms

Let's say you have to implement a tool to efficiently solve an NP-hard problem, with unavoidable possible explosion of memory usage (the output size in some cases exponential to the input size) and you are particularly concerned about the performances of this tool at running time. The source code has also to be readable and understandable once the underlying theory is known, and this requirement is as important as the efficiency of the tool itself.
I personally think that 3 languages could be suitable for these three requirements: c++, scala, java.
They all provide the right abstraction on data types that makes it possible to compare different structures or apply the same algorithms (which is also important) to different data types.
C++ has the advantage of being statically compiled and optimized, and with function inlining (if the data structures and algorithms are designed carefully) and other optimisation techniques it's possible to achieve a performance close to that of pure C while maintaining a fairly good readability.
If you also put a lot of care in data representation you can optimise the cache performance, which can gain orders of magnitude in speed when the cache miss rate is low.
Java is instead JIT compiled, which allows to apply optimisations during runtime, and in this category of algorithms that could have different behaviours between different runs, that may be a plus. I fear instead that such an approach could suffer from garbage collector, however in the case of this algorithm it's common to continuously allocate memory and java heap performance is notoriously better than C/C++ and if you implement your own memory manager inside the language you could even achieve good efficiency.
This approach instead is not able to inline method invocation (which induces a huge performance penalty) and doesn't give you control over the cache performance. Among the pros there's a better and cleaner syntax than C++.
My concerns about scala are more or less the same as Java, plus the fact that I can't control how the language is optimised unless I have a deep knowledge on the compiler and the standard library. But well: I get a very clean syntax :)
What's your take on the subject? Have you had to deal with this already? Would you implement an algorithm with such properties and requirements in any of these languages or would you suggest something else? How would you compare them?
Usually I’d say “C++” in a heartbeat. The secret being that C++ simply produces less (memory) garbage that needs managing.
On the other hand, your observation that
however in the case of this algorithm it's common to continuously allocate memory
is a hint that Java / Scala may actually be more suited. But then you could use a small object heap in C++ as well. Boost has one that uses the standard allocator interface, if memory serves.
Another advantage of C++ is obviously the use of abstraction without penalty through templates – i.e. that you can easily create generic algorithmic components that can interact without incurring a runtime overhead due to abstraction. In fact, you noted that
it's possible to achieve a performance close to that of pure C while maintaining a fairly good readability
– this is looking at things the wrong way: Templates allow C++ to achieve performance superior to that of C while still maintaining high abstraction.
D might be worth a look, seeing as how it tries to be a better C++.
From a superficial glance, it has better source code readability than C++ does, so that's one of your points covered.
It also has memory management, which makes playing with algorithms a bit easier.
And templates
Here is a stackoverflow discussion comparing the performance of C++ and D
The languages you noticed were my first guesses as well.
Each language has a different take on how to handle specific issues like compilation, memory management and source code, but in theory, any of them should be fitting to your problem.
It is impossible to tell which is best, and there is likely no major difference if you are familiar enough with all of them to work around their respective quirks.
And obviously, if you actually find the need to optimize (I'm not sure if that's a given), that's possible in each language. Lower level languages obviously offer more options, but are also (far) more complex to actually improve.
A single note about C++ vs Java: This is really a holy war, and if you've followed the recent development you'll probably have your own opinion. I, for one, think Java offers enough good aspects to make up for its flaws, usually.
And a final note on C++ vs C: According to my knowledge, the difference usually amounts to a sufficiently low percentage to ignore this. It it doesn't make a difference for the source code, it's fine to go with C, if C++ could make for easier-to-read source code, go with C++. In any case, the choice is kind of negligible.
In the end, remember that money spent on a few hours of programming/optimizing this could as well go into slightly superior hardware to make up for missed tiny details.
It all boils down to: Any of your options is fine as long as you do it right (domain knowledge).
I would use a language which makes it very easy to work on the algorithm. Get the algorithm right and it could very easily outweigh any advantage from fine-tuning the wrong algorithm. Don't be scared to play around in a language normally thought of as slow in execution speed if that language makes it easier to express algorithmic ideas. It is usually much easier to transcribe the right algorithm into another language than it is to eek-out the last dregs of speed from the wrong algorithm in the fastest executing language.
So do it in a language you are comfortable with and which is expressive. You might surprise yourself and find that what is produced is fast enough!

Is it possible to design a dynamic language without significant performance loss?

Is it possible to design something like Ruby or Clojure without the significant performance loss in many situations compared with C/Java? Does hardware design play a role?
Edit: With significant I mean in an order of magnitudes, not just ten procent
Edit: I suspect that delnan is correct with me meaning dynamic languages so I changed the title
Performance depends on many things. Of course the semantics of the language have to be preserved even if we are compiling it - you can't remove dynamic dispatch from Ruby, it would speed things up drmatically but it would totally break 95% of the all Ruby code in the world. But still, much of the performance depends on how smart the implementation is.
I assume, by "high-level", you mean "dynamic"? Haskell and OCaml are extremely high-level, yet are is compiled natively and can outperform C# or Java, even C and C++ in some corner cases - especially if parallelism comes into play. And they certainly weren't designed with performance as #1 goal. But compiler writers, especially those focused onfunctional languages, are a very clever folk. If you or I started a high-level language, even if we used e.g. LLVM as backend for native compilation, we wouldn't get anywhere near this performance.
Making dynamic languages run fast is harder - they delay many decisions (types, members of a class/an object, ...) to runtime instead of compiletime, and while static code analysis can sometimes prove it's not possible in lines n and m, you still have to carry an advanced runtime around and do quite a few things a static language's compiler can do at compiletime. Even dynamic dispatch can be optimized with a smarter VM (Inline Cache anyone?), but it's a lot of work. More than a small new-fangeled language could do, that is.
Also see Steve Yegge's Dynamic Languages Strike Back.
And of course, what is a significant peformance loss? 100 times slower than C reads like a lot, but as we all know, 80% of execution time is spent in 20% of the code = 80% of the code won't have notable impact on the percieved performance of the whole program. For the remaining 20%, you can always rewrite it in C or C++ and call it from the dynamic language. For many applications, this suffices (for some, you don't even need to optimize). For the rest... well, if performance is that critical, you should propably write it in a language designed for performance.
Don't confuse the language design with the platform that it runs on.
For instance, Java is a high-level language. It runs on the JVM (as does Clojure - identified above, and JRuby - a Java version of Ruby). The JVM will perform byte-code analysis and optimise how the code runs (making use of escape analysis, just-in-time compilation etc.). So the platform has an effect on the performance that is largely independent of the language itself (see here for more info on Java performance and comparisons to C/C++)
Loss compared to what? If you need a garbage collector or closures then you need them, and you're going to pay the price regardless. If a language makes them easy for you to get at, that doesn't mean you have to use them when you don't need them.
If a language is interpreted instead of compiled, that's going to introduce an order of magnitude slowdown. But such a language may have compensating advantages, like ease of use, platform independence, and not having to compile. And, the programs you write in them may not run long enough for speed to be an issue.
There may be language implementations that introduce slowness for no good reason, but those don't have to be used.
You might want to look at what the DARPA HPCS initiative has come up with. There were 3 programming languages proposed: Sun's Fortress, IBM's X10 and Cray's Chapel. The latter two are still under development. Whether any of these meet your definition of high-level I don't know.
And yes, hardware design certainly does play a part. All 3 of these languages are targeted at supercomputers with very many processors and exhibit features appropriate to that domain.
It's certainly possible. For example, Objective-C is a dynamically-typed language that has performance comparable to C++ (although a wee bit slower, generally speaking, but still roughly equivalent).

Does Scala scale better than other JVM languages?

Here is the only way I know to ask it at the moment. As Understand it Scala uses the Java Virtual Machine. I thought Jruby did also. Twitter switched its middleware to Scala. Could they have done the same thing and used Jruby?
Could they have started with Jruby to start with and not had their scaling problems that caused them to move from Ruby to Scala in the first place? Do I not understand what Jruby is? I'm assuming that because Jruby can use Java it would have scaled where Ruby would not.
Does it all boil down to the static versus dynamic types, in this case?
Scala is "scalable" in the sense that the language can be improved upon by libraries in a way that makes the extensions look like they are part of the language. That's why actors looks like part of the language, or why BigInt looks like part of the language.
This also applies to most other JVM languages. It does not apply to Java, because it has special treatment for it's basic types (Int, Boolean, etc), for operators, cumbersome syntax that makes clear what is built in the language and what is library, etc.
Now, Scala is more performatic than dynamic languages on the JVM because the JVM has no support for them. Dynamic languages on JVM have to resort t reflection, which is very slow.
No, not really. It's not that the JVM is somehow magic and makes things scale by its magic powers; it's that Scala, as a language, is architected to help people write scalable systems. That it's on top of the JVM is almost incidental.
I don't really think that the language is the biggest problem here. Twitter grew insanely fast, which always leads to a code mess. And if you do a rewrite, it is a good idea to go for a different language - that bars you from building your own mistakes again and/or to "reuse some parts". Also, Ruby is not really meant for that kind of heavy data handling that the twitter backend does.
The frontend remains Ruby, so they still use it.
You have to separate out different meanings of scaling:
Scaling in terms of growing the number of requests per second that can be handled with a proportionate increase in hardware
Scaling in terms of growing a code base without it becoming a tangled mess
Scala helps on the first point because it compiles to Java bytecode that's really similar to Java, and therefore usually has the same performance as Java. I say "usually," because Scala there are some cases where idiomatic Scala causes large amount of boxing to take place where idiomatic Java would not (this is slated to change in Scala 2.8).
Performance is of course different than scaling. Equivalent code written in JRuby would scale just as well, but the slope of the line would be steeper - you'd need more hardware to handle the same number of requests, but the shape of the line would be the same. But from a more practical perspective the performance helps because you rarely can scale in a perfectly linear fashion with respect to adding core or especially servers and having better performance slows the rate at which you have to add capacity.
Scala helps with the second point because it has an expressive, compile-time enforced type system and it provides a lot of other means for managing the complexity of your code, such as mixins. You can write spaghetti code in any language, but the Scala compiler will tell you when some of the noodles are broken while with JRuby you'll have to rely solely on tests. I've personally found that for me Python breaks down at about 1000 closely related LOCs, and which point I have to refactor to either substantially reduce of the LOCs or make the structure more modular. Of course this refactoring would be a good idea regardless of what your language, but occasionally the complexity is inherent. Dealing with a large number of tightly couple LOCs isn't easy in any language, but it is much easier in Scala than it is in Python, and I think the analogy extends to Ruby/JRuby as well.
Scala is a statically typed language. JRuby is dynamically typed. That is why Scala is faster than JRuby, even though both run on the JVM. JRuby has to do a lot of work at runtime (method resolution, etc.) that Scala does at compile-time. For what it's worth, though, JRuby is a very fast Ruby implementation.
Scalability is not an inherit language capability. You are talking about speed.
A better question to ask would be "Why is Scala faster than other JVM languages (or is it)?". As others have pointed out, it's a static vs. dynamic language thing.
There's an interesting discussion from the Twitter developers themselves in the comments of this post.
They've evaluated the different options and decided to implement the back-end in Scala because: it ran faster than the Ruby/JRuby alternatives and they felt they could benefit from static typing.

Are functional languages inherently slow? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Why are functional languages always tailing behind C in benchmarks? If you have a statically typed functional language, it seems to me it could be compiled to the same code as C, or to even more optimized code since more semantics are available to the compiler. Why does it seem like all functional languages are slower than C, and why do they always need garbage collection and excessive use of the heap?
Does anyone know of a functional language appropriate for embedded / real-time applications, where memory allocation is kept to a minimum and the produced machine code is lean and fast?
Are functional languages inherently slow?
In some sense, yes. They require infrastructure that inevitably adds overheads over what can theoretically be attained using assembler by hand. In particular, first-class lexical closures only work well with garbage collection because they allow values to be carried out of scope.
Why are functional languages always tailing behind C in benchmarks?
Firstly, beware of selection bias. C acts as a lowest common denominator in benchmark suites, limiting what can be accomplished. If you have a benchmark comparing C with a functional language then it is almost certainly an extremely simple program. Arguably so simple that it is of little practical relevance today. It is not practically feasible to solve more complicated problems using C for a mere benchmark.
The most obvious example of this is parallelism. Today, we all have multicores. Even my phone is a multicore. Multicore parallelism is notoriously difficult in C but can be easy in functional languages (I like F#). Other examples include anything that benefits from persistent data structures, e.g. undo buffers are trivial with purely functional data structures but can be a huge amount of work in imperative languages like C.
Why does it seem like all functional languages are slower than C, and why do they always need garbage collection and excessive use of the heap?
Functional languages will seem slower because you'll only ever see benchmarks comparing code that is easy enough to write well in C and you'll never see benchmarks comparing meatier tasks where functional languages start to excel.
However, you've correctly identified what is probably the single biggest bottleneck in functional languages today: their excessive allocation rates. Nice work!
The reasons why functional languages allocate so heavily can be split into historical and inherent reasons.
Historically, Lisp implementations have been doing a lot of boxing for 50 years now. This characteristic spread to many other languages which use Lisp-like intermediate representations. Over the years, language implementers have continually resorted to boxing as a quick fix for complications in language implementation. In object oriented languages, the default has been to always heap allocate every object even when it can obviously be stack allocated. The burden of efficiency was then pushed onto the garbage collector and a huge amount of effort has been put into building garbage collectors that can attain performance close to that of stack allocation, typically by using a bump-allocating nursery generation. I think that a lot more effort should be put into researching functional language designs that minimize boxing and garbage collector designs that are optimized for different requirements.
Generational garbage collectors are great for languages that heap allocate a lot because they can be almost as fast as stack allocation. But they add substantial overheads elsewhere. Today's programs are increasingly using data structures like queues (e.g. for concurrent programming) and these give pathological behaviour for generational garbage collectors. If the items in the queue outlive the first generation then they all get marked, then they all get copied ("evacuated"), then all of the references to their old locations get updated and then they become eligible for collection. This is about 3× slower than it needs to be (e.g. compared to C). Mark region collectors like Beltway (2002) and Immix (2008) have the potential to solve this problem because the nursery is replaced with a region that can either be collected as if it were a nursery or, if it contains mostly reachable values, it can be replaced with another region and left to age until it contains mostly unreachable values.
Despite the pre-existence of C++, the creators of Java made the mistake of adopting type erasure for generics, leading to unnecessary boxing. For example, I benchmarked a simple hash table running 17× faster on .NET than the JVM partly because .NET did not make this mistake (it uses reified generics) and also because .NET has value types. I actually blame Lisp for making Java slow.
All modern functional language implementations continue to box excessively. JVM-based languages like Clojure and Scala have little choice because the VM they target cannot even express value types. OCaml sheds type information early in its compilation process and resorts to tagged integers and boxing at run-time to handle polymorphism. Consequently, OCaml will often box individual floating point numbers and always boxes tuples. For example, a triple of bytes in OCaml is represented by a pointer (with an implicit 1-bit tag embedded in it that gets checked repeatedly at run-time) to a heap-allocated block with a 64 bit header and 192 bit body containing three tagged 63-bit integers (where the 3 tags are, again, repeatedly examined at run time!). This is clearly insane.
Some work has been done on unboxing optimizations in functional languages but it never really gained traction. For example, the MLton compiler for Standard ML was a whole-program optimizing compiler that did sophisticated unboxing optimizations. Sadly, it was before its time and the "long" compilation times (probably under 1s on a modern machine!) deterred people from using it.
The only major platform to have broken this trend is .NET but, amazingly, it appears to have been an accident. Despite having a Dictionary implementation very heavily optimized for keys and values that are of value types (because they are unboxed) Microsoft employees like Eric Lippert continue to claim that the important thing about value types is their pass-by-value semantics and not the performance characteristics that stem from their unboxed internal representation. Eric seems to have been proven wrong: more .NET developers seem to care more about unboxing than pass-by-value. Indeed, most structs are immutable and, therefore, referentially transparent so there is no semantic difference between pass-by-value and pass-by-reference. Performance is visible and structs can offer massive performance improvements. The performance of structs even saved Stack Overflow and structs are used to avoid GC latency in commercial software like Rapid Addition's!
The other reason for heavy allocation by functional languages is inherent. Imperative data structures like hash tables use huge monolithic arrays internally. If these were persistent then the huge internal arrays would need to be copied every time an update was made. So purely functional data structures like balanced binary trees are fragmented into many little heap-allocated blocks in order to facilitate reuse from one version of the collection to the next.
Clojure uses a neat trick to alleviate this problem when collections like dictionaries are only written to during initialization and are then read from a lot. In this case, the initialization can use mutation to build the structure "behind the scenes". However, this does not help with incremental updates and the resulting collections are still substantially slower to read than their imperative equivalents. On the up-side, purely functional data structures offer persistence whereas imperative ones do not. However, few practical applications benefit from persistence in practice so this is often not advantageous. Hence the desire for impure functional languages where you can drop to imperative style effortlessly and reap the benefits.
Does anyone know of a functional language appropriate for embedded / real-time applications, where memory allocation is kept to a minimum and the produced machine code is lean and fast?
Take a look at Erlang and OCaml if you haven't already. Both are reasonable for memory constrained systems but neither generate particularly great machine code.
Nothing is inherently anything. Here is an example where interpreted OCaml runs faster than equivalent C code, because the OCaml optimizer has different information available to it, due to differences in the language. Of course, it would be foolish to make a general claim that OCaml is categorically faster than C. The point is, it depends upon what you're doing, and how you do it.
That said, OCaml is an example of a (mostly) functional language which is actually designed for performance, in contrast to purity.
Functional languages require the elimination of mutable state that is visible at the level of the language abstraction. Therefore, data that would be mutated in place by an imperative language needs to be copied instead, with the mutation taking place on the copy. For a simple example, see a quick sort in Haskell vs. C.
Furthermore, garbage collection is required because free() is not a pure function, as it has side effects. Therefore, the only way to free memory that does not involve side effects at the level of the language abstraction is with garbage collection.
Of course, in principle, a sufficiently smart compiler could optimize out much of this copying. This is already done to some degree, but making the compiler sufficiently smart to understand the semantics of your code at that level is just plain hard.
The short answer: because C is fast. As in, blazingly ridiculously crazy fast. A language simply doesn't have to be 'slow' to get its rear handed to it by C.
The reason why C is fast is that it was created by really great coders, and gcc has been optimized over the course of a couple more decades and by dozens more brilliant coders than 99% of languages out there.
In short, you're not going to beat C except for specialized tasks that require very specific functional programming constructs.
The control flow of proceedural languages much better matches the actual processing patterns of modern computers.
C maps very closely onto the assembly code its compilation produces, hence the nickname "cross-platform assembly". Computer manufacturers have spent a few decades making assembly code run as fast as possible, so C inherits all of this raw speed.
In comparison, the no side-effects, inherent parallelism of functional languages does not map onto a single processor at all well. The arbitrary order in which functions can be invoked needs to be serialised down to the CPU bottleneck: without extremely clever compilation, you're going to be context switching all the time, none of the pre-fetching will work because you're constantly jumping all over the place, ... Basically, all the optimisation work that computer manufacturers have done for nice, predictable proceedural languages is pretty much useless.
However! With the move towards lots of less powerful cores (rather than one or two turbo-charged cores), functional languages should begin to close the gap, as they naturally scale horizontally.
C is fast because it's basically a set of macros for assembler :) There is no "behind the scene" when you are writing a program in C. You alloc memory when you decide it's time to do that and you free in the same fashion. This is a huge advantage when you are writing a real time application, where predictabily is important (more than anything else, actually).
Also, C compilers are generally extremly fast because language itself is simple. It even doesn't make any type checkings :) This also means that is easier to make hard to find errors.
Ad advantage with the lack of type checking is that a function name can just be exported with its name for example and this makes C code easy to link with other language's code
Well Haskell is only 1.8 times slower than GCC's C++, which is faster than GCC's C implementation for typical benchmark tasks.
That makes Haskell very fast, even faster than C#(Mono that is).
relative Language
speed
1.0 C++ GNU g++
1.1 C GNU gcc
1.2 ATS
1.5 Java 6 -server
1.5 Clean
1.6 Pascal Free Pascal
1.6 Fortran Intel
1.8 Haskell GHC
2.0 C# Mono
2.1 Scala
2.2 Ada 2005 GNAT
2.4 Lisp SBCL
3.9 Lua LuaJIT
source
For the record I use Lua for Games on the iPhone, thus you could easily use Haskell or Lisp if you prefer, since they are faster.
As for now, functional languages aren't used heavily for industry projects, so not enough serious work goes into optimizers. Also, optimizing imperative code for an imperative target is probably way easier.
Functional languages have one feat that will let them outdo imperative languages really soon now: trivial parallelization.
Trivial not in the sense that it is easy, but that it can be built into the language environment, without the developer needing to think about it.
The cost of robust multithreading in a thread-agnostic language like C is prohibitive for many projects.
I disagree with tuinstoel. The important question is whether the functional language provides a faster development time and results in faster code when it is used to what functional languages were meant to be used. See the efficiency issues section on Wikipedia for a glimpse of what I mean.
One more reason for bigger executable size could be lazy evaluation and non-strictness. The compiler can't figure out at compile-time when certain expressions get evaluated, so some runtime gets stuffed into the executable to handle this (to call upon the evaluation of the so-called thunks). As for performance, laziness can be both good and bad. On one hand it allows for additional potential optimization, on the other hand the code size can be larger and programmers are more likely to make bad decisions, e.g. see Haskell's foldl vs. foldr vs. foldl' vs. foldr'.

Resources