Examples where compiler-optimized functional code performs better than imperative code - performance

One of the promises of side-effect free, referentially transparent functional programming is that such code can be extensively optimized.
To quote Wikipedia:
Immutability of data can, in many cases, lead to execution efficiency, by allowing the compiler to make assumptions that are unsafe in an imperative language, thus increasing opportunities for inline expansion.
I'd like to see examples where a functional language compiler outperforms an imperative one by producing a better optimized code.
Edit: I tried to give a specific scenario, but apparently it wasn't a good idea. So I'll try to explain it in a different way.
Programmers translate ideas (algorithms) into languages that machines can understand. At the same time, one of the most important aspects of the translation is that also humans can understand the resulting code. Unfortunately, in many cases there is a trade-off: A concise, readable code suffers from slow performance and needs to be manually optimized. This is error-prone, time consuming, and it makes the code less readable (up to totally unreadable).
The foundations of functional languages, such as immutability and referential transparency, allow compilers to perform extensive optimizations, which could replace manual optimization of code and free programmers from this trade-off. I'm looking for examples of ideas (algorithms) and their implementations, such that:
the (functional) implementation is close to the original idea and is easy to understand,
it is extensively optimized by the compiler of the language, and
it is hard (or impossible) to write similarly efficient code in an imperative language without manual optimizations that reduce its conciseness and readability.
I apologize if it is a bit vague, but I hope the idea is clear. I don't want to give unnecessary restrictions on the answers. I'm open to suggestions if someone knows how to express it better.
My interest isn't just theoretical. I'd like to use such examples (among other things) to motivate students to get interested in functional programming.
At first, I wasn't satisfied by a few examples suggested in the comments. On second thoughts I take my objections back, those are good examples. Please feel free to expand them to full answers so that people can comment and vote for them.
(One class of such examples will be most likely parallelized code, which can take advantage of multiple CPU cores. Often in functional languages this can be done easily without sacrificing code simplicity (like in Haskell by adding par or pseq in appropriate places). I' be interested in such examples too, but also in other, non-parallel ones.)

There are cases where the same algorithm will optimize better in a pure context. Specifically, stream fusion allows an algorithm that consists of a sequence of loops that may be of widely varying form: maps, filters, folds, unfolds, to be composed into a single loop.
The equivalent optimization in a conventional imperative setting, with mutable data in loops, would have to achieve a full effect analysis, which no one does.
So at least for the class of algorithms that are implemented as pipelines of ana- and catamorphisms on sequences, you can guarantee optimization results that are not possible in an imperative setting.

A very recent paper Haskell beats C using generalised stream fusion by Geoff Mainland, Simon Peyton Jones, Simon Marlow, Roman Leshchinskiy (submitted to ICFP 2013) describes such an example. Abstract (with the interesting part in bold):
Stream fusion [6] is a powerful technique for automatically transforming
high-level sequence-processing functions into efficient implementations.
It has been used to great effect in Haskell libraries
for manipulating byte arrays, Unicode text, and unboxed vectors.
However, some operations, like vector append, still do not perform
well within the standard stream fusion framework. Others,
like SIMD computation using the SSE and AVX instructions available
on modern x86 chips, do not seem to fit in the framework at
all.
In this paper we introduce generalized stream fusion, which
solves these issues. The key insight is to bundle together multiple
stream representations, each tuned for a particular class of stream
consumer. We also describe a stream representation suited for efficient
computation with SSE instructions. Our ideas are implemented
in modified versions of the GHC compiler and vector library.
Benchmarks show that high-level Haskell code written using
our compiler and libraries can produce code that is faster than both
compiler- and hand-vectorized C.

This is just a note, not an answer: the gcc has a pure attribute suggesting it can take account of purity; the obvious reasons are remarked on in the manual here.
I would think that 'static single assignment' imposes a form of purity -- see the links at http://lambda-the-ultimate.org/node/2860 or the wikipedia article.

make and various build systems perform better for large projects by assuming that various build steps are referentially transparent; as such, they only need to rerun steps that have had their inputs change.
For small to medium sized changes, this can be a lot faster than building from scratch.

Related

most suitable language for computationally and memory expensive algorithms

Let's say you have to implement a tool to efficiently solve an NP-hard problem, with unavoidable possible explosion of memory usage (the output size in some cases exponential to the input size) and you are particularly concerned about the performances of this tool at running time. The source code has also to be readable and understandable once the underlying theory is known, and this requirement is as important as the efficiency of the tool itself.
I personally think that 3 languages could be suitable for these three requirements: c++, scala, java.
They all provide the right abstraction on data types that makes it possible to compare different structures or apply the same algorithms (which is also important) to different data types.
C++ has the advantage of being statically compiled and optimized, and with function inlining (if the data structures and algorithms are designed carefully) and other optimisation techniques it's possible to achieve a performance close to that of pure C while maintaining a fairly good readability.
If you also put a lot of care in data representation you can optimise the cache performance, which can gain orders of magnitude in speed when the cache miss rate is low.
Java is instead JIT compiled, which allows to apply optimisations during runtime, and in this category of algorithms that could have different behaviours between different runs, that may be a plus. I fear instead that such an approach could suffer from garbage collector, however in the case of this algorithm it's common to continuously allocate memory and java heap performance is notoriously better than C/C++ and if you implement your own memory manager inside the language you could even achieve good efficiency.
This approach instead is not able to inline method invocation (which induces a huge performance penalty) and doesn't give you control over the cache performance. Among the pros there's a better and cleaner syntax than C++.
My concerns about scala are more or less the same as Java, plus the fact that I can't control how the language is optimised unless I have a deep knowledge on the compiler and the standard library. But well: I get a very clean syntax :)
What's your take on the subject? Have you had to deal with this already? Would you implement an algorithm with such properties and requirements in any of these languages or would you suggest something else? How would you compare them?
Usually I’d say “C++” in a heartbeat. The secret being that C++ simply produces less (memory) garbage that needs managing.
On the other hand, your observation that
however in the case of this algorithm it's common to continuously allocate memory
is a hint that Java / Scala may actually be more suited. But then you could use a small object heap in C++ as well. Boost has one that uses the standard allocator interface, if memory serves.
Another advantage of C++ is obviously the use of abstraction without penalty through templates – i.e. that you can easily create generic algorithmic components that can interact without incurring a runtime overhead due to abstraction. In fact, you noted that
it's possible to achieve a performance close to that of pure C while maintaining a fairly good readability
– this is looking at things the wrong way: Templates allow C++ to achieve performance superior to that of C while still maintaining high abstraction.
D might be worth a look, seeing as how it tries to be a better C++.
From a superficial glance, it has better source code readability than C++ does, so that's one of your points covered.
It also has memory management, which makes playing with algorithms a bit easier.
And templates
Here is a stackoverflow discussion comparing the performance of C++ and D
The languages you noticed were my first guesses as well.
Each language has a different take on how to handle specific issues like compilation, memory management and source code, but in theory, any of them should be fitting to your problem.
It is impossible to tell which is best, and there is likely no major difference if you are familiar enough with all of them to work around their respective quirks.
And obviously, if you actually find the need to optimize (I'm not sure if that's a given), that's possible in each language. Lower level languages obviously offer more options, but are also (far) more complex to actually improve.
A single note about C++ vs Java: This is really a holy war, and if you've followed the recent development you'll probably have your own opinion. I, for one, think Java offers enough good aspects to make up for its flaws, usually.
And a final note on C++ vs C: According to my knowledge, the difference usually amounts to a sufficiently low percentage to ignore this. It it doesn't make a difference for the source code, it's fine to go with C, if C++ could make for easier-to-read source code, go with C++. In any case, the choice is kind of negligible.
In the end, remember that money spent on a few hours of programming/optimizing this could as well go into slightly superior hardware to make up for missed tiny details.
It all boils down to: Any of your options is fine as long as you do it right (domain knowledge).
I would use a language which makes it very easy to work on the algorithm. Get the algorithm right and it could very easily outweigh any advantage from fine-tuning the wrong algorithm. Don't be scared to play around in a language normally thought of as slow in execution speed if that language makes it easier to express algorithmic ideas. It is usually much easier to transcribe the right algorithm into another language than it is to eek-out the last dregs of speed from the wrong algorithm in the fastest executing language.
So do it in a language you are comfortable with and which is expressive. You might surprise yourself and find that what is produced is fast enough!

Is there a functional algorithm which is faster than an imperative one?

I'm searching for an algorithm (or an argument of such an algorithm) in functional style which is faster than an imperative one.
I like functional code because it's expressive and mostly easier to read than it's imperative pendants. But I also know that this expressiveness can cost runtime overhead. Not always due to techniques like tail recursion - but often they are slower.
While programming I don't think about runtime costs of functional code because nowadays PCs are very fast and development time is more expensive than runtime. Furthermore for me readability is more important than performance. Nevertheless my programs are fast enough so I rarely need to solve a problem in an imperative way.
There are some algorithms which in practice should be implemented in an imperative style (like sorting algorithms) otherwise in most cases they are too slow or requires lots of memory.
In contrast due to techniques like pattern matching a whole program like a parser written in an functional language may be much faster than one written in an imperative language because of the possibility of compilers to optimize the code.
But are there any algorithms which are faster in a functional style or are there possibilities to setting up arguments of such an algorithm?
A simple reasoning. I don't vouch for terminology, but it seems to make sense.
A functional program, to be executed, will need to be transformed into some set of machine instructions.
All machines (I've heard of) are imperative.
Thus, for every functional program, there's an imperative program (roughly speaking, in assembler language), equivalent to it.
So, you'll probably have to be satisfied with 'expressiveness', until we get 'functional computers'.
The short answer:
Anything that can be easily made parallel because it's free of side-effects will be quicker on a multi-core processor.
QuickSort, for example, scales up quite nicely when used with immutable collections: http://en.wikipedia.org/wiki/Quicksort#Parallelization
All else being equal, if you have two algorithms that can reasonably be described as equivalent, except that one uses pure functions on immutable data, while the second relies on in-place mutations, then the first algorithm will scale up to multiple cores with ease.
It may even be the case that your programming language can perform this optimization for you, as with the scalaCL plugin that will compile code to run on your GPU. (I'm wondering now if SIMD instructions make this a "functional" processor)
So given parallel hardware, the first algorithm will perform better, and the more cores you have, the bigger the difference will be.
FWIW there are Purely functional data structures, which benefit from functional programming.
There's also a nice book on Purely Functional Data Structures by Chris Okasaki, which presents data structures from the point of view of functional languages.
Another interesting article Announcing Intel Concurrent Collections for Haskell 0.1, about parallel programming, they note:
Well, it happens that the CnC notion
of a step is a pure function. A step
does nothing but read its inputs and
produce tags and items as output. This
design was chosen to bring CnC to that
elusive but wonderful place called
deterministic parallelism. The
decision had nothing to do with
language preferences. (And indeed, the
primary CnC implementations are for
C++ and Java.)
Yet what a great match Haskell and CnC
would make! Haskell is the only major
language where we can (1) enforce that
steps be pure, and (2) directly
recognize (and leverage!) the fact
that both steps and graph executions
are pure.
Add to that the fact that Haskell is
wonderfully extensible and thus the
CnC "library" can feel almost like a
domain-specific language.
It doesn't say about performance – they promise to discuss some of the implementation details and performance in future posts, – but Haskell with its "pureness" fits nicely into parallel programming.
One could argue that all programs boil down to machine code.
So, if I dis-assemble the machine code (of an imperative program) and tweak the assembler, I could perhaps end up with a faster program. Or I could come up with an "assembler algorithm" that exploits some specific CPU feature, and therefor it really is faster than the imperative language version.
Does this situation lead to the conclusion that we should use assembler everywhere? No, we decided to use imperative languages because they are less cumbersome. We write pieces in assembler because we really need to.
Ideally we should also use FP algorithms because they are less cumbersome to code, and use imperative code when we really need to.
Well, I guess you meant to ask if there is an implementation of an algorithm in functional programming language that is faster than another implementation of the same algorithm but in an imperative language. By "faster" I mean that it performs better in terms of execution time or memory footprint on some inputs according to some measurement that we deem trustworthy.
I do not exclude this possibility. :)
To elaborate on Yasir Arsanukaev's answer, purely functional data structures can be faster than mutable data structures in some situations becuase they share pieces of their structure. Thus in places where you might have to copy a whole array or list in an imperative language, where you can get away with a fraction of the copying because you can change (and copy) only a small part of the data structure. Lists in functional languages are like this -- multiple lists can share the same tail since nothing can be modified. (This can be done in imperative languages, but usually isn't, because within the imperative paradigm, people aren't usually used to talking about immutable data.)
Also, lazy evaluation in functional languages (particularly Haskell which is lazy by default) can also be very advantageous because it can eliminate code execution when the code's results won't actually be used. (One can be very careful not to run this code in the first place in imperative languages, however.)

OOP vs PP for algorithms

Which paradigm is better for design and analysis of algorithms?
Which is faster? Because I have a subject called Design and Analysis of Algorithms in university and have a time limit for programs. Is OOP slower than Procedure programming? Or the time difference is not big?
Object-Oriented programming isn't particularly relevant to algorithms. Procedural programming you will need, but as far as algorithms are concerned, object-oriented programming is just another way to package up procedural programming. You have methods instead of functions and classes instead of records/structs, but the only relevant difference is run-time dispatch, and that's just a declarative way to handle a run-time decision that could have been handled some other way.
Object-Oriented programming is more relevant to the larger scale - design patterns etc - whereas algorithms are more relevant to the smaller scale involving a small number (often just one) of procedures.
IMO algorithms exist separat from the OO or PP issue.
Neither OO or PP are 'slow', in either design-time or program performance, they are different approaches.
I would think that Functional Programming would produce cleaner implementation of algorithms.
Having said that, you shouldn't see much of a difference whatever approach you take. An algorithm can be expressed in any language or development paradigm.
Update: (following comments)
Apparently functional programming does not lend itself to implementing algorithms as well as I thought it may. It has other strengths and I mostly mentioned it for completeness sake, as the question only mentioned OOP (object oriented programming) and PP (procedural programming).
the weak link is liekly to be your knowledge - what language & paradigm are you most comfortable with. use that
For design, analysis and development: definitely OOP. It was invented solemnly for the benefit of designers and developers.
For program runtime execution: sometimes PP is more efficient, but often OOP gets reduced to plain PP by the compiler, making them equivalent.
The difference (in execution time) is marginal at best.
Note that there is a more important factor than sheer performance: OOP provide the programmer with better means to organize his code which results in programs that are well structured, understandable, and more reliable (less bugs).
Object oriented programming abstracts many low level details from the programmer. It is designed with the goal
to make it easier to write and read (and understand) programs
to make programs look closer to the real world (and hence, easier to understand).
Procedural programming does not have many abstractions like objects, methods, virtual functions etc.
So, talking about speed: a seasoned expert who knows the internals of how an object oriented system will work can write a program that runs just as fast.
That being said, the speed advantage achieved by use PP over OOP will be very marginal. It boils down to which way you can write programs comfortably.
EDIT:
An interesting anecdote comes to my mind: in the Microsoft Foundation Classes, message passing from one object to the other was implemented using macros that looked like BEGIN_MESSAGE_MAP() and END_MESSAGE_MAP(), and the reason was that it was faster than using virtual functions.
This is one case where the library developers have used OOP, but have knowingly sidestepped a performance bottleneck.
My guess is that the difference is not big enough to worry about, and the time limit should allow using a slower language, since the algorithm used would be what's important.
The purpose of the time limit should IMO be to get you to avoid using for example a O(n3) algorithm when there is a O(n log n)
To make writing code easy and less error prone, you need a language that supports Generics - such as C++ with STL or Java with the Java Collections Framework. If you are implementing an algorithm against a deadline, you may be able to save time by not providing your algorithm with a nice O-O or Generic interface, so making the code you write yourself entirely procedural.
For run time efficiency, you would probably be best writing everything in procedural C - see e.g. the examples in "The Practice Of Programming" - but it will take a lot longer to write, and you are more likely to make mistakes. This also assumes that all the building blocks you need are available in their most up to date and efficient from in procedural C as well, which is quite an assumption these days. Most likely making use of the STL or the JFC will in practice save you cpu time as well as development time.
As for functional languages, I remember hearing functional programming enthusiasts point out how much easier to use their languages were than the competition, and then observing that those members of the class who chose a functional language were still struggling when those who wrote in Fortran 77 had finished and gone on to draw graphs of the performance of their program. I see that the claims of the functional programming community have not changed. I do not know if the underlying reality has.
Steve314 said it well. OOP is more about the design patterns and organization of large applications. It also lets you deal with unknowns better, which is great for user apps. However, for analyzing algorithms, most likely you are going to be thinking functionally about what you want to do. In that case, I'd stick to more simple PP and not try to create a fully OO design, when you care about the algorithm. I'd want to work with C or Matlab (depending on how math intensive the algorithm is). Just my opinion on it.
I once adapted the Knuth-Morris-Pratt string search algorithm so that I could have an object that would take a character at a time and return a match/no-match status. It wasn't a straight-forward translation.

Expert system for writing programs?

I am brainstorming an idea of developing a high level software to manipulate matrix algebra equations, tensor manipulations to be exact, to produce optimized C++ code using several criteria such as sizes of dimensions, available memory on the system, etc.
Something which is similar in spirit to tensor contraction engine, TCE, but specifically oriented towards producing optimized rather than general code.
The end result desired is software which is expert in producing parallel program in my domain.
Does this sort of development fall on the category of expert systems?
What other projects out there work in the same area of producing code given the constraints?
What you are describing is more like a Domain-Specific Language.
http://en.wikipedia.org/wiki/Domain-specific_language
It wouldn't be called an expert system, at least not in the traditional sense of this concept.
Expert systems are rule-based inference engines, whereby the expertise in question is clearly encapsulated in the rules. The system you suggest, while possibly encapsulating insight about the nature of the problem domain inside a linear algebra model of sorts, would act more as a black box than an expert system. One of the characteristics of expert systems is that they can produce an "explanation" of their reasoning, and such a feature is possible in part because the knowledge representation, while formalized, remains close to simple statements in a natural language; matrices and operations on them, while possibly being derived upon similar observation of reality, are a lot less transparent...
It is unclear from the description in the question if the system you propose would optimize existing code (possibly in a limited domain), or if it would produced optimized code, in that case driven bay some external goal/function...
Well production systems (rule systems) are one of four general approaches to computation (Turing machines, Church recursive functions, Post production systems and Markov algorithms [and several more have been added to that list]) which more or less have these respective realizations: imperative programming, functional programming, rule based programming - as far as I know Markov algorithms don't have an independent implementation. These are all Turing equivalent.
So rule based programming can be used to write anything at all. Also early mathematical/symbolic manipulation programs did generally use rule based programming until the problem was sufficiently well understood (whereupon the approach was changed to imperative or constraint programming - see MACSYMA - hmmm MACSYMA was written in Lisp so perhaps I have a different program in mind or perhaps they originally implemented a rule system in Lisp for this).
You could easily write a rule system to perform the matrix manipulations. You could keep a trace depending on logical support to record the actual rules fired that contributed to a solution (some rules that fire might not contribute directly to a solution afterall). Then for every rule you have a mapping to a set of C++ instructions (these don't have to be "complete" - they sort of act more like a semi-executable requirement) which are output as an intermediate language. Then that is read by a parser to link it to the required input data and any kind of fix up needed. You might find it easier to generate functional code - for one thing after the fix up you could more easily optimize the output code in functional source.
Having said that, other contributors have outlined a domain specific language approach and that is what the TED people did too (my suggestion is that too just using rules).

Are functional languages inherently slow? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Why are functional languages always tailing behind C in benchmarks? If you have a statically typed functional language, it seems to me it could be compiled to the same code as C, or to even more optimized code since more semantics are available to the compiler. Why does it seem like all functional languages are slower than C, and why do they always need garbage collection and excessive use of the heap?
Does anyone know of a functional language appropriate for embedded / real-time applications, where memory allocation is kept to a minimum and the produced machine code is lean and fast?
Are functional languages inherently slow?
In some sense, yes. They require infrastructure that inevitably adds overheads over what can theoretically be attained using assembler by hand. In particular, first-class lexical closures only work well with garbage collection because they allow values to be carried out of scope.
Why are functional languages always tailing behind C in benchmarks?
Firstly, beware of selection bias. C acts as a lowest common denominator in benchmark suites, limiting what can be accomplished. If you have a benchmark comparing C with a functional language then it is almost certainly an extremely simple program. Arguably so simple that it is of little practical relevance today. It is not practically feasible to solve more complicated problems using C for a mere benchmark.
The most obvious example of this is parallelism. Today, we all have multicores. Even my phone is a multicore. Multicore parallelism is notoriously difficult in C but can be easy in functional languages (I like F#). Other examples include anything that benefits from persistent data structures, e.g. undo buffers are trivial with purely functional data structures but can be a huge amount of work in imperative languages like C.
Why does it seem like all functional languages are slower than C, and why do they always need garbage collection and excessive use of the heap?
Functional languages will seem slower because you'll only ever see benchmarks comparing code that is easy enough to write well in C and you'll never see benchmarks comparing meatier tasks where functional languages start to excel.
However, you've correctly identified what is probably the single biggest bottleneck in functional languages today: their excessive allocation rates. Nice work!
The reasons why functional languages allocate so heavily can be split into historical and inherent reasons.
Historically, Lisp implementations have been doing a lot of boxing for 50 years now. This characteristic spread to many other languages which use Lisp-like intermediate representations. Over the years, language implementers have continually resorted to boxing as a quick fix for complications in language implementation. In object oriented languages, the default has been to always heap allocate every object even when it can obviously be stack allocated. The burden of efficiency was then pushed onto the garbage collector and a huge amount of effort has been put into building garbage collectors that can attain performance close to that of stack allocation, typically by using a bump-allocating nursery generation. I think that a lot more effort should be put into researching functional language designs that minimize boxing and garbage collector designs that are optimized for different requirements.
Generational garbage collectors are great for languages that heap allocate a lot because they can be almost as fast as stack allocation. But they add substantial overheads elsewhere. Today's programs are increasingly using data structures like queues (e.g. for concurrent programming) and these give pathological behaviour for generational garbage collectors. If the items in the queue outlive the first generation then they all get marked, then they all get copied ("evacuated"), then all of the references to their old locations get updated and then they become eligible for collection. This is about 3× slower than it needs to be (e.g. compared to C). Mark region collectors like Beltway (2002) and Immix (2008) have the potential to solve this problem because the nursery is replaced with a region that can either be collected as if it were a nursery or, if it contains mostly reachable values, it can be replaced with another region and left to age until it contains mostly unreachable values.
Despite the pre-existence of C++, the creators of Java made the mistake of adopting type erasure for generics, leading to unnecessary boxing. For example, I benchmarked a simple hash table running 17× faster on .NET than the JVM partly because .NET did not make this mistake (it uses reified generics) and also because .NET has value types. I actually blame Lisp for making Java slow.
All modern functional language implementations continue to box excessively. JVM-based languages like Clojure and Scala have little choice because the VM they target cannot even express value types. OCaml sheds type information early in its compilation process and resorts to tagged integers and boxing at run-time to handle polymorphism. Consequently, OCaml will often box individual floating point numbers and always boxes tuples. For example, a triple of bytes in OCaml is represented by a pointer (with an implicit 1-bit tag embedded in it that gets checked repeatedly at run-time) to a heap-allocated block with a 64 bit header and 192 bit body containing three tagged 63-bit integers (where the 3 tags are, again, repeatedly examined at run time!). This is clearly insane.
Some work has been done on unboxing optimizations in functional languages but it never really gained traction. For example, the MLton compiler for Standard ML was a whole-program optimizing compiler that did sophisticated unboxing optimizations. Sadly, it was before its time and the "long" compilation times (probably under 1s on a modern machine!) deterred people from using it.
The only major platform to have broken this trend is .NET but, amazingly, it appears to have been an accident. Despite having a Dictionary implementation very heavily optimized for keys and values that are of value types (because they are unboxed) Microsoft employees like Eric Lippert continue to claim that the important thing about value types is their pass-by-value semantics and not the performance characteristics that stem from their unboxed internal representation. Eric seems to have been proven wrong: more .NET developers seem to care more about unboxing than pass-by-value. Indeed, most structs are immutable and, therefore, referentially transparent so there is no semantic difference between pass-by-value and pass-by-reference. Performance is visible and structs can offer massive performance improvements. The performance of structs even saved Stack Overflow and structs are used to avoid GC latency in commercial software like Rapid Addition's!
The other reason for heavy allocation by functional languages is inherent. Imperative data structures like hash tables use huge monolithic arrays internally. If these were persistent then the huge internal arrays would need to be copied every time an update was made. So purely functional data structures like balanced binary trees are fragmented into many little heap-allocated blocks in order to facilitate reuse from one version of the collection to the next.
Clojure uses a neat trick to alleviate this problem when collections like dictionaries are only written to during initialization and are then read from a lot. In this case, the initialization can use mutation to build the structure "behind the scenes". However, this does not help with incremental updates and the resulting collections are still substantially slower to read than their imperative equivalents. On the up-side, purely functional data structures offer persistence whereas imperative ones do not. However, few practical applications benefit from persistence in practice so this is often not advantageous. Hence the desire for impure functional languages where you can drop to imperative style effortlessly and reap the benefits.
Does anyone know of a functional language appropriate for embedded / real-time applications, where memory allocation is kept to a minimum and the produced machine code is lean and fast?
Take a look at Erlang and OCaml if you haven't already. Both are reasonable for memory constrained systems but neither generate particularly great machine code.
Nothing is inherently anything. Here is an example where interpreted OCaml runs faster than equivalent C code, because the OCaml optimizer has different information available to it, due to differences in the language. Of course, it would be foolish to make a general claim that OCaml is categorically faster than C. The point is, it depends upon what you're doing, and how you do it.
That said, OCaml is an example of a (mostly) functional language which is actually designed for performance, in contrast to purity.
Functional languages require the elimination of mutable state that is visible at the level of the language abstraction. Therefore, data that would be mutated in place by an imperative language needs to be copied instead, with the mutation taking place on the copy. For a simple example, see a quick sort in Haskell vs. C.
Furthermore, garbage collection is required because free() is not a pure function, as it has side effects. Therefore, the only way to free memory that does not involve side effects at the level of the language abstraction is with garbage collection.
Of course, in principle, a sufficiently smart compiler could optimize out much of this copying. This is already done to some degree, but making the compiler sufficiently smart to understand the semantics of your code at that level is just plain hard.
The short answer: because C is fast. As in, blazingly ridiculously crazy fast. A language simply doesn't have to be 'slow' to get its rear handed to it by C.
The reason why C is fast is that it was created by really great coders, and gcc has been optimized over the course of a couple more decades and by dozens more brilliant coders than 99% of languages out there.
In short, you're not going to beat C except for specialized tasks that require very specific functional programming constructs.
The control flow of proceedural languages much better matches the actual processing patterns of modern computers.
C maps very closely onto the assembly code its compilation produces, hence the nickname "cross-platform assembly". Computer manufacturers have spent a few decades making assembly code run as fast as possible, so C inherits all of this raw speed.
In comparison, the no side-effects, inherent parallelism of functional languages does not map onto a single processor at all well. The arbitrary order in which functions can be invoked needs to be serialised down to the CPU bottleneck: without extremely clever compilation, you're going to be context switching all the time, none of the pre-fetching will work because you're constantly jumping all over the place, ... Basically, all the optimisation work that computer manufacturers have done for nice, predictable proceedural languages is pretty much useless.
However! With the move towards lots of less powerful cores (rather than one or two turbo-charged cores), functional languages should begin to close the gap, as they naturally scale horizontally.
C is fast because it's basically a set of macros for assembler :) There is no "behind the scene" when you are writing a program in C. You alloc memory when you decide it's time to do that and you free in the same fashion. This is a huge advantage when you are writing a real time application, where predictabily is important (more than anything else, actually).
Also, C compilers are generally extremly fast because language itself is simple. It even doesn't make any type checkings :) This also means that is easier to make hard to find errors.
Ad advantage with the lack of type checking is that a function name can just be exported with its name for example and this makes C code easy to link with other language's code
Well Haskell is only 1.8 times slower than GCC's C++, which is faster than GCC's C implementation for typical benchmark tasks.
That makes Haskell very fast, even faster than C#(Mono that is).
relative Language
speed
1.0 C++ GNU g++
1.1 C GNU gcc
1.2 ATS
1.5 Java 6 -server
1.5 Clean
1.6 Pascal Free Pascal
1.6 Fortran Intel
1.8 Haskell GHC
2.0 C# Mono
2.1 Scala
2.2 Ada 2005 GNAT
2.4 Lisp SBCL
3.9 Lua LuaJIT
source
For the record I use Lua for Games on the iPhone, thus you could easily use Haskell or Lisp if you prefer, since they are faster.
As for now, functional languages aren't used heavily for industry projects, so not enough serious work goes into optimizers. Also, optimizing imperative code for an imperative target is probably way easier.
Functional languages have one feat that will let them outdo imperative languages really soon now: trivial parallelization.
Trivial not in the sense that it is easy, but that it can be built into the language environment, without the developer needing to think about it.
The cost of robust multithreading in a thread-agnostic language like C is prohibitive for many projects.
I disagree with tuinstoel. The important question is whether the functional language provides a faster development time and results in faster code when it is used to what functional languages were meant to be used. See the efficiency issues section on Wikipedia for a glimpse of what I mean.
One more reason for bigger executable size could be lazy evaluation and non-strictness. The compiler can't figure out at compile-time when certain expressions get evaluated, so some runtime gets stuffed into the executable to handle this (to call upon the evaluation of the so-called thunks). As for performance, laziness can be both good and bad. On one hand it allows for additional potential optimization, on the other hand the code size can be larger and programmers are more likely to make bad decisions, e.g. see Haskell's foldl vs. foldr vs. foldl' vs. foldr'.

Resources