Efficiency of Streams in scheme

Efficiency of Streams in scheme - performance

As far as I learned using streams in large programs are way more efficient than using normal lisp in DrRacket.So why not the default evaluation is lazy evaluation in DrRacket?I wrote and put a timer procedure which calculates the time the work needed to be complete and in every complex program lazy eval was a lot faster.

AFAIK using streams when you are doing something like sorting is a waste of cycles since you need to be finished with the sort in order to know the first element. If you have tasks that work like a sort so that you'll need to evaluate a whole set you'll end up using more time than without streams. The reason for that is that the whole stream system has a cost as well as benefits.
The benefits of streams are the fact that you can do calculations in parallel so that the program doesn't need to do a whole loop before processing the first element. If you have n layers of processing streams you'll benefit when your program quits and all the other layers hasn't served you the whole thing yet.
DrRacket is not a language but an IDE. Racket is both a language (#!racket as first line of source) and the name of the implementation that implements it.
Racket supports #!lazy which is a lazy version of Racket. Basically everything works just like streams do, everywhere. You'll have the same benefits and cost.
None of the mentioned languages are Scheme, but #!racket was based on and was a superset of #!r5rs. Since then you have #!r6rs and the new #!r7rs. None of the official Scheme reports are lazy. The reason is that its predecessor was eager and making it lazy would completely change the language and ruin all backwards compatibility.
The innovation of Scheme in 1975 was lexical closures. The creators made lazy evaluation by need in an later report (by implementing delay and force). Other languages, like Haskell, are built to be lazy from the ground and they have a more advanced compiler to constant fold and make its code snappy.

Related

Scheme and Lisp best practices: recursion yes for Scheme, no for Lisp?

As I understand -- and I'm here to be corrected if wrong -- good Scheme practice is to do anything requiring looping, repeating with recursion, and, furthermore, overflow won't be a problem because tail-recursion is built-in. Lisp, however, doesn't have protection from overflow, hence, all the loop iterative macros (loop, while, etc). And so in real-world Lisp you usually don't use recursion, while Scheme wants it all but exclusively.
If my assumptions are true, is there a way to be "pure" with Lisp and not risk overflow? Or is this too much swimming against the stream to use recursion in Lisp? I recall from The Little Schemer how they give you a thorough workout with recursion. And there was an earlier edition called The Little Lisper. Did it give you the same recursion workout in Lisp? And then Land of Lisp left me confused about whether loops or recursion was "best practice."
What I'm trying to do is decide whether to use Racket inside of Emacs Org-mode or just use built-in Elisp for beginner students. I want students to stay as purely functional as possible, e.g., I don't want to explain the very new and difficult topic of recursion, then say "Oh, but we won't be using it..."

As I understand -- and I'm here to be corrected if wrong -- good Scheme practice is to do anything requiring looping, repeating with recursion, and, furthermore, overflow won't be a problem because tail-recursion is built-in.
This is correct as far as I know.
Lisp, however, doesn't have protection from overflow [...]
Not exactly. Most self-respective Common Lisp implementations provide tail-call merging (with some restrictions, see https://0branch.com/notes/tco-cl.html). The difference is that there is no requirements from the language specification to have it. That grants compiler writers more freedom when implementing various Common Lisp features. Emacs Lisp does not have TCO, except in the form of libraries like recur or tco (self-recursion).
... hence, all the loop iterative macros (loop, while, etc). And so in real-world Lisp you usually don't use recursion, while Scheme wants it all but exclusively.
The difference is mostly cultural. Take a REPL (Read-Eval-Print-Loop). I think it makes more sense to consider the interaction as a loop rather than a infinitely tail-recursive function. Somehow, there seem to be some reluctance in purely functional languages to even consider loops as a primitive control structure, whereas iterative processes are considered more elegant. Why not have both?
If my assumptions are true, is there a way to be "pure" with Lisp and not risk overflow? Or is this too much swimming against the stream to use recursion in Lisp?
You can certainly use recursion in Lisp, provided you do not abuse it, which should not be the case with small programs. Consider for example that map in OCaml is not tail-recursive, but people still use it regularly. If you use SBCL, there is a section in the manual that explains how to enforce tail-call elimination.
What I'm trying to do is decide whether to use Racket inside of Emacs Org-mode or just use built-in Elisp for beginner students. I want students to stay as purely functional as possible, e.g., I don't want to explain the very new and difficult topic of recursion, then say "Oh, but we won't be using it..."
If you want to teach functional programming, use a more functional language. In other words, between Racket and Emacs Lisp, I'd say Racket is more appropriate for students. There are more materials to teach functional programming with Racket, there is also Typed Racket.

There are a lot of differences between the typical Lisp dialects and the various Scheme dialects:
Scheme
requires tail call optimization
favors tail recursion over imperative loop constructs
dislikes imperative control flow
provides functional abstractions
tries to map control structures to functions (see for example CALL-WITH-CURRENT-CONTINUATION)
implements some looping macros as extensions
Lisp
Parts of this applies for Lisp dialects like Emacs Lisp, Common Lisp or ISLisp.
has low level control flow, for example GOTO like constructs: don't use them in user level code, sometimes they may be useful in macros.
does not standardize or even provide tail call optimization: for maximum portability don't require TCO in your code
provides simple macros: DO, DOTIMES, DOLIST: use them where directly applicable
provides complex macros: LOOP, or ITERATE as a library
provides functional abstractions like MAP, MAPCAR, REDUCE
Implementations typically have hard limits on stack sizes, which makes non-TCO recursive functions problematic. Typically one can set a large stack size upfront or extend the stack size at runtime.
Also tail call optimization has some incompatibilities with dynamic scope constructs like special variables or similar.
For TCO support in Common Lisp see: Tail Call Optimisation in Common Lisp Implementations
Style
Some prefer recursive functions, but I would not use it generally. Generally I (that's my personal opinion) prefer explicit loop constructs over general recursive calls. Others prefer the idea of a more mathematical approach of recursive functions.
If there is a higher-order function like MAP, use that.
If the loop code gets more complex, using a powerful LOOP statement can be useful

In Racket the looping construct of choice is for.
Racket (like Scheme) also has TCO, so in principle you can
write all looping constructs using function calls - it just isn't
very convenient.

There is no Tail Call Optimisation (TCO) in Emacs Lisp, so even tho I personally like recursion very much, I'd generally recommend against its use in Elisp (except of course in those cases where it's the only natural choice).
Elisp as a programming environment is fairly good, but I must agree with "coredump" and would recommend Racket instead, which has great support for beginners.

Examples where compiler-optimized functional code performs better than imperative code

One of the promises of side-effect free, referentially transparent functional programming is that such code can be extensively optimized.
To quote Wikipedia:
Immutability of data can, in many cases, lead to execution efficiency, by allowing the compiler to make assumptions that are unsafe in an imperative language, thus increasing opportunities for inline expansion.
I'd like to see examples where a functional language compiler outperforms an imperative one by producing a better optimized code.
Edit: I tried to give a specific scenario, but apparently it wasn't a good idea. So I'll try to explain it in a different way.
Programmers translate ideas (algorithms) into languages that machines can understand. At the same time, one of the most important aspects of the translation is that also humans can understand the resulting code. Unfortunately, in many cases there is a trade-off: A concise, readable code suffers from slow performance and needs to be manually optimized. This is error-prone, time consuming, and it makes the code less readable (up to totally unreadable).
The foundations of functional languages, such as immutability and referential transparency, allow compilers to perform extensive optimizations, which could replace manual optimization of code and free programmers from this trade-off. I'm looking for examples of ideas (algorithms) and their implementations, such that:
the (functional) implementation is close to the original idea and is easy to understand,
it is extensively optimized by the compiler of the language, and
it is hard (or impossible) to write similarly efficient code in an imperative language without manual optimizations that reduce its conciseness and readability.
I apologize if it is a bit vague, but I hope the idea is clear. I don't want to give unnecessary restrictions on the answers. I'm open to suggestions if someone knows how to express it better.
My interest isn't just theoretical. I'd like to use such examples (among other things) to motivate students to get interested in functional programming.
At first, I wasn't satisfied by a few examples suggested in the comments. On second thoughts I take my objections back, those are good examples. Please feel free to expand them to full answers so that people can comment and vote for them.
(One class of such examples will be most likely parallelized code, which can take advantage of multiple CPU cores. Often in functional languages this can be done easily without sacrificing code simplicity (like in Haskell by adding par or pseq in appropriate places). I' be interested in such examples too, but also in other, non-parallel ones.)

There are cases where the same algorithm will optimize better in a pure context. Specifically, stream fusion allows an algorithm that consists of a sequence of loops that may be of widely varying form: maps, filters, folds, unfolds, to be composed into a single loop.
The equivalent optimization in a conventional imperative setting, with mutable data in loops, would have to achieve a full effect analysis, which no one does.
So at least for the class of algorithms that are implemented as pipelines of ana- and catamorphisms on sequences, you can guarantee optimization results that are not possible in an imperative setting.

A very recent paper Haskell beats C using generalised stream fusion by Geoff Mainland, Simon Peyton Jones, Simon Marlow, Roman Leshchinskiy (submitted to ICFP 2013) describes such an example. Abstract (with the interesting part in bold):
Stream fusion [6] is a powerful technique for automatically transforming
high-level sequence-processing functions into efficient implementations.
It has been used to great effect in Haskell libraries
for manipulating byte arrays, Unicode text, and unboxed vectors.
However, some operations, like vector append, still do not perform
well within the standard stream fusion framework. Others,
like SIMD computation using the SSE and AVX instructions available
on modern x86 chips, do not seem to fit in the framework at
all.
In this paper we introduce generalized stream fusion, which
solves these issues. The key insight is to bundle together multiple
stream representations, each tuned for a particular class of stream
consumer. We also describe a stream representation suited for efficient
computation with SSE instructions. Our ideas are implemented
in modified versions of the GHC compiler and vector library.
Benchmarks show that high-level Haskell code written using
our compiler and libraries can produce code that is faster than both
compiler- and hand-vectorized C.

This is just a note, not an answer: the gcc has a pure attribute suggesting it can take account of purity; the obvious reasons are remarked on in the manual here.
I would think that 'static single assignment' imposes a form of purity -- see the links at http://lambda-the-ultimate.org/node/2860 or the wikipedia article.

make and various build systems perform better for large projects by assuming that various build steps are referentially transparent; as such, they only need to rerun steps that have had their inputs change.
For small to medium sized changes, this can be a lot faster than building from scratch.

Is there a functional algorithm which is faster than an imperative one?

I'm searching for an algorithm (or an argument of such an algorithm) in functional style which is faster than an imperative one.
I like functional code because it's expressive and mostly easier to read than it's imperative pendants. But I also know that this expressiveness can cost runtime overhead. Not always due to techniques like tail recursion - but often they are slower.
While programming I don't think about runtime costs of functional code because nowadays PCs are very fast and development time is more expensive than runtime. Furthermore for me readability is more important than performance. Nevertheless my programs are fast enough so I rarely need to solve a problem in an imperative way.
There are some algorithms which in practice should be implemented in an imperative style (like sorting algorithms) otherwise in most cases they are too slow or requires lots of memory.
In contrast due to techniques like pattern matching a whole program like a parser written in an functional language may be much faster than one written in an imperative language because of the possibility of compilers to optimize the code.
But are there any algorithms which are faster in a functional style or are there possibilities to setting up arguments of such an algorithm?

A simple reasoning. I don't vouch for terminology, but it seems to make sense.
A functional program, to be executed, will need to be transformed into some set of machine instructions.
All machines (I've heard of) are imperative.
Thus, for every functional program, there's an imperative program (roughly speaking, in assembler language), equivalent to it.
So, you'll probably have to be satisfied with 'expressiveness', until we get 'functional computers'.

The short answer:
Anything that can be easily made parallel because it's free of side-effects will be quicker on a multi-core processor.
QuickSort, for example, scales up quite nicely when used with immutable collections: http://en.wikipedia.org/wiki/Quicksort#Parallelization
All else being equal, if you have two algorithms that can reasonably be described as equivalent, except that one uses pure functions on immutable data, while the second relies on in-place mutations, then the first algorithm will scale up to multiple cores with ease.
It may even be the case that your programming language can perform this optimization for you, as with the scalaCL plugin that will compile code to run on your GPU. (I'm wondering now if SIMD instructions make this a "functional" processor)
So given parallel hardware, the first algorithm will perform better, and the more cores you have, the bigger the difference will be.

FWIW there are Purely functional data structures, which benefit from functional programming.
There's also a nice book on Purely Functional Data Structures by Chris Okasaki, which presents data structures from the point of view of functional languages.
Another interesting article Announcing Intel Concurrent Collections for Haskell 0.1, about parallel programming, they note:
Well, it happens that the CnC notion
of a step is a pure function. A step
does nothing but read its inputs and
produce tags and items as output. This
design was chosen to bring CnC to that
elusive but wonderful place called
deterministic parallelism. The
decision had nothing to do with
language preferences. (And indeed, the
primary CnC implementations are for
C++ and Java.)
Yet what a great match Haskell and CnC
would make! Haskell is the only major
language where we can (1) enforce that
steps be pure, and (2) directly
recognize (and leverage!) the fact
that both steps and graph executions
are pure.
Add to that the fact that Haskell is
wonderfully extensible and thus the
CnC "library" can feel almost like a
domain-specific language.
It doesn't say about performance – they promise to discuss some of the implementation details and performance in future posts, – but Haskell with its "pureness" fits nicely into parallel programming.

One could argue that all programs boil down to machine code.
So, if I dis-assemble the machine code (of an imperative program) and tweak the assembler, I could perhaps end up with a faster program. Or I could come up with an "assembler algorithm" that exploits some specific CPU feature, and therefor it really is faster than the imperative language version.
Does this situation lead to the conclusion that we should use assembler everywhere? No, we decided to use imperative languages because they are less cumbersome. We write pieces in assembler because we really need to.
Ideally we should also use FP algorithms because they are less cumbersome to code, and use imperative code when we really need to.

Well, I guess you meant to ask if there is an implementation of an algorithm in functional programming language that is faster than another implementation of the same algorithm but in an imperative language. By "faster" I mean that it performs better in terms of execution time or memory footprint on some inputs according to some measurement that we deem trustworthy.
I do not exclude this possibility. :)

To elaborate on Yasir Arsanukaev's answer, purely functional data structures can be faster than mutable data structures in some situations becuase they share pieces of their structure. Thus in places where you might have to copy a whole array or list in an imperative language, where you can get away with a fraction of the copying because you can change (and copy) only a small part of the data structure. Lists in functional languages are like this -- multiple lists can share the same tail since nothing can be modified. (This can be done in imperative languages, but usually isn't, because within the imperative paradigm, people aren't usually used to talking about immutable data.)
Also, lazy evaluation in functional languages (particularly Haskell which is lazy by default) can also be very advantageous because it can eliminate code execution when the code's results won't actually be used. (One can be very careful not to run this code in the first place in imperative languages, however.)

How to "debug" Haskell with printfs?

coming from the Ocaml community, I'm trying to learn a bit of Haskell. The transition goes quite well but I'm a bit confused with debugging. I used to put (lots of) "printf" in my ocaml code, to inspect some intermediate values, or as flag to see where the computation exactly failed.
Since printf is an IO action, do I have to lift all my haskell code inside the IO monad to be able to this kind of debugging ? Or is there a better way to do this (I really don't want to do it by hand if it can be avoided)
I also find the trace function :
http://www.haskell.org/haskellwiki/Debugging#Printf_and_friends
which seems exactly what I want, but I don't understand it's type: there is no IO anywhere!
Can someone explain me the behaviour of the trace function ?

trace is the easiest to use method for debugging. It's not in IO exactly for the reason you pointed: no need to lift your code in the IO monad. It's implemented like this
trace :: String -> a -> a
trace string expr = unsafePerformIO $ do
putTraceMsg string
return expr
So there is IO behind the scenes but unsafePerformIO is used to escape out of it. That's a function which potentially breaks referential transparency which you can guess looking at its type IO a -> a and also its name.

trace is simply made impure. The point of the IO monad is to preserve purity (no IO unnoticed by the type system) and define the order of execution of statements, which would otherwise be practically undefined through lazy evaluation.
On own risk however, you can nevertheless hack together some IO a -> a, i.e. perform impure IO. This is a hack and of course "suffers" from lazy evaluation, but that's what trace simply does for the sake of debugging.
Nevertheless though, you should probably go other ways for debugging:
Reducing the need for debugging intermediate values
Write small, reusable, clear, generic functions whose correctness is obvious.
Combine the correct pieces to greater correct pieces.
Write tests or try out pieces interactively.
Use breakpoints etc. (compiler-based debugging)
Use generic monads. If your code is monadic nevertheless, write it independent of a concrete monad. Use type M a = ... instead of plain IO .... You can afterwards easily combine monads through transformers and put a debugging monad on top of it. Even if the need for monads is gone, you could just insert Identity a for pure values.

For what it's worth, there are actually two kinds of "debugging" at issue here:
Logging intermediate values, such as the value a particular subexpression has on each call into a recursive function
Inspecting the runtime behavior of the evaluation of an expression
In a strict imperative language these usually coincide. In Haskell, they often do not:
Recording intermediate values can change the runtime behavior, such as by forcing the evaluation of terms that would otherwise be discarded.
The actual process of computation can dramatically differ from the apparent structure of an expression due to laziness and shared subexpressions.
If you just want to keep a log of intermediate values, there are many ways to do so--for instance, rather than lifting everything into IO, a simple Writer monad will suffice, this being equivalent to making functions return a 2-tuple of their actual result and an accumulator value (some sort of list, typically).
It's also not usually necessary to put everything into the monad, only the functions that need to write to the "log" value--for instance, you can factor out just the subexpressions that might need to do logging, leaving the main logic pure, then reassemble the overall computation by combining pure functions and logging computations in the usual manner with fmaps and whatnot. Keep in mind that Writer is kind of a sorry excuse for a monad: with no way to read from the log, only write to it, each computation is logically independent of its context, which makes it easier to juggle things around.
But in some cases even that's overkill--for many pure functions, just moving subexpressions to the toplevel and trying things out in the REPL works pretty well.
If you want to actually inspect run-time behavior of pure code, however--for instance, to figure out why a subexpression diverges--there is in general no way to do so from other pure code--in fact, this is essentially the definition of purity. So in that case, you have no choice but to use tools that exist "outside" the pure language: either impure functions such as unsafePerformPrintfDebugging--errr, I mean trace--or a modified runtime environment, such as the GHCi debugger.

trace also tends to over-evaluate its argument for printing, losing a lot of the benefits of laziness in the process.

If you can wait until the program is finished before studying the output, then stacking a Writer monad is the classic approach to implementing a logger. I use this here to return a result set from impure HDBC code.

Well, since whole Haskell is built around principle of lazy evaluation (so that order of calculations is in fact non-deterministic), use of printf's make very little sense in it.
If REPL+inspect resulting values is really not enough for your debugging, wrapping everything into IO is the only choice (but it's not THE RIGHT WAY of Haskell programming).

Are functional languages inherently slow? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Why are functional languages always tailing behind C in benchmarks? If you have a statically typed functional language, it seems to me it could be compiled to the same code as C, or to even more optimized code since more semantics are available to the compiler. Why does it seem like all functional languages are slower than C, and why do they always need garbage collection and excessive use of the heap?
Does anyone know of a functional language appropriate for embedded / real-time applications, where memory allocation is kept to a minimum and the produced machine code is lean and fast?

Are functional languages inherently slow?
In some sense, yes. They require infrastructure that inevitably adds overheads over what can theoretically be attained using assembler by hand. In particular, first-class lexical closures only work well with garbage collection because they allow values to be carried out of scope.
Why are functional languages always tailing behind C in benchmarks?
Firstly, beware of selection bias. C acts as a lowest common denominator in benchmark suites, limiting what can be accomplished. If you have a benchmark comparing C with a functional language then it is almost certainly an extremely simple program. Arguably so simple that it is of little practical relevance today. It is not practically feasible to solve more complicated problems using C for a mere benchmark.
The most obvious example of this is parallelism. Today, we all have multicores. Even my phone is a multicore. Multicore parallelism is notoriously difficult in C but can be easy in functional languages (I like F#). Other examples include anything that benefits from persistent data structures, e.g. undo buffers are trivial with purely functional data structures but can be a huge amount of work in imperative languages like C.
Why does it seem like all functional languages are slower than C, and why do they always need garbage collection and excessive use of the heap?
Functional languages will seem slower because you'll only ever see benchmarks comparing code that is easy enough to write well in C and you'll never see benchmarks comparing meatier tasks where functional languages start to excel.
However, you've correctly identified what is probably the single biggest bottleneck in functional languages today: their excessive allocation rates. Nice work!
The reasons why functional languages allocate so heavily can be split into historical and inherent reasons.
Historically, Lisp implementations have been doing a lot of boxing for 50 years now. This characteristic spread to many other languages which use Lisp-like intermediate representations. Over the years, language implementers have continually resorted to boxing as a quick fix for complications in language implementation. In object oriented languages, the default has been to always heap allocate every object even when it can obviously be stack allocated. The burden of efficiency was then pushed onto the garbage collector and a huge amount of effort has been put into building garbage collectors that can attain performance close to that of stack allocation, typically by using a bump-allocating nursery generation. I think that a lot more effort should be put into researching functional language designs that minimize boxing and garbage collector designs that are optimized for different requirements.
Generational garbage collectors are great for languages that heap allocate a lot because they can be almost as fast as stack allocation. But they add substantial overheads elsewhere. Today's programs are increasingly using data structures like queues (e.g. for concurrent programming) and these give pathological behaviour for generational garbage collectors. If the items in the queue outlive the first generation then they all get marked, then they all get copied ("evacuated"), then all of the references to their old locations get updated and then they become eligible for collection. This is about 3× slower than it needs to be (e.g. compared to C). Mark region collectors like Beltway (2002) and Immix (2008) have the potential to solve this problem because the nursery is replaced with a region that can either be collected as if it were a nursery or, if it contains mostly reachable values, it can be replaced with another region and left to age until it contains mostly unreachable values.
Despite the pre-existence of C++, the creators of Java made the mistake of adopting type erasure for generics, leading to unnecessary boxing. For example, I benchmarked a simple hash table running 17× faster on .NET than the JVM partly because .NET did not make this mistake (it uses reified generics) and also because .NET has value types. I actually blame Lisp for making Java slow.
All modern functional language implementations continue to box excessively. JVM-based languages like Clojure and Scala have little choice because the VM they target cannot even express value types. OCaml sheds type information early in its compilation process and resorts to tagged integers and boxing at run-time to handle polymorphism. Consequently, OCaml will often box individual floating point numbers and always boxes tuples. For example, a triple of bytes in OCaml is represented by a pointer (with an implicit 1-bit tag embedded in it that gets checked repeatedly at run-time) to a heap-allocated block with a 64 bit header and 192 bit body containing three tagged 63-bit integers (where the 3 tags are, again, repeatedly examined at run time!). This is clearly insane.
Some work has been done on unboxing optimizations in functional languages but it never really gained traction. For example, the MLton compiler for Standard ML was a whole-program optimizing compiler that did sophisticated unboxing optimizations. Sadly, it was before its time and the "long" compilation times (probably under 1s on a modern machine!) deterred people from using it.
The only major platform to have broken this trend is .NET but, amazingly, it appears to have been an accident. Despite having a Dictionary implementation very heavily optimized for keys and values that are of value types (because they are unboxed) Microsoft employees like Eric Lippert continue to claim that the important thing about value types is their pass-by-value semantics and not the performance characteristics that stem from their unboxed internal representation. Eric seems to have been proven wrong: more .NET developers seem to care more about unboxing than pass-by-value. Indeed, most structs are immutable and, therefore, referentially transparent so there is no semantic difference between pass-by-value and pass-by-reference. Performance is visible and structs can offer massive performance improvements. The performance of structs even saved Stack Overflow and structs are used to avoid GC latency in commercial software like Rapid Addition's!
The other reason for heavy allocation by functional languages is inherent. Imperative data structures like hash tables use huge monolithic arrays internally. If these were persistent then the huge internal arrays would need to be copied every time an update was made. So purely functional data structures like balanced binary trees are fragmented into many little heap-allocated blocks in order to facilitate reuse from one version of the collection to the next.
Clojure uses a neat trick to alleviate this problem when collections like dictionaries are only written to during initialization and are then read from a lot. In this case, the initialization can use mutation to build the structure "behind the scenes". However, this does not help with incremental updates and the resulting collections are still substantially slower to read than their imperative equivalents. On the up-side, purely functional data structures offer persistence whereas imperative ones do not. However, few practical applications benefit from persistence in practice so this is often not advantageous. Hence the desire for impure functional languages where you can drop to imperative style effortlessly and reap the benefits.
Does anyone know of a functional language appropriate for embedded / real-time applications, where memory allocation is kept to a minimum and the produced machine code is lean and fast?
Take a look at Erlang and OCaml if you haven't already. Both are reasonable for memory constrained systems but neither generate particularly great machine code.

Nothing is inherently anything. Here is an example where interpreted OCaml runs faster than equivalent C code, because the OCaml optimizer has different information available to it, due to differences in the language. Of course, it would be foolish to make a general claim that OCaml is categorically faster than C. The point is, it depends upon what you're doing, and how you do it.
That said, OCaml is an example of a (mostly) functional language which is actually designed for performance, in contrast to purity.

Functional languages require the elimination of mutable state that is visible at the level of the language abstraction. Therefore, data that would be mutated in place by an imperative language needs to be copied instead, with the mutation taking place on the copy. For a simple example, see a quick sort in Haskell vs. C.
Furthermore, garbage collection is required because free() is not a pure function, as it has side effects. Therefore, the only way to free memory that does not involve side effects at the level of the language abstraction is with garbage collection.
Of course, in principle, a sufficiently smart compiler could optimize out much of this copying. This is already done to some degree, but making the compiler sufficiently smart to understand the semantics of your code at that level is just plain hard.

The short answer: because C is fast. As in, blazingly ridiculously crazy fast. A language simply doesn't have to be 'slow' to get its rear handed to it by C.
The reason why C is fast is that it was created by really great coders, and gcc has been optimized over the course of a couple more decades and by dozens more brilliant coders than 99% of languages out there.
In short, you're not going to beat C except for specialized tasks that require very specific functional programming constructs.

The control flow of proceedural languages much better matches the actual processing patterns of modern computers.
C maps very closely onto the assembly code its compilation produces, hence the nickname "cross-platform assembly". Computer manufacturers have spent a few decades making assembly code run as fast as possible, so C inherits all of this raw speed.
In comparison, the no side-effects, inherent parallelism of functional languages does not map onto a single processor at all well. The arbitrary order in which functions can be invoked needs to be serialised down to the CPU bottleneck: without extremely clever compilation, you're going to be context switching all the time, none of the pre-fetching will work because you're constantly jumping all over the place, ... Basically, all the optimisation work that computer manufacturers have done for nice, predictable proceedural languages is pretty much useless.
However! With the move towards lots of less powerful cores (rather than one or two turbo-charged cores), functional languages should begin to close the gap, as they naturally scale horizontally.

C is fast because it's basically a set of macros for assembler :) There is no "behind the scene" when you are writing a program in C. You alloc memory when you decide it's time to do that and you free in the same fashion. This is a huge advantage when you are writing a real time application, where predictabily is important (more than anything else, actually).
Also, C compilers are generally extremly fast because language itself is simple. It even doesn't make any type checkings :) This also means that is easier to make hard to find errors.
Ad advantage with the lack of type checking is that a function name can just be exported with its name for example and this makes C code easy to link with other language's code

Well Haskell is only 1.8 times slower than GCC's C++, which is faster than GCC's C implementation for typical benchmark tasks.
That makes Haskell very fast, even faster than C#(Mono that is).
relative Language
speed
1.0 C++ GNU g++
1.1 C GNU gcc
1.2 ATS
1.5 Java 6 -server
1.5 Clean
1.6 Pascal Free Pascal
1.6 Fortran Intel
1.8 Haskell GHC
2.0 C# Mono
2.1 Scala
2.2 Ada 2005 GNAT
2.4 Lisp SBCL
3.9 Lua LuaJIT
source
For the record I use Lua for Games on the iPhone, thus you could easily use Haskell or Lisp if you prefer, since they are faster.

As for now, functional languages aren't used heavily for industry projects, so not enough serious work goes into optimizers. Also, optimizing imperative code for an imperative target is probably way easier.
Functional languages have one feat that will let them outdo imperative languages really soon now: trivial parallelization.
Trivial not in the sense that it is easy, but that it can be built into the language environment, without the developer needing to think about it.
The cost of robust multithreading in a thread-agnostic language like C is prohibitive for many projects.

I disagree with tuinstoel. The important question is whether the functional language provides a faster development time and results in faster code when it is used to what functional languages were meant to be used. See the efficiency issues section on Wikipedia for a glimpse of what I mean.

One more reason for bigger executable size could be lazy evaluation and non-strictness. The compiler can't figure out at compile-time when certain expressions get evaluated, so some runtime gets stuffed into the executable to handle this (to call upon the evaluation of the so-called thunks). As for performance, laziness can be both good and bad. On one hand it allows for additional potential optimization, on the other hand the code size can be larger and programmers are more likely to make bad decisions, e.g. see Haskell's foldl vs. foldr vs. foldl' vs. foldr'.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio