I can't find any info on this online... I am also new to Prolog...
It seems to me that Prolog could be highly concurrent, perhaps trying many possibilities at once when trying to match a rule. Are modern Prolog compilers/interpreters inherently* concurrent? Which ones? Is concurrency on by default? Do I need to enable it somehow?
* I am not interested in multi-threading, just inherent concurrency.
Are modern Prolog compilers/interpreters inherently* concurrent? Which ones? Is concurrency on by default?
No. Concurrent logic programming was the main aim of the 5th Generation Computer program in Japan in the 1980s; it was expected that Prolog variants would be "easily" parallelized on massively parallel hardware. The effort largely failed, because automatic concurrency just isn't easy. Today, Prolog compilers tend to offer threading libraries instead, where the program must control the amount of concurrency by hand.
To see why parallelizing Prolog is as hard as any other language, consider the two main control structures the language offers: conjunction (AND, serial execution) and disjunction (OR, choice with backtracking). Let's say you have an AND construct such as
p(X) :- q(X), r(X).
and you'd want to run q(X) and r(X) in parallel. Then, what happens if q partially unifies X, say by binding it to f(Y). r must have knowledge of this binding, so either you've got to communicate it, or you have to wait for both conjuncts to complete; then you may have wasted time if one of them fails, unless you, again, have them communicate to synchronize. That gives overhead and is hard to get right. Now for OR:
p(X) :- q(X).
p(X) :- r(X).
There's a finite number of choices here (Prolog, of course, admits infinitely many choices) so you'd want to run both of them in parallel. But then, what if one succeeds? The other branch of the computation must be suspended and its state saved. How many of these states are you going to save at once? As many as there are processors seems reasonable, but then you have to take care to not have computations create states that don't fit in memory. That means you have to guess how large the state of a computation is, something that Prolog hides from you since it abstracts over such implementation details as processors and memory; it's not C.
In other words, automatic parallelization is hard. The 5th Gen. Computer project got around some of the issues by designing committed-choice languages, i.e. Prolog dialects without backtracking. In doing so, they drastically changed the language. It must be noted that the concurrent language Erlang is an offshoot of Prolog, and it too has traded in backtracking for something that is closer to functional programming. It still requires user guidance to know what parts of a program can safely be run concurrently.
In theory that seems attractive, but there are various problems that make such an implementation seem unwise.
for better or worse, people are used to thinking of their programs as executing left-to-right and top-down, even when programming in Prolog. Both the order of clauses for a predicate and of terms within a clause is semantically meaningful in standard Prolog. Parallelizing them would change the behaviour of far too much exising code to become popular.
non-relational language elements such as the cut operator can only be meaningfully be used when you can rely on such execution orders, i.e. they would become unusable in a parallel interpreter unless very complicated dependency tracking were invented.
all existing parallelization solutions incur at least some performance overhead for inter-thread communication.
Prolog is typically used for high-level, deeply recursive problems such as graph traversal, theorem proving etc. Parallelization on a modern machines can (ideally) achieve a speedup of n for some constant n, but it cannot turn an unviable recursive solution method into a viable one, because that would require an exponential speedup. In contrast, the numerical problems that Fortran and C programmers usually solve typically have a high but quite finite cost of computation; it is well worth the effort of parallelization to turn a 10-hour job into a 1-hour job. In contrast, turning a program that can look about 6 moves ahead into one that can (on average) look 6.5 moves ahead just isn't as compelling.
There are two notions of concurrency in Prolog. One is tied to multithreading, the other to suspended goals. I am not sure what you want to know. So I will expand a little bit about multithreading first:
Today widely available Prolog system can be differentiated whether they are multithreaded or not. In a multithreaded Prolog system you can spawn multiple threads that run concurrently over the same knowledge base. This poses some problems for consult and dynamic predicates, which are solved by these Prolog systems.
You can find a list of the Prolog systems that are multithreaded here:
Operating system and Web-related features
Multithreading is a prerequesite for various parallelization paradigmas. Correspondingly the individudal Prolog systems provide constructs that serve certain paradigmas. Typical paradigmas are thread pooling, for example used in web servers, or spawning a thread for long running GUI tasks.
Currently there is no ISO standard for a thread library, although there has been a proposal and each Prolog system has typically rich libraries that provide thread synchronization, thread communication, thread debugging and foreign code threads. A certain progress in garbage collection in Prolog system was necessary to allow threaded applications that have potentially infinitely long running threads.
Some existing layers even allow high level parallelization paradigmas in a Prolog system independent fashion. For example Logtalk has some constructs that map to various target Prolog systems.
Now lets turn to suspended goals. From older Prolog systems (since Prolog II, 1982, in fact) we know the freeze/2 command or blocking directives. These constructs force a goal not to be expanded by existing clauses, but instead put on a sleeping list. The goal can the later be woken up. Since the execution of the goal is not immediate but only when it is woken up, suspended goals are sometimes seen as concurrent goals,
but the better notion for this form of parallelism would be coroutines.
Suspended goals are useful to implement constraint solving systems. In the simplest case the sleeping list is some variable attribute. But a newer approach for constraint solving systems are constraint handling rules. In constraint handling rules the wake up conditions can be suspended goal pair patterns. The availability of constraint solving either via suspended goals or constraint handling rules can be seen here:
Overview of Prolog Systems
Best Regards
From a quick google search it appears that the concurrent logic programming paradigm has only been the basis for a few research languages and is no longer actively developed. I have seen claims that concurrent logic is easy to do in the Mozart/Oz system.
There was great hope in the 80s/90s to bake parallelism into the language (thus making it "inherently" parallel), in particular in the context of the Fifth Generation Project. Even special hardware constructs were studied to implement "Parallel Inference Machine" (PIM) (Similar to the special hardware for LISP machines in the "functional programming" camp). Hardware efforts were abandoned due to continual improvement of off-the-shelf CPUs and software effort were abandoned due to excessive compiler complexity, lack of demand for hard-to-implement high-level features and likely lack of payoff: parallelism that looks transparent and elegantly exploitable at the language level generally means costly inter-process communication and transactional locking "under the hood".
A good read about this is
"The Deevolution of Concurrent Logic Programming Languages"
by Evan Tick, March 1994. Appeared in "Journal of Logic Programming, Tenth Anniversary Special Issue, 1995". The Postscript file linked to is complete, unlike the PDF you get at Elsevier.
The author says:
There are two main views of concurrent logic programming and its
development over the past several years [i.e. 1990-94]. Most logic programming
literature views concurrent logic programming languages as a
derivative or variant of logic programs, i.e., the main difference
being the extensive use of "don't care" nondeterminism rather than
"don't know" (backtracking) nondeterminism. Hence the name committed
choice or CC languages. A second view is that concurrent logic
programs are concurrent, reactive programs, not unlike other
"traditional" concurrent languages such as 'C' with explicit message
passing, in the sense that procedures are processes that communicate
over data streams to incrementally produce answers. A cynic might say
that the former view has more academic richness, whereas the latter
view has more practical public relations value.
This article is a survey of implementation techniques of concurrent
logic programming languages, and thus full disclosure of both of these
views is not particularly relevant. Instead, a quick overview of basic
language semantics, and how they relate to fundamental programming
paradigms in a variety of languages within the family, will suffice.
No attempt will be made to cover the many feasible programming
paradigms; nor semantical nuances, nor the family history. (...).
The main point I wish to make in this article is that concurrent logic
programming languages have been deevolving since their inception,
about ten years ago, because of the following tatonnement:
Systems designers and compiler writers could supply only certain limited features in robust; efficient implementations. This drove the
market to accept these restricted languages as, in some informal
sense, de facto standards.
Programmers became aware that certain, more expressive language features were not critically important to getting applications
written, and did not demand their inclusion.
Thus my stance in this article will be a third view: how the initially
rich languages gradually lost their "teeth," and became weaker, but
more practically implementable, and achieved faster performance.
The deevolutionary history begins with Concurrent Prolog (deep guards,
atomic unification; read-only annotated variables for
synchronization), and after a series of reductions (for example: GHC
(input-matching synchronization), Parlog (safe), FCP (flat), Fleng (no
guards), Janus (restricted communication), Strand (assignment rather
than output unification)), and ends for now with PCN (flat guards,
non-atomic assignments input-matching synchronization, and
explicitly-defined mutable variables). This and other terminology will
be defined as the article proceeds.
This view may displease some
readers because it presupposes that performance is the main driving
force of the language market; and furthermore that the main "added
value" of concurrent logic programs over logic programs is the ability
to naturally exploit parallelism to gain speed. Certainly the reactive
nature of the languages also adds value; e.g., in building complex
object-oriented applications. Thus one can argue that the deevolution
witnessed is a bad thing when reactive capabilities are being traded
for speed.
ECLiPSe-CLP, a language "largely backward-compatible with Prolog", supports OR-parallelism, even though "this functionality is currently not actively maintained because of other priorities".
[1,2] document OR- (and AND-)parallelism in ECLiPSe-CLP.
However, I tried to get it working some time using the code from ECLiPSe-CLP's repository, but I didn't get it though.
[1] http://eclipseclp.org/reports/book.ps.gz
[2] http://eclipseclp.org/doc/bips/kernel/compiler/parallel-1.html
Related
Look at this statement taken from The examples from Tony Hoare's seminal 1978 paper:
Go's design was strongly influenced by Hoare's paper. Although Go differs significantly from the example language used in the paper, the examples still translate rather easily. The biggest difference apart from syntax is that Go models the conduits of concurrent communication explicitly as channels, while the processes of Hoare's language send messages directly to each other, similar to Erlang. Hoare hints at this possibility in section 7.3, but with the limitation that "each port is connected to exactly one other port in another process", in which case it would be a mostly syntactic difference.
I'm confused.
Processes in Hoare's language communicate directly to each other. Go routines communicate also directly to each other but using channels.
So what impact has the limitation in golang. What is the real difference?
The answer requires a fuller understanding of Hoare's work on CSP. The progression of his work can be summarised in three stages:
based on Dijkstra's semaphore's, Hoare developed monitors. These are as used in Java, except Java's implementation contains a mistake (see Welch's article Wot No Chickens). It's unfortunate that Java ignored Hoare's later work.
CSP grew out of this. Initially, CSP required direct exchange from process A to process B. This rendezvous approach is used by Ada and Erlang.
CSP was completed by 1985, when his Book was first published. This final version of CSP includes channels as used in Go. Along with Hoare's team at Oxford, David May concurrently developed Occam, a language deliberately intended to blend CSP into a practical programming language. CSP and Occam influenced each other (for example in The Laws of Occam Programming). For years, Occam was only available on the Transputer processor, which had its architecture tailored to suit CSP. More recently, Occam has developed to target other processors and has also absorbed Pi calculus, along with other general synchronisation primitives.
So, to answer the original question, it is probably helpful to compare Go with both CSP and Occam.
Channels: CSP, Go and Occam all have the same semantics for channels. In addition, Go makes it easy to add buffering into channels (Occam does not).
Choices: CSP defines both the internal and external choice. However, both Go and Occam have a single kind of selection: select in Go and ALT in Occam. The fact that there are two kinds of CSP choice proved to be less important in practical languages.
Occam's ALT allows condition guards, but Go's select does not (there is a workaround: channel aliases can be set to nil to imitate the same behaviour).
Mobility: Go allows channel ends to be sent (along with other data) via channels. This creates a dynamically-changing topology and goes beyond what is possible in CSP, but Milner's Pi calculus was developed (out of his CCS) to describe such networks.
Processes: A goroutine is a forked process; it terminates when it wants to and it doesn't have a parent. This is less like CSP / Occam, in which processes are compositional.
An example will help here: firstly Occam (n.b. indentation matters)
SEQ
PAR
processA()
processB()
processC()
and secondly Go
go processA()
go processB()
processC()
In the Occam case, processC doesn't start until both processA and processB have terminated. In Go, processA and processB fork very quickly, then processC runs straightaway.
Shared data: CSP is not really concerned with data directly. But it is interesting to note there is an important difference between Go and Occam concerning shared data. When multiple goroutines share a common set of data variables, race conditions are possible; Go's excellent race detector helps to eliminate problems. But Occam takes a different stance: shared mutable data is prevented at compilation time.
Aliases: related to the above, Go allows many pointers to refer to each data item. Such aliases are disallowed in Occam, so reducing the effort needed to detect race conditions.
The latter two points are less about Hoare's CSP and more about May's Occam. But they are relevant because they directly concern safe concurrent coding.
That's exactly the point: in the example language used in Hoare's initial paper (and also in Erlang), process A talks directly to process B, while in Go, goroutine A talks to channel C and goroutine B listens to channel C. I.e. in Go the channels are explicit while in Hoare's language and Erlang, they are implicit.
See this article for more info.
Recently, I've been working quite intensively with Go's channels, and have been working with concurrency and parallelism for many years, although I could never profess to know everything about this.
I think what you're asking is what's the subtle difference between sending a message to a channel and sending directly to each other? If I understand you, the quick answer is simple.
Sending to a Channel give the opportunity for parallelism / concurrency on both sides of the channel. Beautiful, and scalable.
We live in a concurrent world. Sending a long continuous stream of messages from A to B (asynchronously) means that B will need to process the messages at pretty much the same pace as A sends them, unless more than one instance of B has the opportunity to process a message taken from the channel, hence sharing the workload.
The good thing about channels is that that you can have a number of producer/receiver go-routines which are able to push messages to the queue, or consume from the queue and process it accordingly.
If you think linearly, like a single-core CPU, concurrency is basically like having a million jobs to do. Knowing a single-core CPU can only do one thing at a time, and yet also see that it gives the illusion that lots of things are happening at the same time. When executing some code, the time the OS needs to wait a while for something to come back from the network, disk, keyboard, mouse, etc, or even some process which sleeps for a while, give the OS the opportunity to do something else in the meantime. This all happens extremely quickly, creating the illusion of parallelism.
Parallelism on the other hand is different in that the job can be run on a completely different CPU independent of what's going with other CPUs, and therefore doesn't run under the same constraints as the other CPU (although most OS's do a pretty good job at ensuring workloads are evenly distributed to run across all of it's CPUs - with perhaps the exception of CPU-hungry, uncooperative non-os-yielding-code, but even then the OS tames them.
The point is, having multi-core CPUs means more parallelism and more concurrency can occur.
Imagine a single queue at a bank which fans-out to a number of tellers who can help you. If no customers are being served by any teller, one teller elects to handle the next customer and becomes busy, until they all become busy. Whenever a customer walks away from a teller, that teller is able to handle the next customer in the queue.
There's a company that have/are developing a very parallel computer called Parallella. It looks like it has lots of potential, but it runs some C style language.
Q. Has anyone written a language specifically to take advantage of massively parallel computers like this?
Clause 1. It has to be a managed garbage collected language.
Clause 2. It has to make it very easy to write parallel code without requiring the developer to look after low-level locking.
Clause 3. Bonus points for functional languages.
Clause 4. Super bonus points for languages with lambdas.
There are a definitely languages that have been designed to deal with the rising popularity of parallel computing. Parallel processors have sky rocketed in popularity since the death of Moore's Law. Support for better parallel computing in programming languages has followed quickly in its path.
My personal recommendation would be either Haskell or Clojure. Both are functional languages which have made great strides in parallel and concurrent computing leveraging their functional nature to gain advantages. Haskell has a really nice book called Parallel and Concurrent Programming in Haskell by Simon Marlow. I've read it and it's excellent. Clojure has also been built from the ground up with concurrency in mind. An interesting new player in this space is Julia, but I can't say I know much about it at all.
As for clause 1, I don't know what a managed language means. EDIT: What you're calling a managed language is more commonly called garbage collected language. You might want to use that term to help get more effective answers. Also all the languages I recommended have garbage collection.
As for clause 2, Haskell definitely makes parallel computing fairly automatic without any worrying about low level concepts or locking. There is a simple function called 'par' which allows the programmer to annotate two computations to be executed in parallel. The semantics guarantee that the expressions be evaluated when they're necessary and since the computations are functional they are guaranteed not to interact in non-thread-safe ways.
As for clause 3, you're on the right track to be looking for a functional language. Functional subcomputations have automatic thread safety which pays big dividends when it comes to ensuring parallel execution doesn't cause problems. It can't cause any if the computations are functional.
As for clause 4, good luck finding a functional language that doesn't have lambda ;) EDIT: It's not, strictly speaking, part of the definition of a functional language because there is no formal definition for what a functional programing language is. Informally I think a lot of people would mention it as one of the most important features. Concatenative languages or languages that are based on tacit programming (aka point-free style) can be functional and get away with not having lambda. I wouldn't be surprised if the K language didn't have lambda despite being functional. Also, I know for sure combinatory logic (which is the basis for K) does not have lambda. Though combinatory logic is just a theoretical basis and not a practical programming language.
I'm searching for an algorithm (or an argument of such an algorithm) in functional style which is faster than an imperative one.
I like functional code because it's expressive and mostly easier to read than it's imperative pendants. But I also know that this expressiveness can cost runtime overhead. Not always due to techniques like tail recursion - but often they are slower.
While programming I don't think about runtime costs of functional code because nowadays PCs are very fast and development time is more expensive than runtime. Furthermore for me readability is more important than performance. Nevertheless my programs are fast enough so I rarely need to solve a problem in an imperative way.
There are some algorithms which in practice should be implemented in an imperative style (like sorting algorithms) otherwise in most cases they are too slow or requires lots of memory.
In contrast due to techniques like pattern matching a whole program like a parser written in an functional language may be much faster than one written in an imperative language because of the possibility of compilers to optimize the code.
But are there any algorithms which are faster in a functional style or are there possibilities to setting up arguments of such an algorithm?
A simple reasoning. I don't vouch for terminology, but it seems to make sense.
A functional program, to be executed, will need to be transformed into some set of machine instructions.
All machines (I've heard of) are imperative.
Thus, for every functional program, there's an imperative program (roughly speaking, in assembler language), equivalent to it.
So, you'll probably have to be satisfied with 'expressiveness', until we get 'functional computers'.
The short answer:
Anything that can be easily made parallel because it's free of side-effects will be quicker on a multi-core processor.
QuickSort, for example, scales up quite nicely when used with immutable collections: http://en.wikipedia.org/wiki/Quicksort#Parallelization
All else being equal, if you have two algorithms that can reasonably be described as equivalent, except that one uses pure functions on immutable data, while the second relies on in-place mutations, then the first algorithm will scale up to multiple cores with ease.
It may even be the case that your programming language can perform this optimization for you, as with the scalaCL plugin that will compile code to run on your GPU. (I'm wondering now if SIMD instructions make this a "functional" processor)
So given parallel hardware, the first algorithm will perform better, and the more cores you have, the bigger the difference will be.
FWIW there are Purely functional data structures, which benefit from functional programming.
There's also a nice book on Purely Functional Data Structures by Chris Okasaki, which presents data structures from the point of view of functional languages.
Another interesting article Announcing Intel Concurrent Collections for Haskell 0.1, about parallel programming, they note:
Well, it happens that the CnC notion
of a step is a pure function. A step
does nothing but read its inputs and
produce tags and items as output. This
design was chosen to bring CnC to that
elusive but wonderful place called
deterministic parallelism. The
decision had nothing to do with
language preferences. (And indeed, the
primary CnC implementations are for
C++ and Java.)
Yet what a great match Haskell and CnC
would make! Haskell is the only major
language where we can (1) enforce that
steps be pure, and (2) directly
recognize (and leverage!) the fact
that both steps and graph executions
are pure.
Add to that the fact that Haskell is
wonderfully extensible and thus the
CnC "library" can feel almost like a
domain-specific language.
It doesn't say about performance – they promise to discuss some of the implementation details and performance in future posts, – but Haskell with its "pureness" fits nicely into parallel programming.
One could argue that all programs boil down to machine code.
So, if I dis-assemble the machine code (of an imperative program) and tweak the assembler, I could perhaps end up with a faster program. Or I could come up with an "assembler algorithm" that exploits some specific CPU feature, and therefor it really is faster than the imperative language version.
Does this situation lead to the conclusion that we should use assembler everywhere? No, we decided to use imperative languages because they are less cumbersome. We write pieces in assembler because we really need to.
Ideally we should also use FP algorithms because they are less cumbersome to code, and use imperative code when we really need to.
Well, I guess you meant to ask if there is an implementation of an algorithm in functional programming language that is faster than another implementation of the same algorithm but in an imperative language. By "faster" I mean that it performs better in terms of execution time or memory footprint on some inputs according to some measurement that we deem trustworthy.
I do not exclude this possibility. :)
To elaborate on Yasir Arsanukaev's answer, purely functional data structures can be faster than mutable data structures in some situations becuase they share pieces of their structure. Thus in places where you might have to copy a whole array or list in an imperative language, where you can get away with a fraction of the copying because you can change (and copy) only a small part of the data structure. Lists in functional languages are like this -- multiple lists can share the same tail since nothing can be modified. (This can be done in imperative languages, but usually isn't, because within the imperative paradigm, people aren't usually used to talking about immutable data.)
Also, lazy evaluation in functional languages (particularly Haskell which is lazy by default) can also be very advantageous because it can eliminate code execution when the code's results won't actually be used. (One can be very careful not to run this code in the first place in imperative languages, however.)
I am brainstorming an idea of developing a high level software to manipulate matrix algebra equations, tensor manipulations to be exact, to produce optimized C++ code using several criteria such as sizes of dimensions, available memory on the system, etc.
Something which is similar in spirit to tensor contraction engine, TCE, but specifically oriented towards producing optimized rather than general code.
The end result desired is software which is expert in producing parallel program in my domain.
Does this sort of development fall on the category of expert systems?
What other projects out there work in the same area of producing code given the constraints?
What you are describing is more like a Domain-Specific Language.
http://en.wikipedia.org/wiki/Domain-specific_language
It wouldn't be called an expert system, at least not in the traditional sense of this concept.
Expert systems are rule-based inference engines, whereby the expertise in question is clearly encapsulated in the rules. The system you suggest, while possibly encapsulating insight about the nature of the problem domain inside a linear algebra model of sorts, would act more as a black box than an expert system. One of the characteristics of expert systems is that they can produce an "explanation" of their reasoning, and such a feature is possible in part because the knowledge representation, while formalized, remains close to simple statements in a natural language; matrices and operations on them, while possibly being derived upon similar observation of reality, are a lot less transparent...
It is unclear from the description in the question if the system you propose would optimize existing code (possibly in a limited domain), or if it would produced optimized code, in that case driven bay some external goal/function...
Well production systems (rule systems) are one of four general approaches to computation (Turing machines, Church recursive functions, Post production systems and Markov algorithms [and several more have been added to that list]) which more or less have these respective realizations: imperative programming, functional programming, rule based programming - as far as I know Markov algorithms don't have an independent implementation. These are all Turing equivalent.
So rule based programming can be used to write anything at all. Also early mathematical/symbolic manipulation programs did generally use rule based programming until the problem was sufficiently well understood (whereupon the approach was changed to imperative or constraint programming - see MACSYMA - hmmm MACSYMA was written in Lisp so perhaps I have a different program in mind or perhaps they originally implemented a rule system in Lisp for this).
You could easily write a rule system to perform the matrix manipulations. You could keep a trace depending on logical support to record the actual rules fired that contributed to a solution (some rules that fire might not contribute directly to a solution afterall). Then for every rule you have a mapping to a set of C++ instructions (these don't have to be "complete" - they sort of act more like a semi-executable requirement) which are output as an intermediate language. Then that is read by a parser to link it to the required input data and any kind of fix up needed. You might find it easier to generate functional code - for one thing after the fix up you could more easily optimize the output code in functional source.
Having said that, other contributors have outlined a domain specific language approach and that is what the TED people did too (my suggestion is that too just using rules).
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Why are functional languages always tailing behind C in benchmarks? If you have a statically typed functional language, it seems to me it could be compiled to the same code as C, or to even more optimized code since more semantics are available to the compiler. Why does it seem like all functional languages are slower than C, and why do they always need garbage collection and excessive use of the heap?
Does anyone know of a functional language appropriate for embedded / real-time applications, where memory allocation is kept to a minimum and the produced machine code is lean and fast?
Are functional languages inherently slow?
In some sense, yes. They require infrastructure that inevitably adds overheads over what can theoretically be attained using assembler by hand. In particular, first-class lexical closures only work well with garbage collection because they allow values to be carried out of scope.
Why are functional languages always tailing behind C in benchmarks?
Firstly, beware of selection bias. C acts as a lowest common denominator in benchmark suites, limiting what can be accomplished. If you have a benchmark comparing C with a functional language then it is almost certainly an extremely simple program. Arguably so simple that it is of little practical relevance today. It is not practically feasible to solve more complicated problems using C for a mere benchmark.
The most obvious example of this is parallelism. Today, we all have multicores. Even my phone is a multicore. Multicore parallelism is notoriously difficult in C but can be easy in functional languages (I like F#). Other examples include anything that benefits from persistent data structures, e.g. undo buffers are trivial with purely functional data structures but can be a huge amount of work in imperative languages like C.
Why does it seem like all functional languages are slower than C, and why do they always need garbage collection and excessive use of the heap?
Functional languages will seem slower because you'll only ever see benchmarks comparing code that is easy enough to write well in C and you'll never see benchmarks comparing meatier tasks where functional languages start to excel.
However, you've correctly identified what is probably the single biggest bottleneck in functional languages today: their excessive allocation rates. Nice work!
The reasons why functional languages allocate so heavily can be split into historical and inherent reasons.
Historically, Lisp implementations have been doing a lot of boxing for 50 years now. This characteristic spread to many other languages which use Lisp-like intermediate representations. Over the years, language implementers have continually resorted to boxing as a quick fix for complications in language implementation. In object oriented languages, the default has been to always heap allocate every object even when it can obviously be stack allocated. The burden of efficiency was then pushed onto the garbage collector and a huge amount of effort has been put into building garbage collectors that can attain performance close to that of stack allocation, typically by using a bump-allocating nursery generation. I think that a lot more effort should be put into researching functional language designs that minimize boxing and garbage collector designs that are optimized for different requirements.
Generational garbage collectors are great for languages that heap allocate a lot because they can be almost as fast as stack allocation. But they add substantial overheads elsewhere. Today's programs are increasingly using data structures like queues (e.g. for concurrent programming) and these give pathological behaviour for generational garbage collectors. If the items in the queue outlive the first generation then they all get marked, then they all get copied ("evacuated"), then all of the references to their old locations get updated and then they become eligible for collection. This is about 3× slower than it needs to be (e.g. compared to C). Mark region collectors like Beltway (2002) and Immix (2008) have the potential to solve this problem because the nursery is replaced with a region that can either be collected as if it were a nursery or, if it contains mostly reachable values, it can be replaced with another region and left to age until it contains mostly unreachable values.
Despite the pre-existence of C++, the creators of Java made the mistake of adopting type erasure for generics, leading to unnecessary boxing. For example, I benchmarked a simple hash table running 17× faster on .NET than the JVM partly because .NET did not make this mistake (it uses reified generics) and also because .NET has value types. I actually blame Lisp for making Java slow.
All modern functional language implementations continue to box excessively. JVM-based languages like Clojure and Scala have little choice because the VM they target cannot even express value types. OCaml sheds type information early in its compilation process and resorts to tagged integers and boxing at run-time to handle polymorphism. Consequently, OCaml will often box individual floating point numbers and always boxes tuples. For example, a triple of bytes in OCaml is represented by a pointer (with an implicit 1-bit tag embedded in it that gets checked repeatedly at run-time) to a heap-allocated block with a 64 bit header and 192 bit body containing three tagged 63-bit integers (where the 3 tags are, again, repeatedly examined at run time!). This is clearly insane.
Some work has been done on unboxing optimizations in functional languages but it never really gained traction. For example, the MLton compiler for Standard ML was a whole-program optimizing compiler that did sophisticated unboxing optimizations. Sadly, it was before its time and the "long" compilation times (probably under 1s on a modern machine!) deterred people from using it.
The only major platform to have broken this trend is .NET but, amazingly, it appears to have been an accident. Despite having a Dictionary implementation very heavily optimized for keys and values that are of value types (because they are unboxed) Microsoft employees like Eric Lippert continue to claim that the important thing about value types is their pass-by-value semantics and not the performance characteristics that stem from their unboxed internal representation. Eric seems to have been proven wrong: more .NET developers seem to care more about unboxing than pass-by-value. Indeed, most structs are immutable and, therefore, referentially transparent so there is no semantic difference between pass-by-value and pass-by-reference. Performance is visible and structs can offer massive performance improvements. The performance of structs even saved Stack Overflow and structs are used to avoid GC latency in commercial software like Rapid Addition's!
The other reason for heavy allocation by functional languages is inherent. Imperative data structures like hash tables use huge monolithic arrays internally. If these were persistent then the huge internal arrays would need to be copied every time an update was made. So purely functional data structures like balanced binary trees are fragmented into many little heap-allocated blocks in order to facilitate reuse from one version of the collection to the next.
Clojure uses a neat trick to alleviate this problem when collections like dictionaries are only written to during initialization and are then read from a lot. In this case, the initialization can use mutation to build the structure "behind the scenes". However, this does not help with incremental updates and the resulting collections are still substantially slower to read than their imperative equivalents. On the up-side, purely functional data structures offer persistence whereas imperative ones do not. However, few practical applications benefit from persistence in practice so this is often not advantageous. Hence the desire for impure functional languages where you can drop to imperative style effortlessly and reap the benefits.
Does anyone know of a functional language appropriate for embedded / real-time applications, where memory allocation is kept to a minimum and the produced machine code is lean and fast?
Take a look at Erlang and OCaml if you haven't already. Both are reasonable for memory constrained systems but neither generate particularly great machine code.
Nothing is inherently anything. Here is an example where interpreted OCaml runs faster than equivalent C code, because the OCaml optimizer has different information available to it, due to differences in the language. Of course, it would be foolish to make a general claim that OCaml is categorically faster than C. The point is, it depends upon what you're doing, and how you do it.
That said, OCaml is an example of a (mostly) functional language which is actually designed for performance, in contrast to purity.
Functional languages require the elimination of mutable state that is visible at the level of the language abstraction. Therefore, data that would be mutated in place by an imperative language needs to be copied instead, with the mutation taking place on the copy. For a simple example, see a quick sort in Haskell vs. C.
Furthermore, garbage collection is required because free() is not a pure function, as it has side effects. Therefore, the only way to free memory that does not involve side effects at the level of the language abstraction is with garbage collection.
Of course, in principle, a sufficiently smart compiler could optimize out much of this copying. This is already done to some degree, but making the compiler sufficiently smart to understand the semantics of your code at that level is just plain hard.
The short answer: because C is fast. As in, blazingly ridiculously crazy fast. A language simply doesn't have to be 'slow' to get its rear handed to it by C.
The reason why C is fast is that it was created by really great coders, and gcc has been optimized over the course of a couple more decades and by dozens more brilliant coders than 99% of languages out there.
In short, you're not going to beat C except for specialized tasks that require very specific functional programming constructs.
The control flow of proceedural languages much better matches the actual processing patterns of modern computers.
C maps very closely onto the assembly code its compilation produces, hence the nickname "cross-platform assembly". Computer manufacturers have spent a few decades making assembly code run as fast as possible, so C inherits all of this raw speed.
In comparison, the no side-effects, inherent parallelism of functional languages does not map onto a single processor at all well. The arbitrary order in which functions can be invoked needs to be serialised down to the CPU bottleneck: without extremely clever compilation, you're going to be context switching all the time, none of the pre-fetching will work because you're constantly jumping all over the place, ... Basically, all the optimisation work that computer manufacturers have done for nice, predictable proceedural languages is pretty much useless.
However! With the move towards lots of less powerful cores (rather than one or two turbo-charged cores), functional languages should begin to close the gap, as they naturally scale horizontally.
C is fast because it's basically a set of macros for assembler :) There is no "behind the scene" when you are writing a program in C. You alloc memory when you decide it's time to do that and you free in the same fashion. This is a huge advantage when you are writing a real time application, where predictabily is important (more than anything else, actually).
Also, C compilers are generally extremly fast because language itself is simple. It even doesn't make any type checkings :) This also means that is easier to make hard to find errors.
Ad advantage with the lack of type checking is that a function name can just be exported with its name for example and this makes C code easy to link with other language's code
Well Haskell is only 1.8 times slower than GCC's C++, which is faster than GCC's C implementation for typical benchmark tasks.
That makes Haskell very fast, even faster than C#(Mono that is).
relative Language
speed
1.0 C++ GNU g++
1.1 C GNU gcc
1.2 ATS
1.5 Java 6 -server
1.5 Clean
1.6 Pascal Free Pascal
1.6 Fortran Intel
1.8 Haskell GHC
2.0 C# Mono
2.1 Scala
2.2 Ada 2005 GNAT
2.4 Lisp SBCL
3.9 Lua LuaJIT
source
For the record I use Lua for Games on the iPhone, thus you could easily use Haskell or Lisp if you prefer, since they are faster.
As for now, functional languages aren't used heavily for industry projects, so not enough serious work goes into optimizers. Also, optimizing imperative code for an imperative target is probably way easier.
Functional languages have one feat that will let them outdo imperative languages really soon now: trivial parallelization.
Trivial not in the sense that it is easy, but that it can be built into the language environment, without the developer needing to think about it.
The cost of robust multithreading in a thread-agnostic language like C is prohibitive for many projects.
I disagree with tuinstoel. The important question is whether the functional language provides a faster development time and results in faster code when it is used to what functional languages were meant to be used. See the efficiency issues section on Wikipedia for a glimpse of what I mean.
One more reason for bigger executable size could be lazy evaluation and non-strictness. The compiler can't figure out at compile-time when certain expressions get evaluated, so some runtime gets stuffed into the executable to handle this (to call upon the evaluation of the so-called thunks). As for performance, laziness can be both good and bad. On one hand it allows for additional potential optimization, on the other hand the code size can be larger and programmers are more likely to make bad decisions, e.g. see Haskell's foldl vs. foldr vs. foldl' vs. foldr'.