Inlining Algorithm - algorithm

Does anyone know of any papers discussion inlining algorithms? And closely related, the relationship of parent-child graph to call graph.
Background: I have a compiler written in Ocaml which aggressively inlines functions, primarily as a result of this and some other optimisations it generates faster code for my programming language than most others in many circumstances (including even C).
Problem #1: The algorithm has trouble with recursion. For this my rule is only to inline children into parents, to prevent infinite recursion, but this precludes sibling functions inlining once into each other.
Problem #2: I do not know of a simple way to optimise inlining operations. My algorithm is imperative with mutable representation of function bodies because it does not seem even remotely possible to make an efficient functional inlining algorithm. If the call graph is a tree, it is clear that a bottom up inlining is optimal.
Technical information: Inlining consists of a number of inlining steps. The problem is the ordering of the steps.
Each step works as follows:
we make a copy of the function to be inlined and beta-reduce by
replacing both type parameters and value parameters with arguments.
We then replace return statement with an assignment to a new variable
followed by a jump to the end of the function body.
The original call to the function is then replaced by this body.
However we're not finished. We must also clone all the children of
the function, beta-reducting them as well, and reparent the clones to
the calling function.
The cloning operation makes it extremely hard to inline recursive functions. The usual trick of keeping a list of what is already in progress and just checking to see if we're already processing this call does not work in naive form because the recursive call is now moved into the beta-reduced code being stuffed into the calling function, and the recursion target may have changed to a cloned child. However that child, in calling the parent, is still calling the original parent which calls its child, and now the unrolling of the recursion will not stop. As mentioned I broke this regress by only allowing inlining a recursive call to a child, preventing sibling recursions being inlined.
The cost of inlining is further complicated by the need to garbage collect unused functions. Since inlining is potentially exponential, this is essential. If all the calls to a function are inlined, we should get rid of the function if it has not been inlined into yet, otherwise we'll waste time inlining into a function which is no longer used. Actually keeping track of who calls what is extremely difficult, because when inlining we're not working with an actual function representation, but an "unravelled" one: for example, the list of instructions is being processed sequentially and a new list built up, and at any one point in time there may not be a coherent instruction list.
In his ML compiler Steven Weeks chose to use a number of small optimisations applied repeatedly, since this made the optimisations easy to write and easy to control, but unfortunately this misses a lot of optimisation opportunities compared to a recursive algorithm.
Problem #3: when is it safe to inline a function call?
To explain this problem generically: in a lazy functional language, arguments are wrapped in closures and then we can inline an application; this is the standard model for Haskell. However it also explains why Haskell is so slow. The closures are not required if the argument is known, then the parameter can be replaced directly with its argument where is occurs (this is normal order beta-reduction).
However if it is known the argument evaluation is not non-terminating, eager evaluation can be used instead: the parameter is assigned the value of the expression once, and then reused. A hybrid of these two techniques is to use a closure but cache the result inside the closure object. Still, GHC hasn't succeeded in producing very efficient code: it is clearly very difficult, especially if you have separate compilation.
In Felix, I took the opposite approach. Instead of demanding correctness and gradually improving efficiency by proving optimisations preserved semantics, I mandate that the optimisation defines the semantics. This guarantees correct operation of the optimiser at the expense of uncertainty about what how certain code will behave. The idea is to provide ways for the programmer to force the optimiser to conform to intended semantics if the default optimisation strategy is too aggressive.
For example, the default parameter passing mode allows the compiler to chose whether to wrap the argument in a closure, replace the parameter with the argument, or assign the argument to the parameter. If the programmer wants to force a closure, they can just pass in a closure. If the programmer wants to force eager evaluation, they mark the parameter var.
The complexity here is much greater than a functional programming language: Felix is a procedural language with variables and pointers. It also has Haskell style typeclasses. This makes the inlining routine extremely complex, for example, type-class instances replace abstract functions whenever possible (due to type specialisation when calling a polymorphic function, it may be possible to find an instance whilst inlining, so now we have a new function we can inline).
Just to be clear I have to add some more notes.
Inlining and several other optimisations such as user defined term reductions, typeclass instantiations, linear data flow checks for variable elimination, tail rec optimisation, are done all at once on a given function.
The ordering problem isn't the order to apply different optimisations, the problem is to order the functions.
I use a brain dead algorithm to detect recursion: I build up a list of everything used directly by a each function, find the closure, and then check if the function is in the result. Note the usage set is built up many times during optimisation, and this is a serious bottleneck.
Whether a function is recursive or not can change unfortunately. A recursive function might become non-recursive after tail rec optimisation. But there is a much harder case: instantiating a typeclass "virtual" function can make what appeared to be non-recursive recursive.
As to sibling calls, the problem is that given f and g where f calls g and g calls f I actually want to inline f into g, and g into f .. once. My infinite regress stopping rule is to only allow inlining of f into g if they're mutually recursive if f is a child of g, which excludes inlining siblings.
Basically I want to "flatten out" all code "as much as possible".

I realize you probably already know all this, but it seems important to still write it in full, at least for further reference.
In the functional community, there is some litterature mostly from the GHC people. Note that they consider inlining as a transformation in the source language, while you seem to work at a lower level. Working in the source language -- or an intermediate language of reasonably similar semantics -- is, I believe, a big help for simplicity and correctness.
GHC Wiki : Inlining (contains a bibliography)
Secrets of the Glasgow Haskell inliner
For the question of the ordering compiler passes, this is quite arcane. Still in a Haskell setting, there is the Compilation by Transformation in a Non-strict Functional Language PhD Thesis which discusses the ordering of different compiler passes (and also inlining).
There is also the quite recent paper on Compilation by Equality Saturation which propose a novel approach to optimisation passes ordering. I'm not sure it has yet demonstrated applicability at a large scale, but it's certainly an interesting direction to explore.

As for the recursion case, you could use Tarjan algorithm on your call graph to detect circular dependency clusters, and exclude them from inlining. It won't affect sibling calls.
http://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm

Related

Can every recursion function be called DFS(Depth-First Search)? [duplicate]

A reddit thread brought up an apparently interesting question:
Tail recursive functions can trivially be converted into iterative functions. Other ones, can be transformed by using an explicit stack. Can every recursion be transformed into iteration?
The (counter?)example in the post is the pair:
(define (num-ways x y)
(case ((= x 0) 1)
((= y 0) 1)
(num-ways2 x y) ))
(define (num-ways2 x y)
(+ (num-ways (- x 1) y)
(num-ways x (- y 1))
Can you always turn a recursive function into an iterative one? Yes, absolutely, and the Church-Turing thesis proves it if memory serves. In lay terms, it states that what is computable by recursive functions is computable by an iterative model (such as the Turing machine) and vice versa. The thesis does not tell you precisely how to do the conversion, but it does say that it's definitely possible.
In many cases, converting a recursive function is easy. Knuth offers several techniques in "The Art of Computer Programming". And often, a thing computed recursively can be computed by a completely different approach in less time and space. The classic example of this is Fibonacci numbers or sequences thereof. You've surely met this problem in your degree plan.
On the flip side of this coin, we can certainly imagine a programming system so advanced as to treat a recursive definition of a formula as an invitation to memoize prior results, thus offering the speed benefit without the hassle of telling the computer exactly which steps to follow in the computation of a formula with a recursive definition. Dijkstra almost certainly did imagine such a system. He spent a long time trying to separate the implementation from the semantics of a programming language. Then again, his non-deterministic and multiprocessing programming languages are in a league above the practicing professional programmer.
In the final analysis, many functions are just plain easier to understand, read, and write in recursive form. Unless there's a compelling reason, you probably shouldn't (manually) convert these functions to an explicitly iterative algorithm. Your computer will handle that job correctly.
I can see one compelling reason. Suppose you've a prototype system in a super-high level language like [donning asbestos underwear] Scheme, Lisp, Haskell, OCaml, Perl, or Pascal. Suppose conditions are such that you need an implementation in C or Java. (Perhaps it's politics.) Then you could certainly have some functions written recursively but which, translated literally, would explode your runtime system. For example, infinite tail recursion is possible in Scheme, but the same idiom causes a problem for existing C environments. Another example is the use of lexically nested functions and static scope, which Pascal supports but C doesn't.
In these circumstances, you might try to overcome political resistance to the original language. You might find yourself reimplementing Lisp badly, as in Greenspun's (tongue-in-cheek) tenth law. Or you might just find a completely different approach to solution. But in any event, there is surely a way.
Is it always possible to write a non-recursive form for every recursive function?
Yes. A simple formal proof is to show that both µ recursion and a non-recursive calculus such as GOTO are both Turing complete. Since all Turing complete calculi are strictly equivalent in their expressive power, all recursive functions can be implemented by the non-recursive Turing-complete calculus.
Unfortunately, I’m unable to find a good, formal definition of GOTO online so here’s one:
A GOTO program is a sequence of commands P executed on a register machine such that P is one of the following:
HALT, which halts execution
r = r + 1 where r is any register
r = r – 1 where r is any register
GOTO x where x is a label
IF r ≠ 0 GOTO x where r is any register and x is a label
A label, followed by any of the above commands.
However, the conversions between recursive and non-recursive functions isn’t always trivial (except by mindless manual re-implementation of the call stack).
For further information see this answer.
Recursion is implemented as stacks or similar constructs in the actual interpreters or compilers. So you certainly can convert a recursive function to an iterative counterpart because that's how it's always done (if automatically). You'll just be duplicating the compiler's work in an ad-hoc and probably in a very ugly and inefficient manner.
Basically yes, in essence what you end up having to do is replace method calls (which implicitly push state onto the stack) into explicit stack pushes to remember where the 'previous call' had gotten up to, and then execute the 'called method' instead.
I'd imagine that the combination of a loop, a stack and a state-machine could be used for all scenarios by basically simulating the method calls. Whether or not this is going to be 'better' (either faster, or more efficient in some sense) is not really possible to say in general.
Recursive function execution flow can be represented as a tree.
The same logic can be done by a loop, which uses a data-structure to traverse that tree.
Depth-first traversal can be done using a stack, breadth-first traversal can be done using a queue.
So, the answer is: yes. Why: https://stackoverflow.com/a/531721/2128327.
Can any recursion be done in a single loop? Yes, because
a Turing machine does everything it does by executing a single loop:
fetch an instruction,
evaluate it,
goto 1.
Yes, using explicitly a stack (but recursion is far more pleasant to read, IMHO).
Yes, it's always possible to write a non-recursive version. The trivial solution is to use a stack data structure and simulate the recursive execution.
In principle it is always possible to remove recursion and replace it with iteration in a language that has infinite state both for data structures and for the call stack. This is a basic consequence of the Church-Turing thesis.
Given an actual programming language, the answer is not as obvious. The problem is that it is quite possible to have a language where the amount of memory that can be allocated in the program is limited but where the amount of call stack that can be used is unbounded (32-bit C where the address of stack variables is not accessible). In this case, recursion is more powerful simply because it has more memory it can use; there is not enough explicitly allocatable memory to emulate the call stack. For a detailed discussion on this, see this discussion.
All computable functions can be computed by Turing Machines and hence the recursive systems and Turing machines (iterative systems) are equivalent.
Sometimes replacing recursion is much easier than that. Recursion used to be the fashionable thing taught in CS in the 1990's, and so a lot of average developers from that time figured if you solved something with recursion, it was a better solution. So they would use recursion instead of looping backwards to reverse order, or silly things like that. So sometimes removing recursion is a simple "duh, that was obvious" type of exercise.
This is less of a problem now, as the fashion has shifted towards other technologies.
Recursion is nothing just calling the same function on the stack and once function dies out it is removed from the stack. So one can always use an explicit stack to manage this calling of the same operation using iteration.
So, yes all-recursive code can be converted to iteration.
Removing recursion is a complex problem and is feasible under well defined circumstances.
The below cases are among the easy:
tail recursion
direct linear recursion
Appart from the explicit stack, another pattern for converting recursion into iteration is with the use of a trampoline.
Here, the functions either return the final result, or a closure of the function call that it would otherwise have performed. Then, the initiating (trampolining) function keep invoking the closures returned until the final result is reached.
This approach works for mutually recursive functions, but I'm afraid it only works for tail-calls.
http://en.wikipedia.org/wiki/Trampoline_(computers)
I'd say yes - a function call is nothing but a goto and a stack operation (roughly speaking). All you need to do is imitate the stack that's built while invoking functions and do something similar as a goto (you may imitate gotos with languages that don't explicitly have this keyword too).
Have a look at the following entries on wikipedia, you can use them as a starting point to find a complete answer to your question.
Recursion in computer science
Recurrence relation
Follows a paragraph that may give you some hint on where to start:
Solving a recurrence relation means obtaining a closed-form solution: a non-recursive function of n.
Also have a look at the last paragraph of this entry.
It is possible to convert any recursive algorithm to a non-recursive
one, but often the logic is much more complex and doing so requires
the use of a stack. In fact, recursion itself uses a stack: the
function stack.
More Details: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Functions
tazzego, recursion means that a function will call itself whether you like it or not. When people are talking about whether or not things can be done without recursion, they mean this and you cannot say "no, that is not true, because I do not agree with the definition of recursion" as a valid statement.
With that in mind, just about everything else you say is nonsense. The only other thing that you say that is not nonsense is the idea that you cannot imagine programming without a callstack. That is something that had been done for decades until using a callstack became popular. Old versions of FORTRAN lacked a callstack and they worked just fine.
By the way, there exist Turing-complete languages that only implement recursion (e.g. SML) as a means of looping. There also exist Turing-complete languages that only implement iteration as a means of looping (e.g. FORTRAN IV). The Church-Turing thesis proves that anything possible in a recursion-only languages can be done in a non-recursive language and vica-versa by the fact that they both have the property of turing-completeness.
Here is an iterative algorithm:
def howmany(x,y)
a = {}
for n in (0..x+y)
for m in (0..n)
a[[m,n-m]] = if m==0 or n-m==0 then 1 else a[[m-1,n-m]] + a[[m,n-m-1]] end
end
end
return a[[x,y]]
end

Performance of recursive function in register based compiler

I have a question of whether there will be a performance hit when we write recursive functions in Register based compilers like DVM. I'm aware that recursion isn't recommended in compilers with limited depth like compilers for python.
Being register-based does not help for recursive functions, they still have the same problem: conceptually, every call creates a new stack frame. If that is implemented literally, then a recursive call is inherently a little slower than looping, and perhaps more importantly, uses up a finite resource so the recursion depth is limited. A register-based code representation does not have the concept of an operand stack, but that concept is mostly disjoint from the concept of a call stack, which is still necessary just to have general subroutines. Subroutines can be implemented without a call stack if recursion is banned, in which case they need not be re-entrant so the local variables and the variable that holds the return address can be statically allocated.
Going through a trampoline works around the stack growth by quickly returning to a special caller that then calls the continuation, that way recursion doesn't grow the stack at all since the old frame gets deallocated before a new one is created, but it adds even more run-time overhead. Tail call elimination by rewriting the call into a jump achieves a similar effect but by reusing the same frame, with less associated overhead, this requires explicit support from the VM.
Both of those techniques apply equally to stack based and register based representations of the code, which incidentally is primarily a difference in the format in which the code is stored, and need not reflect a difference in the way the code is actually executed: a JIT compiler can turn both of them into whatever form the machine requires.

Will there be functions if there were no stacks?

I know that a stack data structure is used to store the local variables among many other things of a function that is being run.
I also understand how stack can be used to eleganlty manage recursion.
Suppose there was a machine that did not provide a stack area in memory, I don't think there will be programming languages for the machine that will support recursion. I am also wondering if programming languages for the machine would support functions without recursion.
Please, someone shread some sight on this for me.
A bit of theoretical framework is needed to understand that recursion is indeed not tied to functions at all, rather it is tied to expressiveness.
I won't dig into that, leaving Google fill any gaps.
Yes, we can have functions without a stack.
We don't need even the call/ret machinery for functions, we can just have the compiler inline every function call.
So there is no need for a stack at all.
This considers only functions in the programming sense, not mathematical sense.
A better name would be routines.
Anyway that is a simply proof of concepts that functions, intended as reusable code, don't need a stack.
However not all functions, in the mathematical sense, can implemented this way.
This is analogous to say: "We can have dogs on the bed but not all dogs can be on the bed".
You are in the right track by citing recursion, however when it comes to recursion, we need to be a lot more formal as there are various forms of recursion.
For example in-lining every function call may loop the compiler if the function being inlined is not constrained somehow.
Without digging into the theory, in order to be always sure that our compiler won't loop we can only allow primitive (bounded) recursion.
What you probably means by "recursion" is general recursion, that cannot be achieved by in-lining, we can show that we need an infinite amount of memory for GR and that is the demarcation between PR and GR, not having a stack.
So we can have function without a stack, even recursive (for some form of recursion) functions.
If your question was more practical then just consider MIPS.
There is no stack instructions or stack pointer register in the MIPS ISA, everything related to stack is just convention.
The compiler could use any memory area and treat it like a stack.

Efficient representation of functions

Function type A -> B in some sense is not very good. Though functions are first class values, one often cannot freely operate them due to efficiency problems. You can't apply too many transformations (A -> B) -> (C -> D), at some point you have to compute a value.
Obviously this is due to the non-strict nature of -> .
There are well know tricks to deal with functions of type Double -> Double. One can represent them as vectors given a certain basis, which can consist of trig functions, polynomials etc.
Are there any general tricks to get round the inefficiency of the A -> B type?
Or alternatives to -> ?
Your concern seems to be that given h === f • g, f • g is typically less efficient than h. Given a composition of functions known at compile time, there are two tricks performed by the compiler which can render f • g more efficient than you would suspect -- inlining, and fusion. Inlining avoids the extra indirection of a second function call, and opens up many new opportunities for optimizations. Stream fusion (or build/foldr fusion) can (to give a basic example) turn compositions such as map f . map g into map (f . g) thereby reducing the number of traversals of a structure by a constant factor. Fusion operates not only on lists, but other structures, and provides one reason for the efficient performance of Haskell libraries such as Vector.
Short cut fusion: http://www.haskell.org/haskellwiki/Correctness_of_short_cut_fusion
Stream fusion: What is Haskell's Stream Fusion
I cannot confirm this. As a productive user and implementor of AFRP, I am performing transformations on fully polymorphic functions a lot, deeply nested and for long running applications. Note that Haskell compilers do not use the traditional stack-based function calling paradigm. They use graph reduction algorithms. We don't have the same problems as, say, C.
One of the most general tricks is memoization - storing the value of a function after you computed it. Links: Haskellwiki, SO example, MemoCombinators. As you mentioned, the other trick is when you have a nice type of function (polynomial, vector, Taylor series etc.) - then it can be represented as a list, expression etc.
FWIW: In Felix, which is a whole program analyser relying heavily on inlining for performance, function arguments have three kinds: eager, lazy, or "let the compiler decide".
Eager arguments are evaluated and assigned to variable before the body of the function is evaluated.
Lazy arguments are evaluated by replacing the parameter with the argument expression wherever it occurs.
The default is "let the compiler decide". For a large amount of "ordinary" code (whatever that means) it doesn't make any difference whether you use eager or lazy evaluation.
Generally in Felix lazy evaluation is faster: note carefully this does NOT mean closures. It means inlining. However sometimes the compiler will chose eager evaluation, it reduces code bloat, and too much inlining is counter productive. I make no claim the algorithm is any good .. however Felix can sometimes outperform C and Ocaml (GHC didn't get into the finals).
As a simple example .. type classes. Felix has typeclasses, sort of like Haskell. No or very little performance overhead .. certainly no dictionaries!
In my view, Haskell would be a lot better if you just chucked out the archaic concept of separate compilation: whole program analysers can do so much more, and text is much faster to work with than object code (given the complete freedom to cache compilation results). It's crazy to have a lazy language use a compilation model designed for eager evaluation!
The other thing a Haskell variant might try is to drop the idea all functions are lazy, and instead adopt the idea that the evaluation strategy is irrelevant, unless otherwise specified. That may allow a lot more optimisation opportunities.

Reasoning about performance in Haskell

The following two Haskell programs for computing the n'th term of the Fibonacci sequence have greatly different performance characteristics:
fib1 n =
case n of
0 -> 1
1 -> 1
x -> (fib1 (x-1)) + (fib1 (x-2))
fib2 n = fibArr !! n where
fibArr = 1:1:[a + b | (a, b) <- zip fibArr (tail fibArr)]
They are very close to mathematically identical, but fib2 uses the list notation to memoize its intermediate results, while fib1 has explicit recursion. Despite the potential for the intermediate results to be cached in fib1, the execution time gets to be a problem even for fib1 25, suggesting that the recursive steps are always evaluated. Does referential transparency contribute anything to Haskell's performance? How can I know ahead of time if it will or won't?
This is just an example of the sort of thing I'm worried about. I'd like to hear any thoughts about overcoming the difficulty inherent in reasoning about the performance of a lazily-executed, functional programming language.
Summary: I'm accepting 3lectrologos's answer, because the point that you don't reason so much about the language's performance, as about your compiler's optimization, seems to be extremely important in Haskell - more so than in any other language I'm familiar with. I'm inclined to say that the importance of the compiler is the factor that differentiates reasoning about performance in lazy, functional langauges, from reasoning about the performance of any other type.
Addendum: Anyone happening on this question may want to look at the slides from Johan Tibell's talk about high performance Haskell.
In your particular Fibonacci example, it's not very hard to see why the second one should run faster (although you haven't specified what f2 is).
It's mainly an algorithmic issue:
fib1 implements the purely recursive algorithm and (as far as I know) Haskell has no mechanism for "implicit memoization".
fib2 uses explicit memoization (using the fibArr list to store previously computed values.
In general, it's much harder to make performance assumptions for a lazy language like Haskell, than for an eager one. Nevertheless, if you understand the underlying mechanisms (especially for laziness) and gather some experience, you will be able to make some "predictions" about performance.
Referential transparency increases (potentially) performance in (at least) two ways:
First, you (as a programmer) can be sure that two calls to the same function will always return the same, so you can exploit this in various cases to benefit in performance.
Second (and more important), the Haskell compiler can be sure for the above fact and this may enable many optimizations that can't be enabled in impure languages (if you've ever written a compiler or have any experience in compiler optimizations you are probably aware of the importance of this).
If you want to read more about the reasoning behind the design choices (laziness, pureness) of Haskell, I'd suggest reading this.
Reasoning about performance is generally hard in Haskell and lazy languages in general, although not impossible. Some techniques are covered in Chris Okasaki's Purely Function Data Structures (also available online in a previous version).
Another way to ensure performance is to fix the evaluation order, either using annotations or continuation passing style. That way you get to control when things are evaluated.
In your example you might calculate the numbers "bottom up" and pass the previous two numbers along to each iteration:
fib n = fib_iter(1,1,n)
where
fib_iter(a,b,0) = a
fib_iter(a,b,1) = a
fib_iter(a,b,n) = fib_iter(a+b,a,n-1)
This results in a linear time algorithm.
Whenever you have a dynamic programming algorithm where each result relies on the N previous results, you can use this technique. Otherwise you might have to use an array or something completely different.
Your implementation of fib2 uses memoization but each time you call fib2 it rebuild the "whole" result. Turn on ghci time and size profiling:
Prelude> :set +s
If it was doing memoisation "between" calls the subsequent calls would be faster and use no memory. Call fib2 20000 twice and see for yourself.
By comparison a more idiomatic version where you define the exact mathematical identity:
-- the infinite list of all fibs numbers.
fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
memoFib n = fibs !! n
actually do use memoisation, explicit as you see. If you run memoFib 20000 twice you'll see the time and space taken the first time then the second call is instantaneous and take no memory. No magic and no implicit memoization like a comment might have hinted at.
Now about your original question: optimizing and reasoning about performance in Haskell...
I wouldn't call myself an expert in Haskell, I have only been using it for 3 years, 2 of which at my workplace but I did have to optimize and get to understand how to reason somewhat about its performance.
As mentionned in other post laziness is your friend and can help you gain performance however YOU have to be in control of what is lazily evaluated and what is strictly evaluated.
Check this comparison of foldl vs foldr
foldl actually stores "how" to compute the value i.e. it is lazy. In some case you saves time and space beeing lazy, like the "infinite" fibs. The infinite "fibs" doesn't generate all of them but knows how. When you know you will need the value you might as well just get it "strictly" speaking... That's where strictness annotation are usefull, to give you back control.
I recall reading many times that in lisp you have to "minimize" consing.
Understanding what is stricly evaluated and how to force it is important but so is understanding how much "trashing" you do to the memory. Remember Haskell is immutable, that means that updating a "variable" is actually creating a copy with the modification. Prepending with (:) is vastly more efficient than appending with (++) because (:) does not copy memory contrarily to (++). Whenever a big atomic block is updated (even for a single char) the whole block needs to be copied to represent the "updated" version. The way you structure data and update it can have a big impact on performance. The ghc profiler is your friend and will help you spot these. Sure the garbage collector is fast but not having it do anything is faster!
Cheers
Aside from the memoization issue, fib1 also uses non-tailcall recursion. Tailcall recursion can be re-factored automatically into a simple goto and perform very well, but the recursion in fib1 cannot be optimized in this way, because you need the stack frame from each instance of fib1 in order to calculate the result. If you rewrote fib1 to pass a running total as an argument, thus allowing a tail call instead of needing to keep the stack frame for the final addition, the performance would improve immensely. But not as much as the memoized example, of course :)
Since allocation is a major cost in any functional language, an important part of understanding performance is to understand when objects are allocated, how long they live, when they die, and when they are reclaimed. To get this information you need a heap profiler. It's an essential tool, and luckily GHC ships with a good one.
For more information, read Colin Runciman's papers.

Resources