Does "Call by name" slow down Haskell? - performance

I assume it doesn't.
My reason is that Haskell is pure functional programming (without I/O Monad), they could have made every "call by name" use the same evaluated value if the "name"s are the same.
I don't know anything about the implementation details but I'm really interested.
Detailed explanations will be much appreciated :)
BTW, I tried google, it was quite hard to get anything useful.

First of all, Haskell is a specification, not an implementation; the report does not actually require use of call-by-name evaluation, or lazy evaluation for that matter. Haskell implementations are only required to be non-strict, which does rule out call-by-value and similar strategies.
So, strictly (ha, ha) speaking, evaluation strategies can't slow down Haskell. I'm not sure what can slow down Haskell, though clearly something has or else it wouldn't have taken 12 years to get the next version of the Report out after Haskell 98. My guess is that it involves committees somehow.
Anyway, "lazy evaluation" refers to a "call by need" strategy, which is the most common implementation choice for Haskell. This differs from call-by-name in that if a subexpression is used in multiple places, it will be evaluated at most once.
The details of what qualifies as a subexpression that will be shared is a bit subtle and probably somewhat implementation-dependent, but to use an example from GHC Haskell: Consider the function cycle, which repeats an input list infinitely. A naive implementation might be:
cycle xs = xs ++ cycle xs
This ends up being inefficient because there is no single cycle xs expression that can be shared, so the resulting list has to be constructed continually as it's traversed, allocating more memory and doing more computation each time.
In contrast, the actual implementation looks like this:
cycle xs = xs' where xs' = xs ++ xs'
Here the name xs' is defined recursively as itself appended to the end of the input list. This time xs' is shared, and evaluated only once; the resulting infinite list is actually a finite, circular linked list in memory, and once the entire loop has been evaluated no further work is needed.
In general, GHC will not memoize functions for you: given f and x, each use of f x will be re-evaluated unless you give the result a name and use that. The resulting value will be the same in either case, but the performance can differ significantly. This is mostly a matter of avoiding pessimizations--it would be easy for GHC to memoize things for you, but in many cases this would cost large amounts of memory to gain tiny or nonexistent amounts of speed.
The flip side is that shared values are retained; if you have a data structure that's very expensive to compute, naming the result of constructing it and passing that to functions using it ensures that no work is duplicated--even if it's used simultaneously by different threads.
You can also pessimize things yourself this way--if a data structure is cheap to compute and uses lots of memory, you should probably avoid sharing references to the full structure, as that will keep the whole thing alive in memory as long as anything could possibly use it later.

Yes, it does, somewhat. The problem is that Haskell can't, in general, calculate the value too early (e.g. if it would lead to an exception), so it sometimes needs to keep a thunk (code for calculating the value) instead of the value itself, which uses more memory and slows things down. The compiler tries to detect cases where this can be avoided, but it's impossible to detect all of them.

Related

Does Prolog need GC when the occurs check is globally enabled?

As far as I can tell, with sound unification, SLD resolution should not create cyclic data structures (Is this correct?)
If so, one could, in theory, implement Prolog in such a way that it wouldn't need garbage collection (GC). But then again, one might not.
Is this true for WAM-based Prolog implementations?
Is this true for SWI-Prolog? (I believe it's not WAM-based) Is it safe to disable GC in SWI-Prolog when the occurs check is globally enabled?
Specifically:
:- set_prolog_flag(occurs_check, true).
:- set_prolog_flag(gc, false). /* is this safe? */
The creation of cyclic terms is far from being the only operation that can create garbage (as in garbage collection) in Prolog (also worth noting that not all Prolog systems provide comprehensive support for cyclic terms but most of them support some form of garbage collection).
As an example, assume that you have in your code the following sequence of calls to go from a number to an atom:
...,
number_codes(Number, Codes),
atom_codes(Atom, Codes),
...
Here, Codes is a temporary list that should be garbage collected. Another example, assume that you're calling setof/3 to get an ordered list of results where you're only interested in the first two:
...,
setof(C, x(X), [X1, X2| _]),
...
You just created another temporary list. Or that you forgot about sub_atom/5 and decide to use atom_concat/3 to check the prefix of an atom:
...,
atom_concat(Prefix, _, Atom),
...
That second argument, the atom suffix that you don't care about (hence the anonymous variable), is a temporary atom that you just created. But not all Prolog systems provide an atom garbage collector, which can lead to trouble in long running applications.
But even when you think that you have carefully written your code to avoid the creation of temporary terms, the Prolog system may still be creating garbage when running your code. Prolog systems use different memory areas for different purposes, and operations may need to make temporary copies of segments of memory between different memory areas, depending on the implementation. The Prolog system may be written in a language, e.g. Java, that may eventually take care of that garbage. But most likely it's written in C or C++ and some sort of garbage collection is used internally. Not to mention that the Prolog system may be grabbing a big block of memory to be able to prove a query and then reclaiming that memory after the query terminates.
Well, something has to free up multiply-referenced memory to which references exist that can be dropped at any step in the computation. This is the case independently of whether structures are cyclic or not.
Consider variables A and B naming the same structure in memory (they "name the same term"). The structure is referenced from 2 places. Suppose the predicate in which B is defined succeeds or fails. The Prolog Processor cannot just deallocate that structure: it is still referenced from A. This means you need at least reference counting to make sure you don't free memory too early. That's a reference counting garbage collector.
I don't know what kind of garbage collection is implemented in any specific Prolog implementation (there are many approaches, some better suited to Prolog than others ... in a not completely-unrelated context 25 years of Java have created all of these) but you need to use one, not necessarily a reference counting one.
(Cyclic structures are only special to garbage collection because reference counting garbage collection algorithms are unable to free up cyclic structures, as all the cells in a loop have a reference count of at least 1.)
(Also, an IMHO, never a trust a programming languages in which you have to call free yourself. There is probably a variation of Greenspun's tenth rule ("Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.") in that any program written in a programming language in which you have to call free yourself contains an ad hoc, informally-specified, bug-ridden, slow implementation of a garbage collection algorithm.")
(OTOH, Rust seems to take kind of a middle way, offloading some effort onto the developer but having the advantage of being able to decide whether to free memory when a variable goes out of scope. But Rust is not Prolog.)
This here would be safe:
:- set_prolog_flag(gc, false).
But if your Prolog system has garbage collection, switching it off might be not a good idea, since even with occurs check set to true, there can be still garbage through temporary results. And having garbage continuously removed can improve locality of code, i.e. your memory gets less trashed by cache misses:
p(X,Y) :- q(X,Z), r(Z,Y).
The variable Z might point to some Prolog term which is only temporarily needed. Most modern Prolog systems can remove such Prolog terms through so called environment trimming.
But an occurs check opens up the path to a special kind of garbage collection. Namely since no more cyclic terms can appear, it is possible to use reference counting. An old Prolog system that had reference counting was this beast here:
xpProlog: High Performance Extended Pure Prolog - Lüdemann, 1988
https://open.library.ubc.ca/media/download/pdf/831/1.0051961/1
Also Jekejeke Prolog does still do reference counting. A problem with reference counting are attributed variables, which can create a cyclic term nevertheless, for example a freeze/2 as follows creates a cycle through the frozen goal back to the variable:
?- freeze(X, (write(X), nl)).
Edit 04.09.2021:
What also might demand garbage collection is setarg/3. It
can create cycles which cannot be so easily removed by
reference counting.
?- X = f(0), setarg(1,X,X).
X = f(X).
Since setarg/3 is backtrackable, the cycle would go away in
backtracking, at least I guess so. But the cycle might still bother
when we are deep in backtracking and running out of memory.
Or the cycle might not go away through backtracking since we
used the non-backtracking version nb_setarg/3.

Why is computing a factorial faster/efficient in functional programming?

Here's a claim made in one of the answers at: https://softwareengineering.stackexchange.com/a/136146/18748
Try your hand at a Functional language or two. Try implementing
factorial in Erlang using recursion and watch your jaw hit the floor
when 20000! returns in 5 seconds (no stack overflow in site)
Why would it be faster/more efficient than using recursion/iteration in Java/C/C++/Python (any)? What is the underlying mathematical/theoretical concept that makes this happen? Unfortunately, I wasn't ever exposed to functional programming in my undergrad (started with 'C') so I may just lack knowing the 'why'.
A recursive solution would probably be slower and clunkier in Java or C++, but we don't do things that way in Java and C++, so. :)
As for why, functional languages invariably make very heavy usage of recursion (as by default they either have to jump through hoops, or simply aren't allowed, to modify variables, which on its own makes most forms of iteration ineffective or outright impossible). So they effectively have to optimize the hell out of it in order to be competitive.
Almost all of them implement an optimization called "tail call elimination", which uses the current call's stack space for the next call and turns a "call" into a "jump". That little change basically turns recursion into iteration, but don't remind the FPers of that. When it comes to iteration in a procedural language and recursion in a functional one, the two would prove about equal performancewise. (If either one's still faster, though, it'd be iteration.)
The rest is libraries and number types and such. Decent math code ==> decent math performance. There's nothing keeping procedural or OO languages from optimizing similarly, other than that most people don't care that much. Functional programmers, on the other hand, love to navel gaze about how easily they can calculate Fibonacci sequences and 20000! and other numbers that most of us will never use anyway.
Essentially, functional languages make recursion cheap.
Particularly via the tail call optimization.
Tail-call optimization
Lazy evaluation: "When using delayed evaluation, an expression is not evaluated as soon as it gets bound to a variable, but when the evaluator is forced to produce the expression's value."
Both concepts are illustrated in this question about various ways of implementing factorial in Clojure. You can also find a detailed discussion of lazy sequences and TCO in Stuart Halloway and Aaron Bedra's book, Programming Clojure.
The following function, adopted from Programming Clojure, creates a lazy sequence with the first 1M members of the Fibonacci sequence and realizes the one-hundred-thousandth member:
user=> (time (nth (take 1000000 (map first (iterate (fn [[a b]] [b (+ a b)]) [0N 1N]))) 100000))
"Elapsed time: 252.045 msecs"
25974069347221724166155034....
(20900 total digits)
(512MB heap, Intel Core i7, 2.5GHz)
It's not faster (than what?), and it has nothing to do with tail call optimization (it's not enough to throw a buzzword in here, one should also explain why tail call optimization should be faster than looping? it's simply not the case!)
Mind you that I am not a functional programming hater, to the contrary! But spreading myths does not serve the case for functional programming.
BTW, has any one here actually tried how long it takes to compute (and print, which should consume at least 50% of the CPU cycles needed) 20000!?
I did:
main _ = println (product [2n..20000n])
This is a JVM language compiled to Java, and it uses Java big integers (known to be slow). It also suffers of JVM startup costs. And it is not the fastest way to do it (explicit recursion could save list construction, for example).
The result is:
181920632023034513482764175686645876607160990147875264
...
... many many lines with digits
...
000000000000000000000000000000000000000000000000000000000000000
real 0m3.330s
user 0m4.472s
sys 0m0.212s
(On a Intel® Core™ i3 CPU M 350 # 2.27GHz × 4)
We can safely assume that the same in C with GMP wouldn't even use 50% of that time.
Hence, functional is faster is a myth, as well as functional is slower. It's not even a myth, it's simply nonsense as long as one does not say compared to what it is faster/slower.

Efficient representation of functions

Function type A -> B in some sense is not very good. Though functions are first class values, one often cannot freely operate them due to efficiency problems. You can't apply too many transformations (A -> B) -> (C -> D), at some point you have to compute a value.
Obviously this is due to the non-strict nature of -> .
There are well know tricks to deal with functions of type Double -> Double. One can represent them as vectors given a certain basis, which can consist of trig functions, polynomials etc.
Are there any general tricks to get round the inefficiency of the A -> B type?
Or alternatives to -> ?
Your concern seems to be that given h === f • g, f • g is typically less efficient than h. Given a composition of functions known at compile time, there are two tricks performed by the compiler which can render f • g more efficient than you would suspect -- inlining, and fusion. Inlining avoids the extra indirection of a second function call, and opens up many new opportunities for optimizations. Stream fusion (or build/foldr fusion) can (to give a basic example) turn compositions such as map f . map g into map (f . g) thereby reducing the number of traversals of a structure by a constant factor. Fusion operates not only on lists, but other structures, and provides one reason for the efficient performance of Haskell libraries such as Vector.
Short cut fusion: http://www.haskell.org/haskellwiki/Correctness_of_short_cut_fusion
Stream fusion: What is Haskell's Stream Fusion
I cannot confirm this. As a productive user and implementor of AFRP, I am performing transformations on fully polymorphic functions a lot, deeply nested and for long running applications. Note that Haskell compilers do not use the traditional stack-based function calling paradigm. They use graph reduction algorithms. We don't have the same problems as, say, C.
One of the most general tricks is memoization - storing the value of a function after you computed it. Links: Haskellwiki, SO example, MemoCombinators. As you mentioned, the other trick is when you have a nice type of function (polynomial, vector, Taylor series etc.) - then it can be represented as a list, expression etc.
FWIW: In Felix, which is a whole program analyser relying heavily on inlining for performance, function arguments have three kinds: eager, lazy, or "let the compiler decide".
Eager arguments are evaluated and assigned to variable before the body of the function is evaluated.
Lazy arguments are evaluated by replacing the parameter with the argument expression wherever it occurs.
The default is "let the compiler decide". For a large amount of "ordinary" code (whatever that means) it doesn't make any difference whether you use eager or lazy evaluation.
Generally in Felix lazy evaluation is faster: note carefully this does NOT mean closures. It means inlining. However sometimes the compiler will chose eager evaluation, it reduces code bloat, and too much inlining is counter productive. I make no claim the algorithm is any good .. however Felix can sometimes outperform C and Ocaml (GHC didn't get into the finals).
As a simple example .. type classes. Felix has typeclasses, sort of like Haskell. No or very little performance overhead .. certainly no dictionaries!
In my view, Haskell would be a lot better if you just chucked out the archaic concept of separate compilation: whole program analysers can do so much more, and text is much faster to work with than object code (given the complete freedom to cache compilation results). It's crazy to have a lazy language use a compilation model designed for eager evaluation!
The other thing a Haskell variant might try is to drop the idea all functions are lazy, and instead adopt the idea that the evaluation strategy is irrelevant, unless otherwise specified. That may allow a lot more optimisation opportunities.

Why are the functions making Vector an instance of Functor, Monad, Applicative, Alternative, Foldable and Traversable slow?

The changelog for version 0.8 of vector lists the following change with a warning:
Functor, Monad, Applicative, Alternative, Foldable and Traversable
instances for boxed vectors (WARNING: they tend to be slow and are
only provided for completeness).
Could someone explain why this is the case? Is it just the normal cost of typeclass specialization, or something more interesting?
Update: Looking at some particular instances, one sees for example:
instance Foldable.Foldable Vector where
{-# INLINE foldr #-}
foldr = foldr
and similarly for the other folds. Does this mean that folding is slow for Vectors in general? If not, what makes a non-specialized fold slow enough to warrant a warning?
I submitted the original set of these instances to Roman a year and a half ago and have maintained vector-instances since then. (I had to remove these instances from vector-instances once they migrated into vector, and now maintain it solely for the really exotic stuff). His concern was that if folks used these instances polymorphically then the RULES that make Vectors fuse away can't fire unless the polymorphic function gets inlined and monomorphized.
They exist because not every bit of code on the planet is Vector-specific and even then it is nice to sometimes use the common names.
Slow here is relative. The worst case is they perform like anybody else's folds, binds, etc. but Roman takes every single boxed value as a personal insult. :)
I've just had a quick look at the source code and the implementations don't look excessively slow. I'd argue the authors added this warning because when you're writing a program in the Vector monad, you're working from such a high-level point of view that it's easy to forget that every >>= is, in fact, a concatMap, which tends to be inherently slow.
Another thing: Vector is particularly fast for unboxed types. So a user might be attracted to use the monad notation (for convenience), while he should actually be using an unboxed type (for speed).

Reasoning about performance in Haskell

The following two Haskell programs for computing the n'th term of the Fibonacci sequence have greatly different performance characteristics:
fib1 n =
case n of
0 -> 1
1 -> 1
x -> (fib1 (x-1)) + (fib1 (x-2))
fib2 n = fibArr !! n where
fibArr = 1:1:[a + b | (a, b) <- zip fibArr (tail fibArr)]
They are very close to mathematically identical, but fib2 uses the list notation to memoize its intermediate results, while fib1 has explicit recursion. Despite the potential for the intermediate results to be cached in fib1, the execution time gets to be a problem even for fib1 25, suggesting that the recursive steps are always evaluated. Does referential transparency contribute anything to Haskell's performance? How can I know ahead of time if it will or won't?
This is just an example of the sort of thing I'm worried about. I'd like to hear any thoughts about overcoming the difficulty inherent in reasoning about the performance of a lazily-executed, functional programming language.
Summary: I'm accepting 3lectrologos's answer, because the point that you don't reason so much about the language's performance, as about your compiler's optimization, seems to be extremely important in Haskell - more so than in any other language I'm familiar with. I'm inclined to say that the importance of the compiler is the factor that differentiates reasoning about performance in lazy, functional langauges, from reasoning about the performance of any other type.
Addendum: Anyone happening on this question may want to look at the slides from Johan Tibell's talk about high performance Haskell.
In your particular Fibonacci example, it's not very hard to see why the second one should run faster (although you haven't specified what f2 is).
It's mainly an algorithmic issue:
fib1 implements the purely recursive algorithm and (as far as I know) Haskell has no mechanism for "implicit memoization".
fib2 uses explicit memoization (using the fibArr list to store previously computed values.
In general, it's much harder to make performance assumptions for a lazy language like Haskell, than for an eager one. Nevertheless, if you understand the underlying mechanisms (especially for laziness) and gather some experience, you will be able to make some "predictions" about performance.
Referential transparency increases (potentially) performance in (at least) two ways:
First, you (as a programmer) can be sure that two calls to the same function will always return the same, so you can exploit this in various cases to benefit in performance.
Second (and more important), the Haskell compiler can be sure for the above fact and this may enable many optimizations that can't be enabled in impure languages (if you've ever written a compiler or have any experience in compiler optimizations you are probably aware of the importance of this).
If you want to read more about the reasoning behind the design choices (laziness, pureness) of Haskell, I'd suggest reading this.
Reasoning about performance is generally hard in Haskell and lazy languages in general, although not impossible. Some techniques are covered in Chris Okasaki's Purely Function Data Structures (also available online in a previous version).
Another way to ensure performance is to fix the evaluation order, either using annotations or continuation passing style. That way you get to control when things are evaluated.
In your example you might calculate the numbers "bottom up" and pass the previous two numbers along to each iteration:
fib n = fib_iter(1,1,n)
where
fib_iter(a,b,0) = a
fib_iter(a,b,1) = a
fib_iter(a,b,n) = fib_iter(a+b,a,n-1)
This results in a linear time algorithm.
Whenever you have a dynamic programming algorithm where each result relies on the N previous results, you can use this technique. Otherwise you might have to use an array or something completely different.
Your implementation of fib2 uses memoization but each time you call fib2 it rebuild the "whole" result. Turn on ghci time and size profiling:
Prelude> :set +s
If it was doing memoisation "between" calls the subsequent calls would be faster and use no memory. Call fib2 20000 twice and see for yourself.
By comparison a more idiomatic version where you define the exact mathematical identity:
-- the infinite list of all fibs numbers.
fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
memoFib n = fibs !! n
actually do use memoisation, explicit as you see. If you run memoFib 20000 twice you'll see the time and space taken the first time then the second call is instantaneous and take no memory. No magic and no implicit memoization like a comment might have hinted at.
Now about your original question: optimizing and reasoning about performance in Haskell...
I wouldn't call myself an expert in Haskell, I have only been using it for 3 years, 2 of which at my workplace but I did have to optimize and get to understand how to reason somewhat about its performance.
As mentionned in other post laziness is your friend and can help you gain performance however YOU have to be in control of what is lazily evaluated and what is strictly evaluated.
Check this comparison of foldl vs foldr
foldl actually stores "how" to compute the value i.e. it is lazy. In some case you saves time and space beeing lazy, like the "infinite" fibs. The infinite "fibs" doesn't generate all of them but knows how. When you know you will need the value you might as well just get it "strictly" speaking... That's where strictness annotation are usefull, to give you back control.
I recall reading many times that in lisp you have to "minimize" consing.
Understanding what is stricly evaluated and how to force it is important but so is understanding how much "trashing" you do to the memory. Remember Haskell is immutable, that means that updating a "variable" is actually creating a copy with the modification. Prepending with (:) is vastly more efficient than appending with (++) because (:) does not copy memory contrarily to (++). Whenever a big atomic block is updated (even for a single char) the whole block needs to be copied to represent the "updated" version. The way you structure data and update it can have a big impact on performance. The ghc profiler is your friend and will help you spot these. Sure the garbage collector is fast but not having it do anything is faster!
Cheers
Aside from the memoization issue, fib1 also uses non-tailcall recursion. Tailcall recursion can be re-factored automatically into a simple goto and perform very well, but the recursion in fib1 cannot be optimized in this way, because you need the stack frame from each instance of fib1 in order to calculate the result. If you rewrote fib1 to pass a running total as an argument, thus allowing a tail call instead of needing to keep the stack frame for the final addition, the performance would improve immensely. But not as much as the memoized example, of course :)
Since allocation is a major cost in any functional language, an important part of understanding performance is to understand when objects are allocated, how long they live, when they die, and when they are reclaimed. To get this information you need a heap profiler. It's an essential tool, and luckily GHC ships with a good one.
For more information, read Colin Runciman's papers.

Resources