Simple tips for Haskell performance increases (on ProjectEuler problems)? - performance

I'm new to programming and learning Haskell by reading and working through Project Euler problems. Of course, the most important thing one can do to improve performance on these problems is to use a better algorithm. However, it is clear to me that there are other simple and easy to implement ways to improve performance. A cursory search brought up this question, and this question, which give the following tips:
Use the ghc flags -O2 and -fllvm.
Use the type Int, instead of Integer, because it is unboxed (or even Integer instead of Int64). This requires typing the functions, not letting the compiler decide on the fly.
Use rem, not mod, for division testing.
Use Schwartzian transformations when appropriate.
Using an accumulator in recursive functions (a tail-recursion optimization, I believe).
Memoization (?)
(One answer also mentions worker/wrapper transformation, but that seems fairly advanced.)
Question: What other simple optimizations can one make in Haskell to improve performance on Project Euler-style problems? Are there any other Haskell-specific (or functional programming specific?) ideas or features that could be used to help speed up solutions to Project Euler problems? Conversely, what should one watch out for? What are some common yet inefficient things to be avoided?

Here are some good slides by Johan Tibell that I frequently refer to:
Haskell Performance Patterns

One easy suggestion is to use hlint which is a program that checks your source code and makes suggestions for improvements syntax wise. This might not increase speed because most likely it's already done by the compiler or the lazy evaluation. But it might help the compiler in some cases. Further more it will make you a better Haskell programmer since you will learn better ways to do things, and it might be easier to understand your program and analyze it.
examples taken from http://community.haskell.org/~ndm/darcs/hlint/hlint.htm such as:
darcs-2.1.2\src\CommandLine.lhs:94:1: Error: Use concatMap
Found:
concat $ map escapeC s
Why not:
concatMap escapeC s
and
darcs-2.1.2\src\Darcs\Patch\Test.lhs:306:1: Error: Use a more efficient monadic variant
Found:
mapM (delete_line (fn2fp f) line) old
Why not:
mapM_ (delete_line (fn2fp f) line) old
I think the largest increases you can do in Project Euler problems is to understand the problem and remove unnecessary computations. Even if you don't understand everything you can do some small fixes which will make your program run twice the speed. Let's say you are looking for primes up to 1.000.000, then you of course can do filter isPrime [1..1000000]. But if you think a bit, then you can realize that well, no even number above is a prime, there you have removed (about) half the work. Instead doing [1,2] ++ filter isPrime [3,5..999999]

There is a fairly large section of the Haskell wiki about performance.
One fairly common problem is too little (or too much) strictness (this is covered by the sections listed in the General techniques section of the performance page above). Too much laziness causes a large number of thunks to be accumulated, too much strictness can cause too much to be evaluated.
These considerations are especially important when writing tail recursive functions (i.e. those with an accumulator); And, on that note, depending on how the function is used, a tail recursive function is sometimes less efficient in Haskell than the equivalent non-tail-recursive function, even with the optimal strictness annotations.
Also, as demonstrated by this recent question, sharing can make a huge difference to performance (in many cases, this can be considered a form of memoisation).

Project Euler is mostly about finding clever algorithmic solutions to the problems. Once you have the right algorithm, micro-optimization is rarely an issue, since even a straightforward or interpreted (e.g. Python or Ruby) implementation should run well within the speed constraints. The main technique you need is understanding lazy evaluation so you can avoid thunk buildups.

Related

Examples where compiler-optimized functional code performs better than imperative code

One of the promises of side-effect free, referentially transparent functional programming is that such code can be extensively optimized.
To quote Wikipedia:
Immutability of data can, in many cases, lead to execution efficiency, by allowing the compiler to make assumptions that are unsafe in an imperative language, thus increasing opportunities for inline expansion.
I'd like to see examples where a functional language compiler outperforms an imperative one by producing a better optimized code.
Edit: I tried to give a specific scenario, but apparently it wasn't a good idea. So I'll try to explain it in a different way.
Programmers translate ideas (algorithms) into languages that machines can understand. At the same time, one of the most important aspects of the translation is that also humans can understand the resulting code. Unfortunately, in many cases there is a trade-off: A concise, readable code suffers from slow performance and needs to be manually optimized. This is error-prone, time consuming, and it makes the code less readable (up to totally unreadable).
The foundations of functional languages, such as immutability and referential transparency, allow compilers to perform extensive optimizations, which could replace manual optimization of code and free programmers from this trade-off. I'm looking for examples of ideas (algorithms) and their implementations, such that:
the (functional) implementation is close to the original idea and is easy to understand,
it is extensively optimized by the compiler of the language, and
it is hard (or impossible) to write similarly efficient code in an imperative language without manual optimizations that reduce its conciseness and readability.
I apologize if it is a bit vague, but I hope the idea is clear. I don't want to give unnecessary restrictions on the answers. I'm open to suggestions if someone knows how to express it better.
My interest isn't just theoretical. I'd like to use such examples (among other things) to motivate students to get interested in functional programming.
At first, I wasn't satisfied by a few examples suggested in the comments. On second thoughts I take my objections back, those are good examples. Please feel free to expand them to full answers so that people can comment and vote for them.
(One class of such examples will be most likely parallelized code, which can take advantage of multiple CPU cores. Often in functional languages this can be done easily without sacrificing code simplicity (like in Haskell by adding par or pseq in appropriate places). I' be interested in such examples too, but also in other, non-parallel ones.)
There are cases where the same algorithm will optimize better in a pure context. Specifically, stream fusion allows an algorithm that consists of a sequence of loops that may be of widely varying form: maps, filters, folds, unfolds, to be composed into a single loop.
The equivalent optimization in a conventional imperative setting, with mutable data in loops, would have to achieve a full effect analysis, which no one does.
So at least for the class of algorithms that are implemented as pipelines of ana- and catamorphisms on sequences, you can guarantee optimization results that are not possible in an imperative setting.
A very recent paper Haskell beats C using generalised stream fusion by Geoff Mainland, Simon Peyton Jones, Simon Marlow, Roman Leshchinskiy (submitted to ICFP 2013) describes such an example. Abstract (with the interesting part in bold):
Stream fusion [6] is a powerful technique for automatically transforming
high-level sequence-processing functions into efficient implementations.
It has been used to great effect in Haskell libraries
for manipulating byte arrays, Unicode text, and unboxed vectors.
However, some operations, like vector append, still do not perform
well within the standard stream fusion framework. Others,
like SIMD computation using the SSE and AVX instructions available
on modern x86 chips, do not seem to fit in the framework at
all.
In this paper we introduce generalized stream fusion, which
solves these issues. The key insight is to bundle together multiple
stream representations, each tuned for a particular class of stream
consumer. We also describe a stream representation suited for efficient
computation with SSE instructions. Our ideas are implemented
in modified versions of the GHC compiler and vector library.
Benchmarks show that high-level Haskell code written using
our compiler and libraries can produce code that is faster than both
compiler- and hand-vectorized C.
This is just a note, not an answer: the gcc has a pure attribute suggesting it can take account of purity; the obvious reasons are remarked on in the manual here.
I would think that 'static single assignment' imposes a form of purity -- see the links at http://lambda-the-ultimate.org/node/2860 or the wikipedia article.
make and various build systems perform better for large projects by assuming that various build steps are referentially transparent; as such, they only need to rerun steps that have had their inputs change.
For small to medium sized changes, this can be a lot faster than building from scratch.

Performance Implications of Point-Free style

I’m taking my first baby-steps in learning functional programing using F# and I’ve just come across the Forward Pipe (|>) and Forward Composition (>>) operators. At first I thought they were just sugar rather than having an effect on the final running code (though I know piping helps with type inference).
However I came across this SO article:
What are advantages and disadvantages of “point free” style in functional programming?
Which has two interesting and informative answers (that instead of simplifying things for me opened a whole can of worms surrounding “point-free” or “pointless” style) My take-home from these (and other reading around) is that point-free is a debated area. Like lambas, point-free style can make code easier to understand, or much harder, depending on use. It can help in naming things meaningfully.
But my question concerns a comment on the first answer:
AshleyF muses in the answer:
“It seems to me that composition may reduce GC pressure by making it more obvious to the compiler that there is no need to produce intermediate values as in pipelining; helping make the so-called "deforestation" problem more tractable.”
Gasche replies:
“The part about improved compilation is not true at all. In most languages, point-free style will actually decrease performances. Haskell relies heavily on optimizations precisely because it's the only way to make the cost of these things bearable. At best, those combinators are inlined away and you get an equivalent pointful version”
Can anyone expand on the performance implications? (In general and specifically for F#) I had just assumed it was a writing-style thing and the compiler would unstrangle both idioms into equivalent code.
This answer is going to be F#-specific. I don't know how the internals of other functional languages work, and the fact that they don't compile to CIL could make a big difference.
I can see three questions here:
What are the performance implications of using |>?
What are the performance implications of using >>?
What is the performance difference between declaring a function with its arguments and without them?
The answers (using examples from the question you linked to):
Is there any difference between x |> sqr |> sum and sum (sqr x)?
No, there isn't. The compiled CIL is exactly the same (here represented in C#):
sum.Invoke(sqr.Invoke(x))
(Invoke() is used, because sqr and sum are not CIL methods, they are FSharpFunc, but that's not relevant here.)
Is there any difference between (sqr >> sum) x and sum (sqr x)?
No, both samples compile to the same CIL as above.
Is there any difference between let sumsqr = sqr >> sum and let sumsqr x = (sqr >> sum) x?
Yes, the compiled code is different. If you specify the argument, sumsqr is compiled into a normal CLI method. But if you don't specify it, it's compiled as a property of type FSharpFunc with a backing field, whose Invoke() method contains the code.
But the effect of all is that invoking the point-free version means loading one field (the FSharpFunc), which is not done if you specify the argument. But I think that shouldn't measurably affect performance, except in the most extreme circumstances.

Is there a functional algorithm which is faster than an imperative one?

I'm searching for an algorithm (or an argument of such an algorithm) in functional style which is faster than an imperative one.
I like functional code because it's expressive and mostly easier to read than it's imperative pendants. But I also know that this expressiveness can cost runtime overhead. Not always due to techniques like tail recursion - but often they are slower.
While programming I don't think about runtime costs of functional code because nowadays PCs are very fast and development time is more expensive than runtime. Furthermore for me readability is more important than performance. Nevertheless my programs are fast enough so I rarely need to solve a problem in an imperative way.
There are some algorithms which in practice should be implemented in an imperative style (like sorting algorithms) otherwise in most cases they are too slow or requires lots of memory.
In contrast due to techniques like pattern matching a whole program like a parser written in an functional language may be much faster than one written in an imperative language because of the possibility of compilers to optimize the code.
But are there any algorithms which are faster in a functional style or are there possibilities to setting up arguments of such an algorithm?
A simple reasoning. I don't vouch for terminology, but it seems to make sense.
A functional program, to be executed, will need to be transformed into some set of machine instructions.
All machines (I've heard of) are imperative.
Thus, for every functional program, there's an imperative program (roughly speaking, in assembler language), equivalent to it.
So, you'll probably have to be satisfied with 'expressiveness', until we get 'functional computers'.
The short answer:
Anything that can be easily made parallel because it's free of side-effects will be quicker on a multi-core processor.
QuickSort, for example, scales up quite nicely when used with immutable collections: http://en.wikipedia.org/wiki/Quicksort#Parallelization
All else being equal, if you have two algorithms that can reasonably be described as equivalent, except that one uses pure functions on immutable data, while the second relies on in-place mutations, then the first algorithm will scale up to multiple cores with ease.
It may even be the case that your programming language can perform this optimization for you, as with the scalaCL plugin that will compile code to run on your GPU. (I'm wondering now if SIMD instructions make this a "functional" processor)
So given parallel hardware, the first algorithm will perform better, and the more cores you have, the bigger the difference will be.
FWIW there are Purely functional data structures, which benefit from functional programming.
There's also a nice book on Purely Functional Data Structures by Chris Okasaki, which presents data structures from the point of view of functional languages.
Another interesting article Announcing Intel Concurrent Collections for Haskell 0.1, about parallel programming, they note:
Well, it happens that the CnC notion
of a step is a pure function. A step
does nothing but read its inputs and
produce tags and items as output. This
design was chosen to bring CnC to that
elusive but wonderful place called
deterministic parallelism. The
decision had nothing to do with
language preferences. (And indeed, the
primary CnC implementations are for
C++ and Java.)
Yet what a great match Haskell and CnC
would make! Haskell is the only major
language where we can (1) enforce that
steps be pure, and (2) directly
recognize (and leverage!) the fact
that both steps and graph executions
are pure.
Add to that the fact that Haskell is
wonderfully extensible and thus the
CnC "library" can feel almost like a
domain-specific language.
It doesn't say about performance – they promise to discuss some of the implementation details and performance in future posts, – but Haskell with its "pureness" fits nicely into parallel programming.
One could argue that all programs boil down to machine code.
So, if I dis-assemble the machine code (of an imperative program) and tweak the assembler, I could perhaps end up with a faster program. Or I could come up with an "assembler algorithm" that exploits some specific CPU feature, and therefor it really is faster than the imperative language version.
Does this situation lead to the conclusion that we should use assembler everywhere? No, we decided to use imperative languages because they are less cumbersome. We write pieces in assembler because we really need to.
Ideally we should also use FP algorithms because they are less cumbersome to code, and use imperative code when we really need to.
Well, I guess you meant to ask if there is an implementation of an algorithm in functional programming language that is faster than another implementation of the same algorithm but in an imperative language. By "faster" I mean that it performs better in terms of execution time or memory footprint on some inputs according to some measurement that we deem trustworthy.
I do not exclude this possibility. :)
To elaborate on Yasir Arsanukaev's answer, purely functional data structures can be faster than mutable data structures in some situations becuase they share pieces of their structure. Thus in places where you might have to copy a whole array or list in an imperative language, where you can get away with a fraction of the copying because you can change (and copy) only a small part of the data structure. Lists in functional languages are like this -- multiple lists can share the same tail since nothing can be modified. (This can be done in imperative languages, but usually isn't, because within the imperative paradigm, people aren't usually used to talking about immutable data.)
Also, lazy evaluation in functional languages (particularly Haskell which is lazy by default) can also be very advantageous because it can eliminate code execution when the code's results won't actually be used. (One can be very careful not to run this code in the first place in imperative languages, however.)

Reasoning about performance in Haskell

The following two Haskell programs for computing the n'th term of the Fibonacci sequence have greatly different performance characteristics:
fib1 n =
case n of
0 -> 1
1 -> 1
x -> (fib1 (x-1)) + (fib1 (x-2))
fib2 n = fibArr !! n where
fibArr = 1:1:[a + b | (a, b) <- zip fibArr (tail fibArr)]
They are very close to mathematically identical, but fib2 uses the list notation to memoize its intermediate results, while fib1 has explicit recursion. Despite the potential for the intermediate results to be cached in fib1, the execution time gets to be a problem even for fib1 25, suggesting that the recursive steps are always evaluated. Does referential transparency contribute anything to Haskell's performance? How can I know ahead of time if it will or won't?
This is just an example of the sort of thing I'm worried about. I'd like to hear any thoughts about overcoming the difficulty inherent in reasoning about the performance of a lazily-executed, functional programming language.
Summary: I'm accepting 3lectrologos's answer, because the point that you don't reason so much about the language's performance, as about your compiler's optimization, seems to be extremely important in Haskell - more so than in any other language I'm familiar with. I'm inclined to say that the importance of the compiler is the factor that differentiates reasoning about performance in lazy, functional langauges, from reasoning about the performance of any other type.
Addendum: Anyone happening on this question may want to look at the slides from Johan Tibell's talk about high performance Haskell.
In your particular Fibonacci example, it's not very hard to see why the second one should run faster (although you haven't specified what f2 is).
It's mainly an algorithmic issue:
fib1 implements the purely recursive algorithm and (as far as I know) Haskell has no mechanism for "implicit memoization".
fib2 uses explicit memoization (using the fibArr list to store previously computed values.
In general, it's much harder to make performance assumptions for a lazy language like Haskell, than for an eager one. Nevertheless, if you understand the underlying mechanisms (especially for laziness) and gather some experience, you will be able to make some "predictions" about performance.
Referential transparency increases (potentially) performance in (at least) two ways:
First, you (as a programmer) can be sure that two calls to the same function will always return the same, so you can exploit this in various cases to benefit in performance.
Second (and more important), the Haskell compiler can be sure for the above fact and this may enable many optimizations that can't be enabled in impure languages (if you've ever written a compiler or have any experience in compiler optimizations you are probably aware of the importance of this).
If you want to read more about the reasoning behind the design choices (laziness, pureness) of Haskell, I'd suggest reading this.
Reasoning about performance is generally hard in Haskell and lazy languages in general, although not impossible. Some techniques are covered in Chris Okasaki's Purely Function Data Structures (also available online in a previous version).
Another way to ensure performance is to fix the evaluation order, either using annotations or continuation passing style. That way you get to control when things are evaluated.
In your example you might calculate the numbers "bottom up" and pass the previous two numbers along to each iteration:
fib n = fib_iter(1,1,n)
where
fib_iter(a,b,0) = a
fib_iter(a,b,1) = a
fib_iter(a,b,n) = fib_iter(a+b,a,n-1)
This results in a linear time algorithm.
Whenever you have a dynamic programming algorithm where each result relies on the N previous results, you can use this technique. Otherwise you might have to use an array or something completely different.
Your implementation of fib2 uses memoization but each time you call fib2 it rebuild the "whole" result. Turn on ghci time and size profiling:
Prelude> :set +s
If it was doing memoisation "between" calls the subsequent calls would be faster and use no memory. Call fib2 20000 twice and see for yourself.
By comparison a more idiomatic version where you define the exact mathematical identity:
-- the infinite list of all fibs numbers.
fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
memoFib n = fibs !! n
actually do use memoisation, explicit as you see. If you run memoFib 20000 twice you'll see the time and space taken the first time then the second call is instantaneous and take no memory. No magic and no implicit memoization like a comment might have hinted at.
Now about your original question: optimizing and reasoning about performance in Haskell...
I wouldn't call myself an expert in Haskell, I have only been using it for 3 years, 2 of which at my workplace but I did have to optimize and get to understand how to reason somewhat about its performance.
As mentionned in other post laziness is your friend and can help you gain performance however YOU have to be in control of what is lazily evaluated and what is strictly evaluated.
Check this comparison of foldl vs foldr
foldl actually stores "how" to compute the value i.e. it is lazy. In some case you saves time and space beeing lazy, like the "infinite" fibs. The infinite "fibs" doesn't generate all of them but knows how. When you know you will need the value you might as well just get it "strictly" speaking... That's where strictness annotation are usefull, to give you back control.
I recall reading many times that in lisp you have to "minimize" consing.
Understanding what is stricly evaluated and how to force it is important but so is understanding how much "trashing" you do to the memory. Remember Haskell is immutable, that means that updating a "variable" is actually creating a copy with the modification. Prepending with (:) is vastly more efficient than appending with (++) because (:) does not copy memory contrarily to (++). Whenever a big atomic block is updated (even for a single char) the whole block needs to be copied to represent the "updated" version. The way you structure data and update it can have a big impact on performance. The ghc profiler is your friend and will help you spot these. Sure the garbage collector is fast but not having it do anything is faster!
Cheers
Aside from the memoization issue, fib1 also uses non-tailcall recursion. Tailcall recursion can be re-factored automatically into a simple goto and perform very well, but the recursion in fib1 cannot be optimized in this way, because you need the stack frame from each instance of fib1 in order to calculate the result. If you rewrote fib1 to pass a running total as an argument, thus allowing a tail call instead of needing to keep the stack frame for the final addition, the performance would improve immensely. But not as much as the memoized example, of course :)
Since allocation is a major cost in any functional language, an important part of understanding performance is to understand when objects are allocated, how long they live, when they die, and when they are reclaimed. To get this information you need a heap profiler. It's an essential tool, and luckily GHC ships with a good one.
For more information, read Colin Runciman's papers.

How long does it take for you to be comfortable with Haskell?

I'm an OK C/C++ programmer. I find Haskell very intriguing. But it seems to me, that although it's relatively easy to write clean Haskell code, as it mimics math (which I'm very comfortable with) pretty well, it's very hard to write clean code in Haskell that runs fast.
A faster version of quicksort of Haskell is very long and scary, which has no resemblance to the naive but short (two lines), clean and intuitive implementation. The long and scary version of Haskell is actually still much slower than the shorter and simpler C counter part.
Is it because the current Haskell compiler is too dumb or is it just impossible for mortals (other than SJP of course) to write fast Haskell code?
You ask two different questions: learning and performance.
It took me about a month to become comfortable with functional programming using recursion, pattern matching, map, filter, and fold. I did all that with ML but it translated to Haskell very easily.
It took me two or three years to wrap my head around monads, but that's because I read the wrong stuff. I think there are better tutorials now. But if you're beginning, avoid monads for a while.
It took me several months to get good at creating new type classes, but using the existing ones was easy.
I'm still not sure I have the hang of lazy evaluation. But I love Haskell's purity and tend to treat lazy evaluation as an unhappy accident that only a few people (like John Hughes) know how to exploit.
You've observed a performance problem only because you've adapted an algorithm loaded with mutation, which Tony Hoare designed for imperative languages, and tried to translate into Haskell. In Haskell as in any other functional language the expensive operation is allocation. Try writing a merge sort and you'll find it's simple and performs very well.
How do you avoid making similar mistakes in the future? Have a look at Chris Okasaki's book Purely Functional Data Structures. Great book, and it will help you learn the 'functional way of doing things' without giving up performance.
There is a very specific reason why quicksort ain't so quick in Haskell. It is a God-like example of an algorithm that has brilliant hackery woven into how it works - what I mean by hackery in this case is the kind of techniques that a true Haskell devotee would regard as unnecessarily dangerous and non-mathematical. The original implementation made every effort to break the rules that Haskell imposes on itself: the genuine quicksort works by overwriting storage slots with new information. This is very painful to do in Haskell, which finds it much easier to make whole new copies of existing information.
So although that naive two line Haskell version captures something of the essence of quicksort (it does the same number of key comparisons), it isn't really quicksort. It's missing a large portion of the genius that went into it, which took full advantage of the ability to tweak the state of existing values. So it makes large numbers of intermediate copies of pieces of the list.
Speculation: could a Haskell compiler analyze your code, applying the same reasoning as Hoare (inventor of quicksort) and figure out that it could optimize it by completely reimplementing it in a stateful way? Possibly.
The point is not to write fast Haskell code, but to write Haskell code fast. When you get there and you need to make your code fast(er), start optimizing (or use the FFI, you don't have to forget your C++ skills). In Haskell you are looking for elegance, reliability and maintainability first. I'd add profiling to my Haskell-fu, so you don't waste time optimizing that which is not used. And remember not to optimize prematurely.
So, to answer the question in the title, you will likely feel fairly at home with the basics of Haskell within a short time. Especially if you're already familiar with functional programming. The things that really make Haskell stand out, such as laziness, type classes, type families, and of course the dreaded monads (and arrows) are likely to require more time to understand and get used to. There are good resources for learning the language, many freely available, together with a helpful community, so I'd say you'd likely be well on your way to feeling comfortable within a week or two of semi-serious study ;-)
I think it's worth it though, just like some people argue that it's worth learning Lisp even if you'll never actually use it for anything. It's worth it because it makes you a better programmer--it makes you think differently. I'd argue Haskell has a similar effect.
The reason is that mathematical foundation of computers is different from mathematical foundation of Haskell and functional languages in general.
To get taste of problems compiler is facing... Translate Haskell program to C, keeping it as close to original as possible, then try to optimize that C code (without rewriting it from scratch in C-way)
It's not dumb, but functional languages are not made for performance, and concise mathematically simple notation doesn't come for free.

Resources