Haskell: tail-recursive function is slower? [duplicate] - performance

I discovered the "time" command in unix today and thought I'd use it to check the difference in runtimes between tail-recursive and normal recursive functions in Haskell.
I wrote the following functions:
--tail recursive
fac :: (Integral a) => a -> a
fac x = fac' x 1 where
fac' 1 y = y
fac' x y = fac' (x-1) (x*y)
--normal recursive
facSlow :: (Integral a) => a -> a
facSlow 1 = 1
facSlow x = x * facSlow (x-1)
These are valid keeping in mind they were solely for use with this project, so I didn't bother to check for zeroes or negative numbers.
However, upon writing a main method for each, compiling them, and running them with the "time" command, both had similar runtimes with the normal recursive function edging out the tail recursive one. This is contrary to what I'd heard with regards to tail-recursive optimization in lisp. What's the reason for this?

Haskell uses lazy-evaluation to implement recursion, so treats anything as a promise to provide a value when needed (this is called a thunk). Thunks get reduced only as much as necessary to proceed, no more. This resembles the way you simplify an expression mathematically, so it's helpful to think of it that way. The fact that evaluation order is not specified by your code allows the compiler to do lots of even cleverer optimisations than just the tail-call elimination youre used to. Compile with -O2 if you want optimisation!
Let's see how we evaluate facSlow 5 as a case study:
facSlow 5
5 * facSlow 4 -- Note that the `5-1` only got evaluated to 4
5 * (4 * facSlow 3) -- because it has to be checked against 1 to see
5 * (4 * (3 * facSlow 2)) -- which definition of `facSlow` to apply.
5 * (4 * (3 * (2 * facSlow 1)))
5 * (4 * (3 * (2 * 1)))
5 * (4 * (3 * 2))
5 * (4 * 6)
5 * 24
120
So just as you worried, we have a build-up of numbers before any calculations happen, but unlike you worried, there's no stack of facSlow function calls hanging around waiting to terminate - each reduction is applied and goes away, leaving a stack frame in its wake (that is because (*) is strict and so triggers the evaluation of its second argument).
Haskell's recursive functions aren't evaluated in a very recursive way! The only stack of calls hanging around are the multiplications themselves. If (*) is viewed as a strict data constructor, this is what's known as guarded recursion (although it is usually referred to as such with non-strict data constructors, where what's left in its wake are the data constructors - when forced by further access).
Now let's look at the tail-recursive fac 5:
fac 5
fac' 5 1
fac' 4 {5*1} -- Note that the `5-1` only got evaluated to 4
fac' 3 {4*{5*1}} -- because it has to be checked against 1 to see
fac' 2 {3*{4*{5*1}}} -- which definition of `fac'` to apply.
fac' 1 {2*{3*{4*{5*1}}}}
{2*{3*{4*{5*1}}}} -- the thunk "{...}"
(2*{3*{4*{5*1}}}) -- is retraced
(2*(3*{4*{5*1}})) -- to create
(2*(3*(4*{5*1}))) -- the computation
(2*(3*(4*(5*1)))) -- on the stack
(2*(3*(4*5)))
(2*(3*20))
(2*60)
120
So you can see how the tail recursion by itself hasn't saved you any time or space. Not only does it take more steps overall than facSlow 5, it also builds a nested thunk (shown here as {...}) - needing an extra space for it - which describes the future computation, the nested multiplications to be performed.
This thunk is then unraveled by traversing it to the bottom, recreating the computation on the stack. There is also a danger here of causing stack overflow with very long computations, for both versions.
If we want to hand-optimise this, all we need to do is make it strict. You could use the strict application operator $! to define
facSlim :: (Integral a) => a -> a
facSlim x = facS' x 1 where
facS' 1 y = y
facS' x y = facS' (x-1) $! (x*y)
This forces facS' to be strict in its second argument. (It's already strict in its first argument because that has to be evaluated to decide which definition of facS' to apply.)
Sometimes strictness can help enormously, sometimes it's a big mistake because laziness is more efficient. Here it's a good idea:
facSlim 5
facS' 5 1
facS' 4 5
facS' 3 20
facS' 2 60
facS' 1 120
120
Which is what you wanted to achieve I think.
Summary
If you want to optimise your code, step one is to compile with -O2
Tail recursion is only good when there's no thunk build-up, and adding strictness usually helps to prevent it, if and where appropriate. This happens when you're building a result that is needed later on all at once.
Sometimes tail recursion is a bad plan and guarded recursion is a better fit, i.e. when the result you're building will be needed bit by bit, in portions. See this question about foldr and foldl for example, and test them against each other.
Try these two:
length $ foldl1 (++) $ replicate 1000
"The size of intermediate expressions is more important than tail recursion."
length $ foldr1 (++) $ replicate 1000
"The number of reductions performed is more important than tail recursion!!!"
foldl1 is tail recursive, whereas foldr1 performs guarded recursion so that the first item is immediately presented for further processing/access. (The first "parenthesizes" to the left at once, (...((s+s)+s)+...)+s, forcing its input list fully to its end and building a big thunk of future computation much sooner than its full results are needed; the second parenthesizes to the right gradually, s+(s+(...+(s+s)...)), consuming the input list bit by bit, so the whole thing is able to operate in constant space, with optimizations).
You might need to adjust the number of zeros depending on what hardware you're using.

It should be mentioned that the fac function is not a good candidate for guarded recursion. Tail recursion is the way to go here. Due to laziness you are not getting the effect of TCO in your fac' function because the accumulator arguments keep building large thunks, which when evaluated will require a huge stack. To prevent this and get the desired effect of TCO you need to make these accumulator arguments strict.
{-# LANGUAGE BangPatterns #-}
fac :: (Integral a) => a -> a
fac x = fac' x 1 where
fac' 1 y = y
fac' x !y = fac' (x-1) (x*y)
If you compile using -O2 (or just -O) GHC will probably do this on its own in the strictness analysis phase.

You should check out the wiki article on tail recursion in Haskell. In particular, because of expression evaluation, the kind of recursion you want is guarded recursion. If you work out the details of what's going on under the hood (in the abstract machine for Haskell) you get the same kind of thing as with tail recursion in strict languages. Along with this, you have a uniform syntax for lazy functions (tail recursion will tie you to strict evaluation, whereas guarded recursion works more naturally).
(And in learning Haskell, the rest of those wiki pages are awesome, too!)

If I recall correctly, GHC automatically optimizes plain recursive functions into tail-recursive optimized ones.

Related

Do Haskell’s strict folds really use linear space?

I thought I understood the basics of fold performance in Haskell, as described in foldr, foldl, foldl' on the Haskell Wiki and many other places. In particular, I learned that for accumulating functions, one should use foldl', to avoid space leaks, and that the standard library functions are written to respect this. So I presumed that simple accumulators like length, applied to simple lists like replicate n 1, should require constant space (or at least sub-linear) in the length of the list. My intuition was that on sufficiently simple lists, they would behave roughly like a for loop in an imperative language.
But today I found that this seems not to hold in practice. For instance, length $ replicate n 1 seems to use space linear in n. In ghci:
ghci> :set +s
ghci> length $ replicate (10^6) 1
1000000
(0.02 secs, 56,077,464 bytes)
ghci> length $ replicate (10^7) 1
10000000
(0.08 secs, 560,078,360 bytes)
ghci> length $ replicate (10^8) 1
100000000
(0.61 secs, 5,600,079,312 bytes)
ghci> length $ replicate (10^9) 1
1000000000
(5.88 secs, 56,000,080,192 bytes)
Briefly, my question is: Do length and other strict folds really use linear space? If so, why? And is it inevitable? Below are more details of how I’ve played around trying to understand this, but they’re probably not worth reading — the tl;dr is that the linear-space usage seems to persist whatever variations I try.
(I originally used sum as the example function. As Willem Van Onsem points out, that was a badly-chosen example as default instances aren’t actually strict. However, the main question remains, since as noted below, this occurs with plenty of other functions that really are based on strict folds.)
Replacing length with foldl' (\n _ -> n+1) 0 appears to make performance worse by a constant factor; space usage still seems to be linear.
Versions defined with foldl and foldr had worse memory usage (as expected), but only by a small constant factor, not asymptotically worse (as most discussions seem to suggest).
Replacing length with sum, last, or other simple accumulators, or with the obvious definitions of these using foldl', also doesn’t seem to change the linear space usage.
Using [1..n] as the test list, and other similar variations, also seems to make no significant difference.
Switching between the general versions of sum, foldl', etc from Data.Foldable, the specialised ones in Data.List, and local versions defined directly by pattern-matching, also seems to make no difference.
Compiling instead of working in ghci also only seemed to improve space usage by a constant factor.
Switching between several recent versions of GHC — 8.8.4, 8.10.5, and 9.0.1 — also seemed to make no significant difference.
"Do they use linear space" is a slightly unclear question. Usually when we talk about the space an algorithm uses, we're talking about its working set: the maximum amount of memory it needs all at once. "If my computer only had X bytes of memory, could I run this program?" But that's not what GHCI's :set +s measures. It measures the sum of all memory allocations made, including those that were cleaned up partway through. And what is the biggest use of memory in your experiment? The list itself, of course.
So you've really just measured the number of bytes that a list of size N takes up. You can confirm this by using last instead of length, which I hope you'll agree allocates no intermediate results, and is strict. It takes the same amount of memory using your metric as length does - length does no extra allocation for the sums.
But a bigger problem is that GHCI is not an optimizing compiler. If you care about performance characteristics at all, GHCI is the wrong tool. Instead, use GHC with -O2, and turn on GHC's profiler.
import System.Environment (getArgs)
main = do
n <- read . head <$> getArgs
print $ length (replicate (10^n) 1)
And running it:
$ ghc -O2 -prof -fprof-auto stackoverflow.hs
$ ./stackoverflow 6 +RTS -p
1000000
$ grep "total alloc" stackoverflow.prof
total alloc = 54,856 bytes (excludes profiling overheads)
$ ./stackoverflow 9 +RTS -p
1000000000
$ grep "total alloc" stackoverflow.prof
total alloc = 55,008 bytes (excludes profiling overheads)
we can see that space usage is roughly constant despite a thousand-fold increase in input size.
Will Ness correctly points out in a comment that -s would be a better measuring tool than -p.
Replacing sum with foldl' (+) 0 here, then performance improves noticeably in both time and space (which is itself a surprise; shouldn’t the standard sum be at least as efficient?) — but only by a constant factor; space usage still seems to be linear.
The sum is implemented as [src]:
sum :: Num a => t a -> a
sum = getSum #. foldMap Sum
It thus makes use of the Sum data type and its Monoid instance such that mappend = (+) and mempty = 0. foldMap works right associative, indeed:
Map each element of the structure into a monoid, and combine the results with (<>). This fold is right-associative and lazy in the accumulator. For strict left-associative folds consider foldMap' instead.
foldMap is thus implemented with foldr [src]:
foldMap :: Monoid m => (a -> m) -> t a -> m
{-# INLINE foldMap #-}
-- This INLINE allows more list functions to fuse. See #9848.
foldMap f = foldr (mappend . f) mempty
While foldl' will indeed have a (much) smaller memory footprint, and likely be more efficient, a reason to work with foldr is that for Peano numbers for example, one can make use of lazyness, and thus the head normal form will look like S(…) where … might not be evaluated (yet).
foldr can also terminate earlier. If for example you make a sum for a certain algebraic structure, it is possible that we can terminate the looping earlier.

OCaml: Stack_overflow exception in pervasives.ml

I got a Stack_overflow error in my OCaml program lately. If I turn on backtracing, I see the exception is raised by a "primitive operation" "pervasives.ml", line 270. I went into the OCaml source code and saw that line 270 defines the function # (i.e. list append). I don't get any other information from the backtrace, not even where the exception gets thrown in my program. I switched to bytecode and tried ocamldebug, and it doesn't help (no backtrace generated).
I thought this is an extremely weird situation. The only places in my program where I used a list is (a) building a list containing integers 1 to 1000000, (b) in-order traversing a RBT and putting the result into a list, and (c) printing a list of integers containing ostensibly 1000000 numbers. I've tested all functions and none of them contain could an infinite loop, and I thought 1000000 isn't even a huge number. Moreover, I've tried the equivalent of my program in Haskell (GHC), Scala and SML (MLton), and all of those versions worked perfectly and in a reasonably short amount of time. So, the question is, what could be going on? Can I debug it?
The # operator is not tail-recursive in the OCaml standard library,
let rec ( # ) l1 l2 =
match l1 with
[] -> l2
| hd :: tl -> hd :: (tl # l2)
Thus calling it with large lists (as the left argument) will overflow your stack.
It could be possible, that you're building your list by appending a new element to the end of the already generated list, e.g.,
let rec init n x = if n > 0 then init (n-1) x # [x] else []
This has time complexity n^2 and will consume n slots in the stack space.
Concerning the general question - how to debug such stack overflows, my usual recipe is to reduce the stack size, so that the problem is triggered as soon as possible before the trace is bloated, e.g.,
OCAMLRUNPARAM=b,l=1024 ocaml ./test.ml
If you're compiling your OCaml code to the native code, then you need to pass the -g option to the compiler, so that it can produce backtraces. Also, in the native execution, the size of the stack is controlled by the operating system and should be set using the corresponding mechanism of your OS, for example with ulimit in GNU/Linux, e.g., ulimit -s 1024.
As a bonus track, the following init function is tail recursive and will have O(N) time complexity and will take O(1) stack space:
let init n x =
let rec loop n xs =
if n = 0 then xs else loop (n-1) (x :: xs) in
loop n []
The idea is to use an accumulator list and build the list in the heap space.
If you don't like thinking about tail-recursiveness then you can use Janestreet Base library (or Core), or Batteries library. They both provide tail-recursive versions of the init function, as well as guarantees that all other functions are tail-recursive.
List functions in the standard library are optimised for small lists and are not necessarily tail-recursive; with the partial justification that lists are not an efficient data structure for storing large amount of data (note that Haskell lists are lazy and thus are quite different than OCaml eager lists).
In particular, if you get a stackoverflow error using #, you are quite probably implementing an algorithm with a quadratic time-complexity due to the fact that #'s complexity is linear in the size of its left argument.
They are probably far better data structure than list for your problem, if you want iteration the sequence library or any other forms of iterator would be far more efficient for instance.
With all the caveat stated before, it is relatively straightforward to redefine tail-recursive but inefficient version of the standard library function, e.g. :
let (#!) x y = List.rev_append (List.rev x) y
Another option is to use the containers library or any of the extended standard libraries (batteries or base essentially): all of those libraries reimplement tail-recursive version of list functions.

Lazy Evaluation: Why is it faster, advantages vs disadvantages, mechanics (why it uses less cpu; examples?) and simple proof of concept examples [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Lazy evaluation is said to be a way of delaying a process until the first time it is needed. This tends to avoid repeated evaluations and thats why I would imagine that is performing a lot faster.
Functional language like Haskell (and JavaScript..?) have this functionality built-in.
However, I don't understand how and why other 'normal' approaches (that is; same functionality but not using lazy evaluation) are slower.. how and why do these other approaches do repeated evaluations? Can someone elaborate on this by giving simple examples and explaining the mechanics of each approach?
Also, according to Wikipedia page about lazy evaluation these are said to be the advantages of this approach:
Performance increases by avoiding needless calculations, and error
conditions in evaluating compound expressions
The ability to construct potentially infinite data structures
The ability to define control flow (structures) as abstractions
instead of primitives
However, can we just control the calculations needed and avoid repeating the same ones? (1)
We can use i.e. a Linked List to create an infinite data structure (2)
Can we do (3) already..??? We can define classes/templates/objects and use those instead of primitives (i.e JavaScript).
Additionally, it seems to me that (at least from the cases i have seen), lazy evaluation goes hand-to-hand with recursion and using the 'head' and 'tail' (along with others) notions. Surely, there are cases where recursion is useful but is lazy evaluation something more than that...? more than a recursive approach to solving a problem..? Streamjs is JavaScript library that uses recursion along with some other simple operations (head,tail,etc) to perform lazy evaluation.
It seems i can't get my head around it...
Thanks in advance for any contribution.
I'll show examples in both Python 2.7 and Haskell.
Say, for example, you wanted to do a really inefficient sum of all the numbers from 0 to 10,000,000. You could do this with a for loop in Python as
total = 0
for i in range(10000000):
total += i
print total
On my computer, this takes about 1.3s to execute. If instead, I changed range to xrange (the generator form of range, lazily produces a sequence of numbers), it takes 1.2s, only slightly faster. However, if I check the memory used (using the memory_profiler package), the version with range uses about 155MB of RAM, while the xrange version uses only 1MB of RAM (both numbers not including the ~11MB Python uses). This is an incredibly dramatic difference, and we can see where it comes from with this tool as well:
Mem usage Increment Line Contents
===========================================
10.875 MiB 0.004 MiB total = 0
165.926 MiB 155.051 MiB for i in range(10000000):
165.926 MiB 0.000 MiB total += i
return total
This says that before we started we were using 10.875MB, total = 0 added 0.004MB, and then for i in range(10000000): added 155.051MB when it generated the entire list of numbers [0..9999999]. If we compare to the xrange version:
Mem usage Increment Line Contents
===========================================
11.000 MiB 0.004 MiB total = 0
11.109 MiB 0.109 MiB for i in xrange(10000000):
11.109 MiB 0.000 MiB total += i
return total
So we started with 11MB and for i in xrange(10000000): added only 0.109MB. This is a huge memory savings by only adding a single letter to the code. While this example is fairly contrived, it shows how not computing a whole list until the element is needed can make things a lot more memory efficient.
Python has iterators and generators which act as a sort of "lazy" programming for when you need to yield sequences of data (although there's nothing stopping you from using them for single values), but Haskell has laziness built into every value in the language, even user-defined ones. This lets you take advantage of things like data structures that won't fit in memory without having to program complicated ways around that fact. The canonical example would be the fibonacci sequence:
fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
which very elegantly expresses this famous sequence to define a recursive infinite list generating all fibonacci numbers. It's CPU efficient because all values are cached, so each element only has to be computed once (compared to a naive recursive implementation)1, but if you calculate too many elements your computer will eventually run out of RAM because you're now storing this huge list of numbers. This is an example where lazy programming lets you have CPU efficiency, but not RAM efficiency. There is a way around this, though. If you were to write
fib :: Int -> Integer
fib n = let fibs = 1 : 1 : zipWith (+) fibs (tail fibs) in fibs !! n
then this runs in near-constant memory, and does so very quickly, but memoization is lost as subsequent calls to fib have to recompute fibs.
A more complex example can be found here, where the author shows how to use lazy programming and recursion in Haskell to perform dynamic programming with arrays, a feat that most initially think is very difficult and requires mutation, but Haskell manages to do very easily with "tying the knot" style recursion. It results in both CPU and RAM efficiency, and does so in fewer lines than I'd expect in C/C++.
All this being said, there are plenty of cases where lazy programming is annoying. Often you can build up huge numbers of thunks instead of computing things as you go (I'm looking at you, foldl), and some strictness has to be introduced to attain efficiency. It also bites a lot of people with IO, when you read a file to a string as a thunk, close the file, and then try to operate on that string. It's only after the file is closed that the thunk gets evaluated, causing an IO error to occur and crashes your program. As with anything, lazy programming is not without its flaws, gotchas, and pitfalls. It takes time to learn how to work with it well, and to know what its limitations are.
1) By "naive recursive implementation", I mean implementing the fibonacci sequence as
fib :: Integer -> Integer
fib 0 = 1
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
With this implementation, you can see the mathematical definition very clearly, it's very much in the style of inductive proofs, you show your base cases and then the general case. However, if I call fib 5, this will "expand" into something like
fib 5 = fib 4 + fib 3
= fib 3 + fib 2 + fib 2 + fib 1
= fib 2 + fib 1 + fib 1 + fib 0 + fib 1 + fib 0 + fib 1
= fib 1 + fib 0 + fib 1 + fib 1 + fib 0 + fib 1 + fib 0 + fib 1
= 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1
= 8
When instead we'd like to share some of those computations, that way fib 3 only gets computed once, fib 2 only gets computed once, etc.
By using a recursively defined list in Haskell, we can avoid this. Internally, this list is represented something like this:
fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
= 1 : 1 : zipWith (+) (f1:f2:fs) (f2:fs)
^--------------------^ ^ ^
^-------------------|-------|
= 1 : 1 : 2 : zipWith (+) (f2:f3:fs) (f3:fs)
^--------------------^ ^ ^
^-------------------|-------|
= 1 : 1 : 2 : 3 : zipWith (+) (f3:f4:fs) (f4:fs)
^--------------------^ ^ ^
^-------------------|-------|
So hopefully you can see the pattern forming here, as the list is build, it keeps pointers back to the last two elements generated in order to compute the next element. This means that for the nth element computed, there are n-2 additions performed. Even for the naive fib 5, you can see that there are more additions performed than that, and the number of additions will continue to grow exponentially. This definition is made possible through laziness and recursions, letting us turn an O(2^n) algorithm into an O(n) algorithm, but we have to give up RAM to do so. If this is defined at the top level, then values are cached for the lifetime of the program. It does mean that if you need to refer to the 1000th element repeatedly, you don't have to recompute it, just index it.
On the other hand, the definition
fib :: Int -> Integer
fib n =
let fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
in fibs !! n
uses a local copy of fibs every time fib is called. We don't get caching between calls to fib, but we do get local caching, leaving our complexity O(n). Additionally, GHC is smart enough to know that we don't have to keep the beginning of the list around after we've used it to calculate the next element, so as we traverse fibs looking for the nth element, it only needs to hold on to 2-3 elements and a thunk pointing at the next element. This saves us RAM while computing it, and since it isn't defined at a global level it doesn't eat up RAM over the lifetime of the program. It's a tradeoff between when we want to spend RAM and CPU cycles, and different approaches are better for different situations. These techniques are applicable to much of Haskell programming in general, not just for this sequence!
Lazy evaluation is not, in general, faster. When it's said that lazy evaluation is more efficient, it is because when you consider Lambda Calculus (which is essentially what your Haskell programs are once the compiler finishes de-sugaring them) as a system of terms and reduction rules, then applying those rules in the order specified by the rules of a call-by-name with sharing evaluation policy always applies the same or fewer reduction rules than when you follow the rules in the order specified by call-by-value evaluation.
The reason that this theoretical result does not make lazy evaluation faster in general is that the translation to a linear sequential machine model with a memory access bottleneck tends to make all the reductions performed much more expensive! Initial attempts at implementing this model on computers led to programs that executed orders of magnitude more slowly than typical eagerly-evaluating language implementations. It has taken a lot of research and engineering into techniques for implementing lazy evaluation efficiently to get Haskell performance to where it is today. And the fastest Haskell programs take advantage of a form of static analysis called "strictness analysis" which attempts to determine at compile time which expressions will always be needed so that they can be evaluated eagerly rather than lazily.
There are still some cases where straightforward implementations of algorithms will execute faster in Haskell due to only evaluating terms that are needed for the result, but even eager languages always have some facility for evaluating some expressions by need. Conditionals and short-circuiting boolean expressions are ubiquitous examples, and in many eager languages, one can also delay evaluation by wrapping an expression in an anonymous function or some other sort of delaying form. So you can typically use these mechanisms (or even more awkward rewrites) to avoid evaluating expensive things that won't be necessary in an eager language.
The real advantage of Haskell's lazy evaluation is not a performance-related one. Haskell makes it easier to pull expressions apart, re-combine them in different ways, and generally reason about code as if it were a system of mathematical equations instead of being a sequentially-evaluated set of machine instructions. By not specifying any evaluation order, it forced the developers of the language to avoid side-effects that rely on a simple evaluation ordering, such as mutation or IO. This in turn led to a host of elegant abstractions that are generally useful and might not have been developed into usability otherwise.
The state of Haskell is now such that you can write high-level, elegant algorithms that make better re-use of existing higher-order functions and data structures than in nearly any other high-level typed language. And once you become familiar with the costs and benefits of lazy evaluation and how to control when it occurs, you can ensure that the elegant code also performs very well. But getting the elegant code to a state of high performance is not necessarily automatic and may require a bit more thought than in a similar but eagerly-evaluated language.
The concept of "lazy evaluation" is only about 1 thing, and only about that 1 thing:
The ability to postpone evaluation of something until needed
That's it.
Everything else in that wikipedia article follows from it.
Infinite data structures? Not a problem. We'll just make sure we don't actually figure out what the next element is until you actually ask for it. For instance, asking some code what the next value after X is, if the operation to perform is just to increase X by 1, will be infite. If you create a list containing all those values, it's going to fill your available memory in the computer. If you only figure out what the next value is when asked, not so much.
Needless calculations? Sure. You can return an object containing a lot of properties that when asked will provide you with some value. If you don't ask (ie. never inspect the value of a given property), the calculation necessary to figure out the value of that property will never be done.
Control flow ... ? Not at all sure what that is about.
The purpose of lazy evaluation of something is exactly as I stated to begin with, to avoid evaluating something until you actually need it. Be it the next value of something, the value of a property, whatever, adding support for lazy evaluation might conserve CPU cycles.
What would the alternative be?
I want to return an object to the calling code, containing any number of properties, some of which might be expensive to calculate. Without lazy evaluation, I would have to calculate the values of all those properties either:
Before constructing the object
After constructing the object, on the first time you inspected a property
After constructing the object, every time you inspected that property
With lazy evaluation you usually end up with number 2. You postpone evaluating the value of that property until some code inspects it. Note that you might cache the value once evaluated, which would save CPU cycles when inspecting the same property more than once, but that is caching, not quite the same, but in the same line of work: optimizations.

Haskell: Caches, memoization, and referential transparency [duplicate]

I can't figure out why m1 is apparently memoized while m2 is not in the following:
m1 = ((filter odd [1..]) !!)
m2 n = ((filter odd [1..]) !! n)
m1 10000000 takes about 1.5 seconds on the first call, and a fraction of that on subsequent calls (presumably it caches the list), whereas m2 10000000 always takes the same amount of time (rebuilding the list with each call). Any idea what's going on? Are there any rules of thumb as to if and when GHC will memoize a function? Thanks.
GHC does not memoize functions.
It does, however, compute any given expression in the code at most once per time that its surrounding lambda-expression is entered, or at most once ever if it is at top level. Determining where the lambda-expressions are can be a little tricky when you use syntactic sugar like in your example, so let's convert these to equivalent desugared syntax:
m1' = (!!) (filter odd [1..]) -- NB: See below!
m2' = \n -> (!!) (filter odd [1..]) n
(Note: The Haskell 98 report actually describes a left operator section like (a %) as equivalent to \b -> (%) a b, but GHC desugars it to (%) a. These are technically different because they can be distinguished by seq. I think I might have submitted a GHC Trac ticket about this.)
Given this, you can see that in m1', the expression filter odd [1..] is not contained in any lambda-expression, so it will only be computed once per run of your program, while in m2', filter odd [1..] will be computed each time the lambda-expression is entered, i.e., on each call of m2'. That explains the difference in timing you are seeing.
Actually, some versions of GHC, with certain optimization options, will share more values than the above description indicates. This can be problematic in some situations. For example, consider the function
f = \x -> let y = [1..30000000] in foldl' (+) 0 (y ++ [x])
GHC might notice that y does not depend on x and rewrite the function to
f = let y = [1..30000000] in \x -> foldl' (+) 0 (y ++ [x])
In this case, the new version is much less efficient because it will have to read about 1 GB from memory where y is stored, while the original version would run in constant space and fit in the processor's cache. In fact, under GHC 6.12.1, the function f is almost twice as fast when compiled without optimizations than it is compiled with -O2.
m1 is computed only once because it is a Constant Applicative Form, while m2 is not a CAF, and so is computed for each evaluation.
See the GHC wiki on CAFs: http://www.haskell.org/haskellwiki/Constant_applicative_form
There is a crucial difference between the two forms: the monomorphism restriction applies to m1 but not m2, because m2 has explicitly given arguments. So m2's type is general but m1's is specific. The types they are assigned are:
m1 :: Int -> Integer
m2 :: (Integral a) => Int -> a
Most Haskell compilers and interpreters (all of them that I know of actually) do not memoize polymorphic structures, so m2's internal list is recreated every time it's called, where m1's is not.
I'm not sure, because I'm quite new to Haskell myself, but it appears that it's beacuse the second function is parametrized and the first one is not. The nature of the function is that, it's result depends on input value and in functional paradigm especailly it depends ONLY on the input. Obvious implication is that a function with no parameters returns always the same value over and over, no matter what.
Aparently there's an optimizing mechanizm in GHC compiler that exploits this fact to compute the value of such a function only once for whole program runtime. It does it lazily, to be sure, but does it nonetheless. I noticed it myself, when I wrote the following function:
primes = filter isPrime [2..]
where isPrime n = null [factor | factor <- [2..n-1], factor `divides` n]
where f `divides` n = (n `mod` f) == 0
Then to test it, I entered GHCI and wrote: primes !! 1000. It took a few seconds, but finally I got the answer: 7927. Then I called primes !! 1001 and got the answer instantly. Similarly in an instant I got the result for take 1000 primes, because Haskell had to compute the whole thousand-element list to return 1001st element before.
Thus if you can write your function such that it takes no parameters, you probably want it. ;)

Mutable, (possibly parallel) Haskell code and performance tuning

I have now implemented another SHA3 candidate, namely Grøstl. This is still work in progress (very much so), but at the moment a 224-bit version pass all KATs. So now I'm wondering about performance (again :->). The difference this time, is that I chose to more closely mirror the (optimized) C implementation, i.e. I made a port from C to Haskell. The optimized C version use table-lookups to implement the algorithm. Furthermore the code is heavily based on updating an array containing 64-bit words. Thus I chose to use mutable unboxed vectors in Haskell.
My Grøstl code can be found here: https://github.com/hakoja/SHA3/blob/master/Data/Digest/GroestlMutable.hs
Short description of the algorithm: It's a Merkle-Damgård construction, iterating a compression function (f512M in my code) as long as there are 512-bits blocks of message left. The compression function is very simple: it simply runs two different independent 512-bit permutations P and Q (permP and permQ in my code) and combines their output. Its these permutations which are implemented by lookup tables.
Q1) The first thing that bothers me is that the use of mutable vectors makes my code look really fugly. This is my first time writing any major mutable code in Haskell so I don't really know how to improve this. Any tips on how I might better strucure the monadic code would be welcome.
Q2) The second is performance. Actually It's not too bad, because at the moment the Haskell code is only 3 times slower. Using GHC-7.2.1 and compiling as such:
ghc -O2 -Odph -fllvm -optlo-O3 -optlo-loop-reduce -optlo-loop-deletion
the Haskell code uses 60s. on an input of ~1GB, while the C-version uses 21-22s. But there are some things I find odd:
(1) If I try to inline rnd512QM, the code takes 4 times longer, but if I inline rnd512PM nothing happens! Why is this happening? These two functions are virtually identical!
(2) This is maybe more difficult. I've been experimenting with executing the two permutations in parallel. But currently to no avail. This is one example of what I tried:
f512 h m = V.force outP `par` (V.force outQ `pseq` (V.zipWith3 xor3 h outP outQ))
where xor3 x1 x2 x3 = x1 `xor` x2 `xor` x3
inP = V.zipWith xor h m
outP = permP inP
outQ = permQ m
When checking the run-time statistics, and using ThreadScope, I noticed that the correct number of SPARKS was created, but almost none was actually converted to useful parallel work. Thus I gained nothing in speedup. My question then becomes:
Are the P and Q functions just too small for the runtime to bother to run in parallel?
If not, is my use of par and pseq (and possibly Vector.Unboxed.force) wrong?
Would I gain anything by switching to strategies? And how would I go about doing that?
Thank you so much for your time.
EDIT:
Sorry for not providing any real benchmark tests. The testing code in the repo was just intended for myself only. For those wanting to test the code out, you will need to compile main.hs, and then run it as:
./main "algorithm" "testvariant" "byte aligned"
For instance:
./main groestl short224 False
or
./main groestl e False
(e stands for "Extreme". It's the very long message provided with the NIST KATS).
I checked out the repo, but there's no simple benchmark to just run and play with, so my ideas are just from eyeballing the code. Numbering is unrelated to your questions.
1) I'm pretty sure force doesn't do what you want -- it actually forces a copy of the underlying vector.
2) I think the use of unsafeThaw and unsafeFreeze is sort of odd. I'd just put f512M in the ST monad and be done with it. Then run it something like so:
otherwise = \msg -> truncate G224 . outputTransformation . runST $ foldM f512M h0_224 (parseMessage dataBitLen 512 msg)
3) V.foldM' is sort of silly -- you can just use a normal (strict) foldM over a list -- folding over the vector in the second argument doesn't seem to buy anything.
4) i'm dubious about the bangs in columnM and for the unsafeReads.
Also...
a) I suspect that xoring unboxed vectors can probably be implemented at a lower level than zipWith, making use of Data.Vector internals.
b) However, it may be better not to do this as it could interfere with vector fusion.
c) On inspection, extractByte looks slightly inefficient? Rather than using fromIntegral to truncate, maybe use mod or quot and then a single fromIntegral to take you directly to an Int.
Be sure to compile with -threaded -rtsopts and execute with +RTS -N2. Without that, you won't have more than one OS thread to perform computations.
Try to spark computations that are referred to elsewhere, otherwise they might be collected:
_
f512 h m = outP `par` (outQ `pseq` (V.zipWith3 xor3 h outP outQ))
where xor3 x1 x2 x3 = x1 `xor` x2 `xor` x3
inP = V.zipWith xor h m
outP = V.force $ permP inP
outQ = V.force $ permQ m
_
3) If you switch things up so parseBlock accepts strict bytestrings (or chunks and packs lazy ones when needed) then you can use Data.Vector.Storable and potentially avoid some copying.

Resources