What's the meaning of strict version in haskell? - lazy-evaluation

Follow <Real World Haskell> , it is said foldl' are strict version of foldl.
But it's hard for me to understand , what does strict mean??
foldl f z0 xs0 = lgo z0 xs0
where
lgo z [] = z
lgo z (x:xs) = lgo (f z x) xs
foldl' f z0 xs0 = lgo z0 xs0
where lgo z [] = z
lgo z (x:xs) = let z' = f z x in z' `seq` lgo z' xs

It is not widely known, but foldl' is actually non-strict in its accumulator argument! Recall the type:
foldl' :: (a -> b -> a) -> a -> [b] -> a
Its strictness in argument 2 depends on the strictness of the function given for argument 1, as you see if you pass const:
Prelude Data.List> foldl' (const (+1)) undefined [1]
2
Prelude Data.List> foldl' (const (+1)) undefined [1..4]
5
You would have thought, naively, that "foldl' is strict" means "strict in the accumulator argument". The above contradicts that.
However, it is even more insidious, as the strictness is only on the result of the function application in the cons case of the loop. So you still get bottoms if you enter the base case, but not the inductive case:
Prelude Data.List> foldl' (const (+1)) undefined []
*** Exception: Prelude.undefined
So the strictness in argument 2 also depends on the value of argument 3!
This is how I'd write it: "fully" strict in its 2nd argument.
foldl' f z0 xs0 = go z0 xs0
where
go !z [] = z
go !z (x:xs) = go (f z x) xs
Which is truly strict in its second argument, as you can see :
Prelude Data.List.Stream> foldl' (\a b -> 1) undefined [undefined]
*** Exception: Prelude.undefined
Compared with the Haskell2010 version:
Prelude Data.List.Stream> Data.List.foldl' (\a b -> 1 ) undefined [undefined]
1
This actuall has a practical impact -- the current definition will not unbox its accumulator argument consistently.
Historical note: this was discovered when we were specifying the list library's strictness semantics for the stream fusion paper in 2007, and the approach to specifying strictness is given in Duncan Coutt's PhD thesis.

foldl and (the strict) foldl' are close to semantically equivalent. The difference is in performance, especially when you are transversing a large list. The laziness has an overhead of building a thunk and foldl' is the more efficient way to arrive at that result because it doesn't build a huge thunk.
There is a really good article explaining this in detail on Haskell Wiki
Strict functions works like functions in C or other languages in that their arguments are generally eagerly evaluated.

A strict function is a function whose arguments are evaluated before the body is.

Related

Performance gain implementing concatMap with foldl' for finite list?

I read from Foldr Foldl Foldl' that foldl' is more efficient for long finite lists because of the strictness property. I am aware that it is not suitable for infinite list.
Thus, I am limiting the comparison only for long finite lists.
concatMap
concatMap is implemented using foldr, which gives it laziness. However, using it with long finite lists will build up a long unreduced chain according to the article.
concatMap :: Foldable t => (a -> [b]) -> t a -> [b]
concatMap f xs = build (\c n -> foldr (\x b -> foldr c b (f x)) n xs)
Thus I come up with the following implementation with use of foldl'.
concatMap' :: Foldable t => (a -> [b]) -> t a -> [b]
concatMap' f = reverse . foldl' (\acc x -> f x ++ acc) []
Test it out
I have build the following two functions to test out the performance.
lastA = last . concatMap (: []) $ [1..10000]
lastB = last . concatMap' (: []) $ [1..10000]
However, I was shocked by the results.
lastA:
(0.23 secs, 184,071,944 bytes)
(0.24 secs, 184,074,376 bytes)
(0.24 secs, 184,071,048 bytes)
(0.24 secs, 184,074,376 bytes)
(0.25 secs, 184,075,216 bytes)
lastB:
(0.81 secs, 224,075,080 bytes)
(0.76 secs, 224,074,504 bytes)
(0.78 secs, 224,072,888 bytes)
(0.84 secs, 224,073,736 bytes)
(0.79 secs, 224,074,064 bytes)
Follow-up Questions
concatMap outcompetes my concatMap' in both time and memory. I wonder there are mistakes I made in my concatMap' implementation.
Thus, I doubt the articles for stating the goodness of foldl'.
Are there any black magic in concatMap to make it so efficient?
Is it true that foldl' is more efficient for long finite list?
Is it true that using foldr with long finite lists will build up a long unreduced chain and impact the performance?
Are there any black magic in concatMap to make it so efficient?
No, not really.
Is it true that foldl' is more efficient for long finite list?
Not always. It depends on the folding function.
The point is, foldl and foldl' always have to scan the whole input list before producing the output. Instead, foldr does not always have to.
As an extreme case, consider
foldr (\x xs -> x) 0 [10..10000000]
which evaluates to 10 instantly -- only the first element of the list is evaluated. The reduction goes something like
foldr (\x xs -> x) 0 [10..10000000]
= foldr (\x xs -> x) 0 (10 : [11..10000000])
= (\x xs -> x) 10 (foldr (\x xs -> x) 0 [11..10000000])
= (\xs -> 10) (foldr (\x xs -> x) 0 [11..10000000])
= 10
and the recursive call is not evaluated thanks to laziness.
In general, when computing foldr f a xs, it is important to check whether f y ys is able to construct a part of the output before evaluating ys. For instance
foldr f [] xs
where f y ys = (2*y) : ys
produces a list cell _ : _ before evaluating 2*y and ys. This makes it an excellent candidate for foldr.
Again, we can define
map f xs = foldr (\y ys -> f y : ys) [] xs
which runs just fine. It consumes one element from xs and outputs the first output cell. Then it consumes the next element, outputs the next element, and so on. Using foldl' would not output anything until the whole list is processed, making the code quite inefficient.
Instead, if we wrote
sum xs = foldr (\y ys -> y+ys) 0 xs
then we do not output anything after the first element of xs is consumed.
We build a long chain of thunks, wasting a lot of memory.
Here, foldl' would instead work in constant space.
Is it true that using foldr with long finite lists will build up a long unreduced chain and impact the performance?
Not always. It strongly depends on how the output is consumed by the caller.
As a thumb rule, if the output is "atomic", meaning that the output consumer can not observe only a part of it (e.g. Bool, Int, ...) then it's better to use foldl'. If the output is "composed" of many independent values (list, trees, ...) probably foldr is a better choice, if f can produce its output step-by-step, in a "streaming" fashion.

Performance of Foldable's default methods

I've been exploring the Foldable class and also the the Monoid class.
Firstly, lets say I want to fold over a list of the Monoid First. Like so:
x :: [First a]
fold? mappend mempty x
Then I assume in this case the most appropriate fold would be foldr, as mappend for First is lazy in it's second argument.
Conversely, for Last we'd want to foldl' (or just foldl I'm not sure).
Now moving away from lists, I've defined a simple binary tree like so:
{-# LANGUAGE GADTs #-}
data BinaryTree a where
BinaryTree :: BinaryTree a -> BinaryTree a -> BinaryTree a
Leaf :: a -> BinaryTree a
And I've made it Foldable with the most straightforward definition:
instance Foldable BinaryTree where
foldMap f (BinaryTree left right) =
(foldMap f left) `mappend` (foldMap f right)
foldMap f (Leaf x) = f x
As Foldable defines fold as simply foldMap id we can now do:
x1 :: BinaryTree (First a)
fold x1
x2 :: BinaryTree (Last a)
fold x2
Assuming our BinaryTree is balanced, and there's not many Nothing values, these operations should take O(log(n)) time I believe.
But Foldable also defines a whole lot of default methods like foldl, foldl', foldr and foldr' based on foldMap.
These default definitions seem to be implemented by composing a bunch of functions, wrapped in a Monoid called Endo, one for each element in the collection, and then composing them all.
For the purpose of this discussion I am not modifying these default definitions.
So lets now consider:
x1 :: BinaryTree (First a)
foldr mappend mempty x1
x2 :: BinaryTree (Last a)
foldl mappend mempty x2
Does running these retain O(log(n)) performance of the ordinary fold? (I'm not worried about constant factors for the moment). Does laziness result in the tree not needing to be fully traversed? Or will the default definitions of foldl and foldr require an entire traversal of the tree?
I tried to go though the algorithm step by step (much like they did on the Foldr Foldl Foldl' article) but I ended up completely confusing myself as this is a bit more complex as it involves an interaction between Foldable, Monoid and Endo.
So what I'm looking for is an explanation of why (or why not) the default definition of say foldr, would only take O(log(n)) time on a balanced binary tree like above. A step by step example like what's from the Foldr Foldl Foldl' article would be really helpful, but I understand if that's too difficult, as I totally confused myself attempting it.
Yes, it has O(log(n)) best case performance.
Endo is a wrapper around (a -> a) kind of functions that:
instance Monoid (Endo a) where
mempty = Endo id
Endo f `mappend` Endo g = Endo (f . g)
And the default implementation of foldr in Data.Foldable:
foldr :: (a -> b -> b) -> b -> t a -> b
foldr f z t = appEndo (foldMap (Endo #. f) t) z
The definition of . (function composition) in case:
(.) f g = \x -> f (g x)
Endo is defined by newtype constructor, so it only exists at compile stage, not run-time.
#. operator changes the type of it's second operand and discard the first.
The newtype constructor and #. operator guarantee that you can ignore the wrapper when considering performance issues.
So the default implementation of foldr can be reduced to:
-- mappend = (.), mempty = id from instance Monoid (Endo a)
foldr :: (a -> b -> b) -> b -> t a -> b
foldr f z t = foldMap f t z
For your Foldable BinaryTree:
foldr f z t
= foldMap f t z
= case t of
Leaf a -> f a z
-- what we care
BinaryTree l r -> ((foldMap f l) . (foldMap f r)) z
The default lazy evaluation in Haskell is ultimately simple, there are just two rules:
function application first
evaluate the arguments from left to right if the values matter
That makes it easy to trace the evaluation of the last line of the code above:
((foldMap f l) . (foldMap f r)) z
= (\z -> foldMap f l (foldMap f r z)) z
= foldMap f l (foldMap f r z)
-- let z' = foldMap f r z
= foldMap f l z' -- height 1
-- if the branch l is still not a Leaf node
= ((foldMap f ll) . (foldMap f lr)) z'
= (\z -> foldMap f ll (foldMap f lr)) z'
= foldMap f ll (foldMap f lr z')
-- let z'' = foldMap f lr z'
= foldMap f ll z'' -- height 2
The right branch of the tree is never expanded before the left has been fully expanded, and it goes one level higher after an O(1) operation of function expansion and application, therefore when it reached the left-most Leaf node:
= foldMap f leaf#(Leaf a) z'heightOfLeftMostLeaf
= f a z'heightOfLeftMostLeaf
Then f looks at the value a and decides to ignore its second argument (like what mappend will do to First values), the evaluation short-circuits, results O(height of the left-most leaf), or O(log(n)) performance when the tree is balanced.
foldl is all the same, it's just foldr with mappend flipped i.e. O(log(n)) best case performance with Last.
foldl' and foldr' are different.
foldl' :: (b -> a -> b) -> b -> t a -> b
foldl' f z0 xs = foldr f' id xs z0
where f' x k z = k $! f z x
At every step of reduction, the argument is evaluated first and then the function application, the tree will be traversed i.e. O(n) best case performance.

Haskell - foldl' in terms of foldr and performance issues

While studying fold in depth with A tutorial on the universality and expressiveness of fold
I found an amazing definition of foldl using foldr:
-- I used one lambda function inside another only to improve reading
foldl :: (b -> a -> b) -> b -> [a] -> b
foldl f z xs = foldr (\x g -> (\a -> g (f a x))) id xs z
After understanding what is going on, I thought I could even use foldr to define foldl', which would be like this:
foldl' :: (b -> a -> b) -> b -> [a] -> b
foldl' f z xs = foldr (\x g -> (\a -> let z' = a `f` x in z' `seq` g z')) id xs z
Which is parallel to this:
foldl' :: (b -> a -> b) -> b -> [a] -> b
foldl' f z (x:xs) = let z' = z `f` x
in seq z' $ foldl' f z' xs
foldl' _ z _ = z
It seems that both of them are running in constant space (not creating thunks) in simple cases like this:
*Main> foldl' (+) 0 [1..1000000]
500000500000
May I consider both definitions of foldl' equivalent in terms of performance?
In GHC 7.10+, foldl and foldl' are both defined in terms of foldr. The reason that they weren't before is that GHC didn't optimize the foldr definition well enough to participate in foldr/build fusion. But GHC 7.10 introduced a new optimization specifically to allow foldr/build fusion to succeed while using foldl' or foldl' defined that way.
The big win here is that an expression like foldl' (+) 0 [1..10] can be optimized down to never allocating a (:) constructor at all. And we all know that the absolute fastest garbage collection is when there's no garbage to collect.
See http://www.joachim-breitner.de/publications/CallArity-TFP.pdf for information on the new optimization in GHC 7.10, and why it was necessary.

Speeding up a stream like data type

I've made a type which is supposed to emulate a "stream". This is basically a list without memory.
data Stream a = forall s. Stream (s -> Maybe (a, s)) s
Basically a stream has two elements. A state s, and a function that takes the state, and returns an element of type a and the new state.
I want to be able to perform operations on streams, so I've imported Data.Foldable and defined streams on it as such:
import Data.Foldable
instance Foldable Stream where
foldr k z (Stream sf s) = go (sf s)
where
go Nothing = z
go (Just (e, ns)) = e `k` go (sf ns)
To test the speed of my stream, I've defined the following function:
mysum = foldl' (+) 0
And now we can compare the speed of ordinary lists and my stream type:
x1 = [1..n]
x2 = Stream (\s -> if (s == n + 1) then Nothing else Just (s, s + 1)) 1
--main = print $ mysum x1
--main = print $ mysum x2
My streams are about half the speed of lists (full code here).
Furthermore, here's a best case situation, without a list or a stream:
bestcase :: Int
bestcase = go 1 0 where
go i c = if i == n then c + i else go (i+1) (c+i)
This is a lot faster than both the list and stream versions.
So I've got two questions:
How to I get my stream version to be at least as fast as a list.
How to I get my stream version to be close to the speed of bestcase.
As it stands the foldl' you are getting from Foldable is defined in terms of the foldr you gave it. The default implementation is the brilliant and surprisingly good
foldl' :: (b -> a -> b) -> b -> t a -> b
foldl' f z0 xs = foldr f' id xs z0
where f' x k z = k $! f z x
But foldl' is the specialty of your type; fortunately the Foldable class includes foldl' as a method, so you can just add this to your instance.
foldl' op acc0 (Stream sf s0) = loop s0 acc0
where
loop !s !acc = case sf s of
Nothing -> acc
Just (a,s') -> loop s' (op acc a)
For me this seems to give about the same time as bestcase
Note that this is a standard case where we need a strictness annotation on the accumulator. You might look in the vector package's treatment of a similar type https://hackage.haskell.org/package/vector-0.10.12.2/docs/src/Data-Vector-Fusion-Stream.html for some ideas; or in the hidden 'fusion' modules of the text library https://github.com/bos/text/blob/master/Data/Text/Internal/Fusion .

Haskell garbage collector

I have a simple toy example that seems to disagree with the garbage collector on what data structures can be reclaimed (aka memory leak). I am not trying to come up with more memory efficient versions of this algorithm (a good collection of better algorithms is here: Haskell Wiki - Prime numbers, rather an explanation why the garbage collector is not identifying the old, out of scope and unused portions of the list to reclaim that memory.
The code is here:
import Data.List (foldl')
erat' :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
erat' (c,b) ((x,y):xs)
| c < x = (x,y) : erat' (c,b) xs
| c == x = (x+y,y) : erat' (c,True) xs
| c > x = (x+y,y) : erat' (c,b) xs
erat' (c,b) []
| b = []
| otherwise = [(c,c)]
erat :: [Integer] -> [(Integer,Integer)]
erat = foldl' (\a c -> erat' (c,False) a) []
primes :: Integer -> [Integer]
primes n = map snd $ erat [2..n]
In essence, calling primes with a positive integer will return a list of all prime numbers up to and including that number. A list of pairs of primes and their high water mark multiple is passed to erat', together with a pair including a candidate and a boolean (False for prime and True for non-prime). Every non-recursive call to erat' will pass a new list, and I would expect that the output would contain, at most, certain shared cells from the beginning of the list up to the point of the first change.
As soon as the modified cells in the list passed to erat' come out of scope, the memory should be flagged to be recovered, but as you can see when you try calling primes with a large enough number (1,000,000, for example), the memory utilization can quickly spike to tens of gigabytes.
Now, the question is: why is this happening? Shouldn't the generational garbage collector detect dereferenced list cells to reclaim them? And, shouldn't it be fairly easy for it to detect that they don't have references because:
a) nothing can have references from data structures older than itself;
b) there cannot be newer references because those cells/fragments are not even part of a referenceable data structure anymore, since it came out of scope?
Of course, a mutable data structure would take care of this, but I feel like resorting to mutability in a case like this is dropping some of the theoretical principles for Haskell on the floor.
Thanks to the people that commented (particularly Carl), I modified the algorithm slightly to add strictness (and the optimization of starting crossing the square of the new prime, since lower multiples will be crossed by multiples of lower primes too).
This is the new version:
import Data.List (foldl')
erat' :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
erat' (c,b) ((x,y):xs)
| c < x = x `seq` (x,y) : erat' (c,b) xs
| c == x = x `seq` (x+y,y) : erat' (c,True) xs
| c > x = x `seq` (x+y,y) : erat' (c,b) xs
erat' (c,b) []
| b = []
| otherwise = [(c*c,c)] -- lower multiples would be covered by multiples of lower primes
erat :: [Integer] -> [(Integer,Integer)]
erat = foldl' (\a c -> erat' (c,False) a) []
primes :: Integer -> [Integer]
primes n = map snd $ erat [2..n]
The memory consumption seems to still be quite significant. Are there any other changes to this algorithm that could help reduce the total memory utilization?
Since Will pointed out that I didn't provide full statistics, these are the numbers for a run of the updated version of primes listed just above, with 100000 as the parameter:
And after applying the changes that Will proposed, the memory usage is now down considerably. See, for example, on a run of primes for 100000 again:
And last, this is the final code after the proposed changes were incorporated:
import Data.List (foldl')
erat'' :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
erat'' (c,b) ((x,y):xs)
| c < x = (x, y) : if x==y*y then (if b then xs
else xs++[(c*c,c)])
else erat'' (c,b) xs
| c == x = (x+y,y) : if x==y*y then xs
else erat'' (c,True) xs
| c > x = (x+y,y) : erat'' (c,b) xs
erat'' (c,True) [] = []
erat'' (c,False) [] = [(c*c,c)]
primes'' :: Integer -> [Integer]
primes'' n = map snd $ foldl' (\a c -> (if null a then 0 else
case last a of (x,y) -> y) `seq` erat'' (c,False) a) [] [2..n]
And finally a run for 1,000,000 to have a feeling for performance in this new version:
Assumption a) is false in the presence of laziness. And in fact, your code consists almost entirely of generating cons cells pointed to by older cons cells. erat' consumes a list element, then produces a (:) constructor pointing to a tuple and an unevaluated thunk which will perform a recursive call to erat'. Only when that thunk is later evaluated will the (:) list constructor actually point to its tail as a data structure. So yes, nearly every (:) you allocate in erat' is in fact pointing forward in time. (The only exception is the last one - [foo] is going to point to the pre-existing [] constructor when its (:) constructor is allocated.)
Assumption b) is nonsense in the presence of laziness. Scope determines visibility in Haskell, not lifetime. Lifetime depends on evaluation and reachability.
So what happens at runtime is that you build up pipeline of erat' calls in erat. Each one of them holds on to as much of its input as has been evaluated, slowly consuming it. The interesting part is that your code doesn't evaluate anything in advance - it seems like it should actually stream pretty well - except for the fact that the pipeline is too deep. The pipeline created is approximately n stages - this is (inefficient!) trial division, not the sieve of Eratosthenes. You should only be adding prime numbers to the pipeline, not every number.
breaking update: You should use
map snd $ foldl' (\a c -> (if null a then 0 else
case last a of (x,y) -> y) `seq` erat' (c,False) a) [] [2..n]
to force the list fully on each iteration. It will consume less memory and run faster.
The reason for the above is that foldl' only forces the accumulator to weak head normal form, and even using last a isn't enough, as it would be forced just to a pair (_,_), without forcing its constituents.
But when your erat' function is changed so that it stops scanning the interim list of primes and their multiples as soon as possible, and shares its tail whenever possible (as described below), it is faster without the forcing, even if using more memory.
Your (updated) code, edited a little for legibility:
g :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
g (c,b) ((x,y):xs)
| c < x = (x, y) : g (c,b) xs -- `c < x` forces the x already,
| c == x = (x+y,y) : g (c,True) xs -- no need for `seq`
| c > x = (x+y,y) : g (c,b) xs
g (c,True) [] = []
g (c,False) [] = [(c*c,c)]
primes :: Integer -> [Integer]
primes n = map snd $ foldl' (\a c -> g (c,False) a) [] [2..n]
So, your primes n is actually a little like a right fold on the reversed [2..n] list. Writing h for flip $ foldl' (\a c -> g (c,False) a), it is
= map snd $ h [2..n] $ []
= map snd $ h [3..n] $ [(2*2,2)]
= map snd $ h [4..n] $ (4,2) :(g (3,False) [])
= map snd $ h [5..n] $ (4+2,2):(g (4,True ) $ g (3,False) [])
= map snd $ h [6..n] $ (6,2) :(g (5,False) $ g (4,True ) $ g (3,False) [])
....
The strictness of foldl' has limited effect here as the accumulator is forced only to the weak head normal form.
Folding with (\a c -> last a `seq` g (c,False) a) would give us
= map snd $ ... $ g (3,False) [(2*2,2)]
= map snd $ ... $ g (4,False) [(4,2),(3*3,3)]
= map snd $ ... $ g (5,False) [(4+2,2),(9,3)]
= map snd $ ... $ g (6,False) [(6,2),(9,3),(5*5,5)]
= map snd $ ... $ g (7,False) [(6+2,2),(9,3),(25,5)]
= map snd $ ... $ g (8,False) [(8,2),(9,3),(25,5),(7*7,7)]
= map snd $ ... $ g (9,False) [(8+2,2),(9,3),(25,5),(49,7)]
= map snd $ ... $ g (10,False) [(10,2),(9+3,3),(25,5),(49,7)]
= map snd $ ... $ g (11,False) [(10+2,2),(12,3),(25,5),(49,7)]
....
= map snd $ ... $ g (49,False)
[(48+2,2),(48+3,3),(50,5),(49,7),(121,11)...(2209,47)]
....
but all these changes will be pushed through to the list by the final print anyway, so the laziness is not the immediate problem here (it causes stack overflow for bigger inputs, but that's secondary here). The problem is that your erat' (renamed g above) eventually pushes each entry through the whole list needlessly, recreating the whole list for each candidate number. This is a very heavy memory usage pattern.
It should stop as early as possible, and share the list's tail whenever possible:
g :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
g (c,b) ((x,y):xs)
| c < x = (x, y) : if x==y*y then (if b then xs
else xs++[(c*c,c)])
else g (c,b) xs
| c == x = (x+y,y) : if x==y*y then xs
else g (c,True) xs
| c > x = (x+y,y) : g (c,b) xs
g (c,True) [] = []
g (c,False) [] = [(c*c,c)]
Compiled with -O2 and run standalone, it runs under ~ N1.9 vs your original function's ~ N2.4..2.8..and rising, producing primes up to N.
(of course a "normal" sieve of Eratosthenes should run at about ~ N1.1, ideally, its theoretical time complexity being N log (log N)).

Resources