I read from Foldr Foldl Foldl' that foldl' is more efficient for long finite lists because of the strictness property. I am aware that it is not suitable for infinite list.
Thus, I am limiting the comparison only for long finite lists.
concatMap
concatMap is implemented using foldr, which gives it laziness. However, using it with long finite lists will build up a long unreduced chain according to the article.
concatMap :: Foldable t => (a -> [b]) -> t a -> [b]
concatMap f xs = build (\c n -> foldr (\x b -> foldr c b (f x)) n xs)
Thus I come up with the following implementation with use of foldl'.
concatMap' :: Foldable t => (a -> [b]) -> t a -> [b]
concatMap' f = reverse . foldl' (\acc x -> f x ++ acc) []
Test it out
I have build the following two functions to test out the performance.
lastA = last . concatMap (: []) $ [1..10000]
lastB = last . concatMap' (: []) $ [1..10000]
However, I was shocked by the results.
lastA:
(0.23 secs, 184,071,944 bytes)
(0.24 secs, 184,074,376 bytes)
(0.24 secs, 184,071,048 bytes)
(0.24 secs, 184,074,376 bytes)
(0.25 secs, 184,075,216 bytes)
lastB:
(0.81 secs, 224,075,080 bytes)
(0.76 secs, 224,074,504 bytes)
(0.78 secs, 224,072,888 bytes)
(0.84 secs, 224,073,736 bytes)
(0.79 secs, 224,074,064 bytes)
Follow-up Questions
concatMap outcompetes my concatMap' in both time and memory. I wonder there are mistakes I made in my concatMap' implementation.
Thus, I doubt the articles for stating the goodness of foldl'.
Are there any black magic in concatMap to make it so efficient?
Is it true that foldl' is more efficient for long finite list?
Is it true that using foldr with long finite lists will build up a long unreduced chain and impact the performance?
Are there any black magic in concatMap to make it so efficient?
No, not really.
Is it true that foldl' is more efficient for long finite list?
Not always. It depends on the folding function.
The point is, foldl and foldl' always have to scan the whole input list before producing the output. Instead, foldr does not always have to.
As an extreme case, consider
foldr (\x xs -> x) 0 [10..10000000]
which evaluates to 10 instantly -- only the first element of the list is evaluated. The reduction goes something like
foldr (\x xs -> x) 0 [10..10000000]
= foldr (\x xs -> x) 0 (10 : [11..10000000])
= (\x xs -> x) 10 (foldr (\x xs -> x) 0 [11..10000000])
= (\xs -> 10) (foldr (\x xs -> x) 0 [11..10000000])
= 10
and the recursive call is not evaluated thanks to laziness.
In general, when computing foldr f a xs, it is important to check whether f y ys is able to construct a part of the output before evaluating ys. For instance
foldr f [] xs
where f y ys = (2*y) : ys
produces a list cell _ : _ before evaluating 2*y and ys. This makes it an excellent candidate for foldr.
Again, we can define
map f xs = foldr (\y ys -> f y : ys) [] xs
which runs just fine. It consumes one element from xs and outputs the first output cell. Then it consumes the next element, outputs the next element, and so on. Using foldl' would not output anything until the whole list is processed, making the code quite inefficient.
Instead, if we wrote
sum xs = foldr (\y ys -> y+ys) 0 xs
then we do not output anything after the first element of xs is consumed.
We build a long chain of thunks, wasting a lot of memory.
Here, foldl' would instead work in constant space.
Is it true that using foldr with long finite lists will build up a long unreduced chain and impact the performance?
Not always. It strongly depends on how the output is consumed by the caller.
As a thumb rule, if the output is "atomic", meaning that the output consumer can not observe only a part of it (e.g. Bool, Int, ...) then it's better to use foldl'. If the output is "composed" of many independent values (list, trees, ...) probably foldr is a better choice, if f can produce its output step-by-step, in a "streaming" fashion.
I've been exploring the Foldable class and also the the Monoid class.
Firstly, lets say I want to fold over a list of the Monoid First. Like so:
x :: [First a]
fold? mappend mempty x
Then I assume in this case the most appropriate fold would be foldr, as mappend for First is lazy in it's second argument.
Conversely, for Last we'd want to foldl' (or just foldl I'm not sure).
Now moving away from lists, I've defined a simple binary tree like so:
{-# LANGUAGE GADTs #-}
data BinaryTree a where
BinaryTree :: BinaryTree a -> BinaryTree a -> BinaryTree a
Leaf :: a -> BinaryTree a
And I've made it Foldable with the most straightforward definition:
instance Foldable BinaryTree where
foldMap f (BinaryTree left right) =
(foldMap f left) `mappend` (foldMap f right)
foldMap f (Leaf x) = f x
As Foldable defines fold as simply foldMap id we can now do:
x1 :: BinaryTree (First a)
fold x1
x2 :: BinaryTree (Last a)
fold x2
Assuming our BinaryTree is balanced, and there's not many Nothing values, these operations should take O(log(n)) time I believe.
But Foldable also defines a whole lot of default methods like foldl, foldl', foldr and foldr' based on foldMap.
These default definitions seem to be implemented by composing a bunch of functions, wrapped in a Monoid called Endo, one for each element in the collection, and then composing them all.
For the purpose of this discussion I am not modifying these default definitions.
So lets now consider:
x1 :: BinaryTree (First a)
foldr mappend mempty x1
x2 :: BinaryTree (Last a)
foldl mappend mempty x2
Does running these retain O(log(n)) performance of the ordinary fold? (I'm not worried about constant factors for the moment). Does laziness result in the tree not needing to be fully traversed? Or will the default definitions of foldl and foldr require an entire traversal of the tree?
I tried to go though the algorithm step by step (much like they did on the Foldr Foldl Foldl' article) but I ended up completely confusing myself as this is a bit more complex as it involves an interaction between Foldable, Monoid and Endo.
So what I'm looking for is an explanation of why (or why not) the default definition of say foldr, would only take O(log(n)) time on a balanced binary tree like above. A step by step example like what's from the Foldr Foldl Foldl' article would be really helpful, but I understand if that's too difficult, as I totally confused myself attempting it.
Yes, it has O(log(n)) best case performance.
Endo is a wrapper around (a -> a) kind of functions that:
instance Monoid (Endo a) where
mempty = Endo id
Endo f `mappend` Endo g = Endo (f . g)
And the default implementation of foldr in Data.Foldable:
foldr :: (a -> b -> b) -> b -> t a -> b
foldr f z t = appEndo (foldMap (Endo #. f) t) z
The definition of . (function composition) in case:
(.) f g = \x -> f (g x)
Endo is defined by newtype constructor, so it only exists at compile stage, not run-time.
#. operator changes the type of it's second operand and discard the first.
The newtype constructor and #. operator guarantee that you can ignore the wrapper when considering performance issues.
So the default implementation of foldr can be reduced to:
-- mappend = (.), mempty = id from instance Monoid (Endo a)
foldr :: (a -> b -> b) -> b -> t a -> b
foldr f z t = foldMap f t z
For your Foldable BinaryTree:
foldr f z t
= foldMap f t z
= case t of
Leaf a -> f a z
-- what we care
BinaryTree l r -> ((foldMap f l) . (foldMap f r)) z
The default lazy evaluation in Haskell is ultimately simple, there are just two rules:
function application first
evaluate the arguments from left to right if the values matter
That makes it easy to trace the evaluation of the last line of the code above:
((foldMap f l) . (foldMap f r)) z
= (\z -> foldMap f l (foldMap f r z)) z
= foldMap f l (foldMap f r z)
-- let z' = foldMap f r z
= foldMap f l z' -- height 1
-- if the branch l is still not a Leaf node
= ((foldMap f ll) . (foldMap f lr)) z'
= (\z -> foldMap f ll (foldMap f lr)) z'
= foldMap f ll (foldMap f lr z')
-- let z'' = foldMap f lr z'
= foldMap f ll z'' -- height 2
The right branch of the tree is never expanded before the left has been fully expanded, and it goes one level higher after an O(1) operation of function expansion and application, therefore when it reached the left-most Leaf node:
= foldMap f leaf#(Leaf a) z'heightOfLeftMostLeaf
= f a z'heightOfLeftMostLeaf
Then f looks at the value a and decides to ignore its second argument (like what mappend will do to First values), the evaluation short-circuits, results O(height of the left-most leaf), or O(log(n)) performance when the tree is balanced.
foldl is all the same, it's just foldr with mappend flipped i.e. O(log(n)) best case performance with Last.
foldl' and foldr' are different.
foldl' :: (b -> a -> b) -> b -> t a -> b
foldl' f z0 xs = foldr f' id xs z0
where f' x k z = k $! f z x
At every step of reduction, the argument is evaluated first and then the function application, the tree will be traversed i.e. O(n) best case performance.
I have a simple toy example that seems to disagree with the garbage collector on what data structures can be reclaimed (aka memory leak). I am not trying to come up with more memory efficient versions of this algorithm (a good collection of better algorithms is here: Haskell Wiki - Prime numbers, rather an explanation why the garbage collector is not identifying the old, out of scope and unused portions of the list to reclaim that memory.
The code is here:
import Data.List (foldl')
erat' :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
erat' (c,b) ((x,y):xs)
| c < x = (x,y) : erat' (c,b) xs
| c == x = (x+y,y) : erat' (c,True) xs
| c > x = (x+y,y) : erat' (c,b) xs
erat' (c,b) []
| b = []
| otherwise = [(c,c)]
erat :: [Integer] -> [(Integer,Integer)]
erat = foldl' (\a c -> erat' (c,False) a) []
primes :: Integer -> [Integer]
primes n = map snd $ erat [2..n]
In essence, calling primes with a positive integer will return a list of all prime numbers up to and including that number. A list of pairs of primes and their high water mark multiple is passed to erat', together with a pair including a candidate and a boolean (False for prime and True for non-prime). Every non-recursive call to erat' will pass a new list, and I would expect that the output would contain, at most, certain shared cells from the beginning of the list up to the point of the first change.
As soon as the modified cells in the list passed to erat' come out of scope, the memory should be flagged to be recovered, but as you can see when you try calling primes with a large enough number (1,000,000, for example), the memory utilization can quickly spike to tens of gigabytes.
Now, the question is: why is this happening? Shouldn't the generational garbage collector detect dereferenced list cells to reclaim them? And, shouldn't it be fairly easy for it to detect that they don't have references because:
a) nothing can have references from data structures older than itself;
b) there cannot be newer references because those cells/fragments are not even part of a referenceable data structure anymore, since it came out of scope?
Of course, a mutable data structure would take care of this, but I feel like resorting to mutability in a case like this is dropping some of the theoretical principles for Haskell on the floor.
Thanks to the people that commented (particularly Carl), I modified the algorithm slightly to add strictness (and the optimization of starting crossing the square of the new prime, since lower multiples will be crossed by multiples of lower primes too).
This is the new version:
import Data.List (foldl')
erat' :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
erat' (c,b) ((x,y):xs)
| c < x = x `seq` (x,y) : erat' (c,b) xs
| c == x = x `seq` (x+y,y) : erat' (c,True) xs
| c > x = x `seq` (x+y,y) : erat' (c,b) xs
erat' (c,b) []
| b = []
| otherwise = [(c*c,c)] -- lower multiples would be covered by multiples of lower primes
erat :: [Integer] -> [(Integer,Integer)]
erat = foldl' (\a c -> erat' (c,False) a) []
primes :: Integer -> [Integer]
primes n = map snd $ erat [2..n]
The memory consumption seems to still be quite significant. Are there any other changes to this algorithm that could help reduce the total memory utilization?
Since Will pointed out that I didn't provide full statistics, these are the numbers for a run of the updated version of primes listed just above, with 100000 as the parameter:
And after applying the changes that Will proposed, the memory usage is now down considerably. See, for example, on a run of primes for 100000 again:
And last, this is the final code after the proposed changes were incorporated:
import Data.List (foldl')
erat'' :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
erat'' (c,b) ((x,y):xs)
| c < x = (x, y) : if x==y*y then (if b then xs
else xs++[(c*c,c)])
else erat'' (c,b) xs
| c == x = (x+y,y) : if x==y*y then xs
else erat'' (c,True) xs
| c > x = (x+y,y) : erat'' (c,b) xs
erat'' (c,True) [] = []
erat'' (c,False) [] = [(c*c,c)]
primes'' :: Integer -> [Integer]
primes'' n = map snd $ foldl' (\a c -> (if null a then 0 else
case last a of (x,y) -> y) `seq` erat'' (c,False) a) [] [2..n]
And finally a run for 1,000,000 to have a feeling for performance in this new version:
Assumption a) is false in the presence of laziness. And in fact, your code consists almost entirely of generating cons cells pointed to by older cons cells. erat' consumes a list element, then produces a (:) constructor pointing to a tuple and an unevaluated thunk which will perform a recursive call to erat'. Only when that thunk is later evaluated will the (:) list constructor actually point to its tail as a data structure. So yes, nearly every (:) you allocate in erat' is in fact pointing forward in time. (The only exception is the last one - [foo] is going to point to the pre-existing [] constructor when its (:) constructor is allocated.)
Assumption b) is nonsense in the presence of laziness. Scope determines visibility in Haskell, not lifetime. Lifetime depends on evaluation and reachability.
So what happens at runtime is that you build up pipeline of erat' calls in erat. Each one of them holds on to as much of its input as has been evaluated, slowly consuming it. The interesting part is that your code doesn't evaluate anything in advance - it seems like it should actually stream pretty well - except for the fact that the pipeline is too deep. The pipeline created is approximately n stages - this is (inefficient!) trial division, not the sieve of Eratosthenes. You should only be adding prime numbers to the pipeline, not every number.
breaking update: You should use
map snd $ foldl' (\a c -> (if null a then 0 else
case last a of (x,y) -> y) `seq` erat' (c,False) a) [] [2..n]
to force the list fully on each iteration. It will consume less memory and run faster.
The reason for the above is that foldl' only forces the accumulator to weak head normal form, and even using last a isn't enough, as it would be forced just to a pair (_,_), without forcing its constituents.
But when your erat' function is changed so that it stops scanning the interim list of primes and their multiples as soon as possible, and shares its tail whenever possible (as described below), it is faster without the forcing, even if using more memory.
Your (updated) code, edited a little for legibility:
g :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
g (c,b) ((x,y):xs)
| c < x = (x, y) : g (c,b) xs -- `c < x` forces the x already,
| c == x = (x+y,y) : g (c,True) xs -- no need for `seq`
| c > x = (x+y,y) : g (c,b) xs
g (c,True) [] = []
g (c,False) [] = [(c*c,c)]
primes :: Integer -> [Integer]
primes n = map snd $ foldl' (\a c -> g (c,False) a) [] [2..n]
So, your primes n is actually a little like a right fold on the reversed [2..n] list. Writing h for flip $ foldl' (\a c -> g (c,False) a), it is
= map snd $ h [2..n] $ []
= map snd $ h [3..n] $ [(2*2,2)]
= map snd $ h [4..n] $ (4,2) :(g (3,False) [])
= map snd $ h [5..n] $ (4+2,2):(g (4,True ) $ g (3,False) [])
= map snd $ h [6..n] $ (6,2) :(g (5,False) $ g (4,True ) $ g (3,False) [])
....
The strictness of foldl' has limited effect here as the accumulator is forced only to the weak head normal form.
Folding with (\a c -> last a `seq` g (c,False) a) would give us
= map snd $ ... $ g (3,False) [(2*2,2)]
= map snd $ ... $ g (4,False) [(4,2),(3*3,3)]
= map snd $ ... $ g (5,False) [(4+2,2),(9,3)]
= map snd $ ... $ g (6,False) [(6,2),(9,3),(5*5,5)]
= map snd $ ... $ g (7,False) [(6+2,2),(9,3),(25,5)]
= map snd $ ... $ g (8,False) [(8,2),(9,3),(25,5),(7*7,7)]
= map snd $ ... $ g (9,False) [(8+2,2),(9,3),(25,5),(49,7)]
= map snd $ ... $ g (10,False) [(10,2),(9+3,3),(25,5),(49,7)]
= map snd $ ... $ g (11,False) [(10+2,2),(12,3),(25,5),(49,7)]
....
= map snd $ ... $ g (49,False)
[(48+2,2),(48+3,3),(50,5),(49,7),(121,11)...(2209,47)]
....
but all these changes will be pushed through to the list by the final print anyway, so the laziness is not the immediate problem here (it causes stack overflow for bigger inputs, but that's secondary here). The problem is that your erat' (renamed g above) eventually pushes each entry through the whole list needlessly, recreating the whole list for each candidate number. This is a very heavy memory usage pattern.
It should stop as early as possible, and share the list's tail whenever possible:
g :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
g (c,b) ((x,y):xs)
| c < x = (x, y) : if x==y*y then (if b then xs
else xs++[(c*c,c)])
else g (c,b) xs
| c == x = (x+y,y) : if x==y*y then xs
else g (c,True) xs
| c > x = (x+y,y) : g (c,b) xs
g (c,True) [] = []
g (c,False) [] = [(c*c,c)]
Compiled with -O2 and run standalone, it runs under ~ N1.9 vs your original function's ~ N2.4..2.8..and rising, producing primes up to N.
(of course a "normal" sieve of Eratosthenes should run at about ~ N1.1, ideally, its theoretical time complexity being N log (log N)).