Haskell garbage collector - algorithm

I have a simple toy example that seems to disagree with the garbage collector on what data structures can be reclaimed (aka memory leak). I am not trying to come up with more memory efficient versions of this algorithm (a good collection of better algorithms is here: Haskell Wiki - Prime numbers, rather an explanation why the garbage collector is not identifying the old, out of scope and unused portions of the list to reclaim that memory.
The code is here:
import Data.List (foldl')
erat' :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
erat' (c,b) ((x,y):xs)
| c < x = (x,y) : erat' (c,b) xs
| c == x = (x+y,y) : erat' (c,True) xs
| c > x = (x+y,y) : erat' (c,b) xs
erat' (c,b) []
| b = []
| otherwise = [(c,c)]
erat :: [Integer] -> [(Integer,Integer)]
erat = foldl' (\a c -> erat' (c,False) a) []
primes :: Integer -> [Integer]
primes n = map snd $ erat [2..n]
In essence, calling primes with a positive integer will return a list of all prime numbers up to and including that number. A list of pairs of primes and their high water mark multiple is passed to erat', together with a pair including a candidate and a boolean (False for prime and True for non-prime). Every non-recursive call to erat' will pass a new list, and I would expect that the output would contain, at most, certain shared cells from the beginning of the list up to the point of the first change.
As soon as the modified cells in the list passed to erat' come out of scope, the memory should be flagged to be recovered, but as you can see when you try calling primes with a large enough number (1,000,000, for example), the memory utilization can quickly spike to tens of gigabytes.
Now, the question is: why is this happening? Shouldn't the generational garbage collector detect dereferenced list cells to reclaim them? And, shouldn't it be fairly easy for it to detect that they don't have references because:
a) nothing can have references from data structures older than itself;
b) there cannot be newer references because those cells/fragments are not even part of a referenceable data structure anymore, since it came out of scope?
Of course, a mutable data structure would take care of this, but I feel like resorting to mutability in a case like this is dropping some of the theoretical principles for Haskell on the floor.
Thanks to the people that commented (particularly Carl), I modified the algorithm slightly to add strictness (and the optimization of starting crossing the square of the new prime, since lower multiples will be crossed by multiples of lower primes too).
This is the new version:
import Data.List (foldl')
erat' :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
erat' (c,b) ((x,y):xs)
| c < x = x `seq` (x,y) : erat' (c,b) xs
| c == x = x `seq` (x+y,y) : erat' (c,True) xs
| c > x = x `seq` (x+y,y) : erat' (c,b) xs
erat' (c,b) []
| b = []
| otherwise = [(c*c,c)] -- lower multiples would be covered by multiples of lower primes
erat :: [Integer] -> [(Integer,Integer)]
erat = foldl' (\a c -> erat' (c,False) a) []
primes :: Integer -> [Integer]
primes n = map snd $ erat [2..n]
The memory consumption seems to still be quite significant. Are there any other changes to this algorithm that could help reduce the total memory utilization?
Since Will pointed out that I didn't provide full statistics, these are the numbers for a run of the updated version of primes listed just above, with 100000 as the parameter:
And after applying the changes that Will proposed, the memory usage is now down considerably. See, for example, on a run of primes for 100000 again:
And last, this is the final code after the proposed changes were incorporated:
import Data.List (foldl')
erat'' :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
erat'' (c,b) ((x,y):xs)
| c < x = (x, y) : if x==y*y then (if b then xs
else xs++[(c*c,c)])
else erat'' (c,b) xs
| c == x = (x+y,y) : if x==y*y then xs
else erat'' (c,True) xs
| c > x = (x+y,y) : erat'' (c,b) xs
erat'' (c,True) [] = []
erat'' (c,False) [] = [(c*c,c)]
primes'' :: Integer -> [Integer]
primes'' n = map snd $ foldl' (\a c -> (if null a then 0 else
case last a of (x,y) -> y) `seq` erat'' (c,False) a) [] [2..n]
And finally a run for 1,000,000 to have a feeling for performance in this new version:

Assumption a) is false in the presence of laziness. And in fact, your code consists almost entirely of generating cons cells pointed to by older cons cells. erat' consumes a list element, then produces a (:) constructor pointing to a tuple and an unevaluated thunk which will perform a recursive call to erat'. Only when that thunk is later evaluated will the (:) list constructor actually point to its tail as a data structure. So yes, nearly every (:) you allocate in erat' is in fact pointing forward in time. (The only exception is the last one - [foo] is going to point to the pre-existing [] constructor when its (:) constructor is allocated.)
Assumption b) is nonsense in the presence of laziness. Scope determines visibility in Haskell, not lifetime. Lifetime depends on evaluation and reachability.
So what happens at runtime is that you build up pipeline of erat' calls in erat. Each one of them holds on to as much of its input as has been evaluated, slowly consuming it. The interesting part is that your code doesn't evaluate anything in advance - it seems like it should actually stream pretty well - except for the fact that the pipeline is too deep. The pipeline created is approximately n stages - this is (inefficient!) trial division, not the sieve of Eratosthenes. You should only be adding prime numbers to the pipeline, not every number.

breaking update: You should use
map snd $ foldl' (\a c -> (if null a then 0 else
case last a of (x,y) -> y) `seq` erat' (c,False) a) [] [2..n]
to force the list fully on each iteration. It will consume less memory and run faster.
The reason for the above is that foldl' only forces the accumulator to weak head normal form, and even using last a isn't enough, as it would be forced just to a pair (_,_), without forcing its constituents.
But when your erat' function is changed so that it stops scanning the interim list of primes and their multiples as soon as possible, and shares its tail whenever possible (as described below), it is faster without the forcing, even if using more memory.
Your (updated) code, edited a little for legibility:
g :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
g (c,b) ((x,y):xs)
| c < x = (x, y) : g (c,b) xs -- `c < x` forces the x already,
| c == x = (x+y,y) : g (c,True) xs -- no need for `seq`
| c > x = (x+y,y) : g (c,b) xs
g (c,True) [] = []
g (c,False) [] = [(c*c,c)]
primes :: Integer -> [Integer]
primes n = map snd $ foldl' (\a c -> g (c,False) a) [] [2..n]
So, your primes n is actually a little like a right fold on the reversed [2..n] list. Writing h for flip $ foldl' (\a c -> g (c,False) a), it is
= map snd $ h [2..n] $ []
= map snd $ h [3..n] $ [(2*2,2)]
= map snd $ h [4..n] $ (4,2) :(g (3,False) [])
= map snd $ h [5..n] $ (4+2,2):(g (4,True ) $ g (3,False) [])
= map snd $ h [6..n] $ (6,2) :(g (5,False) $ g (4,True ) $ g (3,False) [])
....
The strictness of foldl' has limited effect here as the accumulator is forced only to the weak head normal form.
Folding with (\a c -> last a `seq` g (c,False) a) would give us
= map snd $ ... $ g (3,False) [(2*2,2)]
= map snd $ ... $ g (4,False) [(4,2),(3*3,3)]
= map snd $ ... $ g (5,False) [(4+2,2),(9,3)]
= map snd $ ... $ g (6,False) [(6,2),(9,3),(5*5,5)]
= map snd $ ... $ g (7,False) [(6+2,2),(9,3),(25,5)]
= map snd $ ... $ g (8,False) [(8,2),(9,3),(25,5),(7*7,7)]
= map snd $ ... $ g (9,False) [(8+2,2),(9,3),(25,5),(49,7)]
= map snd $ ... $ g (10,False) [(10,2),(9+3,3),(25,5),(49,7)]
= map snd $ ... $ g (11,False) [(10+2,2),(12,3),(25,5),(49,7)]
....
= map snd $ ... $ g (49,False)
[(48+2,2),(48+3,3),(50,5),(49,7),(121,11)...(2209,47)]
....
but all these changes will be pushed through to the list by the final print anyway, so the laziness is not the immediate problem here (it causes stack overflow for bigger inputs, but that's secondary here). The problem is that your erat' (renamed g above) eventually pushes each entry through the whole list needlessly, recreating the whole list for each candidate number. This is a very heavy memory usage pattern.
It should stop as early as possible, and share the list's tail whenever possible:
g :: (Integer, Bool) -> [(Integer,Integer)] -> [(Integer,Integer)]
g (c,b) ((x,y):xs)
| c < x = (x, y) : if x==y*y then (if b then xs
else xs++[(c*c,c)])
else g (c,b) xs
| c == x = (x+y,y) : if x==y*y then xs
else g (c,True) xs
| c > x = (x+y,y) : g (c,b) xs
g (c,True) [] = []
g (c,False) [] = [(c*c,c)]
Compiled with -O2 and run standalone, it runs under ~ N1.9 vs your original function's ~ N2.4..2.8..and rising, producing primes up to N.
(of course a "normal" sieve of Eratosthenes should run at about ~ N1.1, ideally, its theoretical time complexity being N log (log N)).

Related

Is runInBoundThread the best tool for parallelism?

Say, I want to fold monoids in parallel. My computer has 8 cores. I have this function to split a list into equal-sized smaller lists (with bounded modulo-bias):
import Data.List
parallelize :: Int -> [a] -> [[a]]
parallelize 0 _ = []
parallelize n [] = replicate n []
parallelize n xs = let
(us,vs) = splitAt (quot (length xs) n) xs
in us : parallelize (n-1) vs
The first version of parallel fold I made was:
import Control.Concurrent
import Control.Concurrent.QSemN
import Data.Foldable
import Data.IORef
foldP :: Monoid m => [m] -> IO m
foldP xs = do
result <- newIORef mempty
sem <- newQSemN 0
n <- getNumCapabilities
let yss = parallelize n xs
for_ yss (\ys -> forkIO (modifyIORef result (fold ys <>) >> signalQSemN sem 1))
waitQSemN sem n
readIORef result
But usage of IORefs and semaphores seemed ugly to me. So I made another version:
import Data.Traversable
foldP :: Monoid m => [m] -> IO m
foldP xs = do
n <- getNumCapabilities
let yss = parallelize n xs
rs <- for yss (\ys -> runInUnboundThread (return (fold ys)))
return (fold rs)
The test code I used is:
import Data.Monoid
import System.CPUTime
main :: IO ()
main = do
start <- getCPUTime
Product result <- foldP (fmap Product [1 .. 100])
end <- getCPUTime
putStrLn ("Time took: " ++ show (end - start) ++ "ps.")
putStrLn ("Result: " ++ show result)
The second version of foldP outperformed the first version. When I used runInBoundThread instead of runInUnboundThread, it became even faster.
By what are these performance differences made?
TLDR; Use fold function from massiv package and you will likely get the most efficient solution in Haskell.
I would like to start by saying that the first thing that people forget when trying to implement concurrent patterns like this is exception handling. In the solution from the question the exception handling is non-existent thus it is totally wrong. Therefore I'd recommend to use existing implementations for common concurrency patterns. async is the goto library for concurrency, but for such use case it will not be the most efficient solution.
This particular example can easily be solved with scheduler package, in fact it is exactly the kind of stuff it was designed for. Here is how you can use it to achieve folding of monoids:
import Control.Scheduler
import Control.Monad.IO.Unlift
foldP :: (MonadUnliftIO m, Monoid n) => Comp -> [n] -> m n
foldP comp xs = do
rs <-
withScheduler comp $ \scheduler ->
mapM_ (scheduleWork scheduler . pure . fold) (parallelize (numWorkers scheduler) xs)
pure $ fold rs
See the Comp type for explanation on best parallelization strategies. From what I found in practice Par will usually work best, because it will use pinned threads created with forkOn
Note that the parallelize function is implemented inefficiently and dangerously as well, it is better to write it this way:
parallelize :: Int -> [a] -> [[a]]
parallelize n' xs' = go 0 id xs'
where
n = max 1 n'
-- at least two elements make sense to get benefit of parallel fold
k = max 2 $ quot (length xs') n
go i acc xs
| null xs = acc []
| i < n =
case splitAt k xs of
(ls, rs) -> go (i + 1) (acc . (ls :)) rs
| otherwise = acc . (xs:) $ []
One more bit of advise is that list is far from ideal data structure for parallelization and efficiency in general. In order to split the lists into chunks before parallelizing computation you already have to go through the data structure with parallelize, which can be avoided if you were to use an array. What I am getting at is use an array instead, as suggested in the beginning of this answer.

Is a lazy, breadth-first monadic rose tree unfold possible?

Data.Tree includes unfoldTreeM_BF and unfoldForestM_BF functions to construct trees breadth-first using the results of monadic actions. The tree unfolder can be written easily using the forest unfolder, so I'll focus on the latter:
unfoldForestM_BF :: Monad m =>
(b -> m (a, [b])) -> [b] -> m [Tree a]
Starting with a list of seeds, it applies a function to each, generating actions that will produce the tree roots and the seeds for the next level of unfolding. The algorithm used is somewhat strict, so using unfoldForestM_BF with the Identity monad is not exactly the same as using the pure unfoldForest. I've been trying to figure out if there's a way to make it lazy without sacrificing its O(n) time bound. If (as Edward Kmett suggested to me) this is impossible, I wonder if it would be possible to do it with a more constrained type, specifically requiring MonadFix rather than Monad. The concept there would be to (somehow) set up the pointers to the results of future computations while adding those computations to the to-do list, so if they are lazy in the effects of earlier computations they will be available immediately.
I previously claimed that the third solution presented below has the same strictness as the depth-first unfoldForest, which is not correct.
Your intuition that trees can be lazily unfolded breadth first is at least partially correct, even if we don't require a MonadFix instance. Solutions exist for the special cases when the branching factor is known to be finite and when the branching factor is known to be "large". We will start with a solution that runs in O(n) time for trees with finite branching factors including degenerate trees with only one child per node. The solution for finite branching factors will fail to terminate on trees with infinite branching factors, which we will rectify with a solution that that runs in O(n) time for trees with "large" branching factors greater than one including trees with infinite branching factor. The solution for "large" branching factors will run in O(n^2) time on degenerate trees with only one child or no children per node. When we combine the methods from both steps in an attempt to make a hybrid solution that runs in O(n) time for any branching factor we will get a solution that is lazier than the first solution for finite branching factors but cannot accommodate trees that make a rapid transition from an infinite branching factor to having no branches.
Finite Branching Factor
The general idea is that we will first build all the labels for an entire level and the seeds for the forests for the next level. We will then descend into the next level, building all of it. We will collect together the results from the deeper level to build the forests for the outer level. We will put the labels together with the forests to build the trees.
unfoldForestM_BF is fairly simple. If there are no seeds for the level it returns. After building all the labels, it takes the seeds for each forest and collects them together into one list of all the seeds to build the next level and unfolds the entire deeper level. Finally it constructs the forest for each tree from the structure of the seeds.
import Data.Tree hiding (unfoldTreeM_BF, unfoldForestM_BF)
unfoldForestM_BF :: Monad m => (b->m (a, [b])) -> [b] -> m [Tree a]
unfoldForestM_BF f [] = return []
unfoldForestM_BF f seeds = do
level <- sequence . fmap f $ seeds
let (labels, bs) = unzip level
deeper <- unfoldForestM_BF f (concat bs)
let forests = trace bs deeper
return $ zipWith Node labels forests
trace reconstructs the structure of nested lists from a flattened list.It is assumed that there is an item in [b] for each of the items anywhere in [[a]]. The use of concat ... trace to flatten all the information about ancestor levels prevents this implementation from working on trees with infinite children for a node.
trace :: [[a]] -> [b] -> [[b]]
trace [] ys = []
trace (xs:xxs) ys =
let (ys', rem) = takeRemainder xs ys
in ys':trace xxs rem
where
takeRemainder [] ys = ([], ys)
takeRemainder (x:xs) (y:ys) =
let ( ys', rem) = takeRemainder xs ys
in (y:ys', rem)
Unfolding a tree is trivial to write in terms of unfolding a forest.
unfoldTreeM_BF :: MonadFix m => (b->m (a, [b])) -> b -> m (Tree a)
unfoldTreeM_BF f = (>>= return . head) . unfoldForestMFix_BF f . (:[])
Large Branching Factor
The solution for large branching factor proceeds in much the same way as the solution for finite branching factor, except it keeps the entire structure of the tree around instead of concatenating the branches in a level to a single list and traceing that list. In addition to the imports used in the previous section, we will be using Compose to compose the functors for multiple levels of a tree together and Traversable to sequence across multi-level structures.
import Data.Tree hiding (unfoldForestM_BF, unfoldTreeM_BF)
import Data.Foldable
import Data.Traversable
import Data.Functor.Compose
import Prelude hiding (sequence, foldr)
Instead of flattening all of the ancestor structures together with concat we will wrap with Compose the ancestors and the seeds for the next level and recurse on the entire structure.
unfoldForestM_BF :: (Traversable t, Traceable t, Monad m) =>
(b->m (a, [b])) -> t b -> m (t (Tree a))
unfoldForestM_BF f seeds
| isEmpty seeds = return (fmap (const undefined) seeds)
| otherwise = do
level <- sequence . fmap f $ seeds
deeper <- unfoldForestM_BF f (Compose (fmap snd level))
return $ zipWithIrrefutable Node (fmap fst level) (getCompose deeper)
zipWithIrrefutable is a lazier version of zipWith that relies on the assumption that there is an item in the second list for each item in the first list. The Traceable structures are the Functors that can provide a zipWithIrrefutable. The laws for Traceable are for every a, xs, and ys if fmap (const a) xs == fmap (const a) ys then zipWithIrrefutable (\x _ -> x) xs ys == xs and zipWithIrrefutable (\_ y -> y) xs ys == ys. Its strictness is given for every f and xs by zipWithIrrefutable f xs ⊥ == fmap (\x -> f x ⊥) xs.
class Functor f => Traceable f where
zipWithIrrefutable :: (a -> b -> c) -> f a -> f b -> f c
We can combine two lists lazily if we already know they have the same structure.
instance Traceable [] where
zipWithIrrefutable f [] ys = []
zipWithIrrefutable f (x:xs) ~(y:ys) = f x y : zipWithIrrefutable f xs ys
We can combine the composition of two functors if we know that we can combine each functor.
instance (Traceable f, Traceable g) => Traceable (Compose f g) where
zipWithIrrefutable f (Compose xs) (Compose ys) =
Compose (zipWithIrrefutable (zipWithIrrefutable f) xs ys)
isEmpty checks for an empty structure of nodes to expand like the pattern match on [] did in the solution for finite branching factors.
isEmpty :: Foldable f => f a -> Bool
isEmpty = foldr (\_ _ -> False) True
The astute reader may notice that zipWithIrrefutable from Traceable is very similar to liftA2 which is half of the definition of Applicative.
Hybrid Solution
The hybrid solution combines the approaches of the finite solution and the "large" solution. Like the finite solution, we will compress and decompress the tree representation at each step. Like the solution for "large" branching factors we will use a data structure that allows stepping over complete branches. The finite branching factor solution used a data type that is flattened everywhere, [b]. The "large" branching factor solution used a data type that was flattened nowhere: increasingly nested lists starting with [b] then [[b]] then [[[b]]] and so on. In between these structures would be nested lists that either stop nesting and just hold a b or keep nesting and hold [b]s. That pattern of recursion is described in general by the Free monad.
data Free f a = Pure a | Free (f (Free f a))
We will be working specifically with Free [] which looks like.
data Free [] a = Pure a | Free [Free [] a]
For the hybrid solution we will repeat all of its imports and components so that the code below here should be complete working code.
import Data.Tree hiding (unfoldTreeM_BF, unfoldForestM_BF)
import Data.Traversable
import Prelude hiding (sequence, foldr)
Since we will be working with Free [], we will provide it with a zipWithIrrefutable.
class Functor f => Traceable f where
zipWithIrrefutable :: (a -> b -> c) -> f a -> f b -> f c
instance Traceable [] where
zipWithIrrefutable f [] ys = []
zipWithIrrefutable f (x:xs) ~(y:ys) = f x y : zipWithIrrefutable f xs ys
instance (Traceable f) => Traceable (Free f) where
zipWithIrrefutable f (Pure x) ~(Pure y ) = Pure (f x y)
zipWithIrrefutable f (Free xs) ~(Free ys) =
Free (zipWithIrrefutable (zipWithIrrefutable f) xs ys)
The breadth first traversal will look very similar to the original version for the finitely branching tree. We build the current labels and seeds for the current level, compress the structure of the remainder of the tree, do all the work for the remaining depths, and decompress the structure of the results to get the forests to go with the labels.
unfoldFreeM_BF :: (Monad m) => (b->m (a, [b])) -> Free [] b -> m (Free [] (Tree a))
unfoldFreeM_BF f (Free []) = return (Free [])
unfoldFreeM_BF f seeds = do
level <- sequence . fmap f $ seeds
let (compressed, decompress) = compress (fmap snd level)
deeper <- unfoldFreeM_BF f compressed
let forests = decompress deeper
return $ zipWithIrrefutable Node (fmap fst level) forests
compress takes a Free [] holding the seeds for forests [b] and flattens the [b] into the Free to get a Free [] b. It also returns a decompress function that can be used to undo the flattening to get the original structure back. We compress away branches with no remaining seeds and branches that only branch one way.
compress :: Free [] [b] -> (Free [] b, Free [] a -> Free [] [a])
compress (Pure [x]) = (Pure x, \(Pure x) -> Pure [x])
compress (Pure xs ) = (Free (map Pure xs), \(Free ps) -> Pure (map getPure ps))
compress (Free xs) = wrapList . compressList . map compress $ xs
where
compressList [] = ([], const [])
compressList ((Free [],dx):xs) = let (xs', dxs) = compressList xs
in (xs', \xs -> dx (Free []):dxs xs)
compressList ( (x,dx):xs) = let (xs', dxs) = compressList xs
in (x:xs', \(x:xs) -> dx x:dxs xs)
wrapList ([x], dxs) = (x, \x -> Free (dxs [x]))
wrapList (xs , dxs) = (Free xs, \(Free xs) -> Free (dxs xs ))
Each compression step also returns a function that will undo it when applied to a Free [] tree with the same structure. All of these functions are partially defined; what they do to Free [] trees with a different structure is undefined. For simplicity we also define partial functions for the inverses of Pure and Free.
getPure (Pure x) = x
getFree (Free xs) = xs
Both unfoldForestM_BF and unfoldTreeM_BF are defined by packaging their argument up into a Free [] b and unpackaging the results assuming they are in the same structure.
unfoldTreeM_BF :: MonadFix m => (b->m (a, [b])) -> b -> m (Tree a)
unfoldTreeM_BF f = (>>= return . getPure) . unfoldFreeM_BF f . Pure
unfoldForestM_BF :: MonadFix m => (b->m (a, [b])) -> [b] -> m [Tree a]
unfoldForestM_BF f = (>>= return . map getPure . getFree) . unfoldFreeM_BF f . Free . map Pure
A more elegant version of this algorithm can probably be made by recognizing that >>= for a Monad is grafting on trees and both Free and FreeT provide monad instances. Both compress and compressList probably have more elegant presentations.
The algorithm presented above is not lazy enough to allow querying trees that branch an infinite number of ways and then terminate. A simple counter example is the following generating function unfolded from 0.
counterExample :: Int -> (Int, [Int])
counterExample 0 = (0, [1, 2])
counterExample 1 = (1, repeat 3)
counterExample 2 = (2, [3])
counterExample 3 = (3, [])
This tree would look like
0
|
+- 1
| |
| +- 3
| |
| `- 3
| |
| ...
|
`- 2
|
+- 3
Attempting to descend the second branch (to 2) and inspect the remaining finite sub-tree will fail to terminate.
Examples
The following examples demonstrate that all implementations of unfoldForestM_BF run actions in breadth first order and that runIdentity . unfoldTreeM_BF (Identity . f) has the same strictness as unfoldTree for trees with finite branching factor. For trees with inifinite branching factor, only the solution for "large" branching factors has the same strictness as unfoldTree. To demonstrate laziness we'll define three infinite trees - a unary tree with one branch, a binary tree with two branches, and an infinitary tree with an infinite number of branches for each node.
mkUnary :: Int -> (Int, [Int])
mkUnary x = (x, [x+1])
mkBinary :: Int -> (Int, [Int])
mkBinary x = (x, [x+1,x+2])
mkInfinitary :: Int -> (Int, [Int])
mkInfinitary x = (x, [x+1..])
Together with unfoldTree, we will define unfoldTreeDF in terms of unfoldTreeM to check that unfoldTreeM really is lazy like you claimed and unfoldTreeBF in terms of unfoldTreeMFix_BF to check that the new implementation is just as lazy.
import Data.Functor.Identity
unfoldTreeDF f = runIdentity . unfoldTreeM (Identity . f)
unfoldTreeBF f = runIdentity . unfoldTreeM_BF (Identity . f)
To get finite pieces of these infinite trees, even the infinitely branching one, we'll define a way to take from a tree as long as its labels match a predicate. This could be written more succinctly in terms of the ability to apply a function to every subForest.
takeWhileTree :: (a -> Bool) -> Tree a -> Tree a
takeWhileTree p (Node label branches) = Node label (takeWhileForest p branches)
takeWhileForest :: (a -> Bool) -> [Tree a] -> [Tree a]
takeWhileForest p = map (takeWhileTree p) . takeWhile (p . rootLabel)
This lets us define nine example trees.
unary = takeWhileTree (<= 3) (unfoldTree mkUnary 0)
unaryDF = takeWhileTree (<= 3) (unfoldTreeDF mkUnary 0)
unaryBF = takeWhileTree (<= 3) (unfoldTreeBF mkUnary 0)
binary = takeWhileTree (<= 3) (unfoldTree mkBinary 0)
binaryDF = takeWhileTree (<= 3) (unfoldTreeDF mkBinary 0)
binaryBF = takeWhileTree (<= 3) (unfoldTreeBF mkBinary 0)
infinitary = takeWhileTree (<= 3) (unfoldTree mkInfinitary 0)
infinitaryDF = takeWhileTree (<= 3) (unfoldTreeDF mkInfinitary 0)
infinitaryBF = takeWhileTree (<= 3) (unfoldTreeBF mkInfinitary 0)
All five methods have the same output for the unary and binary trees. The output comes from putStrLn . drawTree . fmap show
0
|
`- 1
|
`- 2
|
`- 3
0
|
+- 1
| |
| +- 2
| | |
| | `- 3
| |
| `- 3
|
`- 2
|
`- 3
However, the breadth first traversal from the finite branching factor solution is not sufficiently lazy for a tree with an infinite branching factor. The other four methods output the entire tree
0
|
+- 1
| |
| +- 2
| | |
| | `- 3
| |
| `- 3
|
+- 2
| |
| `- 3
|
`- 3
The tree generated with unfoldTreeBF for the finite branching factor solution can never be completely drawn past its first branches.
0
|
+- 1
| |
| +- 2
| | |
| | `- 3
| |
| `- 3
The construction is definitely breadth first.
mkDepths :: Int -> IO (Int, [Int])
mkDepths d = do
print d
return (d, [d+1, d+1])
mkFiltered :: (Monad m) => (b -> Bool) -> (b -> m (a, [b])) -> (b -> m (a, [b]))
mkFiltered p f x = do
(a, bs) <- f x
return (a, filter p bs)
binaryDepths = unfoldTreeM_BF (mkFiltered (<= 2) mkDepths) 0
Running binaryDepths outputs the outer levels before the inner ones
0
1
1
2
2
2
2
From Lazy to Downright Slothful
The hybrid solution from the earlier section is not lazy enough to have the same strictness semantics as Data.Tree's unfoldTree. It is the first in a series of algorithms, each slightly lazier than their predecessor, but none lazy enough to have the same strictness semantics as unfoldTree.
The hybrid solution does not provide a guarantee that exploring a portion of a tree doesn't demand exploring other portions of the same tree. Nor will the code presented below. In one particular yet common case identified by dfeuer exploring only a log(N) sized slice of a finite tree forces the entirety of the tree. This happens when exploring the last descendant of each branch of a tree with constant depth. When compressing the tree we throw out every trivial branch with no descendants, which is necessary to avoid O(n^2) running time. We can only lazily skip over this portion of compression if we can quickly show that a branch has at least one descendant and we can therefore reject the pattern Free []. At the greatest depth of the tree with constant depth, none of the branches have any remaining descendants, so we can never skip a step of the compression. This results in exploring the entire tree to be able to visit the very last node. When the entire tree to that depth is non-finite due to infinite branching factor, exploring a portion of the tree fails to terminate when it would terminate when generated by unfoldTree.
The compression step in the hybrid solution section compresses away branches with no descendants in the first generation that they can be discovered in, which is optimal for compression but not optimal for laziness. We can make the algorithm lazier by delaying when this compression occurs. If we delay it by a single generation (or even any constant number of generations) we will maintain the O(n) upper bound on time. If we delay it by a number of generations that somehow depends on N we would necessarily sacrifice the O(N) time bound. In this section we will delay the compression by a single generation.
To control how compression happens, we will separate stuffing the innermost [] into the Free [] structure from compressing away degenerate branches with 0 or 1 descendants.
Because part of this trick doesn't work without a lot of laziness in the compression, we will adopt a paranoid level of excessively slothful laziness everywhere. If anything about a result other than the tuple constructor (,) could be determined without forcing part of its input with a pattern match we will avoid forcing it until it is necessary. For tuples, anything pattern matching on them will do so lazily. Consequently, some of the code below will look like core or worse.
bindFreeInvertible replaces Pure [b,...] with Free [Pure b,...]
bindFreeInvertible :: Free [] ([] b) -> (Free [] b, Free [] a -> Free [] ([] a))
bindFreeInvertible = wrapFree . go
where
-- wrapFree adds the {- Free -} that would have been added in both branches
wrapFree ~(xs, dxs) = (Free xs, dxs)
go (Pure xs) = ({- Free -} (map Pure xs), Pure . map getPure . getFree)
go (Free xs) = wrapList . rebuildList . map bindFreeInvertible $ xs
rebuildList = foldr k ([], const [])
k ~(x,dx) ~(xs, dxs) = (x:xs, \(~(x:xs)) -> dx x:dxs xs)
wrapList ~(xs, dxs) = ({- Free -} xs, \(~(Free xs)) -> Free (dxs xs)))
compressFreeList removes occurrences of Free [] and replaces Free [xs] with xs.
compressFreeList :: Free [] b -> (Free [] b, Free [] a -> Free [] a)
compressFreeList (Pure x) = (Pure x, id)
compressFreeList (Free xs) = wrapList . compressList . map compressFreeList $ xs
where
compressList = foldr k ([], const [])
k ~(x,dx) ~(xs', dxs) = (x', dxs')
where
x' = case x of
Free [] -> xs'
otherwise -> x:xs'
dxs' cxs = dx x'':dxs xs''
where
x'' = case x of
Free [] -> Free []
otherwise -> head cxs
xs'' = case x of
Free [] -> cxs
otherwise -> tail cxs
wrapList ~(xs, dxs) = (xs', dxs')
where
xs' = case xs of
[x] -> x
otherwise -> Free xs
dxs' cxs = Free (dxs xs'')
where
xs'' = case xs of
[x] -> [cxs]
otherwise -> getFree cxs
The overall compression will not bind the Pure []s into Frees until after the degenerate Frees have been compressed away, delaying compression of degenerate Frees introduced in one generation to the next generation's compression.
compress :: Free [] [b] -> (Free [] b, Free [] a -> Free [] [a])
compress xs = let ~(xs' , dxs' ) = compressFreeList xs
~(xs'', dxs'') = bindFreeInvertible xs'
in (xs'', dxs' . dxs'')
Out of continued paranoia, the helpers getFree and getPure are also made irrefutably lazy.
getFree ~(Free xs) = xs
getPure ~(Pure x) = x
This very quickly solves the problematic example dfeuer discovered
print . until (null . subForest) (last . subForest) $
flip unfoldTreeBF 0 (\x -> (x, if x > 5 then [] else replicate 10 (x+1)))
But since we only delayed the compression by 1 generation, we can recreate exactly the same problem if the very last node of the very last branch is 1 level deeper than all of the other branches.
print . until (null . subForest) (last . subForest) $
flip unfoldTreeBF (0,0) (\(x,y) -> ((x,y),
if x==y
then if x>5 then [] else replicate 9 (x+1, y) ++ [(x+1, y+1)]
else if x>4 then [] else replicate 10 (x+1, y)))

Slower execution when using an infinite list

I'm beginning to try and get my head round haskell performance, and what makes things fast and slow, and I'm a little confused by this.
I have two implementations of a function that generates a list of primes up to a certain value. The first is straight off the Haskell wiki:
primesTo :: (Ord a, Num a, Enum a) => a -> [a]
primesTo m = eratos [2..m] where
eratos [] = []
eratos (p:xs) = p : eratos (xs `minus` [p*p, p*p+p..m])
The second is the same, but using an infinite list internally:
primes2 :: (Ord a, Num a, Enum a) => a -> [a]
primes2 m = takeWhile (<= m) (eratos [2..]) where
eratos [] = []
eratos (p:xs) = p : eratos (xs `minus` [p*p, p*p+p..])
In both cases, the minus function is:
minus :: (Ord a) => [a] -> [a] -> [a]
minus (x:xs) (y:ys) = case (compare x y) of
LT -> x : minus xs (y:ys)
EQ -> minus xs ys
GT -> minus (x:xs) ys
minus xs _ = xs
The latter implementation is significantly (~100x) slower than the former, and I don't get why. I would have thought that haskell's lazy evalutation would make them fairly equivalent under the hood.
This is obviously a reduced test case for the purposes of the question - in real life the optimisation would be no problem (although I don't understand why it is needed), but to me a function that just generates an infinite list of primes is more generically useful than a finite list, but appears slower to work with.
Looks like to me that there's a big difference between
(xs `minus` [p*p, p*p+p..m]) -- primesTo
(xs `minus` [p*p, p*p+p..]) -- primes2
The function minus steps through lists pairwise and terminates when one list reaches the end. In the first minus expression above, this occurs in no more than (m-p*p)/p steps when the latter list is exhausted. In the second one, it will always take steps on the order of length xs.
So your infinite lists have disabled at least one meaningful optimization.
One difference is that in the second case you need to generate one extra prime. You need to generate the first prime greater than m before takeWhile knows its time to stop.
Additionally, the [..m] bounds on both the list to filter and the lists of multiples help reduce the number of calculations. Whenever one of these lists gets empty minus immediately returns via its secons clause while in the infinite case the minus gets stuck in the first case. You can explore this a bit better if you also test the cases where only one of the lists is infinite:
--this is also slow
primes3 :: (Ord a, Num a, Enum a) => a -> [a]
primes3 m = takeWhile (<= m) (eratos [2..m]) where
eratos [] = []
eratos (p:xs) = p : eratos (xs `minus` [p*p, p*p+p..])
--this fast
primes4 :: (Ord a, Num a, Enum a) => a -> [a]
primes4 m = takeWhile (<= m) (eratos [2..]) where
eratos [] = []
eratos (p:xs) = p : eratos (xs `minus` [p*p, p*p+p..m])

Writing infinite list to skip every factor of p?

How can I efficiently represent the list [0..] \\ [t+0*p, t+1*p ..]?
I have defined:
Prelude> let factors p t = [t+0*p, t+1*p ..]
I want to efficiently represent an infinite list that is the difference of [0..] and factors p t, but using \\ from Data.List requires too much memory for even medium-sized lists:
Prelude Data.List> [0..10000] \\ (factors 5 0)
<interactive>: out of memory
I know that I can represent the values between t+0*p and t+1*p with:
Prelude> let innerList p1 p2 t = [t+p1+1, t+p1+2 .. t+p2-1]
Prelude> innerList 0 5 0
[1,2,3,4]
However, repeatedly calculating and concatenating innerList for increasing intervals seems clumsy.
Can I efficiently represent [0..] \\ (factors p t) without calculating rem or mod for each element?
For the infinite list [0..] \\ [t,t+p..],
yourlist t p = [0..t-1] ++ [i | m <- [0,p..], i <- [t+m+1..t+m+p-1]]
Of course this approach doesn't scale, at all, if you'd want to remove some other factors, like
[0..] \\ [t,t+p..] \\ [s,s+q..] \\ ...
in which case you'll have to remove them in sequence with minus, mentioned in Daniel Fischer's answer. There is no magic bullet here.
But there's also a union, with which the above becomes
[0..] \\ ( [t,t+p..] `union` [s,s+q..] `union` ... )
the advantage is, we can arrange the unions in a tree, and get algorithmic improvement.
You can't use (\\) for that, because
(\\) :: (Eq a) => [a] -> [a] -> [a]
(\\) = foldl (flip delete)
the list of elements you want to remove is infinite, and a left fold never terminates when the list it folds over is infinite.
If you rather want to use something already written than write it yourself, you can use minus from the data-ordlist package.
The performance should be adequate.
Otherwise,
minus :: Ord a => [a] -> [a] -> [a]
minus xxs#(x:xs) yys#(y:ys)
| x < y = x : minus xs yys
| x == y = minus xs ys
| otherwise = minus xss ys
minus xs _ = xs
You can use a list comprehesion with a predicate, using rem:
>>> let t = 0
>>> let p = 5
>>> take 40 $ [ x | x <- [1..], x `rem` p /= t ]
[1,2,3,4,6,7,8,9,11,12,13,14,16,17,18,19,21,22,23,24,26,27,28,29,31,32,33,34,36,37,38,39,41,42,43,44,46,47,48,49]
If you want efficiency, why does your solution have to use list comprehension syntax?
Why not something like this?
gen' n i p | i == p = gen' (n + p) 1 p
gen' n i p = (n+i) : gen' n (i+1) p
gen = gen' 0 1
and then do
gen 5
Because you have ascending lists, you can simply lazily merge them:
nums = [1..]
nogos = factors p t
result = merge nums (dropWhile (<head nums) nogos) where
merge (a:as) (b:bs)
| a < b = a : merge as (b:bs)
| a == b = merge as bs
| otherwise = error "should not happen"
Writing this in a general way so that we have a function that builds the difference of two infinite lists, provided only that they are in ascending order, is left as exercise. In the end, the following should be possible
[1..] `infiniteDifference` primes `infiniteDifference` squares
For this, make it a left associative operator.

Two simple codes to generate divisors of a number. Why is the recursive one faster?

While solving a problem, I had to calculate the divisors of a number. I have two implementations that produce all divisors > 1 for a given number.
The first is using simple recursion:
divisors :: Int64 -> [Int64]
divisors k = divisors' 2 k
where
divisors' n k | n*n > k = [k]
| n*n == k = [n, k]
| k `mod` n == 0 = (n:(k `div` n):result)
| otherwise = result
where result = divisors' (n+1) k
The second one uses list processing functions from the Prelude:
divisors2 :: Int64 -> [Int64]
divisors2 k = k : (concatMap (\x -> [x, k `div` x]) $!
filter (\x -> k `mod` x == 0) $!
takeWhile (\x -> x*x <= k) [2..])
I find that the first implementation is faster (I printed the whole list returned, so that no part of the result remains unevaluated due to laziness). The two implementations produce differently ordered divisors, but that is not a problem for me. (In fact, if k is a perfect square, the square root is output twice in the second implementation - again not a problem).
In general are such recursive implementations faster in Haskell? Also, I would appreciate any pointers to make either of these codes faster. Thanks!
EDIT:
Here is the code I am using to compare these two implementations for performance: https://gist.github.com/3414372
Here are my timing measurements:
Using divisor2 with strict evaluation ($!)
$ ghc --make -O2 div.hs
[1 of 1] Compiling Main ( div.hs, div.o )
Linking div ...
$ time ./div > /tmp/out1
real 0m7.651s
user 0m7.604s
sys 0m0.012s
Using divisors2 with lazy evaluation ($):
$ ghc --make -O2 div.hs
[1 of 1] Compiling Main ( div.hs, div.o )
Linking div ...
$ time ./div > /tmp/out1
real 0m7.461s
user 0m7.444s
sys 0m0.012s
Using function divisors
$ ghc --make -O2 div.hs
[1 of 1] Compiling Main ( div.hs, div.o )
Linking div ...
$ time ./div > /tmp/out1
real 0m7.058s
user 0m7.036s
sys 0m0.020s
The recursive version is not in general faster than the list-based version. This is because the GHC compiler employs List fusion optimizations when a computation follows a certain pattern. This means that list generators and "list transformers" might be fused into one big generator instead.
However, when you use $!, you basically tell the compiler to "Please produce the first cons of this list before performing the next step." This means that GHC is forced to at least compute one intermediate list element, which disables the whole fusion optimization entirely.
So, the second algorithm is slower, because you produce intermediate lists that have to be constructed and destructed, while the recursive algorithm simply produces a single list straight away.
Since you asked, to make it faster a different algorithm should be used. Simple and straightforward is to find a prime factorization first, then construct the divisors from it somehow.
Standard prime factorization by trial division is:
factorize :: Integral a => a -> [a]
factorize n = go n (2:[3,5..]) -- or: `go n primes`
where
go n ds#(d:t)
| d*d > n = [n]
| r == 0 = d : go q ds
| otherwise = go n t
where (q,r) = quotRem n d
-- factorize 12348 ==> [2,2,3,3,7,7,7]
Equal prime factors can be grouped and counted:
import Data.List (group)
primePowers :: Integral a => a -> [(a, Int)]
primePowers n = [(head x, length x) | x <- group $ factorize n]
-- primePowers = map (head &&& length) . group . factorize
-- primePowers 12348 ==> [(2,2),(3,2),(7,3)]
Divisors are usually constructed, though out of order, with:
divisors :: Integral a => a -> [a]
divisors n = map product $ sequence
[take (k+1) $ iterate (p*) 1 | (p,k) <- primePowers n]
Hence, we have
numDivisors :: Integral a => a -> Int
numDivisors n = product [ k+1 | (_,k) <- primePowers n]
The product here comes from the sequence in the definition above it, because sequence :: Monad m => [m a] -> m [a] for list monad m ~ [] constructs lists of all possible combinations of elements picked by one from each member list, sequence_lists = foldr (\xs rs -> [x:r | x <- xs, r <- rs]) [[]], so that length . sequence_lists === product . map length, and or course length . take n === n for infinite argument lists.
In-order generation is possible, too:
ordDivisors :: Integral a => a -> [a]
ordDivisors n = foldr (\(p,k)-> foldi merge [] . take (k+1) . iterate (map (p*)))
[1] $ reverse $ primePowers n
foldi :: (a -> a -> a) -> a -> [a] -> a
foldi f z (x:xs) = f x (foldi f z (pairs xs)) where
pairs (x:y:xs) = f x y:pairs xs
pairs xs = xs
foldi f z [] = z
merge :: Ord a => [a] -> [a] -> [a]
merge (x:xs) (y:ys) = case (compare y x) of
LT -> y : merge (x:xs) ys
_ -> x : merge xs (y:ys)
merge xs [] = xs
merge [] ys = ys
{- ordDivisors 12348 ==>
[1,2,3,4,6,7,9,12,14,18,21,28,36,42,49,63,84,98,126,147,196,252,294,343,441,588,
686,882,1029,1372,1764,2058,3087,4116,6174,12348] -}
This definition is productive, too, i.e. it starts producing the divisors right away, without noticeable delay:
{- take 20 $ ordDivisors $ product $ concat $ replicate 5 $ take 11 primes
==> [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
(0.00 secs, 525068 bytes)
numDivisors $ product $ concat $ replicate 5 $ take 11 primes
==> 362797056 -}

Resources