Space leak in list program - performance

I am solving some problems of Project Euler in Haskell. I wrote a program for a riddle in it and it did not work as I expected.
When I looked in the task manager when running the program I saw that it was using > 1 gigabyte of RAM on ghc. A friend of me wrote a program with the same meaning in Java and succeeded in 7 seconds.
import Data.List
opl = find vw $ map (\x-> fromDigits (x++[0,0,9]) )
$ sequence [[1],re,[2],re,[3],re,[4],re,[5],re,[6],re,[7],re,[8],re]
vw x = hh^2 == x
where hh = (round.sqrt.fromIntegral) x
re = [0..9]
fromDigits x = foldl1 (\n m->10*n+m) x
I know this program would output the number I want given enough RAM and time, but there has to be a better-performing way.

The main problem here is that sequence has a space leak. It is defined like this:
sequence [] = [[]]
sequence (xs:xss) = [ y:ys | y <- xs, ys <- sequence xss ]
so the problem is that the list produced by the recursive call sequence xss is re-used for each of the elements of xs, so it can't be discarded until the end. A version without the space leak is
myseq :: [[a]] -> [[a]]
myseq xs = go (reverse xs) []
where
go [] acc = [acc]
go (xs:xss) acc = concat [ go xss (x:acc) | x <- xs ]
PS. the answer seems to be Just 1229314359627783009
Edit version avoiding the concat:
seqlists :: [[a]] -> [[a]]
seqlists xss = go (reverse xss) [] []
where
go [] acc rest = acc : rest
go (xs:xss) acc rest = foldr (\y r -> go xss (y:acc) r) rest xs
note that both of these versions generate the results in a different order from the standard sequence, so while they work for this problem we can't use one as a specialised version of sequence.

Following on from the answer given by Simon Marlow, here's a version of sequence that avoids the space leak while otherwise working just like the original, including preserving the order.
It still uses the nice, simple list comprehension of the original sequence - the only difference is that a fake data dependency is introduced that prevents the recursive call from being shared.
sequenceDummy d [] = d `seq` [[]]
sequenceDummy _ (xs:xss) = [ y:ys | y <- xs, ys <- sequenceDummy (Just y) xss ]
sequenceUnshared = sequenceDummy Nothing
I think this is a better way of avoiding the sharing that leads to the space leak.
I'd blame the excessive sharing on the "full laziness" transformation. Normally this does a great job of creating sharing that avoids recomputions, but sometimes recompution is very much more efficient than storing shared results.
It'd be nice if there was a more direct way to tell the compiler not to share a specific expression - the above dummy Maybe argument works and is efficient, but it's basically a hack that's just complicated enough that ghc can't tell that there's no real dependency. (In a strict language you don't have these issues because you only have sharing where you explicitly bind a variable to a value.)

EDIT: I think I'm wrong here - changing the type signature to :: Maybe Word64 (which would be enough bits for this problem I think) also takes forever / has a space leak, so it couldn't be the old Integer bug.
Your problem seems to be an old GHC bug (that I thought was fixed) with Integer causing a space leak. The below code finishes in about 150 ms when compiled with -O2.
import Data.List
import Data.Word
main = print opl
opl :: Maybe Word32
opl = find vw $ map (\x-> fromDigits (x++[0,0,9]) ) $ sequence [[1],re,[2],re,[3],re,[4],re,[5],re,[6],re,[7],re,[8],re]
vw x = hh^2 == x
where hh = (round.sqrt.fromIntegral) x
re = [0..9]
fromDigits x = foldl1 (\n m->10*n+m) x

Since you're looking for a nineteen-digit number with those characteristics found in vw, I'd try to simplify the construction in the mapped function just say fromDigits x*1000+9 for starters. Appending to a list is O(length-of-the-left-list), so throwing those last three digits on the end hurts the computation time a bunch.
As an aside (to you both), using the strict version of the fold (foldl1') will also help.

Related

Optimize a list function that creates too much garbage (not stack overflow)

I have that Haskell function, that's causing more than 50% of all the allocations of my program, causing 60% of my run time to be taken by the GC. I run with a small stack (-K10K) so there is no stack overflow, but can I make this function faster, with less allocation?
The goal here is to calculate the product of a matrix by a vector. I cannot use hmatrix for example because this is part of a bigger function using the ad Automatic Differentiation package, so I need to use lists of Num. At runtime I suppose the use of the Numeric.AD module means my types must be Scalar Double.
listMProd :: (Num a) => [a] -> [a] -> [a]
listMProd mdt vdt = go mdt vdt 0
where
go [] _ s = [s]
go ls [] s = s : go ls vdt 0
go (y:ys) (x:xs) ix = go ys xs (y*x+ix)
Basically we loop through the matrix, multiplying and adding an accumulator until we reach the end of the vector, storing the result, then continuing restarting the vector again. I have a quickcheck test verifying that I get the same result than the matrix/vector product in hmatrix.
I have tried with foldl, foldr, etc. Nothing I've tried makes the function faster (and some things like foldr cause a space leak).
Running with profiling tells me, on top of the fact that this function is where most of the time and allocation is spent, that there are loads of Cells being created, Cells being a data type from the ad package.
A simple test to run:
import Numeric.AD
main = do
let m :: [Double] = replicate 400 0.2
v :: [Double] = replicate 4 0.1
mycost v m = sum $ listMProd m v
mygrads = gradientDescent (mycost (map auto v)) (map auto m)
print $ mygrads !! 1000
This on my machine tells me GC is busy 47% of the time.
Any ideas?
A very simple optimization is to make the go function strict by its accumulator parameter, because it's small, can be unboxed if a is primitive and always needs to be fully evaluated:
{-# LANGUAGE BangPatterns #-}
listMProd :: (Num a) => [a] -> [a] -> [a]
listMProd mdt vdt = go mdt vdt 0
where
go [] _ !s = [s]
go ls [] !s = s : go ls vdt 0
go (y:ys) (x:xs) !ix = go ys xs (y*x+ix)
On my machine, it gives 3-4x speedup (compiled with -O2).
On the other hand, intermediate lists shouldn't be strict so they could be fused.

Space leak with wrapper around Data.ByteString.Builder

I have a wrapper type around Data.ByteString.Builder that allows me to track the length of the ByteString being built (cf. my previous question):
import Data.Monoid
import qualified Data.ByteString.Builder as B
import System.IO (stdout)
data LBuilder = LBuilder { toBuilder :: !B.Builder
, lbLength :: !Int }
instance Monoid LBuilder where
mempty = LBuilder mempty 0
(LBuilder x1 l1) `mappend` (LBuilder x2 l2) =
LBuilder (x1 <> x2) (l1 + l2)
char c = LBuilder (B.char7 c) 1
hPutLBuilder h = B.hPutBuilder h . toBuilder
As far as I understand it, this should be roughly as efficient as using Builder directly. But trying the following test case seems to reveal a space leak:
parts = replicate 10000000 $ char 'x'
main = hPutLBuilder stdout $ mconcat parts
Running this code takes a few seconds and consumes around 250MB of memory. Doing the same task with Builder is far faster and needs only 40KB. The memory profile shows that all the extra space is taken up by instances of BuildStep and Builder, which does not happen when using Builder directly.
What makes this code so inefficient? Why does it not happen when using Builder?
Edit:
Michael's answer below, led me to look at how parts is actually evaluated.
After playing around some more, I rewrote the test code in the following way:
makeStuff !acc 0 = acc
makeStuff !acc i = makeStuff (acc <> char 'x') (i - 1)
stuff = makeStuff mempty 10000000
-- stuffOld = mconcat $ replicate 10000000 $ char 'x'
main = hPutLBuilder stdout stuff
Using this definition the performance and memory usage for Builder and LBuilder is exactly the same (i.e. horrible :-). So it looks like the original version is so fast when using Builder because the compiler can somehow rewrite mconcat $ replicate n $ char c into something like B.lazyByteString $ L.replicate n (toAscii c) at compile time, instead of composing 10000000 functions at runtime on the heap. I tried to confirm this by looking at the generated core. I could tell that:
The definition for stuffOld is a call to a relatively short function that does something with types in Data.ByteString.Builder.Internal.
The definition for stuff is a call to makeStuff.
Said core was not meant to be understood by mere mortals.
So I guess this is just the benchmark being a pathological case, and the actual performance problem in my application lies somewhere else.
One issue is that forcing evaluation of the mconcat parts expression forces evaluation of its lbLength, which in turn will force evaluation of all the individual char 'x' values, which is where the space leak seems to come from. However, the only way I've found to get performance of your code to be the same as the original Builder is to use a newtype around B.Builder. Even just data LBuilder = LBuilder !B.Builder introduces significant overhead.

Haskell: performance of IORefs

I have been trying to encode an algorithm in Haskell that requires using lots of mutable references, but it is (perhaps not surprisingly) very slow in comparison to purely lazy code.
Consider a very simple example:
module Main where
import Data.IORef
import Control.Monad
import Control.Monad.Identity
list :: [Int]
list = [1..10^6]
main1 = mapM newIORef list >>= mapM readIORef >>= print
main2 = print $ map runIdentity $ map Identity list
Running GHC 7.8.2 on my machine, main1 takes 1.2s and uses 290MB of memory, while main2 takes only 0.4s and uses a mere 1MB. Is there any trick to prevent this growth, especially in space? I often need IORefs for non-primitive types unlike Int, and assumed that an IORef would use an additional pointer much like a regular thunk, but my intuition seems to be wrong.
I have already tried a specialized list type with an unpacked IORef, but with no significant difference.
The problem is your use of mapM, which always performs poorly on large lists both in time and space. The correct solution is to fuse away the intermediate lists by using mapM_ and (>=>):
import Data.IORef
import Control.Monad
list :: [Int]
list = [1..10^6]
main = mapM_ (newIORef >=> readIORef >=> print) list
This runs in constant space and gives excellent performance, running in 0.4 seconds on my machine.
Edit: In answer to your question, you can also do this with pipes to avoid having to manually fuse the loop:
import Data.IORef
import Pipes
import qualified Pipes.Prelude as Pipes
list :: [Int]
list = [1..10^6]
main = runEffect $
each list >-> Pipes.mapM newIORef >-> Pipes.mapM readIORef >-> Pipes.print
This runs in constant space in about 0.7 seconds on my machine.
This is very likely not about IORef, but about strictness. Actions in the IO monad are serial -- all previous actions must complete before the next one can be started. So
mapM newIORef list
generates a million IORefs before anything is read.
However,
map runIdentity . map Identity
= map (runIdentity . Identity)
= map id
which streams very nicely, so we print one element of the list, then generate the next one, etc.
If you want a fairer comparison, use a strict map:
map' :: (a -> b) -> [a] -> [b]
map' f [] = []
map' f (x:xs) = (f x:) $! map' f xs
I have found that the hack towards a solution is to use a lazy mapM instead, defined as
lazyMapM :: (a -> IO b) -> [a] -> IO [b]
lazyMapM f [] = return []
lazyMapM f (x:xs) = do
y <- f x
ys <- unsafeInterleaveIO $ lazyMapM f xs
return (y:ys)
This allows the monadic version to run within the same 1MB and similar time. I would expect that a lazy ST monad could solve this problem more elegantly without using unsafeInterleaveIO, as a function:
main = print $ runST (mapM (newSTRef) list >>= mapM (readSTRef))
but that does not work (you also need to use unsafeInterleaveST), what leaves me thinking about how lazy the Control.Monad.ST.Lazy really is. Does someone know? :)

Breaking lists at index

I have a performance question today.
I am making a (Haskell) program and, when profiling, I saw that most of the time is spent in the function you can find below. Its purpose is to take the nth element of a list and return the list without it besides the element itself. My current (slow) definition is as follows:
breakOn :: Int -> [a] -> (a,[a])
breakOn 1 (x:xs) = (x,xs)
breakOn n (x:xs) = (y,x:ys)
where
(y,ys) = breakOn (n-1) xs
The Int argument is known to be in the range 1..n where n is the length of the (never null) list (x:xs), so the function never arises an error.
However, I got a poor performance here. My first guess is that I should change lists for another structure. But, before start picking different structures and testing code (which will take me lot of time) I wanted to ask here for a third person opinion. Also, I'm pretty sure that I'm not doing it in the best way. Any pointers are welcome!
Please, note that the type a may not be an instance of Eq.
Solution
I adapted my code tu use Sequences from the Data.Sequence module. The result is here:
import qualified Data.Sequence as S
breakOn :: Int -> Seq a -> (a,Seq a)
breakOn n xs = (S.index zs 0, ys <> (S.drop 1 zs))
where
(ys,zs) = S.splitAt (n-1) xs
However, I still accept further suggestions of improvement!
Yes, this is inefficient. You can do a bit better by using splitAt (which unboxes the number during the recursive bit), a lot better by using a data structure with efficient splitting, e.g. a fingertree, and best by massaging the context to avoid needing this operation. If you post a bit more context, it may be possible to give more targeted advice.
Prelude functions are generally pretty efficient. You could rewrite your function using splitAt, as so:
breakOn :: Int -> [a] -> (a,[a])
breakOn n xs = (z,ys++zs)
where
(ys,z:zs) = splitAt (n-1) xs

is this implementation of merge sort good?

I've just started to learn Haskell last night and I've never used a functional programming language before.
I just want to know if my implemention of merge sort is good or bad and what exactly is good or bad.
Maybe it's even wrong - Well it does sort but maybe the Algorithm is not what I think what merge sort is.
Just tell me everything I could improve here. I by myself think its a pretty clear and simple implementation.
Thanks for your advice, here's the code :)
merge [] ys = ys
merge xs [] = xs
merge xs ys = sorted : merge left right
where
sorted = if head(xs) < head(ys) then head(xs) else head(ys)
left = if head(xs) <= head(ys) then tail(xs) else xs
right = if head(xs) > head(ys) then tail(ys) else ys
msort [] = []
msort [x] = [x]
msort xs = merge (msort left) (msort right)
where
left = take (div (length xs) 2) xs
right = drop (div (length xs) 2) xs
Well, first of all, we can rewrite merge to be a little more elegant using pattern matching
merge [] ys = ys
merge xs [] = xs
merge xs#(x:xs1) ys#(y:ys1)
| x <= y = x : merge xs1 ys
| otherwise = y : merge xs ys1
In general you should avoid using head and tail since they are a bit unsafe (they raise an error for the empty list) and use pattern matching whenever possible.
The implementation of msort is pretty much spot on, except that we can split the list in a more efficient way. That's because length xs - takes O(N) to complete. The compiler might save you and cache the result of the length call so that the second call to length won't traverse the list again. But the take and drop will pretty much cause another two traversals thus splitting the list using 3 traversals which may prove to be expensive. We can do better by splitting the list in two lists - the first one containing the elements on the odd positions and the second list with the elements placed on the even positions, like so:
msort [] = []
msort [x] = [x]
msort xs = merge (msort first) (msort second)
where
(first, second) = splitInHalves xs
splitInHalves [] = ([], [])
splitInHalves [x] = ([x], [])
splitInHalves (x:y:xs) =
let (xs1, ys1) = splitInHalves xs
in (x:xs1, y:ys1)
This gets you the same Merge Sort in O(NlogN) time. It feels different because you would probably implement it in place (by modifying the original list) in an imperative language such as C. This version is slightly more costly on the memory, but it does have it's advantages - it is more easy to reason about, so it is more maintainable, and also it is very easy to parallelize without being concerned of anything else except the algorithm itself - which is exactly what a good programming language should provide for the developers that use it.
EDIT 1 :
If the syntax is a bit much, here are some resources:
Pattern Matching - the bit with the # symbol is called an as-pattern. You'll find it in there
let is a keyword used to declare a variable to be used in the expression that follows it (whereas where binds a variable in the expression that precedes it). More on Haskell syntax, including guards (the things with | condition = value) can be found here, in this chapter of Learn You a Haskell
EDIT 2 :
#is7s proposed a far more concise version of splitInHalves using the foldr function:
splitInHalves = foldr (\x (l,r) -> (x:r,l)) ([],[])
EDIT 3 :
Here is another answer which provides an alternative implementation of merge sort, which also has the property of being stable:
Lazy Evaluation and Time Complexity
Hope this helps and welcome to the wonderful world of Functional Programming !

Resources