Foldl memory performance in GHC 8.0.x - performance

I got into a weird issue while checking memory usage of some code I was working on.
Using foldl to sum the elements of a very big list, I get a constant memory usage.
Using foldl' I get a constant memory usage as well (as expected).
Using foldr the memory grows and bring my system to knees (no stack overflow exception as I would expect).
The minimum code needed to trigger it is:
main = print $ foldx (+) 0 [1..100000000000000000000]
where foldx is foldl, foldr or foldl'
I was under the impression (as per Foldr Foldl Foldl') that the opposite would have been true.
I setup a repo with the aforementioned code:
https://github.com/framp/hs-fold-perf-test
What's going on here? Is it GHC 8.0.x being too smart?
I'm on macOS Sierra
Thanks

foldl and foldl'
In this case, GHC sees that foldl can be made strict and essentially rewrites it to utilise foldl'. See below how GHC optimizes the foldl construct.
Note that this only applies because you compiled with optimizations -O. Without optimizations the foldl programs consumes all my memory and crashes.
Looking at the output of ghc -O -fforce-recomp -ddump-simpl foldl.hs we can see that GHC eliminates the huge list used entirely and optimizes the expression to a tail recursive function:
Rec {
-- RHS size: {terms: 20, types: 5, coercions: 0, joins: 0/0}
Main.main_go [Occ=LoopBreaker] :: Integer -> Integer -> Integer
[GblId, Arity=2, Str=<S,U><S,1*U>]
Main.main_go
= \ (x_a36m :: Integer) (eta_B1 :: Integer) ->
case integer-gmp-1.0.0.1:GHC.Integer.Type.gtInteger#
x_a36m lim_r4Yv
of wild_a36n
{ __DEFAULT ->
case GHC.Prim.tagToEnum# # Bool wild_a36n of {
False ->
Main.main_go
(integer-gmp-1.0.0.1:GHC.Integer.Type.plusInteger
x_a36m 1)
(integer-gmp-1.0.0.1:GHC.Integer.Type.plusInteger eta_B1 x_a36m);
True -> eta_B1
}
}
end Rec }
Which explains why it runs with constant memory usage.
Why does foldr need that much memory?
foldr builds up a lot of thunks, which are essentially unfinished computations which will hold the correct value eventually. Essentially, when trying to evaluate the foldr expression, this happens:
foldr (+) 0 [1..100]
== (+) 1 $ foldr 0 [2..100]
== (+) 1 $ (+) 2 $ foldr [3..100]
...
== (+) 1 $ (+) 2 $ .. $ (+) 99 $ (+) 100 0 -- at this point there are 100
== (+) 1 $ (+) 2 $ .. $ (+) 99 $ 100 -- unevaluated computations, which
== (+) 1 $ (+) 2 $ .. $ (+) 199 -- take up a lot of memory
...
== (+) 1 $ 5049
== 5050
The limit of 100000000000000000000 is just big for the thunks to take up more space than your RAM and you program crashes.

Related

Optimize a list function that creates too much garbage (not stack overflow)

I have that Haskell function, that's causing more than 50% of all the allocations of my program, causing 60% of my run time to be taken by the GC. I run with a small stack (-K10K) so there is no stack overflow, but can I make this function faster, with less allocation?
The goal here is to calculate the product of a matrix by a vector. I cannot use hmatrix for example because this is part of a bigger function using the ad Automatic Differentiation package, so I need to use lists of Num. At runtime I suppose the use of the Numeric.AD module means my types must be Scalar Double.
listMProd :: (Num a) => [a] -> [a] -> [a]
listMProd mdt vdt = go mdt vdt 0
where
go [] _ s = [s]
go ls [] s = s : go ls vdt 0
go (y:ys) (x:xs) ix = go ys xs (y*x+ix)
Basically we loop through the matrix, multiplying and adding an accumulator until we reach the end of the vector, storing the result, then continuing restarting the vector again. I have a quickcheck test verifying that I get the same result than the matrix/vector product in hmatrix.
I have tried with foldl, foldr, etc. Nothing I've tried makes the function faster (and some things like foldr cause a space leak).
Running with profiling tells me, on top of the fact that this function is where most of the time and allocation is spent, that there are loads of Cells being created, Cells being a data type from the ad package.
A simple test to run:
import Numeric.AD
main = do
let m :: [Double] = replicate 400 0.2
v :: [Double] = replicate 4 0.1
mycost v m = sum $ listMProd m v
mygrads = gradientDescent (mycost (map auto v)) (map auto m)
print $ mygrads !! 1000
This on my machine tells me GC is busy 47% of the time.
Any ideas?
A very simple optimization is to make the go function strict by its accumulator parameter, because it's small, can be unboxed if a is primitive and always needs to be fully evaluated:
{-# LANGUAGE BangPatterns #-}
listMProd :: (Num a) => [a] -> [a] -> [a]
listMProd mdt vdt = go mdt vdt 0
where
go [] _ !s = [s]
go ls [] !s = s : go ls vdt 0
go (y:ys) (x:xs) !ix = go ys xs (y*x+ix)
On my machine, it gives 3-4x speedup (compiled with -O2).
On the other hand, intermediate lists shouldn't be strict so they could be fused.

Why summing native lists is slower than summing church-encoded lists with `GHC -O2`?

In order to test how church-encoded lists perform against user-defiend lists and native lists, I've prepared 3 benchmarks:
User-defined lists
data List a = Cons a (List a) | Nil deriving Show
lenumTil n = go n Nil where
go 0 result = result
go n result = go (n-1) (Cons (n-1) result)
lsum Nil = 0
lsum (Cons h t) = h + (lsum t)
main = print (lsum (lenumTil (100000000 :: Int)))
Native lists
main = print $ sum ([0..100000000-1] :: [Int])
Church lists
fsum = (\ a -> (a (+) 0))
fenumTil n cons nil = go n nil where
go 0 result = result
go n result = go (n-1) (cons (n-1) result)
main = print $ (fsum (fenumTil (100000000 :: Int)) :: Int)
The benchmark results are unexpected:
User-defined lists
-- 4999999950000000
-- real 0m22.520s
-- user 0m59.815s
-- sys 0m20.327s
Native Lists
-- 4999999950000000
-- real 0m0.999s
-- user 0m1.357s
-- sys 0m0.252s
Church Lists
-- 4999999950000000
-- real 0m0.010s
-- user 0m0.002s
-- sys 0m0.003s
One would expect that, with the huge amount of specific optimizations targeted to native lists, they would perform the best. Yet, church lists outperform them by a 100x factor, and outperform user-defined ADTs by a 2250x factor. I've compiled all programs with GHC -O2. I've tried replacing sum by foldl', same result. I've attempted adding user-inputs to make sure the church-list version wasn't optimized to a constant. arkeet pointed out on #haskell that, by inspecting Core, the native version has an intermediate lists, but why? Forcing allocation with an additional reverse, all 3 perform roughly the same.
GHC 7.10 has call arity analysis, which lets us define foldl in terms of foldr and thus let left folds, including sum, participate in fusion. GHC 7.8 also defines sum with foldl but it can't fuse the lists away. Thus GHC 7.10 performs optimally and identically to the Church version.
The Church version is child's play to optimize in either GHC versions. We just have to inline (+) and 0 into fenumTil, and then we have a patently tail-recursive go which can be readily unboxed and then turned into a loop by the code generator.
The user-defined version is not tail-recursive and it works in linear space, which wrecks performance, of course.

Improving performance on chunked lists

I have a simple problem: Given a list of integers, read the first line as N. Then, read the next N lines and return the sum of them. Repeat until N = 0.
My first approach was using this:
main = interact $ unlines . f . (map read) . lines
f::[Int] -> [String]
f (n:ls)
| n == 0 = []
| otherwise = [show rr] ++ (f rest)
where (xs, rest) = splitAt n ls
rr = sum xs
f _ = []
But it's relatively slow. I've profiled it using
ghc -O2 --make test.hs -prof -auto-all -caf-all -fforce-recomp -rtsopts
time ./test +RTS -hc -p -i0.001 < input.in
Where input.in is a test input where the first line is 100k, followed by 100k random numbers, followed by 0. We can see in the Figure below that it's using O(N) memory:
EDITED: My original question was comparing 2 similarly slow approaches. I've updated it to compare with an optimized approach below
Now, if I do the sum iteratively, instead of calling sum, I get constant amount of memory
{-# LANGUAGE BangPatterns #-}
main = interact $ unlines . g . (map read) . lines
g::[Int] -> [String]
g (n:ls)
| n == 0 = []
| otherwise = g' n ls 0
g _ = []
g' n (l:ls) !cnt
| n == 0 = [show cnt] ++ (g (l:ls))
| otherwise = g' (n-1) ls (cnt + l)
I'm trying to understand what is causing the performance loss in the first example. I would guess everything there could be lazily evaluated?
I don't know precisely what is causing the difference. But I can show you this:
Data.Map> sum [1 .. 1e8]
Out of memory.
Data.Map> foldl' (+) 0 [1 .. 1e8]
5.00000005e15
For some reason, sum = foldl (+) 0, rather than foldl' (with the apostrophe). The difference is that the latter function is more strict, so it uses virtually no memory. The lazy version, by contrast, does this:
sum [1..100]
1 + sum [2..100]
1 + 2 + sum [3..100]
1 + 2 + 3 + sum [4.100]
...
In other words, it creates a giant expression that says 1 + 2 + 3 + ... And then, right at the end, it tries to evaluate it all. Well, obviously, that's going to eat a lot of RAM. By using foldl' instead of foldl, you make it do the additions immediately, rather than pointlessly storing them in RAM.
You probably also want to do I/O using ByteString rather than String; but the laziness difference will probably give you a big speed boost on its own.
I think that laziness is what prevents your first and second version from being equivalent.
Consider the result created from the input "numbers"
1
garbage_here
2
3
5
0
The first version would give a result list [error "...some parse error", 8], which you can safely look at the second element of, while the second version errors near immediately. It seems hard to achieve the first in a streaming way.
Even without laziness, though, getting from the first to the second version may be more than GHC can handle - it would need to have fusion rewriting rules combining foldl/foldl' on the first element of a tuple with splitAt. And GHC has only recently got to the point where it can fuse foldl/foldl' at all.

Haskell Fibonacci sequence performance depending on methodology

I was trying out different approaches to getting a number at a given index of the Fibonacci sequence and they could basically be divided into two categories:
building a list and querying an index
using variables (might be separate or tupled, without a list)
I picked an example of both:
fibs1 :: Int -> Integer
fibs1 n = fibs1' !! n
where fibs1' = 0 : scanl (+) 1 fibs1'
fib2 :: Int -> Integer
fib2 n = fib2' 1 1 n where
fib2' _ b 2 = b
fib2' a b n = fib2' b (a + b) (n - 1)
fibs1:
real 0m2.356s
user 0m2.310s
sys 0m0.030s
fibs2:
real 0m0.671s
user 0m0.667s
sys 0m0.000s
Both were compiled with 64bit GHC 7.6.1 and -O2 -fllvm. Their core dumps are very similar in length, but they differ in the parts that I'm not very proficient at interpreting.
I was not surprised that fibs1 failed for n = 350000 (Stack space overflow). However, I am not comfortable with the fact that it used that much memory.
I would like to clear some things up:
Why does the GC not take care of the beginning of the list throughout computation even though most of it quickly becomes useless?
Why does GHC not optimize the list version to a variable version since only two of its elements are required at once?
EDIT: Sorry, I mixed the speed results, fixed. Two of three of my doubts are still valid, though ;).
Why does the GC not take care of the beginning of the list throughout computation even though most of it quickly becomes useless?
fibs1 uses a lot of memory and is slow because scanl is lazy, it doesn't evaluate the list elements, so
fibs1' = 0 : scanl (+) 1 fibs1'
produces
0 : scanl (+) 1 (0 : more)
0 : 1 : let f2 = 1+0 in scanl (+) f2 (1 : more')
0 : 1 : let f2 = 1+0 in f2 : let f3 = f2+1 in scanl (+) f3 (f2 : more'')
0 : 1 : let f2 = 1+0 in f2 : let f3 = f2+1 in f3 : let f4 = f3+f2 in scanl (+) f4 (f3 : more''')
etc. So you rather quickly get a huge nested thunk. When that thunk is evaluated, it is pushed on the stack, and at some point between 250000 and 350000, it becomes too big for the default stack.
And since each list element holds a reference to the previous while it is not evaluated, the beginning of the list cannot be garbage-collected.
If you use a strict scan,
fibs1 :: Int -> Integer
fibs1 n = fibs1' !! n
where
fibs1' = 0 : scanl' (+) 1 fibs1'
scanl' f a (x:xs) = let x' = f a x in x' `seq` (a : scanl' f x' xs)
scanl' _ a [] = [a]
when the k-th list cell is produced, its value is already evaluated, so doesn't refer to a previous, hence the list can be garbage collected (assuming nothing else holds a reference to it) as it is traversed.
With that implementation, the list version is about as fast and lean as fib2 (it needs to allocate list cells nevertheless, so it allocates a small bit more, and is possibly a tiny bit slower therefore, but the difference is minute, since the Fibonacci numbers become so large that the list construction overhead becomes negligible).
The idea of scanl is that its result is incrementally consumed, so that the consumption forces the elements and prevents the build-up of large thunks.
Why does GHC not optimize the list version to a variable version since only two of its elements are required at once?
Its optimiser can't see through the algorithm to determine that. scanl is opaque to the compiler, it doesn't know what scanl does.
If we take the exact source code for scanl (renaming it or hiding scanl from the Prelude, I opted for renaming),
scans :: (b -> a -> b) -> b -> [a] -> [b]
scans f q ls = q : (case ls of
[] -> []
x:xs -> scans f (f q x) xs)
and compile the module exporting it (with -O2), and then look at the generated interface file with
ghc --show-iface Scan.hi
we get (for example, minor differences between compiler versions)
Magic: Wanted 33214052,
got 33214052
Version: Wanted [7, 0, 6, 1],
got [7, 0, 6, 1]
Way: Wanted [],
got []
interface main:Scan 7061
interface hash: ef57dac14815e2f1f897b42a007c0c81
ABI hash: 8cfc8dab79de6a51fcad666f1869574f
export-list hash: 57d6805e5f0b5f76f0dd8dfb228df988
orphan hash: 693e9af84d3dfcc71e640e005bdc5e2e
flag hash: 1e8135cb44ef6dd330f1f56943d1f463
used TH splices: False
where
exports:
Scan.scans
module dependencies:
package dependencies: base* ghc-prim integer-gmp
orphans: base:GHC.Base base:GHC.Float base:GHC.Real
family instance modules:
import -/ base:Prelude 1cb4b618cf45281dc97748b1831bf0cd
d79ca4e223c0de0a770a3b88a5e67687
scans :: forall b a. (b -> a -> b) -> b -> [a] -> [b]
{- Arity: 3, HasNoCafRefs, Strictness: LLL -}
vectorised variables:
vectorised tycons:
vectorised reused tycons:
scalar variables:
scalar tycons:
trusted: safe-inferred
require own pkg trusted: False
and see that the interface file doesn't expose the unfolding of the function, only its type, arity, strictness and that it doesn't refer to CAFs.
When a module importing that is compiled, all that the compiler has to go by is the information exposed by the interface file.
Here, there is no information exposed that would allow the compiler to do anything else but emit a call to the function.
If the unfolding were exposed, the compiler had a chance to inline the unfolding and analyse the code knowing the types and combination function to produce more eager code that doesn't build thunks.
The semantics of scanl, however, are maximally lazy, each element of the output is emitted before the input list is inspected. That has the consequence that GHC can't make the addition strict, since that would change the result if the list contained any undefined values:
scanl (+) 1 [undefined] = 1 : scanl (+) (1 + undefined) [] = 1 : (1 + undefined) : []
while
scanl' (+) 1 [undefined] = let x' = 1 + undefined in x' `seq` 1 : scanl' (+) x' []
= *** Exception: Prelude.undefined
One could make a variant
scanl'' f b (x:xs) = b `seq` b : scanl'' f (f b x) xs
that would produce 1 : *** Exception: Prelude.undefined for the above input, but any strictness would indeed change the result if the list contained undefined values, so even if the compiler knew the unfolding, it couldn't make the evaluation strict - unless it could prove that there are no undefined values in the list, a fact that is obvious to us, but not the compiler [and I don't think it would be easy to teach a compiler recognize that and be able to prove the absence of undefined values].

Laziness and tail recursion in Haskell, why is this crashing?

I have this fairly simple function to compute the mean of elements of a big list, using two accumulators to hold the sum so far and the count so far:
mean = go 0 0
where
go s l [] = s / fromIntegral l
go s l (x:xs) = go (s+x) (l+1) xs
main = do
putStrLn (show (mean [0..10000000]))
Now, in a strict language, this would be tail-recursive, and there would be no problem. However, as Haskell is lazy, my googling has led me to understand that (s+x) and (l+1) will be passed down the recursion as thunks. So this whole thing crashes and burns:
Stack space overflow: current size 8388608 bytes.
After further googling, I found seq and $!. Which it seems I don't understand because all my attempts at using them in this context proved futile, with error messages saying something about infinite types.
Finally I found -XBangPatterns, which solves it all by changing the recursive call:
go !s !l (x:xs) = go (s+x) (l+1) xs
But I'm not happy with this, as -XBangPatterns is currently an extension. I would like to know how to make the evaluation strict without the use of -XBangPatterns. (And maybe learn something too!)
Just so you understand my lack of understanding, here's what I tried (the only try that compiled, that is):
go s l (x:xs) = go (seq s (s+x)) (seq l (l+1)) xs
From what I could understand, seq should here force the evaluation of the s and l argument, thus avoiding the problem caused by thunks. But I still get a stack overflow.
I've written extensively on this:
Real World Haskell, ch 24: controlling evaluation
On recursion and strictness in Haskell
Firstly, yes, if you want to require strict evaluation of the accumulators use seq and stay in Haskell 98:
mean = go 0 0
where
go s l [] = s / fromIntegral l
go s l (x:xs) = s `seq` l `seq`
go (s+x) (l+1) xs
main = print $ mean [0..10000000]
*Main> main
5000000.0
Secondly: strictness analysis will kick in if you give some type annotations, and compile with -O2:
mean :: [Double] -> Double
mean = go 0 0
where
go :: Double -> Int -> [Double] -> Double
go s l [] = s / fromIntegral l
go s l (x:xs) = go (s+x) (l+1) xs
main = print $ mean [0..10000000]
$ ghc -O2 --make A.hs
[1 of 1] Compiling Main ( A.hs, A.o )
Linking A ...
$ time ./A
5000000.0
./A 0.46s user 0.01s system 99% cpu 0.470 total
Because 'Double' is a wrapper over the strict atomic type Double#, with optimizations on, and a precise type, GHC runs strictness analysis and infers that the strict version will be ok.
import Data.Array.Vector
main = print (mean (enumFromToFracU 1 10000000))
data Pair = Pair !Int !Double
mean :: UArr Double -> Double
mean xs = s / fromIntegral n
where
Pair n s = foldlU k (Pair 0 0) xs
k (Pair n s) x = Pair (n+1) (s+x)
$ ghc -O2 --make A.hs -funbox-strict-fields
[1 of 1] Compiling Main ( A.hs, A.o )
Linking A ...
$ time ./A
5000000.5
./A 0.03s user 0.00s system 96% cpu 0.038 total
As described in the RWH chapter above.
The seq function forces evaluation of the first parameter once the function is called. When you pass seq s (s+x) as a parameter the seq function is not called immediately, because there is no need to evaluate the value of that parameter. You want the call to seq to be evaluated before the recursive call, so that that in turn can force its parameter to be evaluated.
Usually this is done link this:
go s l (x:xs) = s `seq` l `seq` go (s+x) (l+1) xs
This is a syntactic variation of seq s (seq l (go (s+x) (l+1) xs)). Here the calls to seq are the outermost function calls in the expression. Because of Haskell's laziness this causes them to be evaluated first: seq is called with the still unevaluated parameters s and seq l (go (s+x) (l+1) xs), evaluating the parameters is deferred to the point where somebody actually tries to access their values.
Now seq can force its first parameter to be evaluated before returning the rest of the expression. Then the next step in the evaluation would be the second seq. If the calls to seq are buried somewhere in some parameter they might not be executed for a long time, defeating their purpose.
With the changed positions of the seqs the program executes fine, without using excessive amounts of memory.
Another solution to the problem would be to simply enable optimizations in GHC when the program is compiled (-O or -O2). The optimizer recognizes the dispensable laziness and produces code that doesn't allocate unnecessary memory.
You are right in your understanding that seq s (s+x) forces the evaluation of s. But it doesn't force s+x, therefore you're still building up thunks.
By using $! you can force the evaluation of the addition (two times, for both arguments). This achieves the same effect as using the bang patterns:
mean = go 0 0
where
go s l [] = s / fromIntegral l
go s l (x:xs) = ((go $! s+x) $! l+1) xs
The use of the $! function will translate the go $! (s+x) to the equivalent of:
let y = s+x
in seq y (go y)
Thus y is first forced into weak head normal form, which means that the outermost function is applied. In the case of y, the outermost function is +, thus y is fully evaluated to a number before being passed to go.
Oh, and you probably got the infinite type error message because you didn't have the parenthesis in the right place. I got the same error when I first wrote your program down :-)
Because the $! operator is right associative, without parenthesis go $! (s+x) $! (l+1) means the same as: go $! ((s+x) $! (l+1)), which is obviously wrong.

Resources