Space leak with wrapper around Data.ByteString.Builder - performance

I have a wrapper type around Data.ByteString.Builder that allows me to track the length of the ByteString being built (cf. my previous question):
import Data.Monoid
import qualified Data.ByteString.Builder as B
import System.IO (stdout)
data LBuilder = LBuilder { toBuilder :: !B.Builder
, lbLength :: !Int }
instance Monoid LBuilder where
mempty = LBuilder mempty 0
(LBuilder x1 l1) `mappend` (LBuilder x2 l2) =
LBuilder (x1 <> x2) (l1 + l2)
char c = LBuilder (B.char7 c) 1
hPutLBuilder h = B.hPutBuilder h . toBuilder
As far as I understand it, this should be roughly as efficient as using Builder directly. But trying the following test case seems to reveal a space leak:
parts = replicate 10000000 $ char 'x'
main = hPutLBuilder stdout $ mconcat parts
Running this code takes a few seconds and consumes around 250MB of memory. Doing the same task with Builder is far faster and needs only 40KB. The memory profile shows that all the extra space is taken up by instances of BuildStep and Builder, which does not happen when using Builder directly.
What makes this code so inefficient? Why does it not happen when using Builder?
Edit:
Michael's answer below, led me to look at how parts is actually evaluated.
After playing around some more, I rewrote the test code in the following way:
makeStuff !acc 0 = acc
makeStuff !acc i = makeStuff (acc <> char 'x') (i - 1)
stuff = makeStuff mempty 10000000
-- stuffOld = mconcat $ replicate 10000000 $ char 'x'
main = hPutLBuilder stdout stuff
Using this definition the performance and memory usage for Builder and LBuilder is exactly the same (i.e. horrible :-). So it looks like the original version is so fast when using Builder because the compiler can somehow rewrite mconcat $ replicate n $ char c into something like B.lazyByteString $ L.replicate n (toAscii c) at compile time, instead of composing 10000000 functions at runtime on the heap. I tried to confirm this by looking at the generated core. I could tell that:
The definition for stuffOld is a call to a relatively short function that does something with types in Data.ByteString.Builder.Internal.
The definition for stuff is a call to makeStuff.
Said core was not meant to be understood by mere mortals.
So I guess this is just the benchmark being a pathological case, and the actual performance problem in my application lies somewhere else.

One issue is that forcing evaluation of the mconcat parts expression forces evaluation of its lbLength, which in turn will force evaluation of all the individual char 'x' values, which is where the space leak seems to come from. However, the only way I've found to get performance of your code to be the same as the original Builder is to use a newtype around B.Builder. Even just data LBuilder = LBuilder !B.Builder introduces significant overhead.

Related

Efficiency of unfoldr versus zipWith

Over on Code Review, I answered a question about a naive Haskell fizzbuzz solution by suggesting an implementation that iterates forward, avoiding the quadratic cost of the increasing number of primes and discarding modulo division (almost) entirely. Here's the code:
fizz :: Int -> String
fizz = const "fizz"
buzz :: Int -> String
buzz = const "buzz"
fizzbuzz :: Int -> String
fizzbuzz = const "fizzbuzz"
fizzbuzzFuncs = cycle [show, show, fizz, show, buzz, fizz, show, show, fizz, buzz, show, fizz, show, show, fizzbuzz]
toFizzBuzz :: Int -> Int -> [String]
toFizzBuzz start count =
let offsetFuncs = drop (mod (start - 1) 15) fizzbuzzFuncs
in take count $ zipWith ($) offsetFuncs [start..]
As a further prompt, I suggested rewriting it using Data.List.unfoldr. The unfoldr version is an obvious, simple modification to this code so I'm not going to type it here unless people seeking to answer my question insist that is important (no spoilers for the OP over on Code Review). But I do have a question about the relative efficiency of the unfoldr solution compared to the zipWith one. While I am no longer a Haskell neophyte, I am no expert on Haskell internals.
An unfoldr solution does not require the [start..] infinite list, since it can simply unfold from start. My thoughts are
The zipWith solution does not memoize each successive element of [start..] as it is asked for. Each element is used and discarded because no reference to the head of [start..] is kept. So there is no more memory consumed there than with unfoldr.
Concerns about the performance of unfoldr and recent patches to make it always inlined are conducted at a level which I have not yet reached.
So I think the two are equivalent in memory consumption but have no idea about the relative performance. Hoping more informed Haskellers can direct me towards an understanding of this.
unfoldr seems a natural thing to use to generate sequences, even if other solutions are more expressive. I just know I need to understand more about it's actual performance. (For some reason I find foldr much easier to comprehend on that level)
Note: unfoldr's use of Maybe was the first potential performance issue that occurred to me, before I even started investigating the issue (and the only bit of the optimisation/inlining discussions that I fully understood). So I was able to stop worrying about Maybe right away (given a recent version of Haskell).
As the one responsible for the recent changes in the implementations of zipWith and unfoldr, I figured I should probably take a stab at this. I can't really compare them so easily, because they're very different functions, but I can try to explain some of their properties and the significance of the changes.
unfoldr
Inlining
The old version of unfoldr (before base-4.8/GHC 7.10) was recursive at the top level (it called itself directly). GHC never inlines recursive functions, so unfoldr was never inlined. As a result, GHC could not see how it interacted with the function it was passed. The most troubling effect of this was that the function passed in, of type (b -> Maybe (a, b)), would actually produce Maybe (a, b) values, allocating memory to hold the Just and (,) constructors. By restructuring unfoldr as a "worker" and a "wrapper", the new code allows GHC to inline it and (in many cases) fuse it with the function passed in, so the extra constructors are stripped away by compiler optimizations.
For example, under GHC 7.10, the code
module Blob where
import Data.List
bloob :: Int -> [Int]
bloob k = unfoldr go 0 where
go n | n == k = Nothing
| otherwise = Just (n * 2, n+1)
compiled with ghc -O2 -ddump-simpl -dsuppress-all -dno-suppress-type-signatures leads to the core
$wbloob :: Int# -> [Int]
$wbloob =
\ (ww_sYv :: Int#) ->
letrec {
$wgo_sYr :: Int# -> [Int]
$wgo_sYr =
\ (ww1_sYp :: Int#) ->
case tagToEnum# (==# ww1_sYp ww_sYv) of _ {
False -> : (I# (*# ww1_sYp 2)) ($wgo_sYr (+# ww1_sYp 1));
True -> []
}; } in
$wgo_sYr 0
bloob :: Int -> [Int]
bloob =
\ (w_sYs :: Int) ->
case w_sYs of _ { I# ww1_sYv -> $wbloob ww1_sYv }
Fusion
The other change to unfoldr was rewriting it to participate in "fold/build" fusion, an optimization framework used in GHC's list libraries. The idea of both "fold/build" fusion and the newer, differently balanced, "stream fusion" (used in the vector library) is that if a list is produced by a "good producer", transformed by "good transformers", and consumed by a "good consumer", then the list conses never actually need to be allocated at all. The old unfoldr was not a good producer, so if you produced a list with unfoldr and consumed it with, say, foldr, the pieces of the list would be allocated (and immediately become garbage) as computation proceeded. Now, unfoldr is a good producer, so you can write a loop using, say, unfoldr, filter, and foldr, and not (necessarily) allocate any memory at all.
For example, given the above definition of bloob, and a stern {-# INLINE bloob #-} (this stuff is a bit fragile; good producers sometimes need to be inlined explicitly to be good), the code
hooby :: Int -> Int
hooby = sum . bloob
compiles to the GHC core
$whooby :: Int# -> Int#
$whooby =
\ (ww_s1oP :: Int#) ->
letrec {
$wgo_s1oL :: Int# -> Int# -> Int#
$wgo_s1oL =
\ (ww1_s1oC :: Int#) (ww2_s1oG :: Int#) ->
case tagToEnum# (==# ww1_s1oC ww_s1oP) of _ {
False -> $wgo_s1oL (+# ww1_s1oC 1) (+# ww2_s1oG (*# ww1_s1oC 2));
True -> ww2_s1oG
}; } in
$wgo_s1oL 0 0
hooby :: Int -> Int
hooby =
\ (w_s1oM :: Int) ->
case w_s1oM of _ { I# ww1_s1oP ->
case $whooby ww1_s1oP of ww2_s1oT { __DEFAULT -> I# ww2_s1oT }
}
which has no lists, no Maybes, and no pairs; the only allocation it performs is the Int used to store the final result (the application of I# to ww2_s1oT). The entire computation can reasonably be expected to be performed in machine registers.
zipWith
zipWith has a bit of a weird story. It fits into the fold/build framework a bit awkwardly (I believe it works quite a bit better with stream fusion). It is possible to make zipWith fuse with either its first or its second list argument, and for many years, the list library tried to make it fuse with either, if either was a good producer. Unfortunately, making it fuse with its second list argument can make a program less defined under certain circumstances. That is, a program using zipWith could work just fine when compiled without optimization, but produce an error when compiled with optimization. This is not a great situation. Therefore, as of base-4.8, zipWith no longer attempts to fuse with its second list argument. If you want it to fuse with a good producer, that good producer had better be in the first list argument.
Specifically, the reference implementation of zipWith leads to the expectation that, say, zipWith (+) [1,2,3] (1 : 2 : 3 : undefined) will give [2,4,6], because it stops as soon as it hits the end of the first list. With the previous zipWith implementation, if the second list looked like that but was produced by a good producer, and if zipWith happened to fuse with it rather than the first list, then it would go boom.

Haskell: partially drop lazy evaluated results

I have a very large decision tree. It is used as follows:
-- once per application start
t :: Tree
t = buildDecisionTree
-- done several times
makeDecision :: Something -> Decision
makeDecision something = search t something
This decision tree is way too large to fit in memory. But, thanks to lazy evaluation, it is only partially evaluated.
The problem is, that there are scenarios where all possible decisions are tried causing the whole tree to be evaluated. This is not going to terminate, but should not cause a memory overflow either. Further, if this process is aborted, the memory usage does not decrease, as a huge subtree is still evaluated already.
A solution would be to reevaluate the tree every time makeDecision is called, but this would loose the benefits of caching decisions and significantly slow down makeDecision.
I would like to go a middle course. In particular it is very common in my application to do successive decisions with common path prefix in the tree. So I would like to cache the last used path but drop the others, causing them to reevaluate the next time they are used. How can I do this in Haskell?
It is not possible in pure haskell, see question Can a thunk be duplicated to improve memory performance? (as pointed out by #shang). You can, however, do this with IO.
We start with the module heade and list only the type and the functions that should make this module (which will use unsafePerformIO) safe. It is also possible to do this without unsafePerformIO, but that would mean that the user has to keep more of his code in IO.
{-# LANGUAGE ExistentialQuantification #-}
module ReEval (ReEval, newReEval, readReEval, resetReEval) where
import Data.IORef
import System.IO.Unsafe
We start by defining a data type that stores a value in a way that prevents all sharing, by keeping the function and the argument away from each other, and only apply the function when we want the value. Note that the value returned by unsharedValue can be shared, but not with the return value of other invocations (assuming the function is doing something non-trivial):
data Unshared a = forall b. Unshared (b -> a) b
unsharedValue :: Unshared a -> a
unsharedValue (Unshared f x) = f x
Now we define our data type of resettable computations. We need to store the computation and the current value. The latter is stored in an IORef, as we want to be able to reset it.
data ReEval a = ReEval {
calculation :: Unshared a,
currentValue :: IORef a
}
To wrap a value in a ReEval box, we need to have a function and an argument. Why not just a -> ReEval a? Because then there would be no way to prevent the parameter to be shared.
newReEval :: (b -> a) -> b -> ReEval a
newReEval f x = unsafePerformIO $ do
let c = Unshared f x
ref <- newIORef (unsharedValue c)
return $ ReEval c ref
Reading is simple: Just get the value from the IORef. This use of unsafePerformIO is safe becuase we will always get the value of unsharedValue c, although a different “copy” of it.
readReEval :: ReEval a -> a
readReEval r = unsafePerformIO $ readIORef (currentValue r)
And finally the resetting. I left it in the IO monad, not because it would be any less safe than the other function to be wrapped in unsafePerformIO, but because this is the easiest way to give the user control over when the resetting actually happens. You don’t want to risk that all your calls to resetReEval are lazily delayed until your memory has run out or even optimized away because there is no return value to use.
resetReEval :: ReEval a -> IO ()
resetReEval r = writeIORef (currentValue r) (unsharedValue (calculation r))
This is the end of the module. Here is example code:
import Debug.Trace
import ReEval
main = do
let func a = trace ("func " ++ show a) negate a
let l = [ newReEval func n | n <- [1..5] ]
print (map readReEval l)
print (map readReEval l)
mapM_ resetReEval l
print (map readReEval l)
And here you can see that it does what expected:
$ runhaskell test.hs
func 1
func 2
func 3
func 4
func 5
[-1,-2,-3,-4,-5]
[-1,-2,-3,-4,-5]
func 1
func 2
func 3
func 4
func 5
[-1,-2,-3,-4,-5]

How does one write efficient Dynamic Programming algorithms in Haskell?

I've been playing around with dynamic programming in Haskell. Practically every tutorial I've seen on the subject gives the same, very elegant algorithm based on memoization and the laziness of the Array type. Inspired by those examples, I wrote the following algorithm as a test:
-- pascal n returns the nth entry on the main diagonal of pascal's triangle
-- (mod a million for efficiency)
pascal :: Int -> Int
pascal n = p ! (n,n) where
p = listArray ((0,0),(n,n)) [f (i,j) | i <- [0 .. n], j <- [0 .. n]]
f :: (Int,Int) -> Int
f (_,0) = 1
f (0,_) = 1
f (i,j) = (p ! (i, j-1) + p ! (i-1, j)) `mod` 1000000
My only problem is efficiency. Even using GHC's -O2, this program takes 1.6 seconds to compute pascal 1000, which is about 160 times slower than an equivalent unoptimized C++ program. And the gap only widens with larger inputs.
It seems like I've tried every possible permutation of the above code, along with suggested alternatives like the data-memocombinators library, and they all had the same or worse performance. The one thing I haven't tried is the ST Monad, which I'm sure could be made to run the program only slighter slower than the C version. But I'd really like to write it in idiomatic Haskell, and I don't understand why the idiomatic version is so inefficient. I have two questions:
Why is the above code so inefficient? It seems like a straightforward iteration through a matrix, with an arithmetic operation at each entry. Clearly Haskell is doing something behind the scenes I don't understand.
Is there a way to make it much more efficient (at most 10-15 times the runtime of a C program) without sacrificing its stateless, recursive formulation (vis-a-vis an implementation using mutable arrays in the ST Monad)?
Thanks a lot.
Edit: The array module used is the standard Data.Array
Well, the algorithm could be designed a little better. Using the vector package and being smart about only keeping one row in memory at a time, we can get something that's idiomatic in a different way:
{-# LANGUAGE BangPatterns #-}
import Data.Vector.Unboxed
import Prelude hiding (replicate, tail, scanl)
pascal :: Int -> Int
pascal !n = go 1 ((replicate (n+1) 1) :: Vector Int) where
go !i !prevRow
| i <= n = go (i+1) (scanl f 1 (tail prevRow))
| otherwise = prevRow ! n
f x y = (x + y) `rem` 1000000
This optimizes down very tightly, especially because the vector package includes some rather ingenious tricks to transparently optimize array operations written in an idiomatic style.
1 Why is the above code so inefficient? It seems like a straightforward iteration through a matrix, with an arithmetic operation at each entry. Clearly Haskell is doing something behind the scenes I don't understand.
The problem is that the code writes thunks to the array. Then when entry (n,n) is read, the evaluation of the thunks jumps all over the array again, recurring until finally a value not needing further recursion is found. That causes a lot of unnecessary allocation and inefficiency.
The C++ code doesn't have that problem, the values are written, and read directly without requiring further evaluation. As it would happen with an STUArray. Does
p = runSTUArray $ do
arr <- newArray ((0,0),(n,n)) 1
forM_ [1 .. n] $ \i ->
forM_ [1 .. n] $ \j -> do
a <- readArray arr (i,j-1)
b <- readArray arr (i-1,j)
writeArray arr (i,j) $! (a+b) `rem` 1000000
return arr
really look so bad?
2 Is there a way to make it much more efficient (at most 10-15 times the runtime of a C program) without sacrificing its stateless, recursive formulation (vis-a-vis an implementation using mutable arrays in the ST Monad)?
I don't know of one. But there might be.
Addendum:
Once one uses STUArrays or unboxed Vectors, there's still a significant difference to the equivalent C implementation. The reason is that gcc replaces the % by a combination of multiplications, shifts and subtractions (even without optimisations), since the modulus is known. Doing the same by hand in Haskell (since GHC doesn't [yet] do that),
-- fast modulo 1000000
-- for nonnegative Ints < 2^31
-- requires 64-bit Ints
fastMod :: Int -> Int
fastMod n = n - 1000000*((n*1125899907) `shiftR` 50)
gets the Haskell versions on par with C.
The trick is to think about how to write the whole damn algorithm at once, and then use unboxed vectors as your backing data type. For example, the following runs about 20 times faster on my machine than your code:
import qualified Data.Vector.Unboxed as V
combine :: Int -> Int -> Int
combine x y = (x+y) `mod` 1000000
pascal n = V.last $ go n where
go 0 = V.replicate (n+1) 1
go m = V.scanl1 combine (go (m-1))
I then wrote two main functions that called out to yours and mine with an argument of 4000; these ran in 10.42s and 0.54s respectively. Of course, as I'm sure you know, they both get blown out of the water (0.00s) by the version that uses a better algorithm:
pascal' :: Integer -> Integer
pascal :: Int -> Int
pascal' n = product [n+1..n*2] `div` product [2..n]
pascal = fromIntegral . (`mod` 1000000) . pascal' . fromIntegral

Space leak in list program

I am solving some problems of Project Euler in Haskell. I wrote a program for a riddle in it and it did not work as I expected.
When I looked in the task manager when running the program I saw that it was using > 1 gigabyte of RAM on ghc. A friend of me wrote a program with the same meaning in Java and succeeded in 7 seconds.
import Data.List
opl = find vw $ map (\x-> fromDigits (x++[0,0,9]) )
$ sequence [[1],re,[2],re,[3],re,[4],re,[5],re,[6],re,[7],re,[8],re]
vw x = hh^2 == x
where hh = (round.sqrt.fromIntegral) x
re = [0..9]
fromDigits x = foldl1 (\n m->10*n+m) x
I know this program would output the number I want given enough RAM and time, but there has to be a better-performing way.
The main problem here is that sequence has a space leak. It is defined like this:
sequence [] = [[]]
sequence (xs:xss) = [ y:ys | y <- xs, ys <- sequence xss ]
so the problem is that the list produced by the recursive call sequence xss is re-used for each of the elements of xs, so it can't be discarded until the end. A version without the space leak is
myseq :: [[a]] -> [[a]]
myseq xs = go (reverse xs) []
where
go [] acc = [acc]
go (xs:xss) acc = concat [ go xss (x:acc) | x <- xs ]
PS. the answer seems to be Just 1229314359627783009
Edit version avoiding the concat:
seqlists :: [[a]] -> [[a]]
seqlists xss = go (reverse xss) [] []
where
go [] acc rest = acc : rest
go (xs:xss) acc rest = foldr (\y r -> go xss (y:acc) r) rest xs
note that both of these versions generate the results in a different order from the standard sequence, so while they work for this problem we can't use one as a specialised version of sequence.
Following on from the answer given by Simon Marlow, here's a version of sequence that avoids the space leak while otherwise working just like the original, including preserving the order.
It still uses the nice, simple list comprehension of the original sequence - the only difference is that a fake data dependency is introduced that prevents the recursive call from being shared.
sequenceDummy d [] = d `seq` [[]]
sequenceDummy _ (xs:xss) = [ y:ys | y <- xs, ys <- sequenceDummy (Just y) xss ]
sequenceUnshared = sequenceDummy Nothing
I think this is a better way of avoiding the sharing that leads to the space leak.
I'd blame the excessive sharing on the "full laziness" transformation. Normally this does a great job of creating sharing that avoids recomputions, but sometimes recompution is very much more efficient than storing shared results.
It'd be nice if there was a more direct way to tell the compiler not to share a specific expression - the above dummy Maybe argument works and is efficient, but it's basically a hack that's just complicated enough that ghc can't tell that there's no real dependency. (In a strict language you don't have these issues because you only have sharing where you explicitly bind a variable to a value.)
EDIT: I think I'm wrong here - changing the type signature to :: Maybe Word64 (which would be enough bits for this problem I think) also takes forever / has a space leak, so it couldn't be the old Integer bug.
Your problem seems to be an old GHC bug (that I thought was fixed) with Integer causing a space leak. The below code finishes in about 150 ms when compiled with -O2.
import Data.List
import Data.Word
main = print opl
opl :: Maybe Word32
opl = find vw $ map (\x-> fromDigits (x++[0,0,9]) ) $ sequence [[1],re,[2],re,[3],re,[4],re,[5],re,[6],re,[7],re,[8],re]
vw x = hh^2 == x
where hh = (round.sqrt.fromIntegral) x
re = [0..9]
fromDigits x = foldl1 (\n m->10*n+m) x
Since you're looking for a nineteen-digit number with those characteristics found in vw, I'd try to simplify the construction in the mapped function just say fromDigits x*1000+9 for starters. Appending to a list is O(length-of-the-left-list), so throwing those last three digits on the end hurts the computation time a bunch.
As an aside (to you both), using the strict version of the fold (foldl1') will also help.

Haskell mutable map/tree

I am looking for a mutable (balanced) tree/map/hash table in Haskell or a way how to simulate it inside a function. I.e. when I call the same function several times, the structure is preserved. So far I have tried Data.HashTable (which is OK, but somewhat slow) and tried Data.Array.Judy but I was unable to make it work with GHC 6.10.4. Are there any other options?
If you want mutable state, you can have it. Just keep passing the updated map around, or keep it in a state monad (which turns out to be the same thing).
import qualified Data.Map as Map
import Control.Monad.ST
import Data.STRef
memoize :: Ord k => (k -> ST s a) -> ST s (k -> ST s a)
memoize f = do
mc <- newSTRef Map.empty
return $ \k -> do
c <- readSTRef mc
case Map.lookup k c of
Just a -> return a
Nothing -> do a <- f k
writeSTRef mc (Map.insert k a c) >> return a
You can use this like so. (In practice, you might want to add a way to clear items from the cache, too.)
import Control.Monad
main :: IO ()
main = do
fib <- stToIO $ fixST $ \fib -> memoize $ \n ->
if n < 2 then return n else liftM2 (+) (fib (n-1)) (fib (n-2))
mapM_ (print <=< stToIO . fib) [1..10000]
At your own risk, you can unsafely escape from the requirement of threading state through everything that needs it.
import System.IO.Unsafe
unsafeMemoize :: Ord k => (k -> a) -> k -> a
unsafeMemoize f = unsafePerformIO $ do
f' <- stToIO $ memoize $ return . f
return $ unsafePerformIO . stToIO . f'
fib :: Integer -> Integer
fib = unsafeMemoize $ \n -> if n < 2 then n else fib (n-1) + fib (n-2)
main :: IO ()
main = mapM_ (print . fib) [1..1000]
Building on #Ramsey's answer, I also suggest you reconceive your function to take a map and return a modified one. Then code using good ol' Data.Map, which is pretty efficient at modifications. Here is a pattern:
import qualified Data.Map as Map
-- | takes input and a map, and returns a result and a modified map
myFunc :: a -> Map.Map k v -> (r, Map.Map k v)
myFunc a m = … -- put your function here
-- | run myFunc over a list of inputs, gathering the outputs
mapFuncWithMap :: [a] -> Map.Map k v -> ([r], Map.Map k v)
mapFuncWithMap as m0 = foldr step ([], m0) as
where step a (rs, m) = let (r, m') = myFunc a m in (r:rs, m')
-- this starts with an initial map, uses successive versions of the map
-- on each iteration, and returns a tuple of the results, and the final map
-- | run myFunc over a list of inputs, gathering the outputs
mapFunc :: [a] -> [r]
mapFunc as = fst $ mapFuncWithMap as Map.empty
-- same as above, but starts with an empty map, and ignores the final map
It is easy to abstract this pattern and make mapFuncWithMap generic over functions that use maps in this way.
Although you ask for a mutable type, let me suggest that you use an immutable data structure and that you pass successive versions to your functions as an argument.
Regarding which data structure to use,
There is an implementation of red-black trees at Kent
If you have integer keys, Data.IntMap is extremely efficient.
If you have string keys, the bytestring-trie package from Hackage looks very good.
The problem is that I cannot use (or I don't know how to) use a non-mutable type.
If you're lucky, you can pass your table data structure as an extra parameter to every function that needs it. If, however, your table needs to be widely distributed, you may wish to use a state monad where the state is the contents of your table.
If you are trying to memoize, you can try some of the lazy memoization tricks from Conal Elliott's blog, but as soon as you go beyond integer arguments, lazy memoization becomes very murky—not something I would recommend you try as a beginner. Maybe you can post a question about the broader problem you are trying to solve? Often with Haskell and mutability the issue is how to contain the mutation or updates within some kind of scope.
It's not so easy learning to program without any global mutable variables.
If I read your comments right, then you have a structure with possibly ~500k total values to compute. The computations are expensive, so you want them done only once, and on subsequent accesses, you just want the value without recomputation.
In this case, use Haskell's laziness to your advantage! ~500k is not so big: Just build a map of all the answers, and then fetch as needed. The first fetch will force computation, subsequent fetches of the same answer will reuse the same result, and if you never fetch a particular computation - it never happens!
You can find a small implementation of this idea using 3D point distances as the computation in the file PointCloud.hs. That file uses Debug.Trace to log when the computation actually gets done:
> ghc --make PointCloud.hs
[1 of 1] Compiling Main ( PointCloud.hs, PointCloud.o )
Linking PointCloud ...
> ./PointCloud
(1,2)
(<calc (1,2)>)
Just 1.0
(1,2)
Just 1.0
(1,5)
(<calc (1,5)>)
Just 1.0
(1,2)
Just 1.0
Are there any other options?
A mutable reference to a purely functional dictionary like Data.Map.

Resources