Squeezing more performance out of monadic streams in Haskell - performance

The most straightforward monadic 'stream' is just a list of monadic actions Monad m => [m a]. The sequence :: [m a] -> m [a] function evaluates each monadic action and collects the results. As it turns out, sequence is not very efficient, though, because it operates on lists, and the monad is an obstacle to achieving fusion in anything but the simplest cases.
The question is: What is the most efficient approach for monadic streams?
To investigate this, I provide a toy problem along with a few attempts to improve performance. The source code can be found on github. The singular benchmark presented below may be misleading for more realistic problems, although I think it is a worst-case scenario of sorts, i.e. the most possible overhead per useful computation.
The toy problem
is a maximum length 16-bit Linear Feedback Shift Register (LFSR), implemented in C in a somewhat over-elaborate way, with a Haskell wrapper in the IO monad. 'Over-elaborate' refers to the unnecessary use of a struct and its malloc - the purpose of this complication is to make it more similar to realistic situations where all you have is a Haskell wrapper around a FFI to a C struct with OO-ish new, set, get, operate semantics (i.e. very much the imperative style). A typical Haskell program looks like this:
import LFSR
main = do
lfsr <- newLFSR -- make a LFSR object
setLFSR lfsr 42 -- initialise it with 42
stepLFSR lfsr -- do one update
getLFSR lfsr >>= print -- extract the new value and print
The default task is to calculate the average of the values (doubles) of 10'000'000 iterations of the LFSR. This task could be part of a suite of tests to analyse the 'randomness` of this stream of 16-bit integers.
0. Baseline
The baseline is the C implementation of the average over n iterations:
double avg(state_t* s, int n) {
double sum = 0;
for(int i=0; i<n; i++, sum += step_get_lfsr(s));
return sum / (double)n;
}
The C implementation isn't meant to be particularly good, or fast. It just provides a meaningful computation.
1. Haskell lists
Compared to the C baseline, on this task Haskell lists are 73x slower.
=== RunAvg =========
Baseline: 1.874e-2
IO: 1.382488
factor: 73.77203842049093
This is the implementation (RunAvg.hs):
step1 :: LFSR -> IO Word32
step1 lfsr = stepLFSR lfsr >> getLFSR lfsr
avg :: LFSR -> Int -> IO Double
avg lfsr n = mean <$> replicateM n (step1 lfsr) where
mean :: [Word32] -> Double
mean vs = (sum $ fromIntegral <$> vs) / (fromIntegral n)
2. Using the streaming library
This gets us to within 9x of the baseline,
=== RunAvgStreaming ===
Baseline: 1.9391e-2
IO: 0.168126
factor: 8.670310969006241
(Note that the benchmarks are rather inaccurate at these short execution times.)
This is the implementation (RunAvgStreaming.hs):
import qualified Streaming.Prelude as S
avg :: LFSR -> Int -> IO Double
avg lfsr n = do
let stream = S.replicateM n (fromIntegral <$> step1 lfsr :: IO Double)
(mySum :> _) <- S.sum stream
return (mySum / fromIntegral n)
3. Using Data.Vector.Fusion.Stream.Monadic
This gives the best performance so far, within 3x of baseline,
=== RunVector =========
Baseline: 1.9986e-2
IO: 4.9146e-2
factor: 2.4590213149204443
As usual, here is the implementation (RunAvgVector.hs):
import qualified Data.Vector.Fusion.Stream.Monadic as V
avg :: LFSR -> Int -> IO Double
avg lfsr n = do
let stream = V.replicateM n (step1' lfsr)
V.foldl (+) 0.0 stream
I didn't expect to find a good monadic stream implementation under Data.Vector. Other than providing fromVector and concatVectors, Data.Vector.Fusion.Stream.Monadic has very little to do with Vector from Data.Vector.
A look at the profiling report shows that Data.Vector.Fusion.Stream.Monadic has a considerable space leak, but that doesn't sound right.
4. Lists aren't necessarily slow
For very simple operations lists aren't terrible at all:
=== RunRepeat =======
Baseline: 1.8078e-2
IO: 3.6253e-2
factor: 2.0053656377917912
Here, the for loop is done in Haskell instead of pushing it down to C (RunRepeat.hs):
do
setLFSR lfsr 42
replicateM_ nIter (stepLFSR lfsr)
getLFSR lfsr
This is just a repetition of calls to stepLFSR without passing the result back to the Haskell layer. It gives an indication of what impact the overhead for calling the wrapper and the FFI has.
Analysis
The repeat example above shows that most, but not all (?), of the performance penalty comes from overhead of calling the wrapper and/or the FFI. But I'm not sure where to look for tweaks, now. Maybe this is just as good as it gets with regards to monadic streams, and in fact this is all about trimming down the FFI, now...
Sidenotes
The fact that LFSRs are chosen as a toy problem does not imply that Haskell isn't able to do these efficiently - see the SO question "Efficient bit-fiddling in a LFSR implementation
".
Iterating a 16-bit LFSR 10M times is a rather silly thing to do. It will take at most 2^16-1 iterations to reach the starting state again. In a maximum length LFSR it will take exactly 2^16-1 iterations.
Update 1
An attempt to remove the withForeignPtr calls can be made by introducing a
Storable and then using alloca :: Storable a => (Ptr a -> IO b) -> IO b
repeatSteps :: Word32 -> Int -> IO Word32
repeatSteps start n = alloca rep where
rep :: Ptr LFSRStruct' -> IO Word32
rep p = do
setLFSR2 p start
(sequence_ . (replicate n)) (stepLFSR2 p)
getLFSR2 p
where LFSRStruct' is
data LFSRStruct' = LFSRStruct' CUInt
and the wrapper is
foreign import ccall unsafe "lfsr.h set_lfsr"
setLFSR2 :: Ptr LFSRStruct' -> Word32 -> IO ()
-- likewise for setLFSR2, stepLFSR2, ...
See RunRepeatAlloca.hs and src/LFSR.hs. Performance-wise this makes no difference (within timing variance).
=== RunRepeatAlloca =======
Baseline: 0.19811199999999998
IO: 0.33433
factor: 1.6875807623970283

After deciphering GHC's assembly product for RunRepeat.hs I'm coming to this conclusion: GHC won't inline the call to the C function step_lfsr(state_t*), whereas the C compiler will, and this makes all the difference for this toy problem.
I can demonstrate this by forbidding inlining with the __attribute__ ((noinline)) pragma. Overall, the C executable gets slower, hence the gap between Haskell and C closes.
Here are the results:
=== RunRepeat =======
#iter: 100000000
Baseline: 0.334414
IO: 0.325433
factor: 0.9731440669349967
=== RunRepeatAlloca =======
#iter: 100000000
Baseline: 0.330629
IO: 0.333735
factor: 1.0093942152684732
=== RunRepeatLoop =====
#iter: 100000000
Baseline: 0.33195399999999997
IO: 0.33791
factor: 1.0179422450098508
I.e. there is no penalty for FFI calls to lfsr_step anymore.
=== RunAvg =========
#iter: 10000000
Baseline: 3.4072e-2
IO: 1.3602589999999999
factor: 39.92307466541442
=== RunAvgStreaming ===
#iter: 50000000
Baseline: 0.191264
IO: 0.666438
factor: 3.484388070938598
Good old lists don't fuse, hence the huge performance hit, and the streaming library also isn't optimal. But Data.Vector.Fusion.Stream.Monadic gets within 20% of the C performance:
=== RunVector =========
#iter: 200000000
Baseline: 0.705265
IO: 0.843916
factor: 1.196594188000255
It has been observed already that GHC doesn't inline FFI calls: "How to force GHC to inline FFI calls?"
.
For situations where the benefit of inlining is so high, i.e. the workload per FFI call is so low, it might be worth looking into inline-c.

Related

Iterate State Monad and Collect Results in Sequence with Good Performance

I implemented the following function:
iterateState :: Int -> (a -> State s a) -> (a -> State s [a])
iterateState 0 f a = return []
iterateState n f a = do
b <- f a
xs <- iterateState (n - 1) f b
return $ b : xs
My primary use case is for a = Double. It works, but it is very slow. It allocates 528MB of heap space to produce a list of 1M Double values and spends most of its time doing garbage collection.
I have experimented with implementations that work on the type s -> (a, s) directly as well as with various strictness annotations. I was able to reduce the heap allocation somewhat, but not even close to what one would expect from a reasonable implementation. I suspect that the resulting ([a], s) being a combination of something to be consumed lazily ([a]) and something whose WHNF forces the entire computation (s) makes optimization difficult for GHC.
Assuming that the iterative nature of lists would be unsuitable for this situation, I turned to the vector package. To my delight, it already contains
iterateNM :: (Monad m, Unbox a) => Int -> (a -> m a) -> a -> m (Vector a)
Unfortunately, this is only slightly faster than my list implementation, still allocating 328MB of heap space. I assumed that this is because it uses unstreamM, whose description reads
Load monadic stream bundle into a newly allocated vector. This function goes through a list, so prefer using unstream, unless you need to be in a monad.
Looking at its behavior for the list monad, it is understandable that there is no efficient implementation for general monads. Luckily, I only need the state monad, and I found another function that almost fits the signature of the state monad.
unfoldrExactN :: Unbox a => Int -> (b -> (a, b)) -> b -> Vector a
This function is blazingly fast and performs no excess heap allocation beyond the 8MB needed to hold the resulting unboxed vector of 1M Double values. Unfortunately, it does not return the final state at the end of the computation, so it cannot be wrapped in the State type.
I looked at the implementation of unfoldrExactN to see if I could adjust it to expose the final state at the end of the computation. Unfortunately, this seems to be difficult, as the stream constructed by
unfoldrExactN :: Monad m => Int -> (s -> (a, s)) -> s -> Stream m a
which is eventually expanded into a vector by unstream has already forgotten the state type s.
I imagine I could circumvent the entire Stream infrastructure and implement iterateState directly on mutable vectors in the ST monad (similarly to how unstream expands a stream into a vector). However, I would lose all the benefits of stream fusion, as well as turning a computation that is easily expressed as a pure function into imperative low-level mush just for performance reasons. This is particularly frustrating while knowing that the existing unfoldrExactN already calculates all the values I want, but I have no access to them.
Is there a better way?
Can this function be implemented in a purely functional way with reasonable performance and no excess heap allocations? Preferably in a way that ties into the vector package and its stream fusion infrastructure.
The following program has 12MB max residency on my computer when compiled with optimizations:
import Data.Vector.Unboxed
import Data.Vector.Unboxed.Mutable
iterateNState :: Unbox a => Int -> (a -> s -> (s, a)) -> (a -> s -> (s, Vector a))
iterateNState n f a0 s0 = createT (unsafeNew n >>= go 0 a0 s0) where
go i a s arr
| i >= n = pure (s, arr)
| otherwise = do
unsafeWrite arr i a
case f a s of
(s', a') -> go (i+1) a' s' arr
main = id
. print
. Data.Vector.Unboxed.sum
. snd
$ iterateNState 1000000 (\a s -> (s+1, a+s :: Int)) 0 0
(It continues to have a nice low residency even when the final two 0s are read from input dynamically.)

Performance of replicateM in IO monad

I have an IO action called action in which I am doing a fairly heavy computation. I am using an IO monad in in order to have easy access to random numbers further down in the computation.
I also have the below functions which replicate the action and take the mean of the result. Because the action takes a fair amount of time to complete, I was wondering what the effect on the performance is from doing the sampling in this manner. Would it be better to do the sampling later in the evaluation so that parts of the program that are the same for each sample are not repeated or does the Haskell compiler optimise this already?
samp :: (Fractional b) => Int -> IO b -> IO b
samp n action = do
samples <- replicateM n action
return $ mean samples
mean :: (Fractional a) => [a] -> a
mean as = s / (genericLength as)
where s = sum as

Efficiency of unfoldr versus zipWith

Over on Code Review, I answered a question about a naive Haskell fizzbuzz solution by suggesting an implementation that iterates forward, avoiding the quadratic cost of the increasing number of primes and discarding modulo division (almost) entirely. Here's the code:
fizz :: Int -> String
fizz = const "fizz"
buzz :: Int -> String
buzz = const "buzz"
fizzbuzz :: Int -> String
fizzbuzz = const "fizzbuzz"
fizzbuzzFuncs = cycle [show, show, fizz, show, buzz, fizz, show, show, fizz, buzz, show, fizz, show, show, fizzbuzz]
toFizzBuzz :: Int -> Int -> [String]
toFizzBuzz start count =
let offsetFuncs = drop (mod (start - 1) 15) fizzbuzzFuncs
in take count $ zipWith ($) offsetFuncs [start..]
As a further prompt, I suggested rewriting it using Data.List.unfoldr. The unfoldr version is an obvious, simple modification to this code so I'm not going to type it here unless people seeking to answer my question insist that is important (no spoilers for the OP over on Code Review). But I do have a question about the relative efficiency of the unfoldr solution compared to the zipWith one. While I am no longer a Haskell neophyte, I am no expert on Haskell internals.
An unfoldr solution does not require the [start..] infinite list, since it can simply unfold from start. My thoughts are
The zipWith solution does not memoize each successive element of [start..] as it is asked for. Each element is used and discarded because no reference to the head of [start..] is kept. So there is no more memory consumed there than with unfoldr.
Concerns about the performance of unfoldr and recent patches to make it always inlined are conducted at a level which I have not yet reached.
So I think the two are equivalent in memory consumption but have no idea about the relative performance. Hoping more informed Haskellers can direct me towards an understanding of this.
unfoldr seems a natural thing to use to generate sequences, even if other solutions are more expressive. I just know I need to understand more about it's actual performance. (For some reason I find foldr much easier to comprehend on that level)
Note: unfoldr's use of Maybe was the first potential performance issue that occurred to me, before I even started investigating the issue (and the only bit of the optimisation/inlining discussions that I fully understood). So I was able to stop worrying about Maybe right away (given a recent version of Haskell).
As the one responsible for the recent changes in the implementations of zipWith and unfoldr, I figured I should probably take a stab at this. I can't really compare them so easily, because they're very different functions, but I can try to explain some of their properties and the significance of the changes.
unfoldr
Inlining
The old version of unfoldr (before base-4.8/GHC 7.10) was recursive at the top level (it called itself directly). GHC never inlines recursive functions, so unfoldr was never inlined. As a result, GHC could not see how it interacted with the function it was passed. The most troubling effect of this was that the function passed in, of type (b -> Maybe (a, b)), would actually produce Maybe (a, b) values, allocating memory to hold the Just and (,) constructors. By restructuring unfoldr as a "worker" and a "wrapper", the new code allows GHC to inline it and (in many cases) fuse it with the function passed in, so the extra constructors are stripped away by compiler optimizations.
For example, under GHC 7.10, the code
module Blob where
import Data.List
bloob :: Int -> [Int]
bloob k = unfoldr go 0 where
go n | n == k = Nothing
| otherwise = Just (n * 2, n+1)
compiled with ghc -O2 -ddump-simpl -dsuppress-all -dno-suppress-type-signatures leads to the core
$wbloob :: Int# -> [Int]
$wbloob =
\ (ww_sYv :: Int#) ->
letrec {
$wgo_sYr :: Int# -> [Int]
$wgo_sYr =
\ (ww1_sYp :: Int#) ->
case tagToEnum# (==# ww1_sYp ww_sYv) of _ {
False -> : (I# (*# ww1_sYp 2)) ($wgo_sYr (+# ww1_sYp 1));
True -> []
}; } in
$wgo_sYr 0
bloob :: Int -> [Int]
bloob =
\ (w_sYs :: Int) ->
case w_sYs of _ { I# ww1_sYv -> $wbloob ww1_sYv }
Fusion
The other change to unfoldr was rewriting it to participate in "fold/build" fusion, an optimization framework used in GHC's list libraries. The idea of both "fold/build" fusion and the newer, differently balanced, "stream fusion" (used in the vector library) is that if a list is produced by a "good producer", transformed by "good transformers", and consumed by a "good consumer", then the list conses never actually need to be allocated at all. The old unfoldr was not a good producer, so if you produced a list with unfoldr and consumed it with, say, foldr, the pieces of the list would be allocated (and immediately become garbage) as computation proceeded. Now, unfoldr is a good producer, so you can write a loop using, say, unfoldr, filter, and foldr, and not (necessarily) allocate any memory at all.
For example, given the above definition of bloob, and a stern {-# INLINE bloob #-} (this stuff is a bit fragile; good producers sometimes need to be inlined explicitly to be good), the code
hooby :: Int -> Int
hooby = sum . bloob
compiles to the GHC core
$whooby :: Int# -> Int#
$whooby =
\ (ww_s1oP :: Int#) ->
letrec {
$wgo_s1oL :: Int# -> Int# -> Int#
$wgo_s1oL =
\ (ww1_s1oC :: Int#) (ww2_s1oG :: Int#) ->
case tagToEnum# (==# ww1_s1oC ww_s1oP) of _ {
False -> $wgo_s1oL (+# ww1_s1oC 1) (+# ww2_s1oG (*# ww1_s1oC 2));
True -> ww2_s1oG
}; } in
$wgo_s1oL 0 0
hooby :: Int -> Int
hooby =
\ (w_s1oM :: Int) ->
case w_s1oM of _ { I# ww1_s1oP ->
case $whooby ww1_s1oP of ww2_s1oT { __DEFAULT -> I# ww2_s1oT }
}
which has no lists, no Maybes, and no pairs; the only allocation it performs is the Int used to store the final result (the application of I# to ww2_s1oT). The entire computation can reasonably be expected to be performed in machine registers.
zipWith
zipWith has a bit of a weird story. It fits into the fold/build framework a bit awkwardly (I believe it works quite a bit better with stream fusion). It is possible to make zipWith fuse with either its first or its second list argument, and for many years, the list library tried to make it fuse with either, if either was a good producer. Unfortunately, making it fuse with its second list argument can make a program less defined under certain circumstances. That is, a program using zipWith could work just fine when compiled without optimization, but produce an error when compiled with optimization. This is not a great situation. Therefore, as of base-4.8, zipWith no longer attempts to fuse with its second list argument. If you want it to fuse with a good producer, that good producer had better be in the first list argument.
Specifically, the reference implementation of zipWith leads to the expectation that, say, zipWith (+) [1,2,3] (1 : 2 : 3 : undefined) will give [2,4,6], because it stops as soon as it hits the end of the first list. With the previous zipWith implementation, if the second list looked like that but was produced by a good producer, and if zipWith happened to fuse with it rather than the first list, then it would go boom.

Why summing native lists is slower than summing church-encoded lists with `GHC -O2`?

In order to test how church-encoded lists perform against user-defiend lists and native lists, I've prepared 3 benchmarks:
User-defined lists
data List a = Cons a (List a) | Nil deriving Show
lenumTil n = go n Nil where
go 0 result = result
go n result = go (n-1) (Cons (n-1) result)
lsum Nil = 0
lsum (Cons h t) = h + (lsum t)
main = print (lsum (lenumTil (100000000 :: Int)))
Native lists
main = print $ sum ([0..100000000-1] :: [Int])
Church lists
fsum = (\ a -> (a (+) 0))
fenumTil n cons nil = go n nil where
go 0 result = result
go n result = go (n-1) (cons (n-1) result)
main = print $ (fsum (fenumTil (100000000 :: Int)) :: Int)
The benchmark results are unexpected:
User-defined lists
-- 4999999950000000
-- real 0m22.520s
-- user 0m59.815s
-- sys 0m20.327s
Native Lists
-- 4999999950000000
-- real 0m0.999s
-- user 0m1.357s
-- sys 0m0.252s
Church Lists
-- 4999999950000000
-- real 0m0.010s
-- user 0m0.002s
-- sys 0m0.003s
One would expect that, with the huge amount of specific optimizations targeted to native lists, they would perform the best. Yet, church lists outperform them by a 100x factor, and outperform user-defined ADTs by a 2250x factor. I've compiled all programs with GHC -O2. I've tried replacing sum by foldl', same result. I've attempted adding user-inputs to make sure the church-list version wasn't optimized to a constant. arkeet pointed out on #haskell that, by inspecting Core, the native version has an intermediate lists, but why? Forcing allocation with an additional reverse, all 3 perform roughly the same.
GHC 7.10 has call arity analysis, which lets us define foldl in terms of foldr and thus let left folds, including sum, participate in fusion. GHC 7.8 also defines sum with foldl but it can't fuse the lists away. Thus GHC 7.10 performs optimally and identically to the Church version.
The Church version is child's play to optimize in either GHC versions. We just have to inline (+) and 0 into fenumTil, and then we have a patently tail-recursive go which can be readily unboxed and then turned into a loop by the code generator.
The user-defined version is not tail-recursive and it works in linear space, which wrecks performance, of course.

How does one write efficient Dynamic Programming algorithms in Haskell?

I've been playing around with dynamic programming in Haskell. Practically every tutorial I've seen on the subject gives the same, very elegant algorithm based on memoization and the laziness of the Array type. Inspired by those examples, I wrote the following algorithm as a test:
-- pascal n returns the nth entry on the main diagonal of pascal's triangle
-- (mod a million for efficiency)
pascal :: Int -> Int
pascal n = p ! (n,n) where
p = listArray ((0,0),(n,n)) [f (i,j) | i <- [0 .. n], j <- [0 .. n]]
f :: (Int,Int) -> Int
f (_,0) = 1
f (0,_) = 1
f (i,j) = (p ! (i, j-1) + p ! (i-1, j)) `mod` 1000000
My only problem is efficiency. Even using GHC's -O2, this program takes 1.6 seconds to compute pascal 1000, which is about 160 times slower than an equivalent unoptimized C++ program. And the gap only widens with larger inputs.
It seems like I've tried every possible permutation of the above code, along with suggested alternatives like the data-memocombinators library, and they all had the same or worse performance. The one thing I haven't tried is the ST Monad, which I'm sure could be made to run the program only slighter slower than the C version. But I'd really like to write it in idiomatic Haskell, and I don't understand why the idiomatic version is so inefficient. I have two questions:
Why is the above code so inefficient? It seems like a straightforward iteration through a matrix, with an arithmetic operation at each entry. Clearly Haskell is doing something behind the scenes I don't understand.
Is there a way to make it much more efficient (at most 10-15 times the runtime of a C program) without sacrificing its stateless, recursive formulation (vis-a-vis an implementation using mutable arrays in the ST Monad)?
Thanks a lot.
Edit: The array module used is the standard Data.Array
Well, the algorithm could be designed a little better. Using the vector package and being smart about only keeping one row in memory at a time, we can get something that's idiomatic in a different way:
{-# LANGUAGE BangPatterns #-}
import Data.Vector.Unboxed
import Prelude hiding (replicate, tail, scanl)
pascal :: Int -> Int
pascal !n = go 1 ((replicate (n+1) 1) :: Vector Int) where
go !i !prevRow
| i <= n = go (i+1) (scanl f 1 (tail prevRow))
| otherwise = prevRow ! n
f x y = (x + y) `rem` 1000000
This optimizes down very tightly, especially because the vector package includes some rather ingenious tricks to transparently optimize array operations written in an idiomatic style.
1 Why is the above code so inefficient? It seems like a straightforward iteration through a matrix, with an arithmetic operation at each entry. Clearly Haskell is doing something behind the scenes I don't understand.
The problem is that the code writes thunks to the array. Then when entry (n,n) is read, the evaluation of the thunks jumps all over the array again, recurring until finally a value not needing further recursion is found. That causes a lot of unnecessary allocation and inefficiency.
The C++ code doesn't have that problem, the values are written, and read directly without requiring further evaluation. As it would happen with an STUArray. Does
p = runSTUArray $ do
arr <- newArray ((0,0),(n,n)) 1
forM_ [1 .. n] $ \i ->
forM_ [1 .. n] $ \j -> do
a <- readArray arr (i,j-1)
b <- readArray arr (i-1,j)
writeArray arr (i,j) $! (a+b) `rem` 1000000
return arr
really look so bad?
2 Is there a way to make it much more efficient (at most 10-15 times the runtime of a C program) without sacrificing its stateless, recursive formulation (vis-a-vis an implementation using mutable arrays in the ST Monad)?
I don't know of one. But there might be.
Addendum:
Once one uses STUArrays or unboxed Vectors, there's still a significant difference to the equivalent C implementation. The reason is that gcc replaces the % by a combination of multiplications, shifts and subtractions (even without optimisations), since the modulus is known. Doing the same by hand in Haskell (since GHC doesn't [yet] do that),
-- fast modulo 1000000
-- for nonnegative Ints < 2^31
-- requires 64-bit Ints
fastMod :: Int -> Int
fastMod n = n - 1000000*((n*1125899907) `shiftR` 50)
gets the Haskell versions on par with C.
The trick is to think about how to write the whole damn algorithm at once, and then use unboxed vectors as your backing data type. For example, the following runs about 20 times faster on my machine than your code:
import qualified Data.Vector.Unboxed as V
combine :: Int -> Int -> Int
combine x y = (x+y) `mod` 1000000
pascal n = V.last $ go n where
go 0 = V.replicate (n+1) 1
go m = V.scanl1 combine (go (m-1))
I then wrote two main functions that called out to yours and mine with an argument of 4000; these ran in 10.42s and 0.54s respectively. Of course, as I'm sure you know, they both get blown out of the water (0.00s) by the version that uses a better algorithm:
pascal' :: Integer -> Integer
pascal :: Int -> Int
pascal' n = product [n+1..n*2] `div` product [2..n]
pascal = fromIntegral . (`mod` 1000000) . pascal' . fromIntegral

Resources