Profiling a Haskell program - performance

I have a piece of code that repeatedly samples from a probability distribution using sequence. Morally, it does something like this:
sampleMean :: MonadRandom m => Int -> m Float -> m Float
sampleMean n dist = do
xs <- sequence (replicate n dist)
return (sum xs)
Except that it's a bit more complicated. The actual code I'm interested in is the function likelihoodWeighting at this Github repo.
I noticed that the running time scales nonlinearly with n. In particular, once n exceeds a certain value it hits the memory limit, and the running time explodes. I'm not certain, but I think this is because sequence is building up a long list of thunks which aren't getting evaluated until the call to sum.
Once I get past about 100,000 samples, the program slows to a crawl. I'd like to optimize this (my feeling is that 10 million samples shouldn't be a problem) so I decided to profile it - but I'm having a little trouble understanding the output of the profiler.
Profiling
I created a short executable in a file main.hs that runs my function with 100,000 samples. Here's the output from doing
$ ghc -O2 -rtsopts main.hs
$ ./main +RTS -s
First things I notice - it allocates nearly 1.5 GB of heap, and spends 60% of its time on garbage collection. Is this generally indicative of too much laziness?
1,377,538,232 bytes allocated in the heap
1,195,050,032 bytes copied during GC
169,411,368 bytes maximum residency (12 sample(s))
7,360,232 bytes maximum slop
423 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 2574 collections, 0 parallel, 2.40s, 2.43s elapsed
Generation 1: 12 collections, 0 parallel, 1.07s, 1.28s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.92s ( 1.94s elapsed)
GC time 3.47s ( 3.70s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.23s ( 0.23s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 5.63s ( 5.87s elapsed)
%GC time 61.8% (63.1% elapsed)
Alloc rate 716,368,278 bytes per MUT second
Productivity 34.2% of total user, 32.7% of total elapsed
Here are the results from
$ ./main +RTS -p
The first time I ran this, it turned out that there was one function being called repeatedly, and it turned out I could memoize it, which sped things up by a factor of 2. It didn't solve the space leak, however.
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
main Main 434 4 0.0 0.0 100.0 100.0
likelihoodWeighting AI.Probability.Bayes 445 1 0.0 0.3 100.0 100.0
distributionLW AI.Probability.Bayes 448 1 0.0 2.6 0.0 2.6
getSampleLW AI.Probability.Bayes 446 100000 20.0 50.4 100.0 97.1
bnProb AI.Probability.Bayes 458 400000 0.0 0.0 0.0 0.0
bnCond AI.Probability.Bayes 457 400000 6.7 0.8 6.7 0.8
bnVals AI.Probability.Bayes 455 400000 20.0 6.3 26.7 7.1
bnParents AI.Probability.Bayes 456 400000 6.7 0.8 6.7 0.8
bnSubRef AI.Probability.Bayes 454 800000 13.3 13.5 13.3 13.5
weightedSample AI.Probability.Bayes 447 100000 26.7 23.9 33.3 25.3
bnProb AI.Probability.Bayes 453 100000 0.0 0.0 0.0 0.0
bnCond AI.Probability.Bayes 452 100000 0.0 0.2 0.0 0.2
bnVals AI.Probability.Bayes 450 100000 0.0 0.3 6.7 0.5
bnParents AI.Probability.Bayes 451 100000 6.7 0.2 6.7 0.2
bnSubRef AI.Probability.Bayes 449 200000 0.0 0.7 0.0 0.7
Here's a heap profile. I don't know why it claims the runtime is 1.8 seconds - this run took about 6 seconds.
Can anyone help me to interpret the output of the profiler - i.e. to identify where the bottleneck is, and provide suggestions for how to speed things up?

A huge improvement has already been achieved by incorporating JohnL's suggestion of using foldM in likelihoodWeighting. That reduced memory usage about tenfold here, and brought down the GC times significantly to almost or actually negligible.
A profiling run with the current source yields
probabilityIO AI.Util.Util 26.1 42.4 413 290400000
weightedSample.go AI.Probability.Bayes 16.1 19.1 255 131200080
bnParents AI.Probability.Bayes 10.8 1.2 171 8000384
bnVals AI.Probability.Bayes 10.4 7.8 164 53603072
bnCond AI.Probability.Bayes 7.9 1.2 125 8000384
ndSubRef AI.Util.Array 4.8 9.2 76 63204112
bnSubRef AI.Probability.Bayes 4.7 8.1 75 55203072
likelihoodWeighting.func AI.Probability.Bayes 3.3 2.8 53 19195128
%! AI.Util.Util 3.3 0.5 53 3200000
bnProb AI.Probability.Bayes 2.5 0.0 40 16
bnProb.p AI.Probability.Bayes 2.5 3.5 40 24001152
likelihoodWeighting AI.Probability.Bayes 2.5 2.9 39 20000264
likelihoodWeighting.func.x AI.Probability.Bayes 2.3 0.2 37 1600000
and 13MB memory usage reported by -s, ~5MB maximum residency. That's not too bad already.
Still, there remain some points we can improve. First, a relatively minor thing, in the grand scheme, AI.UTIl.Array.ndSubRef:
ndSubRef :: [Int] -> Int
ndSubRef ns = sum $ zipWith (*) (reverse ns) (map (2^) [0..])
Reversing the list, and mapping (2^) over another list is inefficient, better is
ndSubRef = L.foldl' (\a d -> 2*a + d) 0
which doesn't need to keep the entire list in memory (probably not a big deal, since the lists will be short) as reversing it does, and doesn't need to allocate a second list. The reduction in allocation is noticeable, about 10%, and that part runs measurably faster,
ndSubRef AI.Util.Array 1.7 1.3 24 8000384
in the profile of the modified run, but since it takes only a small part of the overall time, the overall impact is small. There are potentially bigger fish to fry in weightedSample and likelihoodWeighting.
Let's add a bit of strictness in weightedSample to see how that changes things:
weightedSample :: Ord e => BayesNet e -> [(e,Bool)] -> IO (Map e Bool, Prob)
weightedSample bn fixed =
go 1.0 (M.fromList fixed) (bnVars bn)
where
go w assignment [] = return (assignment, w)
go w assignment (v:vs) = if v `elem` vars
then
let w' = w * bnProb bn assignment (v, fixed %! v)
in go w' assignment vs
else do
let p = bnProb bn assignment (v,True)
x <- probabilityIO p
go w (M.insert v x assignment) vs
vars = map fst fixed
The weight parameter of go is never forced, nor is the assignment parameter, thus they can build up thunks. Let's enable {-# LANGUAGE BangPatterns #-} and force updates to take effect immediately, also evaluate p before passing it to probabilityIO:
go w assignment (v:vs) = if v `elem` vars
then
let !w' = w * bnProb bn assignment (v, fixed %! v)
in go w' assignment vs
else do
let !p = bnProb bn assignment (v,True)
x <- probabilityIO p
let !assignment' = M.insert v x assignment
go w assignment' vs
That brings a further reduction in allocation (~9%) and a small speedup (~%13%), but the total memory usage and maximum residence haven't changed much.
I see nothing else obvious to change there, so let's look at likelihoodWeighting:
func m _ = do
(a, w) <- weightedSample bn fixed
let x = a ! e
return $! x `seq` w `seq` M.adjust (+w) x m
In the last line, first, w is already evaluated in weightedSample now, so we don't need to seq it here, the key x is required to evaluate the updated map, so seqing that isn't necessary either. The bad thing on that line is M.adjust. adjust has no way of forcing the result of the updated function, so that builds thunks in the map's values. You can force evaluation of the thunks by looking up the modified value and forcing that, but Data.Map provides a much more convenient way here, since the key at which the map is updated is guaranteed to be present, insertWith':
func !m _ = do
(a, w) <- weightedSample bn fixed
let x = a ! e
return (M.insertWith' (+) x w m)
(Note: GHC optimises better with a bang-pattern on m than with return $! ... here). That slightly reduces the total allocation and doesn't measurably change the running time, but has a great impact on total memory used and maximum residency:
934,566,488 bytes allocated in the heap
1,441,744 bytes copied during GC
68,112 bytes maximum residency (1 sample(s))
23,272 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
The biggest improvement in running time to be had would be by avoiding randomIO, the used StdGen is very slow.
I am surprised how much time the bn* functions take, but don't see any obvious inefficiency in those.

I have trouble digesting these profiles, but I have gotten my ass kicked before because the MonadRandom on Hackage is strict. Creating a lazy version of MonadRandom made my memory problems go away.
My colleague has not yet gotten permission to release the code, but I've put Control.Monad.LazyRandom online at pastebin. Or if you want to see some excerpts that explain a fully lazy random search, including infinite lists of random computations, check out Experience Report: Haskell in Computational Biology.

I put together a very elementary example, posted here: http://hpaste.org/71919. I'm not sure if it's anything like your example.. just a very minimal thing that seemed to work.
Compiling with -prof and -fprof-auto and running with 100000 iterations yielded the following head of the profiling output (pardon my line numbers):
8 COST CENTRE MODULE %time %alloc
9
10 sample AI.Util.ProbDist 31.5 36.6
11 bnParents AI.Probability.Bayes 23.2 0.0
12 bnRank AI.Probability.Bayes 10.7 23.7
13 weightedSample.go AI.Probability.Bayes 9.6 13.4
14 bnVars AI.Probability.Bayes 8.6 16.2
15 likelihoodWeighting AI.Probability.Bayes 3.8 4.2
16 likelihoodWeighting.getSample AI.Probability.Bayes 2.1 0.7
17 sample.cumulative AI.Util.ProbDist 1.7 2.1
18 bnCond AI.Probability.Bayes 1.6 0.0
19 bnRank.ps AI.Probability.Bayes 1.1 0.0
And here are the summary statistics:
1,433,944,752 bytes allocated in the heap
1,016,435,800 bytes copied during GC
176,719,648 bytes maximum residency (11 sample(s))
1,900,232 bytes maximum slop
400 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.40s ( 1.41s elapsed)
GC time 1.08s ( 1.24s elapsed)
Total time 2.47s ( 2.65s elapsed)
%GC time 43.6% (46.8% elapsed)
Alloc rate 1,026,674,336 bytes per MUT second
Productivity 56.4% of total user, 52.6% of total elapsed
Notice that the profiler pointed its finger at sample. I forced the return in that function by using $!, and here are some summary statistics afterwards:
1,776,908,816 bytes allocated in the heap
165,232,656 bytes copied during GC
34,963,136 bytes maximum residency (7 sample(s))
483,192 bytes maximum slop
68 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.42s ( 2.44s elapsed)
GC time 0.21s ( 0.23s elapsed)
Total time 2.63s ( 2.68s elapsed)
%GC time 7.9% (8.8% elapsed)
Alloc rate 733,248,745 bytes per MUT second
Productivity 92.1% of total user, 90.4% of total elapsed
Much more productive in terms of GC, but not much changed on the time. You might be able to keep iterating in this profile/tweak fashion to target your bottlenecks and eke out some better performance.

I think your initial diagnosis is correct, and I've never seen a profiling report that's useful once memory effects kick in.
The problem is that you're traversing the list twice, once for sequence and again for sum. In Haskell, multiple list traversals of large lists are really, really bad for performance. The solution is generally to use some type of fold, such as foldM. Your sampleMean function can be written as
{-# LANGUAGE BangPatterns #-}
sampleMean2 :: MonadRandom m => Int -> m Float -> m Float
sampleMean2 n dist = foldM (\(!a) mb -> liftM (+a) mb) 0 $ replicate n dist
for example, traversing the list only once.
You can do the same sort of thing with likelihoodWeighting as well. In order to prevent thunks, it's important to make sure that the accumulator in your fold function has appropriate strictness.

Related

Using list generator for memory efficient code in Haskell

I would like to get a handle on writing memory efficient haskell code. One thing I ran across is that there is no dead easy way to make python style list generators/iterators (that I could find).
Small example:
Finding the sum of the integers from 1 to 100000000 without using the closed form formula.
Python that can be done quickly with minimal use of memory as sum(xrange(100000000). In Haskell the analogue would be sum [1..100000000]. However this uses up a lot of memory. I thought using foldl or foldr would be fine but even that uses a lot of memory and is slower than python. Any suggestions?
TL;DR - I think the culprit in this case is - defaulting of GHC to Integer.
Admittedly I do not know enough about python, but my first guess would be that python switches to "bigint" only if necessary - therefore all calculations are done with Int a.k.a. 64-bit integer on my machine.
A first check with
$> ghci
GHCi, version 7.10.3: http://www.haskell.org/ghc/ :? for help
Prelude> maxBound :: Int
9223372036854775807
reveals that the result of the sum (5000000050000000) is less than that number so we have no fear of Int overflow.
I have guessed your example programs to look roughly like this
sum.py
print(sum(xrange(100000000)))
sum.hs
main :: IO ()
main = print $ sum [1..100000000]
To make things explicit I added the type annotation (100000000 :: Integer), compiling it with
$ > stack build --ghc-options="-O2 -with-rtsopts=-sstderr"
and ran your example,
$ > stack exec -- time sum
5000000050000000
3,200,051,872 bytes allocated in the heap
208,896 bytes copied during GC
44,312 bytes maximum residency (2 sample(s))
21,224 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6102 colls, 0 par 0.013s 0.012s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 1.725s ( 1.724s elapsed)
GC time 0.013s ( 0.012s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.739s ( 1.736s elapsed)
%GC time 0.7% (0.7% elapsed)
Alloc rate 1,855,603,449 bytes per MUT second
Productivity 99.3% of total user, 99.4% of total elapsed
1.72user 0.00system 0:01.73elapsed 99%CPU (0avgtext+0avgdata 4112maxresident)k
and indeed the ~3GB of memory consumption is reproduced.
Changing the annotation to (100000000 :: Int) - altered the behaviour drastically
$ > stack build
$ > stack exec -- time sum
5000000050000000
51,872 bytes allocated in the heap
3,408 bytes copied during GC
44,312 bytes maximum residency (1 sample(s))
17,128 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.034s ( 0.034s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.036s ( 0.035s elapsed)
%GC time 0.2% (0.2% elapsed)
Alloc rate 1,514,680 bytes per MUT second
Productivity 99.4% of total user, 102.3% of total elapsed
0.03user 0.00system 0:00.03elapsed 91%CPU (0avgtext+0avgdata 3496maxresident)k
0inputs+0outputs (0major+176minor)pagefaults 0swaps
For the interested
The behaviour of the haskell version does not change a lot if you use libraries like conduit or vector (both boxed and unboxed).
sample programs
sumC.hs
import Data.Conduit
import Data.Conduit.List as CL
main :: IO ()
main = do res <- CL.enumFromTo 1 100000000 $$ CL.fold (+) (0 :: Int)
print res
sumV.hs
import Data.Vector.Unboxed as V
{-import Data.Vector as V-}
main :: IO ()
main = print $ V.sum $ V.enumFromTo (1::Int) 100000000
funny enough the version using
main = print $ V.sum $ V.enumFromN (1::Int) 100000000
is doing worse than the above - even though the documentation says otherwise.
enumFromN :: (Unbox a, Num a) => a -> Int -> Vector a
O(n) Yield a vector of the given length containing the values x, x+1
etc. This operation is usually more efficient than enumFromTo.
Update
#Carsten's comment made me curious - so I had a look into the sources for integer - well integer-simple to be precise, because for Integer there exists other versions integer-gmp and integer-gmp2 using libgmp.
data Integer = Positive !Positive | Negative !Positive | Naught
-------------------------------------------------------------------
-- The hard work is done on positive numbers
-- Least significant bit is first
-- Positive's have the property that they contain at least one Bit,
-- and their last Bit is One.
type Positive = Digits
type Positives = List Positive
data Digits = Some !Digit !Digits
| None
type Digit = Word#
data List a = Nil | Cons a (List a)
so when using Integer there is quite a bit of memory overhead compared to Int or rather unboxed Int# - I guess as this should be optimized, (though I have not confirmed that).
So Integer is (if I calculate correctly)
1 x Word for the sum-type-tag (here Positive
n x (Word + Word) for Some and the Digit part
1 x Word for the last None
a memory overhead of (2 + floor(log_10(n)) for each Integer in that calculation + a bit more for the accumulator.

Haskell: Data.HashSet (from unordered-container) Performance for Large Sets

The data
First of all, let's generate some input so we have concrete data to talk about:
python -c 'for f in xrange(4000000): print f' > input.txt
this will generate a file input.txt containing the numbers from 0 to 3999999, each on its own line. That means we should have a file with 4,000,000 lines, adding up to 30,888,890 bytes, roughly 29 MiB.
Everything as a list
Right, let's load everything into memory as a [Text]:
import Data.Conduit
import Data.Text (Text)
import Control.Monad.Trans.Resource (runResourceT)
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.List as CL
main :: IO ()
main = do
hs <- (runResourceT
$ CB.sourceFile "input.txt"
$$ CT.decode CT.utf8
=$ CT.lines
=$ CL.fold (\b a -> a `seq` b `seq` a:b) [])
print $ head hs
and run it:
[1 of 1] Compiling Main ( Test.hs, Test.o )
Linking Test ...
"3999999"
2,425,996,328 bytes allocated in the heap
972,945,088 bytes copied during GC
280,665,656 bytes maximum residency (13 sample(s))
5,120,528 bytes maximum slop
533 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 4378 colls, 0 par 0.296s 0.309s 0.0001s 0.0009s
Gen 1 13 colls, 0 par 0.452s 0.661s 0.0508s 0.2511s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.460s ( 0.465s elapsed)
GC time 0.748s ( 0.970s elapsed)
EXIT time 0.002s ( 0.034s elapsed)
Total time 1.212s ( 1.469s elapsed)
%GC time 61.7% (66.0% elapsed)
Alloc rate 5,271,326,694 bytes per MUT second
Productivity 38.3% of total user, 31.6% of total elapsed
real 0m1.481s
user 0m1.212s
sys 0m0.232s
runs in 1.4s, takes 533 MB of memory. As of Haskell Wiki's Memory Footprint, the 4M Text instances should take 6 words + 2N bytes of memory. I'm on 64 bit so one word is 8 bytes. That means it should be (6 * 8 bytes * 4000000) + (2*26888890) bytes = 234 MiB. (The 26888890 are all the bytes in input.txt that are not newline characters). For the list which will hold them all, we'll need additional memory of (1 + 3N) words + N * sizeof(v). sizeof(v) should be 8 because it'll be a pointer to the Text. The list should then use (1 + 3 * 4000000) * 8 bytes + 4000000 * 8 bytes = 122MiB. So in total (list + strings) we'd expect 356 MiB of memory used. I don't know where difference of 177 MiB (50%) of our memory went but let's ignore that for now.
The large hash set
Finally, we shall come to the use case that I'm actually interested in: Storing all the words in a large Data.HashSet. For that, I changed the program ever so slightly
import Data.Conduit
import Data.Text (Text)
import Control.Monad.Trans.Resource (runResourceT)
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.List as CL
import qualified Data.HashSet as HS
main :: IO ()
main = do
hs <- (runResourceT
$ CB.sourceFile "input.txt"
$$ CT.decode CT.utf8
=$ CT.lines
=$ CL.fold (\b a -> a `seq` b `seq` HS.insert a b) HS.empty)
print $ HS.size hs
if we run that again
$ ghc -fforce-recomp -O3 -rtsopts Test.hs && time ./Test +RTS -sstderr
[1 of 1] Compiling Main ( Test.hs, Test.o )
Linking Test ...
4000000
6,544,900,208 bytes allocated in the heap
6,314,477,464 bytes copied during GC
442,295,792 bytes maximum residency (26 sample(s))
8,868,304 bytes maximum slop
1094 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 12420 colls, 0 par 5.756s 5.869s 0.0005s 0.0034s
Gen 1 26 colls, 0 par 3.068s 3.633s 0.1397s 0.6409s
INIT time 0.000s ( 0.000s elapsed)
MUT time 3.567s ( 3.592s elapsed)
GC time 8.823s ( 9.502s elapsed)
EXIT time 0.008s ( 0.097s elapsed)
Total time 12.399s ( 13.192s elapsed)
%GC time 71.2% (72.0% elapsed)
Alloc rate 1,835,018,578 bytes per MUT second
Productivity 28.8% of total user, 27.1% of total elapsed
real 0m13.208s
user 0m12.399s
sys 0m0.646s
it's quite bad: 13s and 1094MiB of memory used. The memory footprint page lists 4.5N words + N * sizeof(v) for a hash set, that should become (4.5 * 4000000 * 8bytes) + (4000000 * 8bytes) = 167 MiB. Adding the storage for the stings (234 MiB), I'd expect 401 MiB which is more than double, and it feels quite slow on top of that :(.
Thought experiment: manually managing the memory
As a thought experiment: Using a language where we can manually control memory layout and implement the HashSet with Open addressing I'd expect the following to be the sizes. For fairness, I'll expect the strings to still be in UTF-16 (which is what Data.Text does). Given it's 26888890 characters in total (without newlines), the strings in UTF-16 should roughly be 53777780 bytes (2 * 26888890) = 51 MiB. We will need to store the length for every string, which will be 8 bytes * 4000000 = 30 MiB. And we will need space for the hash set (4000000 * 8 bytes), again 30 MiB. Given that the hash sets are normally increased exponentially, one would maybe expect 32 MiB or 64 MiB worst case. Let's go with the worst case: 64 MiB for the table + 30 MiB for the string lengths + 51 MiB for the actual string data, grand total of 145 MiB.
So given that Data.HashSet is not a specialised implementation for storing strings, the calculated 401 MiB would not be too bad but the actually used 1094 MiB seem a bit much waste.
The questions finally :)
So we finally got there:
Where is the error in my calculations?
Is there some problem in my implementation or is 1094 MiB really the best we can get?
Versions and stuff
I should probably use ByteStrings instead of Text as I only need ascii characters
I'm on GHC 7.10.1 and unordered-containers-0.2.5.1
For comparison: 4,000,000 Ints:
import Data.List (foldl')
import qualified Data.HashSet as HS
main = do
let hs = foldl' (\b a -> a `seq` b `seq` HS.insert a b) (HS.empty :: HS.HashSet Int) [1..4000000]
print $ HS.size hs
doesn't look any better:
[...]
798 MB total memory in use (0 MB lost due to fragmentation)
[...]
real 0m9.956s
that's almost 800 MiB for 4M Ints!

Packed large bit vector with efficient xor and bit count in Haskell

I am looking for an efficient (in both space and time) data type which can hold a 384 bit vector and supports efficient XOR and "bit count" (number of bits set to 1) operations.
Below, please find my demo program. The operations I need are all in the SOQuestionOps type class and I have implemented it for Natural and Data.Vector.Unboxed.Bit. Especially the latter seems perfect as it has a zipWords operation which should allow me to do operations like "bit count" and XOR word-by-word instead of bit-by-bit. Also it claims to store the bits packed (8 bits per byte).
{-# LANGUAGE FlexibleInstances #-}
import Data.Bits
import Data.List (foldl')
import Numeric.Natural
import qualified Data.Vector as V
import qualified Data.Vector.Unboxed.Bit as BV
class SOQuestionOps a where
soqoXOR :: a -> a -> a
soqoBitCount :: a -> Int
soqoFromList :: [Bool] -> a
alternating :: Int -> [Bool]
alternating n =
let c = n `mod` 2 == 0
in if n == 0
then []
else c : alternating (n-1)
instance SOQuestionOps Natural where
soqoXOR = xor
soqoBitCount = popCount
soqoFromList v =
let oneIdxs = map snd $ filter fst (zip v [0..])
in foldl' (\acc n -> acc `setBit` n) 0 oneIdxs
instance SOQuestionOps (BV.Vector BV.Bit) where
soqoXOR = BV.zipWords xor
soqoBitCount = BV.countBits
soqoFromList v = BV.fromList (map BV.fromBool v)
main =
let initialVec :: BV.Vector BV.Bit
initialVec = soqoFromList $ alternating 384
lotsOfVecs = V.replicate 10000000 (soqoFromList $ take 384 $ repeat True)
xorFolded = V.foldl' soqoXOR initialVec lotsOfVecs
sumBitCounts = V.foldl' (\n v -> n + soqoBitCount v) 0 lotsOfVecs
in putStrLn $ "folded bit count: " ++ show (soqoBitCount xorFolded) ++ ", sum: " ++ show sumBitCounts
So let's calculate numbers for the best case: lotsOfVecs shouldn't need to allocate much because it's just 10,000,000 times the same vector initialVec. The foldl obviously creates one of these vectors per fold operation, so it should create 10,000,000 bit vectors. The bit counting should create anything but 10,000,000 Ints. So in the best case, my program should use very little (and constant) memory and the total allocations should roughly be 10,000,000 * sizeof(bit vector) + 10,000,000 * sizeof(int) = 520,000,000 bytes .
Ok, let's run the program for Natural:
let's make initialVec :: Natural, compile with
ghc --make -rtsopts -O3 MemStuff.hs
result (this is with GHC 7.10.1):
$ ./MemStuff +RTS -sstderr
folded bit count: 192, sum: 3840000000
1,280,306,112 bytes allocated in the heap
201,720 bytes copied during GC
80,106,856 bytes maximum residency (2 sample(s))
662,168 bytes maximum slop
78 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2321 colls, 0 par 0.056s 0.059s 0.0000s 0.0530s
Gen 1 2 colls, 0 par 0.065s 0.069s 0.0346s 0.0674s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.579s ( 0.608s elapsed)
GC time 0.122s ( 0.128s elapsed)
EXIT time 0.000s ( 0.002s elapsed)
Total time 0.702s ( 0.738s elapsed)
%GC time 17.3% (17.3% elapsed)
Alloc rate 2,209,576,763 bytes per MUT second
Productivity 82.7% of total user, 78.7% of total elapsed
real 0m0.754s
user 0m0.704s
sys 0m0.037s
which has 1,280,306,112 bytes allocated in the heap, that's in the ballpark (2x) of the expected figure. Btw on GHC 7.8 this allocates 353,480,272,096 bytes and runs for absolute ages as popCount isn't very efficient on GHC 7.8's Naturals.
EDIT: I changed the code a bit. In the original version, every other vector was 0 in the fold. Which gave a lot better allocation figures for the Natural version. I changed it so the vector alternates between to different representations (with many bits set) and now we see 2x allocations of the expected. That's another downside of Natural (and Integer): The allocation rate depends on the values.
But maybe we can do better, let's try the densely packed Data.Vector.Unboxed.Bit:
That's initialVec :: BV.Vector BV.Bit and re-compile and re-run with the same options.
$ time ./MemStuff +RTS -sstderr
folded bit count: 192, sum: 1920000000
75,120,306,536 bytes allocated in the heap
54,914,640 bytes copied during GC
80,107,368 bytes maximum residency (2 sample(s))
664,128 bytes maximum slop
78 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 145985 colls, 0 par 0.543s 0.627s 0.0000s 0.0577s
Gen 1 2 colls, 0 par 0.065s 0.070s 0.0351s 0.0686s
INIT time 0.000s ( 0.000s elapsed)
MUT time 27.679s ( 28.228s elapsed)
GC time 0.608s ( 0.698s elapsed)
EXIT time 0.000s ( 0.002s elapsed)
Total time 28.288s ( 28.928s elapsed)
%GC time 2.1% (2.4% elapsed)
Alloc rate 2,714,015,097 bytes per MUT second
Productivity 97.8% of total user, 95.7% of total elapsed
real 0m28.944s
user 0m28.290s
sys 0m0.456s
That's very slow and roughly 100 times the allocations :(.
Ok, then lets recompile and profile both runs (ghc --make -rtsopts -O3 -prof -auto-all -caf-all -fforce-recomp MemStuff.hs):
The Natural version:
COST CENTRE MODULE %time %alloc
main.xorFolded Main 51.7 76.0
main.sumBitCounts.\ Main 25.4 16.0
main.sumBitCounts Main 12.1 0.0
main.lotsOfVecs Main 10.4 8.0
The Data.Vector.Unboxed.Bit version:
COST CENTRE MODULE %time %alloc
soqoXOR Main 96.7 99.3
main.sumBitCounts.\ Main 1.9 0.2
Is Natural really the best option for a fixed size bit vector? And what about GHC 6.8? And is there anything better which can implement my SOQuestionOps type class?
Have a look at the Data.LargeWord module in the Crypto package:
http://hackage.haskell.org/package/Crypto-4.2.5.1/docs/Data-LargeWord.html
It provides Bits instances for large words of various sizes, e.g. 96 through 256 bits.

Debugging performance bottlenecks for longest common subsequence algorithm

I am writing a longest common subsequence algorithm in Haskell using vector library and state monad (to encapsulate the very imperative and mutable nature of Miller O(NP) algorithm). I have already written it in C for some project I needed it for, and am now writing it in Haskell as a way to explore how to write those imperative grid-walk algorithms with good performance that match C. The version I wrote with unboxed vectors is about 4-times slower than C version for same inputs (and compiled with right optimization flags - I used both system clock time and Criterion methods to validate the relative time measurements between Haskell and C versions, and same data types, both large and tiny inputs). I have been trying to figure out where the performance issues might be, and will appreciate feedback - it is possible there is some well-known performance issue I might have run into here, especially in vector library which I am heavily using here.
In my code, I have one function called gridWalk that is called most often, and also, does most of the work. The performance slowdown is likely to be there, but I can't figure out what it could be. Complete Haskell code is here. Snippets of the code below:
import Data.Vector.Unboxed.Mutable as MU
import Data.Vector.Unboxed as U hiding (mapM_)
import Control.Monad.ST as ST
import Control.Monad.Primitive (PrimState)
import Control.Monad (when)
import Data.STRef (newSTRef, modifySTRef, readSTRef)
import Data.Int
type MVI1 s = MVector (PrimState (ST s)) Int
cmp :: U.Vector Int32 -> U.Vector Int32 -> Int -> Int -> Int
cmp a b i j = go 0 i j
where
n = U.length a
m = U.length b
go !len !i !j| (i<n) && (j<m) && ((unsafeIndex a i) == (unsafeIndex b j)) = go (len+1) (i+1) (j+1)
| otherwise = len
-- function to find previous y on diagonal k for furthest point
findYP :: MVI1 s -> Int -> Int -> ST s (Int,Int)
findYP fp k offset = do
let k0 = k+offset-1
k1 = k+offset+1
y0 <- MU.unsafeRead fp k0 >>= \x -> return $ 1+x
y1 <- MU.unsafeRead fp k1
if y0 > y1 then return (k0,y0)
else return (k1,y1)
{-#INLINE findYP #-}
gridWalk :: Vector Int32 -> Vector Int32 -> MVI1 s -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> ST s ()
gridWalk a b fp !k cmp = {-#SCC gridWalk #-} do
let !offset = 1+U.length a
(!kp,!yp) <- {-#SCC findYP #-} findYP fp k offset
let xp = yp-k
len = {-#SCC cmp #-} cmp a b xp yp
x = xp+len
y = yp+len
{-#SCC "updateFP" #-} MU.unsafeWrite fp (k+offset) y
return ()
{-#INLINE gridWalk #-}
-- The function below executes ct times, and updates furthest point as they are found during furthest point search
findSnakes :: Vector Int32 -> Vector Int32 -> MVI1 s -> Int -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> (Int -> Int -> Int) -> ST s ()
findSnakes a b fp !k !ct cmp op = {-#SCC findSnakes #-} U.forM_ (U.fromList [0..ct-1]) (\x -> gridWalk a b fp (op k x) cmp)
{-#INLINE findSnakes #-}
I added some cost center annotations, and ran profiling with a certain LCS input for testing. Here is what I get:
total time = 2.39 secs (2394 ticks # 1000 us, 1 processor)
total alloc = 4,612,756,880 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
gridWalk Main 67.5 52.7
findSnakes Main 23.2 27.8
cmp Main 4.2 0.0
findYP Main 3.5 19.4
updateFP Main 1.6 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 64 0 0.0 0.0 100.0 100.0
main Main 129 0 0.0 0.0 0.0 0.0
CAF Main 127 0 0.0 0.0 100.0 100.0
findSnakes Main 141 0 0.0 0.0 0.0 0.0
main Main 128 1 0.0 0.0 100.0 100.0
findSnakes Main 138 0 0.0 0.0 0.0 0.0
gridWalk Main 139 0 0.0 0.0 0.0 0.0
cmp Main 140 0 0.0 0.0 0.0 0.0
while Main 132 4001 0.1 0.0 100.0 100.0
findSnakes Main 133 12000 23.2 27.8 99.9 99.9
gridWalk Main 134 16004000 67.5 52.7 76.7 72.2
cmp Main 137 16004000 4.2 0.0 4.2 0.0
updateFP Main 136 16004000 1.6 0.0 1.6 0.0
findYP Main 135 16004000 3.5 19.4 3.5 19.4
newVI1 Main 130 1 0.0 0.0 0.0 0.0
newVI1.\ Main 131 8004 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 112 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 104 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 102 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 95 0 0.0 0.0 0.0 0.0
If I am interpreting the profiling output correctly (and assuming there are not too much distortions due to profiling), gridWalk takes most of the time, but the main functions cmp and findYP that do the heavy lifting in gridWalk, seem to take very little time in profiling report. So, perhaps the bottleneck is in forM_ wrapper that findSnakes function uses to call gridWalk? The heap profile too seems very clean:
Reading the core, nothing really jumps out. I thought some values in inner loops might be boxed, but I don't spot them in the core. I hope the performance issue is due to something simple I have missed.
Update
Following suggestion of #DanielFischer, I replaced forM_ of Data.Vector.Unboxed with that of Control.Monad in findSnakes function which improved the performance from 4x to 2.5x of C version. Haskell and C versions are now posted here if you want to try them out.
I am still digging through the core to see where the bottlenecks are. gridWalk is most frequently called function, and for it to perform well, lcsh should reduce whileM_ loop to a nice iterative inner loop of condition check and inlined findSnakes code. I suspect that in assembly, this is not the case for whileM_ loop, but since I am not very knowledgeable about translating core, and locating name-mangled GHC functions in assembly, I guess it is just a matter of patiently plugging away at the problem until I figure it out. Meanwhile if there are any pointers on performance fixes, they will be appreciated.
Another possibility that I can think of is overhead of heap checks during function calls. As seen in profiling report, gridWalk is called 16004000 times. Assuming 6 cycles for heap check (I am guessing it is less, but still let us assume that), it is ~0.02 seconds on a 3.33GHz box for 96024000 cycles.
Also, some performance numbers:
Haskell code (GHC 7.6.1 x86_64): It was ~0.25s before forM_ fix.
time ./T
1
real 0m0.150s
user 0m0.145s
sys 0m0.003s
C code (gcc 4.7.2 x86_64):
time ./test
1
real 0m0.065s
user 0m0.063s
sys 0m0.000s
Update 2:
Updated code is here. Using STUArray doesn't change the numbers either. The performance is about 1.5x on Mac OS X (x86_64,ghc7.6.1), pretty similar to what #DanielFischer reported on Linux.
Haskell code:
$ time ./Diff
1
real 0m0.087s
user 0m0.084s
sys 0m0.003s
C code:
$ time ./test
1
real 0m0.056s
user 0m0.053s
sys 0m0.002s
Glancing at the cmm, the call is tail-recursive, and is turned into a loop by llvm. But each new iteration seems to allocate new values which invokes heap check too, and so, might explain the difference in performance. I have to think about how to write the tail recursion in such a way such that no values are allocated across iterations, avoiding heap-check and allocation overhead.
You take a huge hit at
U.forM_ (U.fromList [0..ct-1])
in findSnakes. I'm convinced that isn't supposed to happen (ticket?), but that allocates a new Vector to traverse every time findSnakes is called. If you use
Control.Monad.forM_ [0 .. ct-1]
instead, the running time roughly halves, and the allocation drops by a factor of about 500 here. (GHC optimises C.M.forM_ [0 :: Int .. limit] well, the list is eliminated, and what remains is basically a loop.) You can do slightly better by writing the loop yourself.
Some things that cause gratuitous allocation/code size bloat without hurting performance much are
the unused Bool argument of lcsh
the cmp argument to findSnakes and gridWalk; if these are never called with a different comparison than the top-level cmp, that argument leads to unnecessary code duplication.
the general type of while; specialising it to the used type ST s Bool -> ST s () -> ST s () reduces allocation (much), and also running time (slightly, but noticeably, here).
A general word about profiling: Compiling a programme for profiling inhibits many optimisations. In particular for libraries like vector, bytestring or text that make heavy use of fusion, profiling often produces misleading results.
For example, your original code produces here
total time = 3.42 secs (3415 ticks # 1000 us, 1 processor)
total alloc = 4,612,756,880 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc ticks bytes
gridWalk Main 63.7 52.7 2176 2432608000
findSnakes Main 20.0 27.8 682 1281440080
cmp Main 9.2 0.0 313 16
findYP Main 4.2 19.4 144 896224000
updateFP Main 2.7 0.0 91 0
Just adding a bang on the binding of len in gridWalk changes nothing at all in the non-profiling version, but for the profiling version
total time = 2.98 secs (2985 ticks # 1000 us, 1 processor)
total alloc = 3,204,404,880 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc ticks bytes
gridWalk Main 63.0 32.0 1881 1024256000
findSnakes Main 22.2 40.0 663 1281440080
cmp Main 7.2 0.0 214 16
findYP Main 4.7 28.0 140 896224000
updateFP Main 2.7 0.0 82 0
it makes a lot of difference. For the version including the changes mentioned above (and the bang on len in gridWalk), the profiling version says
total alloc = 1,923,412,776 bytes (excludes profiling overheads)
but the non-profiling version
1,814,424 bytes allocated in the heap
10,808 bytes copied during GC
49,064 bytes maximum residency (2 sample(s))
25,912 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.12s ( 0.12s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.12s ( 0.12s elapsed)
says it allocated 1000-fold less than the profiling version.
For vector and friends code, more reliable for identifying bottlenecks than profiling (unfortunately also much much more time-consuming and difficult) is studying the generated core (or assembly, if you are proficient in reading that).
Concerning the update, the C runs a little slower on my box (gcc-4.7.2, -O3)
$ time ./miltest1
real 0m0.074s
user 0m0.073s
sys 0m0.001s
but the Haskell about the same
$ time ./hsmiller
1
real 0m0.151s
user 0m0.149s
sys 0m0.001s
That is a little faster when compiling via the LLVM backend:
$ time ./hsmiller1
real 0m0.131s
user 0m0.129s
sys 0m0.001s
And when we replace the forM_ with a manual loop,
findSnakes a b fp !k !ct op = go 0
where
go x
| x < ct = gridWalk a b fp (op k x) >> go (x+1)
| otherwise = return ()
it gets a bit faster,
$ time ./hsmiller
1
real 0m0.124s
user 0m0.121s
sys 0m0.002s
resp. via LLVM:
$ time ./hsmiller
1
real 0m0.108s
user 0m0.107s
sys 0m0.000s
By and large, the generated core looks fine, one small annoyance was
Main.$wa
:: forall s.
GHC.Prim.Int#
-> GHC.Types.Int
-> GHC.Prim.State# s
-> (# GHC.Prim.State# s, Main.MVI1 s #)
and a slightly roundabout implementation. That is fixed by making newVI1 strict in its second argument,
newVI1 n !x = do
Since that isn't called often, the effect on performance is of course negligible.
The meat is the core for lcsh, and that doesn't look too bad. The only boxed things in that are the Ints read from /written to the STRef, and that is inevitable. What's not so pleasant is that the core contains a lot of code duplication, but in my experience, that rarely is a real performance problem, and not all duplicated code survives the code generation.
and for it to perform well, lcsh should reduce whileM_ loop to a nice iterative inner loop of condition check and inlined findSnakes code.
You get an inner loop when you add an INLINE pragma to whileM_, but that loop is not nice, and, in this case it is much slower than having the whileM_ out-of-line (I'm not sure whether it's solely due to code size, but it could be).

Haskell: unexpected time-complexity in the computation involving large lists

I am dealing with the computation which has as an intermediate result a list A=[B], which is a list of K lists of the length L. The time-complexity to compute an element of B is controlled by the parameter M and is theoretically linear in M. Theoretically I would expect the time-complexity for the computation of A to be O(K*L*M). However, this is not the case and I don't understand why?
Here is the simple complete sketch program which exhibits the problem I have explained
import System.Random (randoms, mkStdGen)
import Control.Parallel.Strategies (parMap, rdeepseq)
import Control.DeepSeq (NFData)
import Data.List (transpose)
type Point = (Double, Double)
fmod :: Double -> Double -> Double
fmod a b | a < 0 = b - fmod (abs a) b
| otherwise = if a < b then a
else let q = a / b in b * (q - fromIntegral (floor q))
standardMap :: Double -> Point -> Point
standardMap k (q, p) = (fmod (q + p) (2 * pi), fmod (p + k * sin(q)) (2 * pi))
trajectory :: (Point -> Point) -> Point -> [Point]
trajectory map initial = initial : (trajectory map $ map initial)
justEvery :: Int -> [a] -> [a]
justEvery n (x:xs) = x : (justEvery n $ drop (n-1) xs)
justEvery _ [] = []
subTrace :: Int -> Int -> [a] -> [a]
subTrace n m = take (n + 1) . justEvery m
ensemble :: Int -> [Point]
ensemble n = let qs = randoms (mkStdGen 42)
ps = randoms (mkStdGen 21)
in take n $ zip qs ps
ensembleTrace :: NFData a => (Point -> [Point]) -> (Point -> a) ->
Int -> Int -> [Point] -> [[a]]
ensembleTrace orbitGen observable n m =
parMap rdeepseq ((map observable . subTrace n m) . orbitGen)
main = let k = 100
l = 100
m = 100
orbitGen = trajectory (standardMap 7)
observable (p, q) = p^2 - q^2
initials = ensemble k
mean xs = (sum xs) / (fromIntegral $ length xs)
result = (map mean)
$ transpose
$ ensembleTrace orbitGen observable l m
$ initials
in mapM_ print result
I am compiling with
$ ghc -O2 stdmap.hs -threaded
and runing with
$ ./stdmap +RTS -N4 > /dev/null
on the intel Q6600, Linux 3.6.3-1-ARCH, with GHC 7.6.1 and get the following results
for the different sets of the parameters K, L, M (k, l, m in the code of the program)
(K=200,L=200,N=200) -> real 0m0.774s
user 0m2.856s
sys 0m0.147s
(K=2000,L=200,M=200) -> real 0m7.409s
user 0m28.102s
sys 0m1.080s
(K=200,L=2000,M=200) -> real 0m7.326s
user 0m27.932s
sys 0m1.020s
(K=200,L=200,M=2000) -> real 0m10.581s
user 0m38.564s
sys 0m3.376s
(K=20000,L=200,M=200) -> real 4m22.156s
user 7m30.007s
sys 0m40.321s
(K=200,L=20000,M=200) -> real 1m16.222s
user 4m45.891s
sys 0m15.812s
(K=200,L=200,M=20000) -> real 8m15.060s
user 23m10.909s
sys 9m24.450s
I don't quite understand where the problem of such a pure scaling might be. If I understand correctly the lists are lazy and should not be constructed, since they are consumed in the head-tail direction? As could be observed from the measurements there is a correlation between the excessive real-time consumption and the excessive system-time consumption as the excess would be on the system account. But if there is some memory management wasting time, this should still scale linearly in K, L, M.
Help!
EDIT
I made changes in the code according to the suggestions given by Daniel Fisher, which indeed solved the bad scaling with respect to M. As pointed out, by forcing the strict evaluation in the trajectory, we avoid the construction of large thunks. I understand the performance improvement behind that, but I still don't understand the bad scaling of the original code, because (if I understand correctly) the space-time-complexity of the construction of the thunk should be linear in M?
Additionally, I still have problems understanding the bad scaling with respect to K (the size of the ensemble). I performed two additional measurements with the improved code for K=8000 and K=16000, keeping L=200, M=200. Scaling up to K=8000 is as expected but for K=16000 it is already abnormal. The problem seems to be in the number of overflowed SPARKS, which is 0 for K=8000 and 7802 for K=16000. This probably reflects in the bad concurrency which I quantify as a quotient Q = (MUT cpu time) / (MUT real time) which would be ideally equal to the number of CPU-s. However, Q ~ 4 for K = 8000 and Q ~ 2 for K = 16000.
Please help me understand the origin of this problem and the possible solutions.
K = 8000:
$ ghc -O2 stmap.hs -threaded -XBangPatterns
$ ./stmap +RTS -s -N4 > /dev/null
56,905,405,184 bytes allocated in the heap
503,501,680 bytes copied during GC
53,781,168 bytes maximum residency (15 sample(s))
6,289,112 bytes maximum slop
151 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 27893 colls, 27893 par 7.85s 1.99s 0.0001s 0.0089s
Gen 1 15 colls, 14 par 1.20s 0.30s 0.0202s 0.0558s
Parallel GC work balance: 23.49% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 8000 (8000 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 95.90s ( 24.28s elapsed)
GC time 9.04s ( 2.29s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 104.95s ( 26.58s elapsed)
Alloc rate 593,366,811 bytes per MUT second
Productivity 91.4% of total user, 360.9% of total elapsed
gc_alloc_block_sync: 315819
and
K = 16000:
$ ghc -O2 stmap.hs -threaded -XBangPatterns
$ ./stmap +RTS -s -N4 > /dev/null
113,809,786,848 bytes allocated in the heap
1,156,991,152 bytes copied during GC
114,778,896 bytes maximum residency (18 sample(s))
11,124,592 bytes maximum slop
300 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 135521 colls, 135521 par 22.83s 6.59s 0.0000s 0.0190s
Gen 1 18 colls, 17 par 2.72s 0.73s 0.0405s 0.1692s
Parallel GC work balance: 18.05% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 16000 (8198 converted, 7802 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 221.77s (139.78s elapsed)
GC time 25.56s ( 7.32s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 247.34s (147.10s elapsed)
Alloc rate 513,176,874 bytes per MUT second
Productivity 89.7% of total user, 150.8% of total elapsed
gc_alloc_block_sync: 814824
M. A. D.'s point about fmod is a good one, but it is not necessary to call out to C, and we can do better staying in Haskell land (the ticket the linked thread was about is meanwhile fixed). The trouble in
fmod :: Double -> Double -> Double
fmod a b | a < 0 = b - fmod (abs a) b
| otherwise = if a < b then a
else let q = a / b in b * (q - fromIntegral (floor q))
is that type defaulting leads to floor :: Double -> Integer (and consequently fromIntegral :: Integer -> Double) being called. Now, Integer is a comparatively complicated type, with slow operations, and the conversion from Integer to Double is also relatively complicated. The original code (with parameters k = l = 200 and m = 5000) produced the stats
./nstdmap +RTS -s -N2 > /dev/null
60,601,075,392 bytes allocated in the heap
36,832,004,184 bytes copied during GC
2,435,272 bytes maximum residency (13741 sample(s))
887,768 bytes maximum slop
9 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 46734 colls, 46734 par 41.66s 20.87s 0.0004s 0.0058s
Gen 1 13741 colls, 13740 par 23.18s 11.62s 0.0008s 0.0041s
Parallel GC work balance: 60.58% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 34.99s ( 17.60s elapsed)
GC time 64.85s ( 32.49s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 99.84s ( 50.08s elapsed)
Alloc rate 1,732,048,869 bytes per MUT second
Productivity 35.0% of total user, 69.9% of total elapsed
on my machine (-N2 because I have only two physical cores). Simply changing the code to use a type signature floor q :: Int brings that down to
./nstdmap +RTS -s -N2 > /dev/null
52,105,495,488 bytes allocated in the heap
29,957,007,208 bytes copied during GC
2,440,568 bytes maximum residency (10481 sample(s))
893,224 bytes maximum slop
8 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 36979 colls, 36979 par 32.96s 16.51s 0.0004s 0.0066s
Gen 1 10481 colls, 10480 par 16.65s 8.34s 0.0008s 0.0018s
Parallel GC work balance: 68.64% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.01s ( 0.01s elapsed)
MUT time 29.78s ( 14.94s elapsed)
GC time 49.61s ( 24.85s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 79.40s ( 39.80s elapsed)
Alloc rate 1,749,864,775 bytes per MUT second
Productivity 37.5% of total user, 74.8% of total elapsed
a reduction of about 20% in elapsed time, 13% in MUT time. Not bad. If we look at the code for floor that you get with optimisations, we can see why:
floorDoubleInt :: Double -> Int
floorDoubleInt (D# x) =
case double2Int# x of
n | x <## int2Double# n -> I# (n -# 1#)
| otherwise -> I# n
floorDoubleInteger :: Double -> Integer
floorDoubleInteger (D# x) =
case decodeDoubleInteger x of
(# m, e #)
| e <# 0# ->
case negateInt# e of
s | s ># 52# -> if m < 0 then (-1) else 0
| otherwise ->
case TO64 m of
n -> FROM64 (n `uncheckedIShiftRA64#` s)
| otherwise -> shiftLInteger m e
floor :: Double -> Int just uses the machine conversion, while floor :: Double -> Integer needs an expensive decodeDoubleInteger and more branches. But where floor is called here, we know that all involved Doubles are nonnegative, hence floor is the same as truncate, which maps directly to the machine conversion double2Int#, so let's try that instead of floor:
INIT time 0.00s ( 0.00s elapsed)
MUT time 29.29s ( 14.70s elapsed)
GC time 49.17s ( 24.62s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 78.45s ( 39.32s elapsed)
a really small reduction (to be expected, the fmod isn't really the bottleneck). For comparison, calling out to C:
INIT time 0.01s ( 0.01s elapsed)
MUT time 31.46s ( 15.78s elapsed)
GC time 54.05s ( 27.06s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 85.52s ( 42.85s elapsed)
is a bit slower (unsurprisingly, you can execute a number of primops in the time calling out to C takes).
But that's not where the big fish swim. The bad thing is that picking only every m-th element of the trajectories leads to large thunks that cause a lot of allocation and take long to evaluate when the time comes. So let's eliminate that leak and make the trajectories strict:
{-# LANGUAGE BangPatterns #-}
trajectory :: (Point -> Point) -> Point -> [Point]
trajectory map !initial#(!a,!b) = initial : (trajectory map $ map initial)
That reduces the allocations and GC time drastically, and as a consequence also the MUT time:
INIT time 0.00s ( 0.00s elapsed)
MUT time 21.83s ( 10.95s elapsed)
GC time 0.72s ( 0.36s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 22.55s ( 11.31s elapsed)
with the original fmod,
INIT time 0.00s ( 0.00s elapsed)
MUT time 18.26s ( 9.18s elapsed)
GC time 0.58s ( 0.29s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 18.84s ( 9.47s elapsed)
with floor q :: Int, and within measuring accuracy the same times for truncate q :: Int (the allocation figures are a bit lower for truncate).
The problem seems to be in the number of overflowed SPARKS, which is 0 for K=8000 and 7802 for K=16000. This probably reflects in the bad concurrency
Yes (though as far as I know the more correct term here would be parallelism instead of concurrency), there is a spark pool, and when that's full, any further sparks are not scheduled for being evaluated in whatever thread next has time when its turn comes, the computation is then done without parallelism, from the parent thread. In this case that means after an initial parallel phase, the computation falls back to sequential.
The size of the spark pool is apparently about 8K (2^13).
If you watch the CPU load via top, you will see that it drops from (close to 100%)*(number of cores) to a much lower value after a while (for me, it was ~100% with -N2 and ~130% with -N4).
The cure is to avoid sparking too much, and letting each spark do some more work. With the quick-and-dirty modification
ensembleTrace orbitGen observable n m =
withStrategy (parListChunk 25 rdeepseq) . map ((map observable . subTrace n m) . orbitGen)
I'm back to 200% with -N2 for practically the entire run and a good productivity,
INIT time 0.00s ( 0.00s elapsed)
MUT time 57.42s ( 29.02s elapsed)
GC time 5.34s ( 2.69s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 62.76s ( 31.71s elapsed)
Alloc rate 1,982,155,167 bytes per MUT second
Productivity 91.5% of total user, 181.1% of total elapsed
and with -N4 it's also fine (even a wee bit faster on the wall-clock - not much because all threads do basically the same, and I have only 2 physical cores),
INIT time 0.00s ( 0.00s elapsed)
MUT time 99.17s ( 26.31s elapsed)
GC time 16.18s ( 4.80s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 115.36s ( 31.12s elapsed)
Alloc rate 1,147,619,609 bytes per MUT second
Productivity 86.0% of total user, 318.7% of total elapsed
since now the spark pool doesn't overflow.
The proper fix is to make the size of the chunks a parameter that is computed from the number of trajectories and available cores so that the number of sparks doesn't exceed the pool size.
After doing some quick profiling I found that these are the serial offenders:
ghc --make -O2 MainNonOpt.hs -threaded -prof -auto-all -caf-all -fforce-recomp
./MainNonOpt +RTS -N4 -p > /dev/null
>>>
COST CENTRE MODULE %time %alloc
fmod Main 46.3 33.3
standardMap Main 28.5 0.0
trajectory Main 23.8 66.6
What's surprising about fmod is the large number of allocations it does considering it is mostly a numerical function. So the next step would be to annotate fmod to see where is the problem:
fmod :: Double -> Double -> Double
fmod a b | a < 0 = {-# SCC "negbranch" #-} b - fmod (abs a) b
| otherwise = {-# SCC "posbranch" #-} if a < b then a
else let q = {-# SCC "division" #-} a / b in {-# SCC "expression" #-} b * (q - {-# SCC "floor" #-} fromIntegral (floor q))
This gives us:
ghc --make -O2 MainNonOpt.hs -threaded -prof -caf-all -fforce-recomp
./MainNonOpt +RTS -N4 -p > /dev/null
COST CENTRE MODULE %time %alloc
MAIN MAIN 61.5 70.0
posbranch Main 16.6 0.0
floor Main 14.9 30.0
expression Main 4.5 0.0
negbranch Main 1.9 0.0
So the bit with floor is the one which causes the issues. After looking around it turns out that the Prelude does not implement some Double RealFrac functions as best as it should(see here), probably causing some boxing/unboxing.
So by following the advice from the link I used a modified version of floor which also made the call to fromIntegral unnecessary:
floor' :: Double -> Double
floor' x = c_floor x
{-# INLINE floor' #-}
foreign import ccall unsafe "math.h floor"
c_floor :: Double -> Double
fmod :: Double -> Double -> Double
fmod a b | a < 0 = {-# SCC "negbranch" #-} b - fmod (abs a) b
| otherwise = {-# SCC "posbranch" #-} if a < b then a
else let q = {-# SCC "division" #-} a / b in {-# SCC "expression" #-} b * (q - ({-# SCC "floor" #-} floor' q))
EDIT:
As Daniel Fisher Points out, there is no need to inline C code to improve the performance. An analogous Haskell function already exists. I'll leave the answer anyway, for further reference.
This does make a difference. On my machine, for k=l=200, M=5000 here are the number for the non-optimized and the optimized version:
Non optimized:
real 0m20.635s
user 1m17.321s
sys 0m4.980s
Optimized:
real 0m14.858s
user 0m55.271s
sys 0m3.815s
The trajectory function may have similar problems and you can use profiling like it was used above to pin-point the issue.
A great starting point for profiling in Haskell can be found in this chapter of Real World Haskell.

Resources